what is logistic regression

what is logistic regression

Logistic regression is a statistical method used for binary classification problems, where the goal is to predict the probability that an observation belongs to one of two possible classes. In scikit-learn, logistic regression is implemented in the LogisticRegression class from the linear_model module.

Key Concepts

  • Binary Classification: Logistic regression is used to predict a binary outcome (1/0, True/False, Yes/No) based on one or more predictor variables.

  • Sigmoid Function: Logistic regression uses the sigmoid function to map predicted values to probabilities. The sigmoid function outputs values between 0 and 1.

  • Odds and Log-Odds: The logistic regression model predicts the log-odds of the positive class, which is then converted to a probability.

 

program to create a simple credit scoring application using logistic regression in scikit-learn. The goal is to predict whether a loan applicant is likely to default or not based on various features such as income, loan amount, credit history, etc.


1. Import Libraries: Import the necessary libraries for data manipulation, model building, and evaluation.


2. Load Dataset: Load and preprocess the dataset.


3. Split Dataset: Split the data into training and testing sets.


4. Train Model: Train a logistic regression model.


5. Evaluate Model: Evaluate the model using metrics such as accuracy, confusion matrix, etc.


6. Make Predictions: Make predictions on new data.


 Example Program


Let's assume we have a dataset with the following features:


 Income: Applicant's income.


 LoanAmount: Amount of loan requested.


 CreditHistory: Binary indicator of credit history (1 for good, 0 for bad).


 LoanDefault: Target variable indicating if the applicant defaulted (1 for default, 0 for no default).




Here's a complete example in Python using scikit-learn:


import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix

#Sample data for demonstration purposes

data = {

    'Income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000],

    'LoanAmount': [10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000],

    'CreditHistory': [1, 0, 1, 1, 0, 1, 1, 0, 1, 0],

    'LoanDefault': [0, 1, 0, 0, 1, 0, 0, 1, 0, 1]

}


# Convert to DataFrame

df = pd.DataFrame(data)


#Features and target variable

X = df[['Income', 'LoanAmount', 'CreditHistory']]

y = df['LoanDefault']


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Initialize the logistic regression model

model = LogisticRegression()


# Train the model

model.fit(X_train, y_train)




# Make predictions

y_pred = model.predict(X_test)


#Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy}')

print(f'Confusion Matrix:\n{conf_matrix}')




#Example of making predictions on new data

new_data = pd.DataFrame({


    'Income': [125000, 65000],


    'LoanAmount': [35000, 20000],


    'CreditHistory': [1, 0]


})



new_predictions = model.predict(new_data)

print(f'Predictions for new data: {new_predictions}')

# Explanation


1. Import Libraries: We import the necessary libraries including pandas for data manipulation, train_test_split for splitting the dataset, LogisticRegression for the model, and accuracy_score and confusion_matrix for evaluation.


2. Load Dataset: We create a simple hypothetical dataset with features Income, LoanAmount, CreditHistory, and the target variable LoanDefault.


3. Split Dataset: We split the dataset into training and testing sets using an 80-20 split.


4. Train Model: We initialize and train a logistic regression model on the training data.


5. Evaluate Model: We evaluate the model on the test data using accuracy and the confusion matrix.


6. Make Predictions: We demonstrate making predictions on new data points.




This example provides a basic implementation of a credit scoring application using logistic regression in scikit-learn. In a real-world scenario, the dataset would be larger and more complex, and you might need to perform additional preprocessing steps such as handling missing values, scaling features, and more.



0 Comments