what is logistic regression
Logistic regression is a statistical method used for binary
classification problems, where the goal is to predict the probability that an
observation belongs to one of two possible classes. In scikit-learn, logistic
regression is implemented in the LogisticRegression class from the linear_model module.
Key Concepts
- Binary
Classification: Logistic regression is used to predict a binary outcome (1/0,
True/False, Yes/No) based on one or more predictor variables.
- Sigmoid
Function: Logistic regression uses the sigmoid function to map predicted
values to probabilities. The sigmoid function outputs values between 0 and
1.
- Odds and
Log-Odds: The logistic regression model predicts the log-odds of the
positive class, which is then converted to a probability.
program to create a simple credit scoring application using logistic regression in scikit-learn. The goal is to predict whether a loan applicant is likely to default or not based on various features such as income, loan amount, credit history, etc.
1. Import Libraries: Import the necessary libraries for data manipulation, model building, and evaluation.
2. Load Dataset: Load and preprocess the dataset.
3. Split Dataset: Split the data into training and testing sets.
4. Train Model: Train a logistic regression model.
5. Evaluate Model: Evaluate the model using metrics such as accuracy, confusion matrix, etc.
6. Make Predictions: Make predictions on new data.
Example Program
Let's assume we have a dataset with the following features:
Income: Applicant's income.
LoanAmount: Amount of loan requested.
CreditHistory: Binary indicator of credit history (1 for good, 0 for bad).
LoanDefault: Target variable indicating if the applicant defaulted (1 for default, 0 for no default).
Here's a complete example in Python using scikit-learn:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
#Sample data for demonstration purposes
data = {
'Income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000],
'LoanAmount': [10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000],
'CreditHistory': [1, 0, 1, 1, 0, 1, 1, 0, 1, 0],
'LoanDefault': [0, 1, 0, 0, 1, 0, 0, 1, 0, 1]
}
# Convert to DataFrame
df = pd.DataFrame(data)
#Features and target variable
X = df[['Income', 'LoanAmount', 'CreditHistory']]
y = df['LoanDefault']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the logistic regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
#Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
#Example of making predictions on new data
new_data = pd.DataFrame({
'Income': [125000, 65000],
'LoanAmount': [35000, 20000],
'CreditHistory': [1, 0]
})
new_predictions = model.predict(new_data)
print(f'Predictions for new data: {new_predictions}')
# Explanation
1. Import Libraries: We import the necessary libraries including pandas for data manipulation, train_test_split for splitting the dataset, LogisticRegression for the model, and accuracy_score and confusion_matrix for evaluation.
2. Load Dataset: We create a simple hypothetical dataset with features Income, LoanAmount, CreditHistory, and the target variable LoanDefault.
3. Split Dataset: We split the dataset into training and testing sets using an 80-20 split.
4. Train Model: We initialize and train a logistic regression model on the training data.
5. Evaluate Model: We evaluate the model on the test data using accuracy and the confusion matrix.
6. Make Predictions: We demonstrate making predictions on new data points.
This example provides a basic implementation of a credit scoring application using logistic regression in scikit-learn. In a real-world scenario, the dataset would be larger and more complex, and you might need to perform additional preprocessing steps such as handling missing values, scaling features, and more.
0 Comments