Bank client behavior prediction

The goal of this analysis is to compare the performance of two machine learning models, Logistic Regression and Random Forest, in predicting whether a client will have overdue payments. By using features from client application and credit records, the analysis aims to assess which model can better predict a client's likelihood of having overdue payments, based on accuracy metrics.

The ultimate goal is to determine whether the relationships between the input features and the target variable are linear or nonlinear, and to select the model that provides the most accurate predictions for identifying at-risk clients.

Skills:

Python
Machine learning model training
Feature engineering
Model evaluation & comparison
Model overfitting detection

Data cleaning

This part involved importing the datasets into Dataframes and determining the number of rows and unique bank clients in each. I then merged the two datasets on a common ID column, applying an inner join to focus only on matching records, creating a consolidated DataFrame.

This allowed me to explore the merged data, analyze the number of rows and unique clients, and investigate differences in multiple rows associated with the same ID.

Data Wrangling and Feature Engineering

After merging these datasets based on a common ID column, I performed a series of data wrangling tasks to identify clients with overdue payments in the last 12 months. This involved mapping the STATUS column to indicate whether a client had past-due payments and filtering the data accordingly. I utilized NumPy to extract unique client IDs with overdue statuses and created a new Dataframe containing these clients, ensuring duplicates were removed and missing values handled efficiently.

I then appended additional rows to balance the dataset to a required size, while marking each client with an appropriate target variable.

Imputing missing values and dealing with categorical features

There are 5 numeric and 12 nominal and 1 ordinal variables in dataframe.

In this part handled missing values in a dataset by applying different imputation techniques based on the nature of the variables. For numerical data, I used the median to fill in missing values, ensuring robustness against outliers. For nominal data, I applied the mode to maintain the most common category. Similarly, for ordinal data in NAME_EDUCATION_TYPE, I used the mode to preserve the order and frequency of the values. This careful imputation process ensured that the dataset was complete and ready for accurate analysis, while maintaining the integrity of each variable type.

One-Hot Encoding of Nominal Features

In this step, I applied one-hot encoding to the nominal features in the df_final Dataframe, converting them into dummy variables to make the data suitable for machine learning models. First, I identified the nominal features, which were originally stored as strings, and used the get_dummies() function to create binary variables for each category in these features. To avoid multicollinearity, I set the drop_first=True parameter, which removed the first column from each set of dummy variables. This ensures that the encoded variables are not redundant. After creating the dummy variables, I dropped the original nominal columns from df_final and added the newly created dummy variables back into the DataFrame. This process transformed the categorical data into a numerical format, making it ready for analysis and machine learning, while ensuring efficient and accurate encoding.

Data standardization and training

In this part, I prepared the dataset for machine learning by splitting it into training and testing sets and standardizing the feature values. First, I separated the target variable y from the feature matrix X. Using train_test_split from the sklearn.model_selection library, I split the data into 75% training data and 25% testing data, ensuring the target variable y was stratified to maintain the same class proportions in both sets.

After splitting, I standardized the feature values using StandardScaler. This scaling technique ensures that all features have a mean of 0 and a standard deviation of 1, which is essential for many machine learning algorithms to perform optimally. The scaler was fitted on the training data and then applied to both the training and test sets to avoid data leakage. This process ensures that the model is trained on normalized data, improving its efficiency and performance during training and testing.

Logistic Regression and Random Forest Classifiers and Accuracies

In this part of the project, I trained two machine learning models—a Logistic Regression classifier and a Random Forest classifier—on standardized data to evaluate their performance. First, I trained a Logistic Regression model using the training data, with random_state set to 10 for reproducibility. After fitting the model, I calculated the accuracy for both the training and test datasets, which provided insight into how well the model generalizes to unseen data.

Next, I trained a Random Forest classifier, also with random_state set to 10, to compare its performance with the Logistic Regression model. I similarly calculated and printed the training and test accuracies to evaluate the model's fit and ability to predict on new data.

This comparison between two different models showcases my ability to implement, train, and evaluate multiple machine learning algorithms, leveraging accuracy metrics to assess their performance and draw insights on model effectiveness.