LogisticRegression Algorithm in ML
Let’s use a LogisticRegression model to train and test the data.
In this exercise, we will use the data.csv file as the data source for training the machine learning algorithm. You can also download this data and use it for the ML task.
Example: Import the necessary libraries and read the data from the CSV file.
Python
import numpy as np import pandas as pd # Load Data into Pandas DataFrame df = pd.read_csv("data.csv") df
The output of the above code is shown below:
Let’s get the count of Males and Females in the Dataframe.
Python
# Get the unique values and their corresponding count df["Gender"].value_counts()
The output of the above code is shown below:
We found that there are 6 Females and 3 Males in the source data. So, we extract the Male data in one variable named minority and Female data in one variable named majority.
Python
# Filter the Dataframe minority = df[df['Gender'] == "Male"] majority = df[df['Gender'] == "Female"]
We are going to do resampling in the data so number of males and females in the data is equal, so there is no gender bias in the trained model.
Python
# Import the resample module from sklearn.utils import resample minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42) minority_upsampled
The output of the above code is shown below:
Now we are going to concat the majority and minority_upsampled data.
Python
df_balanced = pd.concat([majority, minority_upsampled]) df_balanced
The output of the above code is shown below:
Let’s again check the number of Males and Females in the transformed data.
Python
df_balanced['Gender'].value_counts()
The output of the above code is shown below:
So, now we can see 6 Female and 6 Males.
We have a column Company which are going to use in the machine learning, to use it we need to convert the values in some numerical value. To do so, we can use the get_dummies pandas function.
Python
df_encoded = pd.get_dummies(df_balanced, columns=['Company'], dtype=int) df_encoded
The output of the above code is shown below:
Python
# Import the module from sklearn.preprocessing import LabelEncoder # Initialize the object label_encoder = LabelEncoder() df_encoded['Gender'] = label_encoder.fit_transform(df_encoded['Gender']) df_encoded
The output of the above code is shown below:
Now we are selecting the columns from the transformed data for the features and target value in the X and y variables.
Python
# Here, the data in the variables are loaded, which we are going to use to train our model # Note this data should not contain data other than numerical data type # Otherwise, the algorithm will raise an error X = df_encoded.drop(columns=['Name', 'Country', 'Gender']) y = df_encoded['Gender'] X
The output of the above code is shown below:
And in the y variable we have the data as shown below:
Now it’s time to split the training and testing data. Here we are using the 60% data for training and 40% for testing.
Python
# Import the module from sklearn.model_selection import train_test_split # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
The train_test_split function is from the sklearn.model_selection module and is used to split a dataset into training and testing sets.
The following are the parameters in the function:
- X: Feature matrix (independent variables or predictors).
- y: Target vector (dependent variable or labels).
- test_size=0.4: Specifies the proportion of the dataset to include in the test split (40% of the data for testing, 60% for training in this case).
- random_state=42: Ensures reproducibility by setting a seed for the random number generator. This guarantees the same split each time the code runs.
In the output, the function returns:
- X_train: Training set features (60% of X).
- X_test: Test set features (40% of X).
- y_train: Training set labels corresponding to X_train.
- y_test: Test set labels corresponding to X_test.
Now train the model using the training data with LogisticRegression model algorithm.
LogisticRegression Algorithm We can initialize the algorithm using the below syntax.
a) LogisticRegression()
b) LogisticRegression(max_iter=1000)
The parameter max_iter specifies the maximum number of iterations taken for the solvers to converge. The default value of this parameter is 100.
Python
# Import the module from sklearn.linear_model import LogisticRegression # Initialize the model # model = LogisticRegression() model = LogisticRegression(max_iter=1000) # Training the model model.fit(X_train, y_train)
The output of the above code is shown below:
Now it is time to make the predictions on the testing data.
Python
# Predict class labels for samples in X # It returns y_pred y_pred = model.predict(X_test) y_pred
The output of the above code is shown below:
Let’s create the classification_report of the LogisticRegression model.
Python
# Import the module from sklearn.metrics import classification_report # Generate the classification report report = classification_report(y_test, y_pred) # Print the classification report print("Classification Report:\n", report)
The output of the above code is shown below:
Let’s check the accuracy_score of the LogisticRegression model.
Python
# Import the module from sklearn.metrics import accuracy_score # Print the Accuracy Score print("Accuracy:", accuracy_score(y_test, y_pred))
The output of the above code is shown below: