KNeighborsClassifier Algorithm in ML
In this exercise, we will learn about the KNeighborsClassifier Algorithm in ML.
K-Nearest Neighbors (KNN) is a simple, yet effective supervised learning algorithm used for both classification and regression. Below is an example of implementing KNN for classification using the scikit-learn library in Python.
Syntax sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
The parameter n_neighbors specifies the number of neighbors to use by default for k-neighbors queries. The default value for this parameter is 5. As n_neighbors is the first parameter, so we can ignore the parameter name with argument.
In this exercise, we will use the data.csv file as the data source for training the machine learning algorithm. You can also download this data and use it for the ML task.
Example: Import the necessary libraries and read the data from the CSV file.
Python
import numpy as np import pandas as pd # Load Data into Pandas DataFrame df = pd.read_csv("data.csv") df
The output of the above code is shown below:
Let’s get the count of Males and Females in the Dataframe.
Python
df["Gender"].value_counts()
The output of the above code is shown below:
We found that there are 6 Females and 3 Males in the source data. So, we extract the Male data in one variable named minority and Female data in one variable named majority.
Python
# Filter the Dataframe minority = df[df['Gender'] == "Male"] majority = df[df['Gender'] == "Female"]
We are going to do resampling in the data so number of males and females in the data is equal, so there is no gender bias in the trained model.
Python
# Import the resample module from sklearn.utils import resample minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42) minority_upsampled
The output of the above code is shown below:
Now we are going to concat the majority and minority_upsampled data.
Python
df_balanced = pd.concat([majority, minority_upsampled]) df_balanced
The output of the above code is shown below:
Let’s again check the number of Males and Females in the transformed data.
Python
df_balanced['Gender'].value_counts()
The output of the above code is shown below:
So, now we can see 6 Female and 6 Males.
We have a column Company which are going to use in the machine learning, to use it we need to convert the values in some numerical value. To do so, we can use the get_dummies pandas function.
Python
df_encoded = pd.get_dummies(df_balanced, columns=['Company'], dtype=int) df_encoded
The output of the above code is shown below:
Now we are selecting the columns from the transformed data for the features and target value in the X and y variables.
Python
# Here, the data in the variables are loaded, which we are going to use to train our model # Note this data should not contain data other than numerical data type # Otherwise, the algorithm will raise an error X = df_encoded.drop(columns=['Name', 'Country', 'Gender']) y = df_encoded['Gender'] X
The output of the above code is shown below:
And in the y variable we have the data as shown below:
Now it’s time to split the training and testing data. Here we are using the 60% data for training and 40% for testing.
Python
# Import the module from sklearn.model_selection import train_test_split # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
The train_test_split function is from the sklearn.model_selection module and is used to split a dataset into training and testing sets.
The following are the parameters in the function:
- X: Feature matrix (independent variables or predictors).
- y: Target vector (dependent variable or labels).
- test_size=0.4: Specifies the proportion of the dataset to include in the test split (40% of the data for testing, 60% for training in this case).
- random_state=42: Ensures reproducibility by setting a seed for the random number generator. This guarantees the same split each time the code runs.
In the output, the function returns:
- X_train: Training set features (60% of X).
- X_test: Test set features (40% of X).
- y_train: Training set labels corresponding to X_train.
- y_test: Test set labels corresponding to X_test.
Now train the model using the training data with KNeighborsClassifier model algorithm.
Python
# Import the module from sklearn.neighbors import KNeighborsClassifier # Initialize the model kn = KNeighborsClassifier(n_neighbors=2) # Set k=2 # Train or fit the model on the training data kn.fit(X_train, y_train)
The output of the above code is shown below:
Get the predictions by the model trained by KNN algorithm.
Python
# Predict class or regression value for X # It returns y_pred y_pred = kn.predict(X_test) y_pred
The output of the above code is shown below:
Let’s get the accuracy_score, classification_report and other metrics of the model.
Python
# Import the module from sklearn.metrics import classification_report # Generate the classification report report = classification_report(y_test, y_pred) # Print the report print("Classification Report:\n", report)
The output of the above code is shown below:
Python
# Import the module from sklearn.metrics import accuracy_score # Print the Accuracy Score print("Accuracy:", accuracy_score(y_test, y_pred))
The output of the above code is shown below: