An Introduction to Artificial Intelligence
In this exercise, we will learn about the Artificial Intelligence and its related terms.
Artificial intelligence (AI) AI is a broad field that encompasses the development of intelligent systems capable of performing tasks that typically require human intelligence, such as perception, reasoning, learning, problem-solving, and decision-making. AI serves as an umbrella term for various techniques and approaches, including machine learning, deep learning, and generative AI, among others.
Machine learning (ML) ML is a type of AI for understanding and building methods that make it possible for machines to learn. These methods use data to improve computer performance on a set of tasks.
Deep learning (DL) Deep learning uses the concept of neurons and synapses similar to how our brain is wired. An example of a deep learning application is Amazon Rekognition, which can analyze millions of images and streaming and stored videos within seconds.
Generative AI Generative AI is a subset of deep learning because it can adapt models built using deep learning, but without retraining or fine tuning.
Generative AI systems are capable of generating new data based on the patterns and structures learned from training data.
Classification Model Definition A classification model is a type of machine learning algorithm used to predict discrete class labels based on input features. It maps input data to a set of predefined categories or classes.
Key Characteristics
- Output: A class label (e.g., "Yes"/"No", "Dog"/"Cat"/"Bird").
- Task Type: Predicts categorical variables.
- Goal: To assign the input data to one of the predefined classes.
Common Algorithms
- Logistic Regression: A statistical method for binary classification.
- Decision Trees: Splits data into subsets based on feature values.
- Random Forest: An ensemble of decision trees.
Applications
- Fraud detection in transactions.
- Diagnosing diseases (e.g., cancer detection).
- Sentiment analysis of text (e.g., positive, negative, neutral).
Regression Model Definition A regression model is a type of machine learning algorithm used to predict continuous numerical values. It learns the relationship between the dependent variable (target) and independent variables (features).
Key Characteristics
- Output: A continuous value (e.g., house price, stock price, temperature).
- Task Type: Predicts numerical variables.
- Goal: To predict a value as accurately as possible.
Examples:
- Predicting house prices based on square footage, location, and number of bedrooms.
- Estimating the number of sales a store will make next month.
- Forecasting stock market trends.
Common Algorithms
- Linear Regression: Models the relationship as a straight line (linear function).
- Decision Trees and Random Forest: Can also handle regression tasks.
- Neural Networks: Useful for complex and non-linear problems.
Applications
- Predicting weather conditions like temperature or rainfall.
- Estimating life expectancy based on demographic data.
- Predicting real estate prices.
Note: In a regression model, both the features (independent variables) and the target (dependent variable) should typically be numerical.
If we have categorical data, it must be converted into a numerical format using encoding techniques like:
* One-hot encoding: Creates binary columns for each category.
* Label encoding: Assigns a unique integer to each category.
If the target is categorical (e.g., labels or classes), a classification model should be used instead.
Random Forest Classifier Algorithm The Random Forest Classifier is an ensemble learning method primarily used for classification (and regression) tasks. It is based on constructing multiple decision trees during training and outputs the class that is the mode of the classes predicted by individual trees.
Example: Implementing Random Forest in Python Let’s see how we can use the RandomForestClassifier from the sklearn library.
In this exercise, we will use the data.csv file as the data source for training the machine learning algorithm. You can also download this data and use it for the ML task.
Importing Necessary Libraries
Python
import numpy as np import pandas as pd
Load the data into the pandas dataframe.
Python
# Load Data into Pandas DataFrame # df = pd.read_csv("file_path") df = pd.read_csv("data.csv") df
The output of the above code is shown below:
Let’s get the count of Males and Females in the Dataframe.
Python
df["Gender"].value_counts()
The output of the above code is shown below:
We found that there are 6 Females and 3 Males in the source data. So, we extract the Male data in one variable named minority and Female data in one variable named majority.
Python
minority = df[df['Gender'] == "Male"] majority = df[df['Gender'] == "Female"]
We are going to do resampling in the data so number of males and females in the data is equal, so there is no gender bias in the trained model.
Python
from sklearn.utils import resample minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42) minority_upsampled
The output of the above code is shown below:
Now we are going to concat the majority and minority_upsampled data.
Python
df_balanced = pd.concat([majority, minority_upsampled]) df_balanced
The output of the above code is shown below:
Let’s again check the number of Males and Females in the transformed data.
Python
df_balanced['Gender'].value_counts()
The output of the above code is shown below:
So, now we can see 6 Female and 6 Males.
We have a column Company which are going to use in the machine learning, to use it we need to convert the values in some numerical value. To do so, we can use the get_dummies pandas function.
Python
df_encoded = pd.get_dummies(df_balanced, columns=['Company'], dtype=int) df_encoded
The output of the above code is shown below:
Now we are selecting the columns from the transformed data for the features and target value in the X and y variables.
Python
# Here, the data in the variables are loaded, which we are going to use to train our model # Note this data should not contain data other than numerical data type # Otherwise, the algorithm will raise an error X = df_encoded.drop(columns=['Name', 'Country', 'Gender']) y = df_encoded['Gender'] X
The output of the above code is shown below:
And in the y variable we have the data as shown below:
Now it’s time to split the training and testing data. Here we are using the 60% data for training and 40% for testing.
Python
from sklearn.model_selection import train_test_split # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
The train_test_split function is from the sklearn.model_selection module and is used to split a dataset into training and testing sets.
The following are the parameters in the function:
- X: Feature matrix (independent variables or predictors).
- y: Target vector (dependent variable or labels).
- test_size=0.4: Specifies the proportion of the dataset to include in the test split (40% of the data for testing, 60% for training in this case).
- random_state=42: Ensures reproducibility by setting a seed for the random number generator. This guarantees the same split each time the code runs.
In the output, the function returns:
- X_train: Training set features (60% of X).
- X_test: Test set features (40% of X).
- y_train: Training set labels corresponding to X_train.
- y_test: Test set labels corresponding to X_test.
The names X_train, X_test, y_train, and y_test are just standard convention names, but we can use any variable names that make sense for our application.
For example, you can write:
Python
Here, the variables are renamed to:
- features_train for the training features.
- features_test for the testing features.
- labels_train for the training labels.
- labels_test for the testing labels.
Now train the model using the training data with RandomForestClassifier model algorithm.
RandomForestClassifier We can initialize the algorithm using the below syntax. Syntax model = RandomForestClassifier(n_estimators=100, random_state=42)
The parameter n_estimators has the default value 100. It specifies the number of trees in the forest.
Python
#Training the Random Forest Classifier #Initialize the classifier from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(random_state=42) # Fit the model on the training data rf.fit(X_train, y_train)
The output of the above code is shown below:
Now its time to make the predictions on the testing data.
Python
# Make predictions on the test set y_pred = rf.predict(X_test) y_pred
The output of the above code is shown below:
Now we are evaluating the model.
Python
from sklearn.metrics import classification_report # Generate the classification report report = classification_report(y_test, y_pred) print("Classification Report:\n", report)
The output of the above code is shown below:
Q: What is a Classification Report? A Classification Report is a detailed performance evaluation tool provided by the scikit-learn library in Python. It summarizes the key metrics used to assess the effectiveness of a classification model, helping you understand how well your model is performing on each class.
The classification report includes:
- Precision
- Recall (Sensitivity)
- F1 Score
- Support
These metrics are calculated for each class in your dataset and provide a deeper insight than just looking at accuracy, especially for imbalanced datasets.
Key Metrics Explained Here's what each metric in the classification report represents: 1. Precision:
- The proportion of correctly predicted positive observations to the total predicted positives.
- Formula: Precision=TP/(TP+FP)
- High precision indicates a low false positive rate.
2. Recall (Sensitivity or True Positive Rate):
- The proportion of correctly predicted positive observations to all actual positives.
- Formula: Recall=TP/(TP+FN)
- High recall indicates a low false negative rate.
3. F1 Score:
- The harmonic mean of precision and recall. It balances the two, especially useful when you need a single metric to compare models.
- Formula: F1=(2×Precision×Recall)/(Precision+Recall)
- A high F1 score indicates both high precision and recall.
4. Support:
- The number of actual occurrences of each class in your dataset.
- This is not a performance metric but rather a count of the samples available for each class.
DecisionTreeClassfier Algorithm Let’s use a DecisionTreeClassfier model to train and test the data. Example:
Python
from sklearn.tree import DecisionTreeClassifier # Initialize the model model = DecisionTreeClassifier() # Train the Model model.fit(X_train, y_train) # Predict class or regression value for X. predictions = model.predict(X_test) predictions
Let’s check the score of the DecisionTreeClassifier model.
Python
# Return the mean accuracy on the given test data and true labels for given test data. result = model.score(X_test, y_test) result
The output of the above code is shown below:
LogisticRegression Algorithm Let’s use a LogisticRegression model to train and test the data. Example:
Python
from sklearn.linear_model import LogisticRegression # Initialize the model model = LogisticRegression() # Training the model model.fit(X_train, y_train) # Making predictions # Predict class labels for samples in X # It returns y_pred predictions = model.predict(X_test) predictions
The output of the above code is shown below:
Let’s check the score of the LogisticRegression model.
Python
# Return the mean accuracy on the given test data and true labels for given test data. result = model.score(X_test, y_test) result
The output of the above code is shown below:
Machine Learning using ColumnTransformer
Python Syntax
from sklearn.compose import ColumnTransformer # Create a ColumnTransformer column_transformer = ColumnTransformer( transformers=[ ('transformer_name1', transformer1, column_list1), ('transformer_name2', transformer2, column_list2), ... ], remainder='drop' # Optional: Specifies what to do with unselected columns )
Create a Pipeline in the ML
Python Syntax
from sklearn.pipeline import Pipeline # Create a pipeline pipeline = Pipeline(steps=[ ('step_name1', transformer_or_model1), ('step_name2', transformer_or_model2), ... ])
Explanation:
- Pipeline: The scikit-learn class used to create the pipeline.
- steps: A list of tuples specifying the sequence of steps.
- step_name: A string that names the step (e.g., "preprocessor", "scaler").
- transformer_or_model: A transformer (e.g., scaler, encoder) or model (e.g., RandomForestClassifier) to be applied in this step.
- Order: Steps are executed sequentially in the order they appear in the list.
Pipeline with ColumnTransformer
Python
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.pipeline import Pipeline # Define the preprocessors preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['age', 'income']), # Scale numerical columns ('cat', OneHotEncoder(), ['gender', 'city']) # Encode categorical columns ]) # Create a pipeline pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), # Step 1: Preprocess the data ('classifier', RandomForestClassifier()) # Step 2: Train the classifier ])
Key Methods in Pipelines: 1. fit(X, y) Fits all steps (e.g., preprocessing and model) to the training data.
Python Syntax
2. predict(X) Applies the transformations and makes predictions using the model.
Python Syntax
3. fit_predict(X, y) Fits the pipeline and directly predicts results.
Python Syntax
4. score(X, y) Evaluates the model's performance (e.g., accuracy for classifiers).
Python Syntax