An Introduction to Artificial Intelligence

In this exercise, we will learn about the Artificial Intelligence and its related terms.

Artificial intelligence (AI) AI is a broad field that encompasses the development of intelligent systems capable of performing tasks that typically require human intelligence, such as perception, reasoning, learning, problem-solving, and decision-making. AI serves as an umbrella term for various techniques and approaches, including machine learning, deep learning, and generative AI, among others.

Machine learning (ML) ML is a type of AI for understanding and building methods that make it possible for machines to learn. These methods use data to improve computer performance on a set of tasks.

Deep learning (DL) Deep learning uses the concept of neurons and synapses similar to how our brain is wired. An example of a deep learning application is Amazon Rekognition, which can analyze millions of images and streaming and stored videos within seconds.

Generative AI Generative AI is a subset of deep learning because it can adapt models built using deep learning, but without retraining or fine tuning.

Generative AI systems are capable of generating new data based on the patterns and structures learned from training data.

An Introduction to Artificial Intelligence

Classification Model Definition A classification model is a type of machine learning algorithm used to predict discrete class labels based on input features. It maps input data to a set of predefined categories or classes.
Key Characteristics

Common Algorithms

Applications

Regression Model Definition A regression model is a type of machine learning algorithm used to predict continuous numerical values. It learns the relationship between the dependent variable (target) and independent variables (features).
Key Characteristics

Examples:

  1. Predicting house prices based on square footage, location, and number of bedrooms.
  2. Estimating the number of sales a store will make next month.
  3. Forecasting stock market trends.

Common Algorithms

Applications

Note: In a regression model, both the features (independent variables) and the target (dependent variable) should typically be numerical.
If we have categorical data, it must be converted into a numerical format using encoding techniques like:
* One-hot encoding: Creates binary columns for each category.
* Label encoding: Assigns a unique integer to each category.
If the target is categorical (e.g., labels or classes), a classification model should be used instead.

Random Forest Classifier Algorithm The Random Forest Classifier is an ensemble learning method primarily used for classification (and regression) tasks. It is based on constructing multiple decision trees during training and outputs the class that is the mode of the classes predicted by individual trees.

Example: Implementing Random Forest in Python Let’s see how we can use the RandomForestClassifier from the sklearn library.

In this exercise, we will use the data.csv file as the data source for training the machine learning algorithm. You can also download this data and use it for the ML task.

Importing Necessary Libraries

Python

import numpy as np
import pandas as pd

Load the data into the pandas dataframe.

Python

# Load Data into Pandas DataFrame
# df = pd.read_csv("file_path")
df = pd.read_csv("data.csv")
df

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Let’s get the count of Males and Females in the Dataframe.

Python

df["Gender"].value_counts()  

The output of the above code is shown below:

An Introduction to Artificial Intelligence

We found that there are 6 Females and 3 Males in the source data. So, we extract the Male data in one variable named minority and Female data in one variable named majority.

Python

minority = df[df['Gender'] == "Male"]
majority = df[df['Gender'] == "Female"]

We are going to do resampling in the data so number of males and females in the data is equal, so there is no gender bias in the trained model.

Python

from sklearn.utils import resample
minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42)
minority_upsampled

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Now we are going to concat the majority and minority_upsampled data.

Python

df_balanced = pd.concat([majority, minority_upsampled])
df_balanced

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Let’s again check the number of Males and Females in the transformed data.

Python

df_balanced['Gender'].value_counts()

The output of the above code is shown below:

An Introduction to Artificial Intelligence

So, now we can see 6 Female and 6 Males.

We have a column Company which are going to use in the machine learning, to use it we need to convert the values in some numerical value. To do so, we can use the get_dummies pandas function.

Python

df_encoded = pd.get_dummies(df_balanced, columns=['Company'], dtype=int)
df_encoded

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Now we are selecting the columns from the transformed data for the features and target value in the X and y variables.

Python

# Here, the data in the variables are loaded, which we are going to use to train our model
# Note this data should not contain data other than numerical data type
# Otherwise, the algorithm will raise an error

X = df_encoded.drop(columns=['Name', 'Country', 'Gender'])
y = df_encoded['Gender']
X

The output of the above code is shown below:

An Introduction to Artificial Intelligence

And in the y variable we have the data as shown below:

An Introduction to Artificial Intelligence

Now it’s time to split the training and testing data. Here we are using the 60% data for training and 40% for testing.

Python

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

The train_test_split function is from the sklearn.model_selection module and is used to split a dataset into training and testing sets.

The following are the parameters in the function:

In the output, the function returns:

The names X_train, X_test, y_train, and y_test are just standard convention names, but we can use any variable names that make sense for our application.
For example, you can write:

Python

features_train, features_test, labels_train, labels_test = train_test_split(X, y, test_size=0.4, random_state=42)

Here, the variables are renamed to:

Now train the model using the training data with RandomForestClassifier model algorithm.

RandomForestClassifier We can initialize the algorithm using the below syntax. Syntax model = RandomForestClassifier(n_estimators=100, random_state=42)

The parameter n_estimators has the default value 100. It specifies the number of trees in the forest.

Python

#Training the Random Forest Classifier
#Initialize the classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Fit the model on the training data
rf.fit(X_train, y_train)

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Now its time to make the predictions on the testing data.

Python

# Make predictions on the test set
y_pred = rf.predict(X_test)
y_pred

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Now we are evaluating the model.

Python

from sklearn.metrics import classification_report
# Generate the classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Q: What is a Classification Report? A Classification Report is a detailed performance evaluation tool provided by the scikit-learn library in Python. It summarizes the key metrics used to assess the effectiveness of a classification model, helping you understand how well your model is performing on each class.

The classification report includes:

These metrics are calculated for each class in your dataset and provide a deeper insight than just looking at accuracy, especially for imbalanced datasets.

Key Metrics Explained Here's what each metric in the classification report represents: 1. Precision:

2. Recall (Sensitivity or True Positive Rate):

3. F1 Score:

4. Support:

DecisionTreeClassfier Algorithm Let’s use a DecisionTreeClassfier model to train and test the data. Example:

Python

from sklearn.tree import DecisionTreeClassifier

# Initialize the model
model = DecisionTreeClassifier()

# Train the Model
model.fit(X_train, y_train)

# Predict class or regression value for X.
predictions = model.predict(X_test)
predictions

Let’s check the score of the DecisionTreeClassifier model.

Python

# Return the mean accuracy on the given test data and true labels for given test data.
result = model.score(X_test, y_test)
result 

The output of the above code is shown below:

An Introduction to Artificial Intelligence

LogisticRegression Algorithm Let’s use a LogisticRegression model to train and test the data. Example:

Python

from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

# Training the model
model.fit(X_train, y_train) 

# Making predictions
# Predict class labels for samples in X
# It returns y_pred
predictions = model.predict(X_test)  
predictions 

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Let’s check the score of the LogisticRegression model.

Python

# Return the mean accuracy on the given test data and true labels for given test data.
result = model.score(X_test, y_test)
result 

The output of the above code is shown below:

An Introduction to Artificial Intelligence

Machine Learning using ColumnTransformer

Python Syntax

from sklearn.compose import ColumnTransformer

# Create a ColumnTransformer
column_transformer = ColumnTransformer(
    transformers=[
        ('transformer_name1', transformer1, column_list1),
        ('transformer_name2', transformer2, column_list2),
        ...
    ],
    remainder='drop'  # Optional: Specifies what to do with unselected columns
)  

Create a Pipeline in the ML

Python Syntax

from sklearn.pipeline import Pipeline

# Create a pipeline
pipeline = Pipeline(steps=[
    ('step_name1', transformer_or_model1),
    ('step_name2', transformer_or_model2),
    ...
])  

Explanation:

Pipeline with ColumnTransformer

Python

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Define the preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),        # Scale numerical columns
        ('cat', OneHotEncoder(), ['gender', 'city'])         # Encode categorical columns
    ])

# Create a pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),                         # Step 1: Preprocess the data
    ('classifier', RandomForestClassifier())                # Step 2: Train the classifier
]) 

Key Methods in Pipelines: 1. fit(X, y) Fits all steps (e.g., preprocessing and model) to the training data.

Python Syntax

pipeline.fit(X_train, y_train)

2. predict(X) Applies the transformations and makes predictions using the model.

Python Syntax

y_pred = pipeline.predict(X_test)

3. fit_predict(X, y) Fits the pipeline and directly predicts results.

Python Syntax

y_pred = pipeline.fit_predict(X_train, y_train)

4. score(X, y) Evaluates the model's performance (e.g., accuracy for classifiers).

Python Syntax

pipeline.score(X_test, y_test)