Filter rows from a Pandas Dataframe
Now in this exercise, we are using the datasource employees.csv. You can download the datasource and use for the transformation.
Let’s extract a series from a dataframe and compare it. In the output we will see it returns a boolean series.
Example: Load the CSV file and use the dataframe for the further transformation.
Python
import pandas as pd # Read the CSV file mydata=pd.read_csv("employees.csv") # We are comparing Series to a value, it returns a boolean series mydata["Gender"] == "Male" # Basically, the condition is working over all the rows of the series
The output of the above code is shown below:
To select rows from a dataframe based on a conditional expression, use a condition inside the square brackets []. As boolean series has true or false for each row, if it is true, it will return that row and if false, it will not return that row.
Example: Filter dataframe with the Male employees.
Python
# Extract only Male Employees mydata[mydata["Gender"] == "Male"]
Or we can use the variable to specify the condition.
Python
# Here we are specifying the condition output to a variable # Using the variable, we can filter the dataframe male_employees=mydata["Gender"] == "Male" mydata[male_employees]
The output of the above code is shown below:
Similarly, we can specify the condition on other columns also. For example, now we want to filter the employees whose salary is above 40000.
Example: Filter the dataframe based on the Salary column.
Python
# Here we are specifying the condition output to a variable # Using the variable, we can filter the dataframe high_salary_employees=mydata["Salary"] > 40000 mydata[high_salary_employees]
The output of the above code is shown below:
Multiple Conditions When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, we cannot use or/and but need to use the or operator | and the and operator &.
Example: Filter the dataframe on the basis of Salary and Country column.
Python
# Here we are specifying the condition output to a variable # Using the variable, we can filter the dataframe high_salary_employees=mydata["Salary"] > 30000 indian_employees=mydata["Country"] == "India" mydata[high_salary_employees & indian_employees]
Or the above code can be rewritten as the following:
Python
mydata[(mydata["Salary"] > 30000) & (mydata["Country"] == "India")]
The output of the above code is shown below:
Only rows for which the value is True will be selected.
The isin Method The isin Series method accepts a collection object like a list, tuple, or Series. The method returns True for a row if its value is found in the collection.
Syntax pandas.Series.isin(values)
Example: Filter the dataframe on the basis of multiple column values.
Python
# Here we are specifying the condition output to a variable # Using the variable, we can filter the dataframe google_employees=mydata["Company"] == "Google" tcs_employees=mydata["Company"] == "TCS" microsoft_employees=mydata["Company"] == "Microsoft" mydata[google_employees | tcs_employees | microsoft_employees]
Or the above code can be rewritten as the following:
Python
filtered_values = mydata["Company"].isin(["TCS", "Google", "Microsoft"]) mydata[filtered_values]
Or the above code can be rewritten as the following:
Python
mydata[mydata["Company"].isin(["TCS", "Google", "Microsoft"])]
The output of the above code is shown below:
pandas.Series.between The pandas.Series.between function is used to specify the condition range.
Syntax Series.between(left, right)
Example: If we are interested in the employees having salary in between 20000 and 40000. By default, both values are inclusive.
Python
mydata[mydata["Salary"].between(20000, 40000)]
The output of the above code is shown below: