Filter Dataframe in Pyspark

In this exercise, we will learn about filtering the dataframes in Pyspark.

The dataframe.filter() method is used to filter rows using the given condition. The where() is an alias for filter().

Syntax 1: dataframe.filter(conditions) Syntax 2: dataframe.where(conditions)

In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.

Example: First create the SparkSession.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()   

Read the data from the CSV file and show the data after reading.

Python

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show()  

The output of the above code is shown below:

Filter Dataframe in Pyspark

Here, we filtered the rows which has Salary column value greater than 25000.

Python

df.filter(df["Salary"] > 25000).show()

The output of the above code is shown below:

Filter Dataframe in Pyspark

Here, we can specify the multiple conditions in the filter function. We can use logical operators like & (and), | (or), and ~ (not).

Python

df.filter((df["Salary"] > 25000) & (df["Company"]=="BDO")).show()

Note: Always put the condition in the parenthesis ().

Filter Dataframe in Pyspark

between() method in Pyspark We can use the between method to filter for values in a specified range.

Example:

Python

df = df.filter(df[“Salary”].between(30000, 40000))

Keeps only the rows where Salary is between 30000 and 40000 (inclusive).

Filter Dataframe in Pyspark