Filter Dataframe in Pyspark
In this exercise, we will learn about filtering the dataframes in Pyspark.
The dataframe.filter() method is used to filter rows using the given condition. The where() is an alias for filter().
Syntax 1: dataframe.filter(conditions) Syntax 2: dataframe.where(conditions)
In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.
Example: First create the SparkSession.
Python
# Import the SparkSession module from pyspark.sql import SparkSession # Initialize a Spark session spark = SparkSession.builder.appName("App Name").getOrCreate()
Read the data from the CSV file and show the data after reading.
Python
# Import the Data df = spark.read.csv("data.csv", header=True, inferSchema=True) # Show the data in the DataFrame df.show()
The output of the above code is shown below:

Here, we filtered the rows which has Salary column value greater than 25000.
Python
The output of the above code is shown below:

Here, we can specify the multiple conditions in the filter function. We can use logical operators like & (and), | (or), and ~ (not).
Python
Note: Always put the condition in the parenthesis ().

between() method in Pyspark We can use the between method to filter for values in a specified range.
Example:
Python
Keeps only the rows where Salary is between 30000 and 40000 (inclusive).
