Select specific columns from dataframe in Pyspark

In this exercise, we will learn how to select the specific columns from dataframe in Pyspark.

The dataframe.select() function is used to select the specific columns from the dataframe and returns the transformed new dataframe.

Syntax df.select("column1", "column2")
where, column1, column2 is the name of the columns which we want to select from the dataframe.

Example: First create the SparkSession.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate() 

Read the data from the CSV file and show the data after reading.

Python

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show()  

The output of the above code is shown below:

Select specific columns from dataframe in Pyspark

Let’s select the columns “Name” and “Company” from the DataFrame.

Python

df.select("Name", "Company").show()  

The output of the above code is shown below:

Select specific columns from dataframe in Pyspark