Write the DataFrame to CSV in Pyspark

In PySpark, we can write a DataFrame to a CSV file using the write.csv() method.

Syntax dataframe.write.csv("path/to/save/csvfile.csv", header=True)

The function has the following parameters:

In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.

Example: First create the SparkSession and read the data from the CSV file.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show() 

The output of the above code is shown below:

Write the DataFrame to CSV in Pyspark

Let’s filter the data and after performing the transformation save the data.

Python

# Import the module
from pyspark.sql.functions import *

groupdf=df.groupBy(df["Company"]).agg(sum("Salary").alias("Total Salary"))
groupdf.sort("Total Salary").show()  

The output of the above code is shown below:

Write the DataFrame to CSV in Pyspark

To write the dataframe as the file “csvfile.csv”.

Python

# We have set the mode to 'overwrite'
groupdf.write.csv("csvfile.csv", header=True, mode='overwrite')  

The output of the above code is shown below:

Write the DataFrame to CSV in Pyspark