Write the DataFrame to CSV in PySpark

In PySpark, we can write a DataFrame to a CSV file using the write.csv() method.

Syntax dataframe.write.csv("path/to/save/csvfile.csv", header=True)

The function has the following parameters:

path: The directory where the CSV files will be saved.
header=True: This argument indicates that the first row of the CSV file should contain the column names from the DataFrame. This is important for preserving the schema when the data is read back or used elsewhere.
mode: How to handle existing data. Options include:

"overwrite": Overwrites the existing files.
"append": Appends to the existing files.
"ignore": Ignores the write operation if files already exist.
"error" or "errorifexists": Fails if files already exist (default).

In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.

Example: First create the SparkSession and read the data from the CSV file.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show()

The output of the above code is shown below:

Let’s filter the data and after performing the transformation save the data.

Python

# Import the module
from pyspark.sql.functions import *

groupdf=df.groupBy(df["Company"]).agg(sum("Salary").alias("Total Salary"))
groupdf.sort("Total Salary").show()

The output of the above code is shown below:

To write the dataframe as the file “csvfile.csv”.

Python

# We have set the mode to 'overwrite'
groupdf.write.csv("csvfile.csv", header=True, mode='overwrite')

The output of the above code is shown below:

Previous Next