Create a new column or update an existing column in Dataframe

In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.

The withColumn() method is used to create a new column in the dataframe, or replace the existing columns that have the same names.

Syntax df.withColumn("Newcolumnname", ColumnExpression)

Example: First create the SparkSession.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate() 

Read the data from the CSV file and show the data after reading.

Python

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show() 

The output of the above code is shown below:

Create a new column or update an existing column in Dataframe in Pyspark

Create a new column named “New Salary” in the DataFrame.

Python

# Create a new column in the dataframe
newdf = df.withColumn("New Salary", df["Salary"]*2)

# Display the DataFrame with the new column
newdf.show()  

The output of the above code is shown below:

Create a new column or update an existing column in Dataframe in Pyspark

Using complex expressions: We can use other functions and transformations within withColumn, such as string operations, aggregations, or conditional logic.

Python

# Import the module
from pyspark.sql.functions import *

# Add a new column with conditional logic
newdf = df.withColumn("New Salary", when(df["Salary"] >25000,"High").otherwise("Low"))

# show the dataframe
newdf.show()          

The output of the above code is shown below:

Create a new column or update an existing column in Dataframe in Pyspark