Create a new column or update an existing column in Dataframe
In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.
The withColumn() method is used to create a new column in the dataframe, or replace the existing columns that have the same names.
Syntax df.withColumn("Newcolumnname", ColumnExpression)
Example: First create the SparkSession.
Python
# Import the SparkSession module from pyspark.sql import SparkSession # Initialize a Spark session spark = SparkSession.builder.appName("App Name").getOrCreate()
Read the data from the CSV file and show the data after reading.
Python
# Import the Data df = spark.read.csv("data.csv", header=True, inferSchema=True) # Show the data in the DataFrame df.show()
The output of the above code is shown below:

Create a new column named “New Salary” in the DataFrame.
Python
# Create a new column in the dataframe newdf = df.withColumn("New Salary", df["Salary"]*2) # Display the DataFrame with the new column newdf.show()
The output of the above code is shown below:

Using complex expressions: We can use other functions and transformations within withColumn, such as string operations, aggregations, or conditional logic.
Python
# Import the module from pyspark.sql.functions import * # Add a new column with conditional logic newdf = df.withColumn("New Salary", when(df["Salary"] >25000,"High").otherwise("Low")) # show the dataframe newdf.show()
The output of the above code is shown below:
