groupBy and agg() method in DataFrame in Pyspark

The groupBy() method groups the DataFrame using the specified columns, so we can run aggregation on them. The groupby() is an alias for groupBy().

Syntax DataFrame.groupBy(columnNames)

In this exercise, we are using the datasource data.csv. You can download the datasource and use for the transformation.

Example: First create the SparkSession and read the data from the CSV file.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()

# Import the Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show the data in the DataFrame
df.show() 

The output of the above code is shown below:

groupBy and agg() method in DataFrame in Pyspark

Let’s group the dataframe by the “Company” column and get the sum of “Salary” by each company.

Python

# Import the module
from pyspark.sql.functions import *

df.groupBy(df["Company"]).agg(sum("Salary")).show()  

The output of the above code is shown below:

groupBy and agg() method in DataFrame in Pyspark

Here, we can use the alias with the agg function. The alias function is used to name the column.

Python

df.groupBy(df["Company"]).agg(sum("Salary").alias("Total Salary")).show()  

The output of the above code is shown below:

groupBy and agg() method in DataFrame in Pyspark

In PySpark, the agg() function is used to perform aggregate operations on DataFrame columns. It allows us to compute multiple aggregates at once, such as sum(), avg(), count(), min(), and max(). We can use any combination of aggregate functions inside agg().