The pandas.DataFrame.groupby method in Pandas
For the demonstration of groupby function, we are using the datasource employees.csv. You can download the datasource and use for the transformation.
Example: Load the employees.csv file.
Python
import pandas as pd mydata=pd.read_csv("employees.csv") mydata
The output of the above code is shown below:
Now let’s group the dataframe by the column named “Country”.
Python
grouped_data=mydata.groupby("Country") grouped_data
The output of the above code is shown below:
Now the dataframe is grouped based on Country column values and make a DataFrameGroupBy object.
Let’s get the length of the dataframe object.
Python
# It returns the number of items in the object len(grouped_data)
The output of the above code is shown below:
It returns four means we can say the dataframe groupby object has four dataframes, and it is because of the four unique values in the Country column of the original dataframe.
Retrieve a Group with the get_group Method
The get_group method on the DataFrameGroupBy object retrieves a nested DataFrame belonging to a specific group/category.
Syntax pandas.core.groupby.DataFrameGroupBy.get_group(name)
The parameter name specifies the name of the group to get as a DataFrame.
Example: Get all the rows of the group named “India”. Or we can say get all the rows where the country name is India, as the original dataframe is grouped by Country column.
Python
grouped_data.get_group("India")
The output of the above code is shown below:
Methods on the GroupBy Object
- Use square brackets on the DataFrameGroupBy object to "extract" a column from the original DataFrame.
- The resulting SeriesGroupBy object will have aggregation methods available on it.
- Pandas will perform the calculation on every group within the collection.
- For example, the sum method will sum together the Salary for every row by group/category.
Example: Get the SeriesGroupBy object.
Python
# Select the column in square brackets [] grouped_data["Salary"]
The output of the above code is shown below:
Let’s get the sum of the “Salary” column on the based on the groups.
Python
# Select the column in square brackets [] # On which we want to apply the aggregation grouped_data["Salary"].sum()
The output of the above code is shown below:
Let’s get the minimum value in the Salary column of each group.
Python
# Select the column in square brackets [] # On which we want to apply the aggregation grouped_data["Salary"].min()
The output of the above code is shown below:
Let’s get the maximum value in the Salary column of each group.
Python
# Select the column in square brackets [] # On which we want to apply the aggregation grouped_data["Salary"].max()
The output of the above code is shown below:
Let’s get the mean of the Salary column of each group.
Python
grouped_data["Salary"].mean()
The output of the above code is shown below:
pandas.core.groupby.DataFrameGroupBy.first() This function is used to return the first row from each group,
Python
# It returns the first row of each group grouped_data.first()
pandas.core.groupby.DataFrameGroupBy.last() This function is used to return the last row from each group,
Python
# It returns the last row of each group grouped_data.last()
The output of the above code is shown below: