PySpark- An Introduction to Apache Spark
PySpark is the Python API for Apache Spark. It enables us to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.
First, we must install the pyspark, to install PySpark using the pip use the following command:
pip install pyspark
In PySpark, the pyspark.sql.functions module is a collection of built-in functions that are essential for data manipulation and transformation when working with Spark DataFrames. By importing these functions, we can perform a wide range of operations on our data, such as filtering, aggregating, modifying, and creating new columns.
Use the following command to import the module.
from pyspark.sql.functions import *
The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder:
Python
# Import the SparkSession module from pyspark.sql import SparkSession # Initialize a Spark session spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()
OR
Python
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("App Name") \ .getOrCreate()
OR
Python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("App Name").getOrCreate()
Use coalesce() function
The coalesce() function in PySpark is used to reduce the number of partitions in a DataFrame. It's a key function for optimizing the performance and resource utilization of your distributed data processing.
What is Partitioning? When we create a DataFrame or read data into Spark, Spark automatically splits the data into partitions. These partitions are distributed across multiple nodes (in a cluster) or cores (in a local machine) for parallel processing. The number of partitions determines how much parallelism Spark can use.
However, in certain scenarios, having too many partitions can cause inefficiency because each partition introduces some overhead. This is where coalesce() comes in.
Example: Let's assume that after some transformations, our DataFrame data has 10 partitions, but we want to write it into a single CSV file. Spark will write each partition as a separate file by default, but we can use coalesce(1) to combine all the data into one partition, so only one file will be written.
Python
Explanation:
- coalesce(1): This reduces the number of partitions to 1, meaning all the data will be written into a single file (which can be useful when you want only one output file, such as for CSV).
- .write.csv(): This writes the data to the specified output_path.
- header=True: Includes the column names as the header in the output CSV file.
- mode='overwrite': This overwrites any existing files at the specified output path.
Stop Spark Session
To stop the Spark session, use the syntax spark.stop(), it releases the resources used by the Spark application. It’s a good practice to stop the session when all operations are complete.