PySpark- An Introduction to Apache Spark

PySpark is the Python API for Apache Spark. It enables us to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.

First, we must install the pyspark, to install PySpark using the pip use the following command:
pip install pyspark

In PySpark, the pyspark.sql.functions module is a collection of built-in functions that are essential for data manipulation and transformation when working with Spark DataFrames. By importing these functions, we can perform a wide range of operations on our data, such as filtering, aggregating, modifying, and creating new columns.

Use the following command to import the module.
from pyspark.sql.functions import *

The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder:

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("App Name") \
    .getOrCreate()

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("App Name").getOrCreate()

Use coalesce() function

The coalesce() function in PySpark is used to reduce the number of partitions in a DataFrame. It's a key function for optimizing the performance and resource utilization of your distributed data processing.

What is Partitioning? When we create a DataFrame or read data into Spark, Spark automatically splits the data into partitions. These partitions are distributed across multiple nodes (in a cluster) or cores (in a local machine) for parallel processing. The number of partitions determines how much parallelism Spark can use.

However, in certain scenarios, having too many partitions can cause inefficiency because each partition introduces some overhead. This is where coalesce() comes in.

Example: Let's assume that after some transformations, our DataFrame data has 10 partitions, but we want to write it into a single CSV file. Spark will write each partition as a separate file by default, but we can use coalesce(1) to combine all the data into one partition, so only one file will be written.

Python

data.coalesce(1).write.csv(output_path, header=True, mode='overwrite')

Explanation:

coalesce(1): This reduces the number of partitions to 1, meaning all the data will be written into a single file (which can be useful when you want only one output file, such as for CSV).
.write.csv(): This writes the data to the specified output_path.
header=True: Includes the column names as the header in the output CSV file.
mode='overwrite': This overwrites any existing files at the specified output path.

Stop Spark Session

To stop the Spark session, use the syntax spark.stop(), it releases the resources used by the Spark application. It’s a good practice to stop the session when all operations are complete.