Create a Dataframe in PySpark

In general, DataFrames can be defined as a data structure, which is tabular in nature.

Features of dataframe

DataFrames are Distributed in Nature, which makes it fault tolerant and highly available data structure.
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an expression until its value is needed.
DataFrames are Immutable in nature which means that it is an object whose state cannot be modified after it is created.

To create a dataframe use the following pyspark syntax:

Python

spark.createDataFrame(data, schema)

Example: First create the SparkSession.

Python

# Import the SparkSession module
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate()

The following shows an example of how to create a dataframe:

Python

data = [
    (15779, "small_business", 1.204, "high"),
    (87675, "large_business", 0.167, "low")
]

columns= ["Salary", "Business Type", "Score", "Standard"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

The output of the above command is shown below:

Previous Next