Create a Dataframe in PySpark
In general, DataFrames can be defined as a data structure, which is tabular in nature.
Features of dataframe
- DataFrames are Distributed in Nature, which makes it fault tolerant and highly available data structure.
- Lazy Evaluation is an evaluation strategy which will hold the evaluation of an expression until its value is needed.
- DataFrames are Immutable in nature which means that it is an object whose state cannot be modified after it is created.
To create a dataframe use the following pyspark syntax:
Python
spark.createDataFrame(data, schema)
Example: First create the SparkSession.
Python
# Import the SparkSession module
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("App Name").getOrCreate() The following shows an example of how to create a dataframe:
Python
data = [
(15779, "small_business", 1.204, "high"),
(87675, "large_business", 0.167, "low")
]
columns= ["Salary", "Business Type", "Score", "Standard"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show DataFrame
df.show() The output of the above command is shown below:
