Create a Dataframe in Pyspark
In general, DataFrames can be defined as a data structure, which is tabular in nature.
Features of dataframe
- DataFrames are Distributed in Nature, which makes it fault tolerant and highly available data structure.
- Lazy Evaluation is an evaluation strategy which will hold the evaluation of an expression until its value is needed.
- DataFrames are Immutable in nature which means that it is an object whose state cannot be modified after it is created.
To create a dataframe use the following pyspark syntax:
Python
spark.createDataFrame(data, schema)
Example: First create the SparkSession.
Python
# Import the SparkSession module from pyspark.sql import SparkSession # Initialize a Spark session spark = SparkSession.builder.appName("App Name").getOrCreate()
The following shows an example of how to create a dataframe:
Python
data = [ (15779, "small_business", 1.204, "high"), (87675, "large_business", 0.167, "low") ] columns= ["Salary", "Business Type", "Score", "Standard"] # Create DataFrame df = spark.createDataFrame(data, columns) # Show DataFrame df.show()
The output of the above command is shown below:
