[Spark By Example] Schema

Spark can infer the data structure, but you can explicitly specify the data by providing the Schema to the DataFrame.


  • You can use “StructType” to define Schema.
  • You can define nested “StructType”s.
  • The “StructType” is a collection of “StructField“.
  • The “StructField” defines a column
    • name
    • data type
    • nullable
  • “StructField” Data Types
    • StringType
    • BooleanType
    • IntegerType
    • BinaryType
    • ArrayType
    • MapType

Data File

  • Create a json file with the following content.
      "id": 1, 
      "noTax": false,
      "manager": { "firstname": "Paul", "lastname": "Henderson" },
      "products": ["Washer","Dryer","Refrigerator"]
      "id": 2, 
      "manager": { "firstname": "Grace", "lastname": "Carr" },
      "products": ["Sweater","Jacket"]
      "id": 3, 
      "noTax": true,
      "manager": { "firstname": "Julia", "lastname": "Jackson" },
      "products": ["Bread","Coffee","Milk"]


Python Application

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

if __name__ == "__main__":
  # create a session with the builder
  spark = (SparkSession
  data_file = "test/data/products.json"

   # schema
  custom_schema = StructType([
        StructField('id', IntegerType(), False),
        StructField('category', StringType(), False),
        StructField('noTax', BooleanType(), True),
        StructField('manager', StructType([
            StructField('firstname', StringType(), False),
            StructField('lastname', StringType(), False)
        ]), False),
        StructField('products', ArrayType(StringType()), True)])

  # all data
  df = (spark.read
    .option('multiline', True) # important!

  # products with no tax
  df_no_tax = df.where(


C# Application

using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;
using Microsoft.Spark.Sql.Types;

namespace MySpark.Examples.Basics
    internal class DataFrameSchema
        public static void Run()
            SparkSession spark =

            string filePath = "data/products.json";

            // schema
            StructField[] fields = {
                new StructField("id", new IntegerType(), false),
                new StructField("category", new StringType(), false),
                new StructField("noTax", new BooleanType(), true),
                new StructField("manager", new StructType( new StructField[]{
                    new StructField("firstname", new StringType(), false),
                    new StructField("lastname", new StringType(), false)
                    }), false),
                new StructField("products", new ArrayType(new StringType()), true)
            StructType schema = new StructType(fields);

            // all data
            DataFrame df = spark.Read()
                .Option("multiline", true)

            // products with no tax
            DataFrame dfNoTax = df


Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s