0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

1 Answer

0 votes
by (32.5k points)
edited by

Lets assume you want a data frame with the following schema:

root

 |-- k: string (nullable = true)

 |-- v: integer (nullable = false)

You simply define schema for a data frame and use empty RDD[Row]:

import org.apache.spark.sql.types.{

    StructType, StructField, StringType, IntegerType}

import org.apache.spark.sql.Row

val schema = StructType(

    StructField("k", StringType, true) ::

    StructField("v", IntegerType, false) :: Nil)

// Spark < 2.0

// sqlContext.createDataFrame(sc.emptyRDD[Row], schema) 

spark.createDataFrame(sc.emptyRDD[Row], schema)

PySpark equivalent is almost identical:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([

    StructField("k", StringType(), True), StructField("v", IntegerType(), False)

])

# or df = sc.parallelize([]).toDF(schema)

# Spark < 2.0 

# sqlContext.createDataFrame([], schema)

df = spark.createDataFrame([], schema)

If you want to know more about Scala, then do check out this awesome video tutorial:

If you want to learn Scala visit Scala Online Training by Intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...