Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have a Cassandra table that for simplicity looks something like:

key: text
jsonData: text
blobData: blob


I can create a basic data frame for this using spark and the spark-cassandra-connector using:

val df = sqlContext.read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "mytable", "keyspace" -> "ks1"))
  .load()


I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?

1 Answer

0 votes
by (32.3k points)

Answering your question "Is this currently possible"?

As far as I am concerned, it is not directly possible. I would suggest you to try something similar to this:

val df = sc.parallelize(Seq(

  ("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),

  ("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")

)).toDF("key", "jsonData", "blobData")

I assume that in JSON, blob field cannot be represented . Otherwise you can omit splitting and joining:

import org.apache.spark.sql.Row

val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")

Then, follow the below steps:

 

image

 

For Spark >= 2.4

If needed, schema can be determined using schema_of_json function (this assumes that an arbitrary row is a valid representative of the schema).

import org.apache.spark.sql.functions.{lit, schema_of_json}

val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))

df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]())) 

...