Xml processing in Spark

Question

asked Aug 20, 2019 in Big Data Hadoop & Spark by ParasSharma1 (19k points)

Scenario: My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.

Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.

Sample XML:

<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>

How will this be loaded into the RDD?

1 Answer

Anurag · Answer 1 · 2019-08-20T12:35:24+0000

You can perform XML processing in Spark, but you need different dependencies for that.

If your file is small, then the simplest solution is to load the data using SparkContext.wholeTextFiles. It will load data as RDD, where the first element should be path and second should be file content.

But for larger files, you can use Hadoop input formats.

If the structure is simple, then you can split records using textinputformat.record.delimiter. Input is not an XML but it should give you an idea of how to proceed.

Mahout also provides XmlInputFormat

Finally, it is possible to read the file using SparkContext.textFile and adjust later for record spanning between partitions.

Simply use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records.

You can use the second mapPartitionsWithIndex to repair broken records

There is a relatively new spark-XML package which allows you to extract specific records by tag:

val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "foo")
.load("bar.xml")

Hope this answer helps you!

Xml processing in Spark

1 Answer

Related questions

Browse Categories