You can perform XML processing in Spark, but you need different dependencies for that.
If your file is small, then the simplest solution is to load the data using SparkContext.wholeTextFiles. It will load data as RDD, where the first element should be path and second should be file content.
But for larger files, you can use Hadoop input formats.
If the structure is simple, then you can split records using textinputformat.record.delimiter. Input is not an XML but it should give you an idea of how to proceed.
Mahout also provides XmlInputFormat
Finally, it is possible to read the file using SparkContext.textFile and adjust later for record spanning between partitions.
Simply use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records.
You can use the second mapPartitionsWithIndex to repair broken records
There is a relatively new spark-XML package which allows you to extract specific records by tag:
val df = sqlContext.read
Hope this answer helps you!