Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

Quoting the Spark DataFrames, Datasets and SQL manual:

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

Being new to Spark, I'm a bit baffled by this for two reasons:

  1. Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?

  2. Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map.


1 Answer

0 votes
by (32.3k points)

Spark SQL is mostly about distributed in-memory computations on massive scale.

Spark SQL provides Spark programmers with the benefits of relational processing (e.g., declarative queries and optimized storage), and allows SQL users to call complex analytics libraries in Spark (e.g., machine learning).

Now talking about index:

Indexing is done in order to allow you to move your disk head to the precise location on disk you want and just read that record.

Indexing input data

  • The reason why indexing over external data sources is not supported by Spark is that Spark is not a data management system but a batch data processing engine. Since the data used by spark comes from different sources and is not owned by spark, it cannot reliably monitor changes. And as a consequence it cannot maintain indices.

  • Now, if the data source that provides data to the Spark supports indexing, it will be indirectly utilized by Spark through mechanisms like predicate pushdown.

Indexing Distributed Data Structures:

  • standard indexing techniques require persistent and well defined data distribution but the exact distribution of the data present in Spark is nondeterministic and also this data is ephemeral.

  • high-level data layout achieved by proper partitioning plus attached columnar storage and compression provides an efficient distributed access without an overhead of creating, storing and maintaining indices. This common pattern is used by different in-memory columnar systems.

That being said some forms of indexed structures do exist in the Spark ecosystem.

Other projects, like Succinct (mostly inactive today) runs on a different approach and use advanced compression techniques with random access support.

And of course this raises a question. If there comes a case where we require an efficient random access why not use a system which is designed as a database from the beginning. There are many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark is evolving as a project. The future of Spark might hold anything.

Browse Categories