Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding databases. How does that work when you have lots and lots of data that is very related?

What if you have to make several queries to several different tables to get information to compile? Isn't hitting the database that many times starting to get more expensive than just using joins in the first place? I guess it depends on how much data you've got?

1 Answer

0 votes
by (32.3k points)
edited by

A relational database is a general-purpose tool that organizes your data and makes it easily accessible. For fast modern computers, and RDBMS' that are well-optimized, you can scale your data to quite a respectable size on a single box. RDBMS gives you flexible access to your data, and it also comes with powerful correctness constraints that make the coding part easy against a large amount of data.

Now take a scenario where you are going to have to scale beyond the size of a single DB server. Here, you will need to start identifying bottlenecks and remove them. The RDBMS will look like a harder approach now, which will require a lot of fixing. The more interconnected your data will be, the more work you'll have to do, but maybe you won't have to completely solve the whole thing. Maybe there are just some problem tables that can be moved to a more scalable data store, your user profile might be very cache-friendly and you can just migrate the load to a giant memory cached cluster.

But when the scalable key-value stores like BigTable come into the picture, none of the above possibilities work, and you have so much data of a single type that even when it's denormalized a single table is more than enough for one server. At this point, you need to be able to partition it and still have a clean API to access it. Naturally, when the data is spread out across so many machines you can't have algorithms that require these machines to talk to each other much, which would be required by many of the standard relational algorithms. As you suggest, these distributed querying algorithms have the potential to require more total processing power than the equivalent JOIN in a properly indexed relational database, but because they are parallelized the real-time performance is orders of magnitude better than any single machine could do (assuming a machine that could hold the entire index even exists).

You can scale your massive data set horizontally by just plugging in more servers. But ongoing operations and development at this scale are a lot harder than the single-server app. The point is that as long as the application servers will be getting the data they need( from time to time, the scaling of these servers will remain to be a trivial task.

To answer your question about how commonly used ORMs handle the inability to use JOINs, the short answer is they don't. ORM stands for Object Relational Mapping, and the work of  ORM is to convert data between incompatible type systems using object-oriented programming languages. Most of the value of what they give you is simply not going to be possible from a key-value store. In practice, you will probably need to build up and maintain your own data-access layer that's suited to your particular needs because data profiles are going to be very different at these scales and I think there are many compromises that a general-purpose tool has to do in order to emerge and become dominant as RDBMS. In short, you'll always have to do more legwork at this scale.

If you are a beginner in Hadoop, refer the following video tutorial that will teach you Hadoop from scratch:

Browse Categories