Let’s discuss how Uber managed to fix the problem of 100 petabytes of analytical data generated within its system due to more and more insights over time.
Identification of Big Data at Uber
Before Uber realized the existence of big data within its system, the data used to be stored in legacy database systems, such as MySQL and PostgreSQL, in databases or tables. In the company, the total data size back in 2014 was around a few terabytes. Therefore, the latency of accessing this data was very fast, accomplished in less than a minute!
Here is how Uber’s data storage architecture looked like in the year 2014:
As the business started growing rapidly, the size of the data started increasing exponentially, leading to the creation of an analytical data warehouse that had all the data in one place, easily accessible to the analysts all at once. To do so, data users were categorized into three main groups:
- City Operations Team: On-ground crews responsible for managing and scaling Uber’s transport system
- Data Scientists and Analysts: A group of Analysts and Data Scientists who need data to deliver a good service for transportation
- Engineering Team: Engineers focused on building automated data applications
A data warehouse software named Vertica was used as it was fast, scalable, and had a column-oriented design. Besides, multiple ad-hoc ETL jobs were created that copied data from different sources into Vertica. To achieve this, Uber started using an online query service that would accept users’ queries based on SQL and upload them on to Vertica.
It was a huge success for Uber when Vertica was launched. Uber’s users had a global view, along with all the data they needed, in one place. In just a few months later, the data was again increasing exponentially as the number of users was increasing.
Since SQL was in use, the City Operators team found it easy to interact with whatever data they needed, without having any knowledge of the underlying technologies. On the other hand, the Engineering team began building services and products according to user needs that were identified by the analysis of the data.
Although everything was going well and Uber was attracting more customers and profit, there were still a few limitations:
- The use of data warehouse became too expensive as data compilation had to be extended to involve more and more data. So, to free up more space for new data, older and obsolete data had to be deleted.
- Uber’s Big Data platform wasn’t scalable horizontally. Its prime goal was to focus on the critical business needs for centralized data access.
- Uber’s data warehouse was like a data lake, where all the data used to pile up. Even multiple copies of the same data existed, which increased storage costs.
- When it came to data quality, there were issues related to backfilling as it was laborious and time-consuming, and the ad-hoc ETL jobs were source-dependant. Data projections and data transformations were performed during the time of ingestion, and due to the lack of standardized ingestion jobs, it became difficult to ingest new datasets and data types.
What is Hadoop? Check out the Big Data Hadoop Training in Sydney and learn more!
Introduction of Apache Hadoop in Uber’s System
To address the problems created by big data, Uber took the initiative to re-architecture its Big Data platform on top of Hadoop. In other words, it designed an Apache Hadoop data lake and ingested all the raw data from various online data stores into it once, without any transformation during this process. The change in the design decreased the data load on its online data stores and helped it to shift from the ad-hoc ingestion jobs to an ingestion platform that was scalable.
Then, Uber introduced a series of innovations, such as Presto, Apache Spark, and Apache Hive to enable interactive user queries and access to data and to serve even larger queries, all making Uber’s Big Data platform more flexible.
Data modeling and transformation were needed to make the platform scalable, which was held only in Apache Hadoop. This enables quick data recovery when there were any issues.
Another thing that really helped Uber was that it made sure only modeled tables to be transferred onto its warehouse. This, in turn, reduced the operational cost for running a large data warehouse. This was referred to as the second generation of Uber’s Big Data platform.
Now, the ad-hoc data ingestion jobs were exchanged with the standard platform to transfer all the data in the original and nested formats into the Hadoop lake.
As Uber’s business was growing at the speed of light, tens of terabytes of data were getting generated and added to the Hadoop data lake, on a daily basis. Soon, its Big Data platform grew over 10,000 vCores having approximately 100,000 batch jobs running per day. This resulted in the Hadoop data lake becoming a centralized source-of-truth for Uber’s analytical data.
The following image summarizes how the snapshot-based data ingestions moved through Uber’s Big Data platform.
This is how Uber managed its big data with the help of the Hadoop ecosystem.
The question ‘What is Hadoop?’ cannot be answered completely without discussing its features. So, let’s move on with that now in this Hadoop tutorial.
Watch this video on ‘Hadoop vs Spark?’:
Features of Hadoop
Let’s now look at a few features of Big Data Hadoop:
1. Enables Flexible Data Processing
The most prominent problem organizations face is the issue of handling unstructured data. Hadoop plays a key role here as it is able to manage data, whether it is structured or unstructured, or of any kind.
2. Highly Scalable
Since Hadoop is an open-source platform that runs on proper industry-standard hardware, it is a highly scalable platform wherein distinct nodes can easily be united in the system for making the replicas of data blocks.
In Hadoop, data is actually saved in HDFS wherein it can automatically be duplicated at three different locations. Therefore, even if two of the systems get collapsed, the file will still be present on the third system.
4. Faster in Data Processing
Hadoop is remarkably efficient at batch processing at high volume. This is because Hadoop can perform parallel processing. It can implement batch processes 10 times quicker when compared to a single-thread server or mainframe.
5. Robust Ecosystem
Hadoop has a pretty robust ecosystem that suitably aligns with the analytical requirements of developers and of small or large organizations.
There are a lot of cost benefits that Hadoop brings in. Parallel computing to commodity servers results in a noticeable reduction in the cost per terabyte of storage.
Next in this Hadoop tutorial, let’s look at the various domains used in Hadoop.
Various Domains That Use Hadoop
Hadoop is being used in a large number of sectors to manage data effectively. Some of these major domains are as follows:
Banks have a huge amount of data stored in their servers and databases that need to be managed effectively and to be secured at the same time. Meanwhile, they have to adhere to customer requirements and reduce risks, along with sustaining regulatory compliance.How does Hadoop pitch in?
Vast financial data residing in the extensive databases of banking sectors can be converted into a goldmine of information provided that there is a suitable tool to analyse data, for example, Cloudera.
Government sectors mainly utilize big data in managing their huge stack of resources and utilities, along with getting insights from surveys conducted on a huge scale. They need to manage huge databases containing the data records of billions of people.How does Hadoop cater to this problem?
- Preventing fraud and waste: Apache Hadoop is a tool that can be used to detect fraud and analyze data by creating new data models focused on fraud, waste, and abuse.
- Identifying terror threats on social media: Terrorist organizations often communicate through social networks in order to circulate instructions. Hadoop not only identifies such data but with its advanced filtering and matching algorithms it can be used to detect all the accomplices working with such organizations.
- Storing government records: It’s hard to store extensive amounts of data—e.g., the data records related to Aadhar card—in traditional databases. Thus, the government is using various Big Data Analytics tools, especially Hadoop, to sort and manage such huge data effectively and efficiently.
The education sector has to maintain a huge volume of data that may be segregated into several fields. To manage this data and to provide access to it according to users’ interests is a huge challenge.How is Hadoop used in the education sector?
- Examination records and results: Analyzing each students result to get a better understanding of the student’s behavior and thus creating the best possible learning environment
- Analytics for educators: Many programs can be created encouraging individuals about their interests. And, on this basis, many reports can be created. And accordingly, educators can be assigned with their respective skills and interests.
- Career prediction: A thorough analysis of a student’s report card. This analysis can be done using various Hadoop tools and can suggest some appropriate courses a student can pursue in the future according to his/her areas of interest.
Due to a large population, it gets increasingly difficult managing all medical-related data in the healthcare sector and analyzing the data to suggest a suitable treatment for each of the patients. Several Machine Learning algorithms and Data Analytics tools can be used for analyzing patients’ medical history and getting insights from it and thus, in turn, appropriately treating them.Use cases of Big Data Hadoop in the Healthcare Sector
- Prediction analytics in healthcare: Several Big Data tools are available to analyze and assess patients’ medical history and give a predictive measure as to what kind of treatment can be used to cure them in the future.
- Electronic health records: Electronic health records have become one of the main applications of big data in healthcare as they enable every patient to possess his/her own medical records such as prescribed medicines, medical reports, lab test results, etc. This data can be modified over time by doctors and shared efficiently.
E-commerce tools can help provide insights into what a customer needs and the current requirements of the market. These tools can also help enterprises in building customer relationships and new strategies and ideas as to how to expand their business further.
- Service Improvement: Various Big Data tools can analyze demands in the market, along with predicting what customers would want more in the future based on these insights.
- Personalized Approach: This involves sending selective mails and offers to customers who are interested in a particular domain.
Price Formation: For analyzing the dynamic nature of the market and the demand-and-supply ratio and for predicting the pricing of a particular product, various tools can be used so that the sale of this product does not get affected.
- Social media
Social media today is the largest data producer, and it contains a lot of sensitive data that needs to be managed efficiently and securely. This data also needs to be optimized and stored effectively.
Usage of Big Data Tools in Social Media
- Bitly: Bitly is a Data Analytics tool used for high-quality analytics, and it is able to establish short links that can be tracked across the web. It can be used to cut short any URL so that it can fit nicely across any social media webpage.
- Everypost: Everypost is a platform that can manage multiple networks in parallel. Here, a user can curate a type of content at one place and organize it in another. It enables the storage of massive data on various social websites in a single large repository.
We have now come to the end of this section on ‘What is Hadoop?’
In this section of the Hadoop tutorial, we learned ‘What is Hadoop?’, the need for it, and how Hadoop solved the problem of big data, and we also saw how Uber dealt with its big data with the help of the Hadoop ecosystem.
What is Hadoop? To answer this question comprehensively, we need to know about Big Data. What is this Big Data that we are talking about all this while in this tutorial? So, in the next section of this Big Data Hadoop tutorial, we shall be learning about What is Big Data.
- Intellipaat’s Hadoop tutorial is designed for Programming Developers and System Administrators
- Project Managers eager to learn new techniques of maintaining large datasets
- Experienced working professionals aiming to become Big Data Analysts
- Mainframe Professionals, Architects & Testing Professionals
- Entry-level programmers and working professionals in Java, Python, C++, eager to learn the latest Big Data technology.
- Before starting with this Hadoop tutorial, it is advised to have prior programming language experience in Java and Linux Operating system.
- Basic command knowledge of UNIX and SQL Scripting can be beneficial to better understand the Big data concepts in Hadoop applications.