The last five years have seen an absolute explosion of data in the world. There are around 6000 tweets every second, which calculates to over 350,000 tweets per minute and 50 million tweets per day. Similarly, Facebook has over 1.55 billion active users per month and around 1.39 billion mobile active users. Every minute on Facebook, 510 comments are posted, 293,000 statuses are updated and 136,000 photos are uploaded.
Companies of all sizes providing all kinds of services and products – whether IT and software, Manufacturing, E-commerce or Medical, use Hadoop at present time. The primary goal of Hadoop is to extract valuable information from the structured and unstructured data available within the organization and on their digital sources.
Ultimately, big data analytics helps enterprises in taking improved and more informed business decisions as it includes data from several resources like web server logs, Internet clickstream data, social media content, email content and responses from customers, reports from social network activities, mobile phone data and also captured from Internet of Things.
Hadoop is an open source technology with a distributed processing framework and node-cluster hardware structure. It is, in fact, a collection of open source technologies and thus, its development is in the hands of not a single Apache Software Foundation. The major components of Hadoop are:
Hadoop Distributed File System (HDFS): encompasses a conventional hierarchy of distributed file system that distributes files across Datanodes (storage nodes) in a cluster.
MapReduce: is a programming model and software framework based on Java for creating applications that process massive volumes of data across thousands of servers in single Hadoop cluster. It is better known as the heart of Hadoop.
In large and small scale enterprises, Hadoop is not only storage space/framework but considered important for data warehousing, data modeling, data analytics, data scalability and data computations. Only the challenges these companies face today are lack of appropriate skills, fragile business support and unreliable open source tools, to which Hadoop vendor Apache is continually upgrading its hardware and processing systems. The latest release of Apache is Hadoop 2.7.2, which is built upon its previous Ver. 2.7.1 in the 2.x.y series.
We have listed and explained here several leading companies that use Hadoop and related platforms
Facebook uses Hadoop and Hive: We can’t even imagine and calculate the exact amount of data generated by Facebook posts, images, videos, profiles and every activity happening on FB. According to experts and company professionals, Hadoop functions on every product of Facebook and in a variety of ways. User actions including ‘like,’ ‘status update’ or ‘add comment’ are stored and saved in an excellent distributed and personalized database, MySQL. Similarly, Facebook messenger application runs on HBase, and all the messages sent and received on Facebook are gathered and hoarded in HBase.
In addition, all of the external advertisers’ and developers’ campaigns and applications running on this social media platform use Hive to generate their success reports. Facebook has built a higher level data warehousing infrastructure using features of Hive that help in querying the database using SQL language-HiveQL.
Amazon uses Elastic MapReduce(EMR) and Elastic Cloud Compute(EC2)
Amazon web services, the top E-commerce today, simplifies its big data processing and analytics using Elastic MapReduce web service. EMR provides a managed framework of Hadoop that employs easy, fast and cost-effective mechanism to distribute and compute vast amounts of data across Amazon EC2 instances. The major functions performed by Hadoop in Amazon web services include log analysis, data warehousing, web indexing, financial analysis, machine learning, scientific simulation and bioinformatics.
eBay works with 532 nodes cluster (8 * 532 cores and 5.petabytes)
Another global online retailer, eBay makes heavy usage of big data hadoop components like Java MapReduce , Apache HBase, Apache Hive and Apache Pig for its Search optimization and Research.
Adobe uses Apache HBase and Apache Hadoop
Planning a deployment of around 80 nodes cluster, Adobe’s processes currently have 30 nodes running on HDFS, HBase and Hadoop in clusters in the range of 5-14 nodes for its production and development operations. Adobe is a well-known international enterprise whose products and services are used worldwide; one of them is its Digital Marketing business Unit. Hadoop ecosystem has been deployed on Adobe’s VMware vSphere for several Adobe users. This deployment has reduced time to insight data and costs by using existing servers.
This report is a thought through and cogent signal for all those Software Developers and Administrators aspiring to get a level ahead in their careers that Hadoop is the real winner for big data analytics. There are thousands of newly available options that Big Data Hadoop provides for real-world applications and enterprises, delivering maximum business value and productive decision making.
The opinions expressed in this article are the author’s own and do not reflect the view of the organization.