Frequently Asked Question from clients planning to adopt Hadoop:
1. Who are the players available in the market, what services/solution they provide?
There are many players in the market and they fall broadly into these categories. First category are pure technology players like Hortonworks, Cloudera, Yahoo. etc. These players provide technology and consulting around Hadoop technology. They are directly responsible for Hadoop projects like Hive, HBase, Hadoop Core, Hadoop MapReduce, Hadoop HDFS, Avro, Thrift, Pig, etc. The second category is solution providers on Hadoop, these are companies like Infosys, Wipro, etc who build center of excellence so that they can handle customers Hadoop / BigData projects.
The third category is the integration vendors, they are the ones like Informatica, they provide tools to make sure Hadoop and other eco-system co-exists, they help in making two sys-tems communicate with each other, they help in data migration and management between Ha-doop and other native / legacy system.
The fourth category is the one who build framework above Hadoop so as to make the de-velopment of application using Hadoop in a simpler manner, they are like cascading, google FlumeJava, Crunch, etc. Fifth category is the Analytics industry, they have already build Analytics / Data Mining systems which can now work with Hadoop, example are Greenplum, Pentahao, Karmasphere, etc. They provide simple visual tools using which your data analysis need can be described without explicitly writing MapReduce programs in Java.
2. What are the areas to focus on optimizing big data deployment from Analytical point of view?
Domain in which you want to focus initially should be know to you already and your learning should only be technology and not domain knowledge. For example you might have a good domain knowledge on eCommerce analytics and its good to focus on that domain rather than entering high frequency trading, thought big data concepts are same. Technology is only few piece of the puzzle and most of the other things are hard earned domain knowledge. When attempting an analytics on Big data enter a domain in which you have high level of confidence so that when something fails we know its only on technology and not on the domain knowledge. Slowly you can leverage the technology expertise in other domain. This is the only way to break the otherwise what looks like a catch 22 situation.
3. How it is useful in terms of structure, semi-structured and unstructured data?
Big data come in three flavor, Structured, Semi-Structured, Un-Structured and we need to process these data to come to some conclusion / generate reports. Hadoop helps in writing sim-ple code to handle these volume of data and analyze them. It enables parallel computation on this huge data volume and get results much more faster.
4. What expertise one need to develop to be able to use it from analytical point of view ?
Understanding of how Distributed File System works, and how big data is split and stored in multiple machine, the concept of code pushed to where data is stored, the concept of Single Instruction Multiple Data, Map and Reduce as its in Functional Programming, the concept of Key and Value and how map and reduce and key and value are inter-related and how us-ing all these one can solve problems.
One can really not understand all these in depth and work with R / Hive / Pig and try to get the relevant answer, he / she should know when to use R vs Hive vs Pig vs Mahout. From analytics part of view he / she should be comfortable with mathematics as its the base and apart from that if they are going to write Java code they should be very good in programming and core MapReduce concepts, otherwise they can work with R / Hive without having much thought on Hadoop or use Parallel R.
5. Why should a client move to Big Data and when?
They move once their data need becomes un-manageable using traditional technology both for storing and retrieval and to process those data and get insights.
6. What is lacking in the entire Big Data space – tech, people, senior consulting?
Everything, and this is one reason the market is open for anyone to participate and make money / value, if it was a mature market it would be dominated by few players. Hadoop / Big-Data is the new market and it gives lots of change for new players to enter and make it big.
7. To continuously train people, what could be a good training schedule?
Get an initial training of 3 to 5 Days on the technology and hands on programming and execution for the core members and get them on to projects where they learn on the Job. Any new skill or additional knowledge can be got by having one or two days session and they can learn trivial things online from many blogs and tutorials and books and academic papers.
8. What are the free / paid certifications we can get people to do?
Cloudera is having one and I think hortonworks is also working on certification, certification matters to service company more because they have to win projects from clients and they can say we have engineers who are Cloudera certified but for product organization it does not matter as long as the engineer is good with technology and can comfortably work with it.
9. The ability to integrate Hadoop with other Apache projects like Apache UIMA for text/audio mining or other softwares like R, SAS etc. Essentially – how flexible Hadoop is when it comes to addressing a variety of analytical tasks ?
Hadoop has multiple projects now and most of them are created with the intent to help doing analytics better and better. Framework like Mahout helps in Machine Learning – Collabo-rative Filtering, it has LDA algorithms that can be used to work on Images. Since its open source Mahout and UIMA can be easily integrated and already there are project that are doing this, that is, making UIMA work in hadoop, the project i called Behemout.
10. How much of Map reduce Knowledge is required in executing Hadoop?
The more the more good, its like working in an engineering organization without known mathematics, people can survive as long as its simple math but the moment they need to work with multiple variables and the effect of one variable on other variables, the simple math is not going to help, they need to know calculus and linear algebra. Its the same with Big Data, as long as its simple models its fine but the moment they need to work or tune some algorithms they need the Knowledge of MapReduce.
11. When to use Cassandra v/s Hadoop v/s MongoDB , what parameters did you consider before asking any clients to move to these platforms?
No SQL or big data database falls into three category based on CAP Theorem, typically based on what it means by consistency and availability. These databased can also be classified based on the way they represent data.
By CAP theorem model they are Consistent and Available, they are Consistent and Toler-ant to Partition and the last one is Available and Tolerant to Partition. The following picture illustrate them and also what database fallas into which category.
Click To Enlarge
From pure data representation perspective they are classified as either Relational, Key Value, Column oriented, Document Oriented. The above pictures illustrates these two classifica-tion in a single diagram.
13. Talk about the merits and demerits of all these three – Also cost of Implementation for the client?
There is nothing like merits and de-merits, we have to use each system for what its de-signed for, which puts a lots of pressure on us to know what each system is designed for. Not understanding the architecture and using them would risk the product / application.
14. Brief us more on Pig v/s Hive ?
Hive is used if your big data is structured and most of the requirements can be expressed with writing SQL like queries and some simple python / perl program. Hive cannot work on un-structured data and we need to have metadata.
Pig is more like a data pipelining system and can be used to work both on structured and unstructured data, its more expressive than Hive. You can use Pig and Hive when writing MapReduce is not necessary, that is it could be done by Pig or Hive, there will be scenario that some algorithms are not expressed using Hive or Pig in such case we need to write either Java MapReduce or R.
15. Elaborate more on HBase and its applications?
HBase is an highly scalable large scale distributed massively scalable database and used in many application even the search in google is powered by a system similar to HBase (actually HBase is based on the system called Big Table which was created inside google to solve their search problem, google released a paper and open source implementation of Big Table is called HBase)
Any application that need data to be retrieve without having to even make a disk seek can benefit from HBase, high volume read and write application can benefit from HBase. UID AADHAR project is one example of using HBase.
HBase is one of the most widely used NoSQL / BigData database as its directly inte-grates with Hadoop and can run MapReduce jobs from HBase directly.
16. In Cassandra with integrating Hadoop what are pros and cons of the same?
Cassandra was not that actively developed sometime back but now there is activity after it become Apache top level project and its now possible to run Cassandra over Hadoop but one should have a solid understanding of leader election and sharading architecture of cassandra. Master Slave architecture is easily understood by all and its simple hence HBase / BigTable fol-lowed this model but cassandra follows a peer model in which any node can become master so the runtime deployment has to be well understood. Even Facebook has migrated all their Cas-sandra deployments into HBase.
17. Kindly show us an example on the No Schema which is said to be biggest takeaway of these platforms and how structures could be altered on the fly?
This is a typical example of NoSQL schema (done with MongoDB), this data represents a user in online and from where he made a call to a website called MassMutual.
{
"_id" : ObjectId("50981cf8e4b08e6b37bb842d"), "cookieType" : 3,
"pixelId" : "000007",
"cookieId" : "c31600a9-d8df-4612-9efe-e880b266d5dc", "ipAddress" : "160.81.118.42",
"pageURL" : " http://www.mass
mutual.com/", "visitCount" : 1,
"lastVisitedTime" : NumberLong("1352146168365"),
"userAgent" : "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; GomezAgent 3.0)", "browser" : "Internet Explorer",
"operatingSystem" : "Windows", "geoTagInfo" : {
"countryCode" : "US", "countryName" : "United States", "region" : "NY",
"city" : "New York", "coordinates" : [
-74.00599670410156, 40.71429443359375
],
"dma_code" : 0, "area_code" : 212, "metro_code" : 501
},
"version" : NumberLong("1352146168366")
}
The advantage of this model is its a Pure JSON object and can have any level of nesting and we have no need to have a static schema, if we want more fields we can add them at runtime.