Praveen Kumar K C: Big Data and Hadoop

What is Big Data
Big Data is very large, loosely structured data set that defies traditional storage.

"Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time”. – wiki

So big that a single data set may contains few terabytes to many petabytes of data.

Human Generated Data and Machine Generated Data

Human Generated Data is emails, documents, photos and tweets. We are generating this data faster than ever. Just imagine the number of videos uploaded to You Tube and tweets swirling around. This data can be Big Data too.

Machine Generated Data is a new breed of data. This category consists of sensor data, and logs generated by 'machines' such as email logs, click stream logs, etc. Machine generated data is orders of magnitude larger than Human Generated Data.

Before 'Hadoop' was in the scene, the machine generated data was mostly ignored and not captured. It is because dealing with the volume was NOT possible, or NOT cost effective.

Where does Big Data come from

Original big data was the web data -- as in the entire Internet! Remember Hadoop was built to index the web. These days Big data comes from multiple sources.

Web Data -- still it is big data
Social media data : Sites like Facebook, Twitter, LinkedIn generate a large amount of data
Click stream data : when users navigate a website, the clicks are logged for further analysis (like navigation patterns). Click stream data is important in on line advertising and and E-Commerce
sensor data : sensors embedded in roads to monitor traffic and misc. other applications generate a large volume of data

Examples of Big Data in the Real world

Facebook : has 40 PB of data and captures 100 TB / day
Yahoo : 60 PB of data
Twitter : 8 TB / day
EBay : 40 PB of data, captures 50 TB / day

Challenges of Big Data

Size of Big Data

Big data is... well... big in size! How much data constitute Big Data is not very clear cut. So lets not get bogged down in that debate. For a small company that is used to dealing with data in gigabytes, 10TB of data would be BIG. However for companies like Facebook and Yahoo, peta bytes is big.

Just the size of big data, makes it impossible (or at least cost prohibitive) to store in traditional storage like databases or conventional filers.

We are talking about cost to store gigabytes of data. Using traditional storage filers can cost a lot of money to store Big Data.

Big Data is unstructured or semi structured

A lot of Big Data is unstructured. For example click stream log data might look like time stamp, user_id, page, referrer_page

Lack of structure makes relational databases not well suited to store Big Data.

Plus, not many databases can cope with storing billions of rows of data.

No point in just storing big data, if we can't process it

Storing Big Data is part of the game. We have to process it to mine intelligence out of it. Traditional storage systems are pretty 'dumb' as in they just store bits -- They don't offer any processing power.

The traditional data processing model has data stored in a 'storage cluster', which is copied over to a 'compute cluster' for processing, and the results are written back to the storage cluster.

This model however doesn't quite work for Big Data because copying so much data out to a compute cluster might be too time consuming or impossible. So what is the answer?

One solution is to process Big Data 'in place' -- as in a storage cluster doubling as a compute cluster.

How Hadoop solves the Big Data problem

Hadoop clusters scale horizontally

More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and expensive hardware.

Hadoop can handle unstructured / semi-structured data

Hadoop doesn't enforce a 'schema' on the data it stores. It can handle arbitrary text and binary data. So Hadoop can 'digest' any unstructured data easily.

Hadoop clusters provides storage and computing

We saw how having separate storage and processing clusters is not the best fit for Big Data. Hadoop clusters provide storage and distributed computing all in one.

Why do I Need Hadoop

Too Much Data

Hadoop provides storage for Big Data at reasonable cost
Storing Big Data using traditional storage can be expensive. Hadoop is built around commodity hardware. Hence it can provide fairly large storage for a reasonable cost. Hadoop has been used in the field at Peta byte scale.

Hadoop allows to capture new or more data
Some times organizations don't capture a type of data, because it was too cost prohibitive to store it. Since Hadoop provides storage at reasonable cost, this type of data can be captured and stored.
One example would be web site click logs. Because the volume of these logs can be very high, not many organizations captured these. Now with Hadoop it is possible to capture and store the logs

With Hadoop, you can store data longer
To manage the volume of data stored, companies periodically purge older data. For example only logs for the last 3 months could be stored and older logs were deleted. With Hadoop it is possible to store the historical data longer. This allows new analytics to be done on older historical data.
For example, take click logs from a web site. Few years ago, these logs were stored for a brief period of time to calculate statics like popular pages ..etc. Now with Hadoop it is viable to store these click logs for longer period of time.

Hadoop provides scalable analytics
There is no point in storing all the data, if we can't analyze them. Hadoop not only provides distributed storage, but also distributed processing as well. Meaning we can crunch a large volume of data in parallel. The compute framework of Hadoop is called Map Reduce. Map Reduce has been proven to the scale of peta bytes.
Hadoop provides rich analytics
Native Map Reduce supports Java as primary programming language. Other languages like Ruby, Python and R can be used as well.
Of course writing custom Map Reduce code is not the only way to analyze data in Hadoop. Higher level Map Reduce is available. For example a tool named Pig takes english like data flow language and translates them into Map Reduce. Another tool Hive, takes SQL queries and runs them using Map Reduce.
Business Intelligence (BI) tools can provide even higher level of analysis. Quite a few BI tools can work with Hadoop and analyze data stored in Hadoop.

Hadoop - It's All About The Data

A key point to understand about Hadoop is that it's all about the data. Don't lose focus. It's easy to get hung up on Hive, Pig, HBase, HCatalog and lose sight of designing the right data architecture. Also, if you have a strong background in data warehouse design, BI, analytics, etc. all those skills are transferable to Hadoop. Hadoop just takes data warehousing to new levels of scalability and agility with reduction of business latency while working with data sets ranging from structured to unstructured data. Hadoop 2.0 and YARN are going to move Hadoop deep into the enterprise and allow organizations to make faster and more accurate business decisions. The ROI of Hadoop is multiple factors higher than the traditional data warehouse. Companies should be extremely nervous about being out Hadooped by their competition.
Newbies often look at Hadoop with wide eyes versus understanding that Hadoop has a lot of components to it that they already understand such as: clustering, distributed file systems, parallel processing, batch and stream processing.
A few key success factors for a Hadoop project are:

Start with a good data design using a scalable reference architecture.
Building successful analytical models that provide business value.
Be aggressive in reducing the latency between data hitting the disk and leveraging business value from that data.

The ETL strategies and data set generation for Hadoop is similar to what you are going to want to do in your Hadoop cluster. It's important to look at your data warehouse and understanding how your enterprise data strategy is going to evolve with Hadoop now a part of the ecosystem.

"Hadoop cannot be an island, it must integrate with Enterprise Data Architecture". - HadoopSummit

"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network." - Merv Adrian at Gartner Research

This is a sample Hadoop 1.x cluster so you can see the key software processes that make up Hadoop. The good point of this diagram is that if you understand it you are probably worth another $20-30k. :)

YARN (Hadoop 2.0) is the distributed operating system of the future. YARN allows you to run multiple applications in Hadoop, all sharing a common resource management. YARN is going to disrupt the data industry to a level not seen since the .com days.

A Hadoop cluster will usually have multiple data layers.

Batch Layer: Raw data is loaded into a data set that is immutable so it becomes your source of truth. Data scientists and analysts can start working with this data as soon as it hits the disk.
Serving Layer: Just as in a traditional data warehouse, this data is often massaged, filtered and transformed into a data set that is easier to do analytics on. Unstructured and semi-structured data will be put into a data set that is easier to work with. Metadata is then attached to this data layer using HCatalog so users can access the data in the HDFS files using abstract table definitions.
Speed Layer: To optimize the data access and performance often additional data sets (views) are calculated to create a speed layer. HBase can be used for this layer dependent on the requirements.

This diagram emphasizes two key points:

The different data layers you will have in your Hadoop cluster.
The importance of building your metadata layer (HCatalog).

With the massive scalability of Hadoop, you need to be able to automate as much as possible and manage the data in your cluster. This is where Falcon is going to play a key role. Falcon is a data lifecycle management framework that provides the data orchestration, disaster recovery as well as data retention you need to manage your data.

Weaknesses in Traditional Data Platforms

Everyone understands that Hadoop brings high performance commercial computing to organizations using relatively low cost commodity storage. What is accelerating the move to Hadoop are weaknesses in traditional relational and data warehouse platforms in meeting today's business needs. Some key weaknesses of traditional platforms include:

Late binding with schemas greatly increase the latency between receiving new data sources and deriving business value from this data.
The significantly high cost and complexity of SAN storage. This high cost forces organizations to aggregate and remove a lot of data that contains high business value. Important details and information are getting thrown out or hidden in aggregated data.
The complexity of working with semi-structured and unstructured data.
The incredible cost, complexity and ramifications of maintaining database administrators, storage and networking teams in traditional platforms. There are lots of silos of expertise and software required in traditional environments that have dramatic effects on agility and cost. It's gotten to the point that vendors are now delivering extremely expensive engineered systems to deal with the complexity of these silos. These expensive engineered systems require even more specialized expertise to maintain and make customers ever more dependent on the vendors. What's funny is you hear the old phrase "one throat to choke but it's the customer whose choking on the cost. With Hadoop's self-healing and fault tolerance a small team can manage thousands of servers. A single Hadoop administrator can manage 1000 - 3000 nodes all on relatively inexpensive commodity hardware.

While the above highlights the need for Hadoop, it's also important to understand traditional relational databases and data warehouses still have the same role and are needed. A relational database provides a completely different function that a Hadoop cluster. Also, a company is not going to throw out all their existing data warehouses or the expertise and reporting they've built around them. Hadoop today is usually used to add new capabilities to an enterprise data environment not replace existing platforms.
The old line of no one ever gets fired for buying IBM is a thing of the past with Hadoop. An entire organization may go under if your competition is effectively using big data and you are not. Hadoop is the most disruptive technology since the .com days.

Hadoop Reference Architectures

Some initial key factors for success when building a Hadoop cluster is to build a solid foundation. This includes:

Selecting the right hardware. In working with a hardware vendor make sure you are working from their hardware compatibility lists (HCL) and you are making the right decisions for your cluster. Commodity hardware does not have to be generic. You can select commodity hardware that is customized for running Hadoop.
Build an enterprise OS platform. Whether you are using Linux or Windows, customize and tune your operating system using enterprise best practices and standards for Hadoop.
Design your Hadoop clusters using reference architectures. Don't reinvent the wheel unless you have to. Vendors are publishing Hadoop reference architectures that give you a great starting place.

I've included a few HDP reference architectures to give you a feel for what a Hadoop platform may look like.

HP reference architecture Rackspace reference architecture StackIQ reference architecture Cisco UCS reference architecture
I'd like to keep building this reference architecture list.

How to Learn Hadoop

Big data is one the hottest area in IT, so everyone is wanting to learn Hadoop. I am constantly being asked how to learn Hadoop. So I want to share an approach I've been recommending and have gotten a lot of positive feedback on. It's often hard to learn a new technology because a lot of the terminology, concepts and architecture approaches are new. Books, white papers, blogs usually try to teach Hadoop from a perspective of already knowing it. These resources use terms, concepts and context that a newbie does not understand so it can make it very hard to learn a completely new subject. So here is my recommendation for a way to learn Hadoop.
Learn the Basic Concepts First
Everyone gets in a hurry to learn a new technology, so they are trying to learn all the tricks and fancy stuff right away and do not build a solid base first.
Big data books. Hadoop is all about the data. Learn big data concepts before looking at Hadoop in any depth. These books will build core data concepts around Hadoop.

Disruptive Possibilities: How Big Data Changes Everything,- This is a must read for anyone getting started in Big Data.
Big Data Now, 2012 Edition: Easy read and good insights on Big Data. Some of this content on companies is out of date, but there is a lot of valuable information here so this is still a must read.
Big Data, by Nathan Marx - The book does a great job of teaching core concepts, fundamentals and provides a great perspective of Big Data. This book will build a solid foundation and helps you understand the Lambda architecture. You may need to get this book from MEAP if it has not released yet. (http://www.manning.com/marz/)

A good way to learn basic concepts and terminology before looking at technology in more depth. Both are short reads. By reading these basic books, they will gently introduce you so when you read the more technical books you will understand them better.

Hadoop for Dummies, by Tim Jones - Easy introduction to learn basic concepts and terms.
Big Data for Dummies, - This is a very gentle introduction to Big Data, concepts and technologies surrounding it.

Three Defining Whitepapers to Read
These papers are excellent papers to build fundamental knowledge around Hadoop and Hive. Even though they are a few years old, the concepts and perspective discussed are excellent. They will provide foundational insights into Hadoop.

Professional Training
Professional training is the quickest and easiest way to learn core concepts, fundamentals and get some hands on experience. I do work at Hortonworks, but there are some specific reasons I recommend Hortonworks University. The reason is Hortonworks is all open source so you are not learning someone's proprietary or open-proprietary distribution. By learning from 100% open source at Hortonworks you can learn from the open source base, which is then applicable to any distribution. Also, Hadoop 2 has YARN which is a key foundational component and Hortonworks is driving the innovation and roadmap around YARN.
Additional Resources
Once you get the fundamental concepts down you will be wanting to learn in more detail. The two books below are good for taking that next step. However, I recommend reading them in parallel and bouncing back and forth. The reason is each has areas that I believe they do a better job on. Each book has sections that I prefer and using them together was very helpful for me.

Apache Hadoop Yarn (not released yet), by Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline
Hadoop The Definitive Guide (3rd Edition), by Tom White
Hadoop Operations, by Eric Sammer

Getting Hands on Experience and Learning Hadoop in Detail
A great way to start getting hands on experience and learning Hadoop through tutorials, videos and demonstrations is with the Hortonworks Sandbox. The Hortonworks sandbox is designed for beginners, so it is an excellent platform for learning and skill development. The tutorials, videos and demonstrations will be updated on a regular basis. The sandbox is available in a Virtualbox or VMware virtual machine. An additional 4GB of RAM and 2GB of storage is recommended for either of the virtual machines. If you have a laptop that does not have a lot of memory you can go to the VM settings and cut the RAM for the VM down to about 1.5 - 2GB of RAM. This is likely to impact performance of the VM but it will help it at least run on a minimal configured laptop.
Other books to consider:

Programming Hive, by Edward Capriolo, Dean Wampler, ...
Programming Pig, by Alan Gates

Engineering Blogs:

http://engineering.linkedln.com/hadoop
http://engineering.twitter.com

Hadoop Ecosystem

http://nosql.mypopescu.com/post/44639201775/the-hadoop-ecosystem-infographic

What is Hadoop?

Mark Madsen - http://www.insideanalysis.com/2012/12/what-hadoop-is-what-is-isnt/
Jim Walker - http://www.youtube.com/watch?v=j6toE6Ke7k4
Expert panel - http://www.infoq.com/articles/HadoopVirtualPanel

Data

Analyzing Sentiment Data

Best Practices for Apache Hadoop Hardware

https://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/

Hadoop Distributions

Below are the companies offering commercial implementations and/or providing support for Apache Hadoop, which is the base for all the below.

Cloudera offers CDH (Cloudera's Distribution including Apache Hadoop) and Cloudera Enterprise.
Hortonworks (formed by Yahoo and Benchmark Capital), whose focus is on making Hadoop more robust and easier to install, manage and use for enterprise users. Hortonworks provides Hortonworks Data Platform (HDP).
MapR Technologies offers distributed filesystem and MapReduce engine, the MapR Distribution for Apache Hadoop.
Oracle announced the Big Data Appliance, which integrates Cloudera's Distribution Including Apache Hadoop (CDH).
IBM offers InfoSphere BigInsights based on Hadoop in both a basic and enterprise edition.
Greenplum, A Division of EMC, offers Hadoop in Community and Enterprise editions.
Intel - the Intel Distribution for Apache Hadoop is the product includes the Intel Manager for Apache Hadoop for managing a cluster.
Amazon Web Services - Amazon offers a version of Apache Hadoop on their EC2 infrastructure, sold as Amazon Elastic MapReduce.
VMware - Initiate Open Source project and product to enable easily and efficiently deploy and use Hadoop on virtual infrastructure.
Bigtop - project for the development of packaging and tests of the Apache Hadoop ecosystem.
DataStax - DataStax provides a product of Hadoop which fully integrates Apache Hadoop with Apache Cassandra and Apache Solr in its DataStax Enterprise platform.
Cascading - A popular feature-rich API for defining and executing complex and fault tolerant data processing workflows on a Apache Hadoop cluster.
Mahout - Apache project using Hadoop to build scalable machine learning algorithms like canopy clustering, k-means and many more.
Cloudspace - uses Apache Hadoop to scale client and internal projects on Amazon's EC2 and bare metal architectures.
Datameer - Datameer Analytics Solution (DAS) is a Hadoop-based solution for big data analytics that includes data source integration, storage, an analytics engine and visualization.
Data Mine Lab - Developing solutions based on Hadoop, Mahout, HBase and Amazon Web Services.
Debian - A Debian package of Apache Hadoop is available.
HStreaming - offers real-time stream processing and continuous advanced analytics built into Hadoop, available as free community edition, enterprise edition, and cloud service.
Impetus
Karmasphere - Distributes Karmasphere Studio for Hadoop, which allows cross-version development and management of Apache Hadoop jobs.
Nutch - Apache Nutch, flexible web search engine software.
NGDATA - Makes available Lily Open Source that builds upon Hadoop, HBase and SOLR. Distributes Lily Enterprise.
Pentaho – Pentaho provides a complete, end-to-end open-source BI and offers an easy-to-use, graphical ETL tool that is integrated with Apache Hadoop for managing data and coordinating Hadoop related tasks in the broader context of ETL and Business Intelligence workflow.
Pervasive Software - Provides Pervasive DataRush, a parallel dataflow framework which improves performance of Apache Hadoop and MapReduce jobs by exploiting fine-grained parallelism on multicore servers.
Platform Computing - Provides an Enterprise Class MapReduce solution for Big Data Analytics with high scalability and fault tolerance. Platform MapReduce provides unique scheduling capabilities and its architecture is based on almost two decades of distributed computing research and development.
Sematext International - Provides consulting services around Apache Hadoop and Apache HBase, along with large-scale search using Apache Lucene, Apache Solr, and Elastic Search.
Talend - Talend Platform for Big Data includes support and management tools for all the major Apache Hadoop distributions. Talend Open Studio for Big Data is an Apache License Eclipse IDE, which provides a set of graphical components for HDFS, HBase, Pig, Sqoop and Hive.
Think Big Analytics - Offers expert consulting services specializing in Apache Hadoop, MapReduce and related data processing architectures.
Tresata - Financial Industry's first software platform architected from the ground up on Hadoop. Data storage, processing, analytics and visualization all done on Hadoop.
WANdisco is a committed member & sponsor of the Apache Software community and has active committers on several projects including Apache Hadoop

What is Hadoop

Apache Hadoop is, an open-source software framework, written in Java, by Doug Cutting and Michael J. Cafarella, that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. It provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop platform is commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), and number of related projects including Apache Hive, Apache HBase, Apache Pig, Zookeeper etc.

Praveen Kumar K C

Wednesday, February 10, 2016

Big Data and Hadoop