While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. Remember it stands front and center in the discussion of how to implement a big data strategy. Early adopters are google, facebook, linkedin and yahoo, but it is now becoming mainstream, so what is it? 1
2
Firstly some context Who here has big data? So, Big Data, what does it mean? Most often big data refers to the size, if it is data and it is big, it must be BIG DATA. However, there are other elements of big data not necessarily associated with size the best way I like to describe big data is big data is any attribute that challenges the constraints of a system capability, for example 20mb powerpoint you cant send it via email, so this could be big data 100gb xray image you cant accurately display it on a remote screen in real time for a consultation, this could be big data 1 tb movie you cant render edits within the time constraints of the business, this could be big data With the explosion of data in our environments, the likelihood of you having big data is higher than you think. So, who here now has big data? As you can see, in 2000, we generated 2 exabytes or 2000 PB of new information in the year. 3
Fast forward to 2011, the amount of information we generated is 2 exabytes everyday! That is huge, most of this could be a result of big data. 4
This unstructured, file-based data comes from many sources such as email, audio, video, images, Word documents, machine-generated logs, etc. This type of file-based data is growing at an especially accelerated rate and is expected to grow 50x in the next 10 years. So we have gone from 2 exabytes in a year in 2000, to 2 exabytes a day in 2011, and then 50x that in the next 10 years. That to me means we need to do things differently otherwise we wont be able to store that data let alone extract any type of business value from it. -Unprompted, an airline or a telephone company offers you a spiff? Perhaps just a few days after you ve had an unpleasant experience on one of their flights. How d they do that? - A retailer seems to offer you products closer and closer to the ones you d actually like to buy. And at prices better than you had seen in the store, or on your favorite web site. -Your Physician is increasingly able to predict how you individually will respond to a particular course of treatment? - You re seeing fewer and fewer dropped calls on your mobile? - Your power company is now delivering exact assessments of how energy-efficient your dwelling is? 5
We arrive at what many have called the next great productivity revolution. The age in which research and the scientific process becomes automated in the sense that so much data is available for analysis that we no longer need to speculate about what may be taking place the trends, the hypotheses, are already produced by the amount of data available. (see article here: http://www.wired.com/wired/issue/16-07 A story to illustrate: The Australian government recently announced funding for a National Mental Health Council. Their job is to collect annually all the metrics on mental health from around Australia (this is currently done on an ad-hoc basis by researchers, but to systematically). So they will gather data from the states on suicide rates, on hospital admissions, on substance abuse, on medically reported rates of depression, and so on all manually collated by emailing and asking for spreadsheets a huge undertaking. The old way. Consider now that you can install an iphone app that will use voice pattern recognition to determine your mood ie your mental health can be roughly determined by your smartphone listening to your intonation. So in the age of big data: all this manual assembly of data becomes redundant the process can be autmoated. And not only do we leave behind the manual assembly of data, we can leave behind the process of speculation, hypothesis, and then testing the hypothesis the mental health of all Australians, if they have a smartphone, could be collected (this would need to be voluntary obviously) and then researchers could dig into patterns of mood on a daily basis across demographics and so on. An outbreak of short-term depression across Queensland the day after the maroons got beaten by the blues, for example. The point is, there s a revolution in research productivity here. That s why Wired Magazine call it the end of science. 6
7
8
Big Data is massive new data volumes This is a typical Australian electricity bill. How often do I get one of these? (guess) every 3 months. (Why? Because a person has to physically walk up to the meter and read it.) very manual. Can only be done a few times a year. Now, the electricity company has a data warehouse which captures all their billing data. They use it to analyse usage patterns across parts of the network (and not much else). Their data warehouse might be 3 terabytes. Not huge. This is a smart meter. The smart meter provides readings directly to the utility via wireless or mobile phone networks every 5 minutes. So instead of one reading per customer every three months, we can access a record per customer every 5 minutes. So the data has just grown 3000 times. (That s big data). So suddenly the utility, to retain the same level of customer data, needs 9 petabytes of data warehouse. Most of it becomes exhaust data but there is tremendous value in this data if you can keep it and analyse it. -Network load analysis over time -Better decisions on where to increase network capacity -Real-time alerts when a particular power node is approaching saturation -You can also provide the information back to your customers 9
10
This is the Silverspring web site, a screenshot taken about 2 weeks ago. Notice the promise here to the consumer See your energy use in real time you only want to promise that if your database can perform, can handle huge numbers of ad-hoc queries from consumers accessing your website 24 x 7. SilverSpring use Greenplum to capture all the readings from their smart meters into a database, and they make this data available to their customers. 11
An example of what consumer-facing real-time electricity usage looks like. 12 12
But it s not just for the consumers the real value for the utility is what they can do with the data themselves: -Predictive maintenance -Usage trends over time down to the suburb and street level -Geo-spatial mappings over streets looking weather-related incidents, maintenance cost anomalies and so on. -(I worked with a Utility in Sydney where, using their data warehouse, they were able to identify some motors and pumps that were starting and stopping several times an hour, and others that only cycled once or twice a day or week so instead of blindly sending around a maintenance crew every 3 months, they could maintain some pumps every month and other pumps only twice a year. Savings of several $M) 13 13
14
15
16
17
And here are some examples of the kind of analytics that can be run against different types of data, and the kind of insights you can expect to gain His point is that traditional data warehousing architectures don t cater for these types of analytics. 18
Firstly, lets cover the name. Hadoop was created by Doug Cutting who named his new project after his son s favorite toy, a stuffed elephant, for which the boy had made up the name. His elephant s name was Hadoop! So it is not an acronym, and it doesn t have a special hidden meaning, it is simply the name of a toy elephant. 19
I am sure we have all heard the hype.hadoop is increasingly generating the attention of the press and the IT industry as a transformative technology, and can be used to obtain the competitive advantage, if you read some of these quotes you would be saying, where can I get one! 20
On the flip side, when you dig a little deeper you can find some hidden facts. Although Hadoop does have real momentum in the market today, as it says on the slide, only a few enterprise vendors have adopted hadoop. Companies may say they have it on their fast track, or are evaluating it..but as we shall soon find out, traditional Hadoop is difficult and does lack some of the features, functionalities, and deployment options needed to be easily adopted by enterprises. 21
While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. Remember it stands front and center in the discussion of how to implement a big data strategy. Early adopters are google, facebook, linkedin and yahoo, but it is now becoming mainstream, so what is it? 22
Trying to understand Hadoop concepts in the context of the wrong architecture design can actually hurt our brains Hadoop was not designed with the typical enterprise environment in mind. Hadoop was designed for a Cluster Architecture built out of commodity x86 hardware with DAS using Open source software based off of papers written by Google. Why base Hadoop on a cluster architecture? Linearly and horizontally scalable Open source software based off of papers written by Google Able to store and query massive amounts of unstructured data Fault tolerant, reliable storage Inexpensive to build and maintain Hide complexity with a common interface These are the elements that were a critical part of the hadoop design. 23
Here it is, this is Hadoop is all its glory. Hadoop harnesses the Cluster architecture, by addressing two key points: The layout of the data across the cluster, ensuring that data is evenly distributed; And then present that evenly distributed data back to the applications so they can benefit from Data Locality; And that brings us to Hadoop s two main mechanisms: Hadoop Distributed File System Hadoop MapReduce HDFS is the distributed file system that lets Hadoop scale across commodity servers. That part is easy, we all know what filesystems are and HDFS is no different, it stores data across all nodes that participate in an hadoop cluster. MapReduce is the parallel-processing engine that allows Hadoop to churn through that data, quickly. These are the two must-have components for any Hadoop cluster and together they take care of all complexities for us in leveraging the parallel processing power of a cluster. You now know what Hadoop is at a high level, lets looks at the detail. 24
In a traditional Hadoop Architecture, we now can visualise the major components. 1. Map-reduce (Job Tracker and task tracker) 2. Namenode and Secondary namenode (A HDFS NameNode stores Edit logs and File system Image). 3. Datanode (Runs on slaves) In a traditional architecture the compute nodes and the storage nodes can be one in the same. 25
So now we are clear on what hadoop is and how it fits within an environment, it does come with some challenges. Namenode failover is not seamless. Poor utilization of storage and CPU resources in Hadoop clusters Inefficient data staging and loading processes Backup and disaster recovery missing Servers with Direct Attached Storage (DAS) islands of storage, remember the 90 s? Data Protection: 3x mirror of all data Data Ingest: Difficult and tool dependent (No common protocol access) Scaling: Add more servers with DAS Single Points of Failure: NameNode Replication: None Data Recovery: Recreate data from other sources 26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57