Release Ultimately, big data is more about attitude than tools; data-driven organizations look at big data as a solution, not a problem.

Size: px
Start display at page:

Download "Release 2.0 11. Ultimately, big data is more about attitude than tools; data-driven organizations look at big data as a solution, not a problem."


1 Release Issue , February Ultimately, big data is more about attitude tha tools; data-drive orgaizatios look at big data as a solutio, ot a problem. Roger Magoulas ad Be Lorica, from Big Data: Techologies ad Techiques for Large Scale Data, page 32

2 Release 2.0 Issue 11, February 2009 ISSN Published six times a year by O Reilly Media, Ic., 1005 Gravestei Highway North, Sebastopol, CA This ewsletter covers the world of iformatio techology ad the Iteret ad the busiess ad societal issues they raise. executive editor Tim O Reilly editor ad publisher Sara Wige art director Mark Paglietti copy editor Sarah Scheider cotributig writers Brady Forrest Jerry Michalski Sarah Milstei Peter Morville Natha Torkigto David Weiberger 2009, O Reilly Media, Ic. All rights reserved. No material i this publicatio may be reproduced without prior writte permissio; however, we gladly arrage for reprits, bulk orders, or site liceses. Idividual subscriptios cost $495 per year Cotets 01: Big Data: Techologies ad Techiques for Large-Scale Data By Roger Magoulas ad Be Lorica 01: Preface: Stories from the Field 02: Itroductio to Big Data 03: Why Big Data Matters 05: Big Data Techologies 06: Massively Parallel Processig (MPP) 09: Colum-Orieted Databases 12: How Colum Stores Work 13: MapReduce 18: Key Techology Dimesios 18: Sigle Server ad Distributed Data/Parallel Processig Clusters 19: A Data Architecture for Fast Platforms 23: Data Partitioig 24: MapReduce ad SQL 25: Relatioal ad Key/Value Pairs 26: Reliability ad Resiliece 26: Hardware Optios 28: Big Data Tool Feature Grid 29: Big Data Roadmaps 33: How Sciece Hadles Big Data 39: Disclosure 35: Boldly Goig Where No Data Has Goe Before 39: Ackowledgmets 40: Caledar subscriptio iformatio Release 2.0 PO Box North Hollywood, CA customer service

3 Roger Magoulas is the Director of Research at O Reilly. Be Lorica is a Seior Aalyst i O Reilly s Research group. Big Data: Techologies ad Techiques for Large-Scale Data Preface: Stories from the Field You take a leave of absece from a orgaizatio kow for hadlig big data to work o the data aalysis systems for the Obama campaig. You re faced with oe big server ad five terabytes of messy voter registratio data from multiple sources i multiple formats. You re tasked with optimizig get out the vote efforts by fidig out who has already voted ad removig those ames from the call-bak lists used by cavassers real-time, o electio day. With oly a few weeks to build the system, you assemble a small team of people comfortable with several aspects of big data maagemet, i.e., the size ad state of the data, aalytics, ad servig the data to may users ad may devices. I the ed, while there are problems o electio day, you are able to clea 1.6 millio voters from the call lists the campaig distributes to cavassers that afteroo, makig those lists 25% shorter, o average. Your kowledge ad experiece with big data maagemet makes a complex task maageable i a tight timeframe with a small team. Ad, you spare 1.6 millio supporters a uecessary phoe call. Alteratively, you work for a large social etworkig website ad you re tasked with creatig a aalysis ifrastructure that serves a uique group of users. The data is already large ad growig fast. There s o real busiess pla ad the data is iterestig beyod just its busiess isights there s fodder for real sociology research i your social graph. There s o time ad o guidace, but everyoe thiks the data is importat. You determie that the Hadoop implemetatio of MapReduce provides the scalig, performace, ad flexibility you eed. There s o requiremet to predefie a schema, so you ca just throw data ito the Hadoop platform. Your developers ad aalysts ca build ad use the various access poits, with the help of a few tutorials. Over time, the most effective ad repeatable aalysis : 01

4 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica patters become clear ad you start to develop access tools that support differet classes of users, e.g., programmig laguage APIs for developers ad SQL-like access for aalysts. Your ow data-drive orgaizatio has the ifrastructure to capture all the data geerated by your website, ru experimets ad oe-off aalyses, ad idetify opportuities to build more formal, repeatable data aalysis iterfaces whe eeded. We heard these stories from folks doig leadig-edge big data implemetatios, ad their experieces are t isolated or uusual. More ad more orgaizatios are facig the challeges ad opportuities of workig with big data, ad they re trasformig themselves i the process. We ve see that the orgaizatios that best hadle their big data challeges ca gai competitive advatage ad improve their product ad service offerigs. For those orgaizatios facig ew challeges ad ew opportuities regardig big data, we preset a roadmap of choices ad trade-offs for largescale data maagemet. There s a lot to make sese of ad may competig perspectives. The high-level descriptios ad guidace regardig what to cosider ca iform a deeper dive ito makig decisios about your big data eviromet. Itroductio to Big Data Big Data: whe the size ad performace requiremets for data maagemet become sigificat desig ad decisio factors for implemetig a data maagemet ad aalysis system. For some orgaizatios, facig hudreds of gigabytes of data for the first time may trigger a eed to recosider data maagemet optios. For others, it may take tes or hudreds of terabytes before data size becomes a sigificat cosideratio. We re at the frot edge of a data deluge, brought o by ew, pervasive data sources. A few years ago, a retailer aalyzig thousads of T-shirt sales to discer customer behavior thought it was dealig with big data. Today, social etworkig compaies examie hudreds of millios of persoal iteractios to idetify social treds ad relatioships, ad eergy utilities plow through petabytes of sesor data to uderstad use treds ad demad projectios. Give the scale of today's data sets, traditioal approaches to data acquisitio, maagemet, ad aalysis do't always measure up. Over the past year, we ve oticed a umber of fait sigals, from the people we talk to ad the data we research, that makig sese of large-scale data 02 :

5 stores is icreasigly iterestig ad importat. What we fid most otable is the broad array of orgaizatios, from large eterprises ad govermet agecies dow to startups, that are tacklig big data the size of the orgaizatio o loger directly correlates with the size of the data challeges. We re also seeig the role of data becomig more cetral to busiess strategy. Compaies like Google (where aalytics is at the heart of how they maage ad reveue), Facebook, (which is attemptig to haress the power of data o its social graph of users to develop its busiess pla), or Twitter (focusig o aalyzig its micro-messagig data as the basis for a busiess model) are examples of compaies orgaized aroud data isight. Vedors ad various ope source commuities are respodig with a ew set of tools ad techiques to hadle this emergig focus o data. We see ew big data challeges, growig iterest i the topic, ad a icreasigly diverse set of tools available to address these challeges. Big data is a big topic. To help make sese of all that big data etails, we divide the topic ito three broad activities: Data acquisitio Data maagemet Aalysis ad isight Data acquisitio whether from data-collectig sesors, icreasigly computerized systems, web cotet, telematics, social etworks, or ubiquitous computig leads to the eed to store ad maage more data, data that ca become valuable with access ad iterative, repeatable aalysis. This puts data maagemet at the ceter of big data scalig to acquire more data ad providig fast, coveiet access ad sophisticated aalysis to all that data. I this report we focus o data maagemet as a critical lik i the big data story. We ll ivestigate differet approaches to hadlig large-scale data, describe the techology, idetify key trade-offs, ad address resource requiremets. Why Big Data Matters We believe that orgaizatios eed to embrace ad uderstad data to make better sese of the world. (We believe it so much that O Reilly co-sposored a ucoferece coverig Collective Itelligece topics, e.g., data miig ad aalytics aroud huma behavior.) Big data matters because: The world is icreasigly awash i sesors that create more data both explicit sesors like poit-of-sales scaers ad RFID tags, ad implicit sesors like cell phoes with GPSs ad search activity. Key Takeaway Buildig ad makig sese of massive databases is the core competecy of the iformatio age. Beig better at data is why Google beat Yahoo! ad Microsoft i search ad oe reaso why Barack Obama beat Joh McCai. A bad ecoomy accelerates the importace of big data compaies without big data competecies will be left behid. : 03

6 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica Haressig both explicit ad implicit huma cotributio leads to far more profoud ad powerful isights tha traditioal data aalysis aloe, e.g.: Google ca detect regioal flu outbreaks seve to te days faster tha the Ceters for Disease Cotrol ad Prevetio by moitorig icreased search term activity for phrases associated with flu systems, [M. Helft, Google Uses Searches to Track Flu s Spread, New York Times, 11/11/2008] MIT researchers were able to predict locatio ad social iteractios by aalyzig patters i geo/spatial/proximity data collected from studets usig GPS-eabled cell phoes for a semester, [N. Eagle ad A. Petlad, Reality miig: sesig complex social systems, Persoal ad Ubiquitous Computig, Vol 10, #4, ] IMMI captures media ratig data by givig participats special cell phoes that moitor ambiet oise ad idetify where ad what media (e.g., TV, radio, music, video games) a perso is watchig, listeig to, or playig, [J. Poti, Are Those Commercials Workig? Just Liste. New York Times, 9/9/2007] Competitive advatage comes from capturig data more quickly, ad buildig systems to respod automatically to that data. The practice of sesig, processig, ad respodig (based o pre-built models of what matters, "the database of expectatios," so to speak) is arguably the hallmark of livig thigs. We're ow startig to build computers that work the same way. Ad we're buildig eterprises aroud this ew kid of sese-ad-respod computig ifrastructure. As our aggregate behavior is measured ad moitored, it becomes feedback that improves the overall itelligece of the system, a pheomeo Tim O Reilly refers to as haressig collective itelligece. With more data becomig publicly available, from the Web, from public data sharig sites like Ifochimps, Swivel, ad IBM s May Eyes, from icreasigly trasparet govermet sources, from sciece orgaizatios, from data aalysis cotests (e.g., Netflix), ad so o, there are more opportuities for mashig data together ad ope sourcig aalysis. Brigig disparate data sources together ca provide cotext ad deeper isights tha what s available from the data i ay oe orgaizatio. Experimetatio ad models drive the aalysis culture. At Google, the search quality team has the authority ad madate to fie-tue search rakigs ad results. To boost search quality ad relevacy, they focus o tweakig the algorithms, ot aalyzig the data. 04 :

7 Models improve as more data becomes available, e.g., Google s automatic laguage traslatio tools keep gettig better over time as they absorb more data*. Models ad algorithms become the focus, ot data maagemet. Big data repositories provide the opportuity, via aalysis, for isights that ca help you uderstad ad guide your orgaizatio s activities ad behaviors. You ca improve results by combiig more data from more sources with more sophisticated aalysis ad models. The power of big data eeds to be tempered with the resposibility of protectig privacy ad civil liberties, prevetig sesitive data from gettig hacked or iappropriately shared, ad treatig people geeratig the data fairly. The isights gaied from big data ca be used to improve products ad customer service, but they ca also be used i ways that creep out customers ad make them feel ucomfortable or watched. The idustry does t have all the aswers, e.g., academic research shows it s difficult to create truly aoymous data. There are techiques, such as aggregatig data beyod the level of a idividual, that ca protect privacy while still allowig isightful aalysis. Although it s beyod the scope of this report to address privacy issues i detail, you ll eed to cosider them as you work with big data. Big Data Techologies Surveyig the techology aroud big data, there are three fudametal strategies for storig ad providig fast access to large data sets: Improved hardware performace ad capacity Faster CPUs More CPU cores Requires parallel/threaded operatios to take advatage of multicore CPUs Icreased disk capacity ad data trasfer throughput Icreased etwork throughput Reducig the size of data accessed Data compressio Key Takeaway Big data is drivig ew approaches: MPP, MapReduce, colum-orieted data are all becomig essetial parts of the database toolkit. The relatioal model is o loger the oly database that matters. For may problems, MapReduce-style processig is superior. What s more, it s easier for may programmers to uderstad ad implemet tha SQL. * FOOTNOTE: How Google traslates without uderstadig; Most of the right words, i mostly the right order, by Bill Softky, The Register, May 15, google_traslatio/ Parallelism is the aswer to big data challeges: it lets you divide ad coquer, ad it s built to scale. : 05

8 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica Data structures that, by desig, limit the amout of data required for queries (e.g., bitmaps, colum-orieted databases) Distributig data ad parallel processig Puttig data o more disks to parallelize disk I/O Various schemes to put slices of data o separate compute odes that ca work o these smaller slices i parallel; icludes custom hardware ad commodity server architectures Massively distributed architectures lead to a emphasis o fault tolerace ad performace moitorig As the umber of odes i a cluster icreases, failures or slowdows are ievitable ad the systems eeds resiliecy to recover from faults i a reliable maer Higher-throughput etworks to improve data trasfer betwee odes These three techology strategies udergird the discussio that follows. Because they are ot discrete, mutually exclusive approaches, we do t offer simple apples-to-apples comparisos; each separably ad i combiatio ca provide effective platforms for hadlig big data challeges. Massively Parallel Processig (MPP) Key Takeaway Massively Parallel Processig (MPP) ca dramatically improve query ad load performace for all data types. It works with existig relatioal tools ad ifrastructure, ad you ca throw hardware at a cluster to improve performace. The MPP relatioal/sql database architecture spreads data over a umber of idepedet servers, or odes, i a maer trasparet to those usig the database. We focus o aalytic MPP systems usually called shared-othig databases, as the odes that make up the cluster operate idepedetly, commuicatig via a etwork but ot sharig disk or memory resources (see sidebar). With moder multi-core CPUs, MPP databases ca be cofigured to treat each core as a ode ad ru tasks i parallel o a sigle server. By distributig data across odes ad ruig database operatios across those odes i parallel, MPP databases are able to provide fast performace eve whe hadlig very large data stores. The massively parallel, or sharedothig, architecture allows MPP databases to scale performace i a earliear fashio as odes are added, i.e., a cluster with eight odes will ru twice as fast as a cluster with four odes for the same data. The collectio of servers that make up a MPP system is kow as a cluster. Withi a MPP cluster there are two topologies: Master ode as a sigle poit for all cluster coectios, for aggregatig results ad orchestratig activities o the rest of the odes i the cluster. 06 :

9 Peer architecture where all odes i the cluster ca be used to coect to data, ad ca aggregate results ad coordiate activity across the cluster. The topologies represet trade-offs ivolvig the umber of coectios, ease of addig ad removig compute or data odes, ad availability. Peer architectures offer more flexibility with more coordiatio overhead tha master ode architectures. A key to MPP performace is distributig the data evely across all the odes i the cluster. This requires idetifyig a key whose value is radom eough that, eve over time, the data does ot cocetrate i oe or a subset of odes. The MPP databases have algorithms that help keep the data distributed i practice, usig fields like dates or states for distributio may lead to a skewed data distributio ad poor performace. MPP databases are available i three flavors: tightly coupled with proprietary hardware as a data appliace, loosely coupled with a specially cofigured server platform as a data appliace, ad idepedet as software. Data appliaces help reduce istallatio ad cofiguratio complexity for cliets ad help vedors optimize performace; they are a popular optio for MPP databases (see Hardware Optios sectio for more detail). The focus of moder shared-othig MPP system vedors is o easy implemetatio ad operatios coupled with SQL compliace. MPP systems ca create more work for system admiistrators ad desigers: The eed to admiister all server odes i a cluster ad a dedicated etwork more parts to break ad take care of More complex database performace moitorig ad problem resolutio Idetifyig keys that evely distribute the data across the odes i the clusters, especially over time ad to support jois Rebalacig/redistributig data if the umber of odes i the cluster is chaged (some MPP systems offer optios to help redistribute data to ew odes) MPP Architectures There are shared-everythig architectures for MPP from Oracle (RAC) ad IBM. Sharedeverythig is optimized to support OLTP (trasactioal) operatios ad requires sychroizatio overhead betwee odes to esure data itegrity (i.e., trasactios are atomic ad complete with o collisios). Shared-othig MPP are optimized for readitesive tasks associated with aalysis. Shared-othig MPP databases are ot ew. There were versios available from a umber of vedors i the mid- 80s. I the 90s, NCR s Teradata became the most promiet MPP database vedor, sellig a MPP appliace to mostly eterprise customers. Usig custom hardware, Teradata s appliace became a popular choice for hadlig big data eeds that exteded beyod what oe machie could hadle. I recet years, a umber of vedors have emerged with shared-othig MPP architectures, icludig Netezza, Greeplum, Kogitio WX2, ad Aster Cluster. These ew etrats compete by offerig better value, either via ruig o commodity servers or clever hardware cofiguratios or by removig restrictios o data warehouse desig. Shared-othig MPP databases are ot desiged for OLTP workloads ad they do t have the coectios, robust trasactio support, ad other features associated with trasactioal systems. MPP database advatages: Fast query ad load performace; scalable by throwig more hardware at the cluster Stadard SQL Easy itegratio with ETL (extract/trasform/load), visualizatio, ad display tools No ew skills required for SQL or SQL abstractio layer-savvy developers : 07

10 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica Geerally fast to istall, cofigure, ad ru Parallelizatio available via stadard SQL; o special codig required Other cosideratios: Performace is affected by the choice of distributio keys; skewed data distributios slow queries Hardware costs ad eergy costs eve with low-cost commodity servers Limits to the umber of coectios (hudreds of users at the upper limit) Maagig simultaeous operatios ca be difficult, as orchestratig parallel operatios is complex ad aalysis eviromets ted to ru may table scas. Queries with complex, multi-table jois ca ru slowly whe too much data eeds trasferrig betwee odes Attetio to distributio keys ca improve joi performace by colocatig commoly joied data MPP vedors cocetrate their egieerig effort to reduce ad speed up itercoect traffic, ad, to co-locate related data o the same odes Data eeds to be redistributed ad rebalaced whe ew odes are added MPP databases beefit from multi-core CPUs (parallelism is available for each core), large amouts of RAM, ad direct-attached disks Clusters of heterogeous odes are limited by the performace of the slowest ode with evely distributed data Check with vedors o recommeded etwork optios; some MPP database vedors suggest high-speed etworks for optimal performace. MPP databases are geerally sold i two flavors: As a appliace with hardware ad software budled together. As software that rus o commodity hardware. Software MPP databases ca be packaged with commodity hardware ad sold as a appliace. The feature set for MPP databases is evolvig to iclude MapReduce ad colum-orieted storage, puttig the MPP database at the ceter of a big data architecture that supports multiple operatig modes. Here s a summary of recetly itroduced or plaed features for MPP databases: Itegratig MapReduce with the database Addig colum-orieted storage optios Icreased moitorig ad adaptive operatios 08 :

11 Support for icreasig the umber of simultaeous coectios, via improved queueig algorithms Fast data loadig via direct writes to files Optios for rebalacig data Makig ew compute odes available before the data is rebalaced Backgroud rebalacig (to avoid a complete dump ad reload of data; ot a simple task whe dealig with terabytes or more of data) Optios to keep data i-memory to icrease performace More compressio optios, geared towards reducig the amout of storage eeded while miimizig the impact o load ad query performace a tricky balacig act that depeds o data ad system load MPP databases are a relatively easy trasitio for a orgaizatio already steeped i relatioal database techologies ad resources. Colum-Orieted Databases Relatioal Database Maagemet Systems (RDBMSs) typically store table data as rows, i.e., all the colums associated with a row are stored ad retrieved together regardless of the umber of colums i the row used. I a columorieted database, the data is stored by colums, ad, whe possible, tured ito bitmaps or compressed i other ways to reduce the amout of data stored (see sidebar). Compressig colums reduces how much data eeds storig; the combiatio of compressed data ad retrievig oly the colums requested speeds query performace by reducig the amout of I/O required ad icreasig the amout of query data that ca be stored i fast memory. The techiques for reducig the data footprit of colum data work best o iteger colums with few distict values. More complex data types ad more complex relatioships betwee colums reduces compressio opportuities, icreasig data sizes ad slowig query performace. Colum-orieted databases are relatioal, usig SQL as the laguage for accessig ad maipulatig data, ad the same set of theory cocepts that udergird covetioal RDBMSs, i.e., tables ca be joied, filtered, grouped, ad ordered. The colum orietatio exists uder the covers for most users of the database. Holdig the colums together ito table etities requires joi idices, a extra layer of storage ad abstractio compared to traditioal roworieted relatioal databases. This differece is felt most by desigers ad admiistrators, as they eed to map query patters ad requiremets to colum idex ad compressio strategies. Key Takeaway Colum stores are impressively fast whe used with the right type of data, e.g., time series ad trial data. They ve gaied adherets i the past two years, ad are overcomig their udeserved reputatio for requirig extra egieerig effort. : 09

12 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica Because the differet colum idex strategies have differet performace characteristics, desigers create multiple idexes o the same colum ad let the optimizer pick the best choice. Usig multiple idexes icreases how much data eeds to be stored, which limits the overall reductio i data size from compressig the colums. Colum-orieted databases provide fast query performace for aalysis eviromets with mostly iteger data, e.g., time series ad bioiformatics, ad whe most queries focus o a sigle or small subset of colums. The advatages of colum-orieted DBMSs dimiish for queries that require may colums ad complex table jois due to the extra overhead of brigig all the colums together. The colum-orieted database folks we iterviewed desiged aroud complex table jois to avoid the impact o performace. Colum-orieted databases speed performace by limitig the amout of data eeded to process queries. With less data to move aroud, disk throughput ad latecy becomes less critical ad etwork storage devices* becomes a optio. Network storage devices provides the followig scalig optios: 1) icrease storage capacity by addig disks to the storage uit; 2) improve disk I/O by addig more disk spidles to icrease parallel reads; 3) icrease data throughput betwee the server ad storage uit with a high-speed, dedicated etwork (e.g., fiber chael); 4) add compute capacity with reader servers attached to the storage uit (additioal servers coect to a sigle storage device). Colum-orieted databases have bee aroud for more tha a decade, first popularized by Sybase IQ ad joied by a umber of commercial ad ope source offerigs i the last few years. Colum-orieted databases are desiged to be easy to istall. Resource impacts iclude: DBAs ad desigers eed to determie the idex strategy for each colum; vedors provide idex aalysis tools that recommed idex strategies based o colum data Developers ad aalysts should be aware that query performace tuig icludes limitig the umber of requested colums Usig etwork storage devices requires more attetio to etwork ad storage uit admiistratio * FOOTNOTE: We are usig etwork storage devices as a geeric term for two types of shared, etworkedattached storage devices: Storage Area Networks (SAN) ad Networ- Attached Storage (NAS). 10 :

13 Colum-orieted database advatages: Fast read query performace: Data size, memory requiremets, ad I/O are reduced by data compressio ad by oly accessig requested colums Stadard SQL itegrates with relatioal database tools ad iterfaces for database desig, data access, query, ad aalysis Compressio ad fast query performace ca allow a smaller hardware platform Works best with iteger data ad queries that access oe or a few colums, E.g., time series, fiace data, sesor data, or bioiformatics Compressio reduces the amout of disk storage required Best compressio with low cardiality (data with few distict values), iteger data Compressio gais ca be partially offset by the eed for extra idexes to support differet query requiremets ad high cardiality data Architectures are available for hadlig may coectios Other cosideratios: Compressig data ad idex buildig ca slow write performace Vedors all have fast data-loadig optios ad focus egieerig resources o improvig write performace Performace is impacted by large ustructured text, complex joi logic, ad high cardiality data (i.e., colums with may distict values) Scalig for performace ad data size ca be complex, requires plaig, depeds o hardware topology ad DBMS optios, ad ca be expesive Colum-orieted databases perform best o systems with lots of RAM ad multi-core CPUs For systems with a etwork storage device, addig disk spidles, RAM cachig, ad high-throughput, dedicated coectios betwee the storage uit ad the server ca help improve performace Scalig MPP colum-orieted databases is similar to scalig roworieted MPP databases; for MPP architecture, direct-attached disks are recommeded Future features of colum-orieted DBMSs MPP cofiguratios Faster data-loadig algorithms Icreased support for text ad ustructured data : 11

14 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica Colum-orieted DBMSs create performace advatages o a give hardware platform by reducig the amout of data that eeds to be processed. With the right data eviromet, colum-orieted DBMSs ca be a optio to meet performace requiremets while maitaiig a fit with existig relatioal database tools ad ifrastructure especially for orgaizatios tryig to miimize the umber of servers required to hadle a give data volume. How Colum Stores Work Uderstadig how colums ca be compressed or tured ito bitmaps helps explai the techology uderlyig colum-orieted databases or addig columorieted support to other database schemes. These examples show how colum databases maage colum data for compressio ad fast reads (see C-Store: A colum-orieted DBMS, Proceedigs of the 31st VLDB Coferece, 2005, by Stoebraker, Abadi, et al.). Commercial colum-orieted databases also have more complex ecodig ad compressio schemes tha those outlied below. For data that ca be ordered ad has few distict values, the colum ca be ecoded by storig the value, whe the value first appears, ad how may times the value appears reducig the umber of rows required to represet a colum of may rows to oe row for each value. The figure below shows how the origial values map to the ew (v, s, ) ecodig (v -: value, s -: start, :, umber of times v appears). The value 303 is stored as (303, 7, 5), i.e., the first istace of the value 303 is row seve ad the value appears five times. 12 :

15 Whe the colum sort depeds o a foreig colum, the colum ca be ecoded with the value ad a bitmap of the relative positio of where the value is stored. The bitmaps are typically sparse ad ca be more efficietly stored ad idexed. The followig diagram shows the origial, usorted colum data ad the map to the value bitmap. I this example, the value 303 bitmaps looks like , i.e., the value 303 occurs i the 3rd ad the 9th positio of the bitmap. These compressio techiques led themselves best to iteger data ad to colums with low cardiality (i.e., colums with few distict values). Iteral dictioaries ad other lookup strategies ca be used that process text fields more like itegers. Colum store strategies do t work as well o colums with may distict values or o ustructured data. MapReduce I the last few years, we ve bee hearig about ad have bee itrigued by the buzz surroudig MapReduce from startups ad large techology compaies. May of the data-itese startups we meet with are usig or pla to use Hadoop for some or all of their data maagemet. We see the ethusiasm about MapReduce comig from: The massive scale of data Google is processig with their MapReduce ifrastructure as proof the techology works Key Takeaway MapReduce is the ext, ew thig i big data, scalig to meet the biggest data processig eviromets, e.g., petabytes at Google, Yahoo!, Facebook. : 13

16 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica The success of Hadoop at Facebook, Yahoo!, The New York Times ad others The availability of Amazo Web Services (AWS) as a coveiet ad cheap platform for tryig MapReduce Small orgaizatios lookig for a affordable, scalable way to maage ad aalyze big data The expese of commercial big data products MapReduce refers both to a style of programmig ad to a parallel data processig egie for maagig large-scale, distributed data. As a style of programmig, the cocept is based o combiig fuctios (map ad reduce) commo to may programmig laguages. The map fuctio performs filterig or trasformatios before creatig output records as a key/ value pair. A split fuctio, most typically a hash (but ay determiistic fuctio that guaratees the data always lads o the same server will work), distributes these records for storage to the servers that make up the system. The reduce fuctio performs some type of aggregate fuctio o all the records i the bucket associated with a key. The map fuctios ca be distributed to ru o differet odes i a cluster, with each map give a portio of the iput data to process. By aalogy to SQL, the map is like group by ad the reduce is like a aggregate fuctio (e.g., sum or cout) for a aggregate query. MapReduce programmig ca be applied i other cotexts for example, agaist a distributed relatioal database (both Aster Data ad Greeplum have released versios of their MPP databases with MapReduce fuctioality) or key/ value pair data stores like CouchDB. As a parallel data processig egie, MapReduce is most closely associated with Google s implemetatio ad with Hadoop, the ope source cloe of MapReduce supported by Yahoo!, Facebook, Cloudera, ad others. Google implemeted MapReduce as a stack that was the used as the ispiratio for Hadoop. Sice much of the Hadoop documetatio refereces the Google MapReduce stack, we describe the stack i more detail: Google File System (GFS) GFS triplicates all data across a cluster. If a ode i the cluster becomes uavailable, the data is the automatically replicated. BigTable: a distributed data storage system with a multidimesioal data structure BigTable uses the terms rows ad colums differetly tha they are used i relatioal databases; the rows ad colums are really elemets of a map (hash i Perl or Ruby, dictioary i Pytho, associative array i PHP, 14 :

17 object i JavaScript), ad BigTable is described as a multidimesioal sparse map. A spreadsheet aalogy may help values are accessed by usig row ad colum as a idex. All colums are timestamped to allow data versioig, with automatic retrieval of the most recet items. Users ofte serialize fields ito a sigle colum, creatig a simpler key/ value pair data structure that they ca deserialize o retrieval. A complete descriptio of the BigTable data structure is beyod the scope of this report. For more i-depth iformatio, see Bigtable: A Distributed System for Structured Data by Chag, Dea, et al. (http:// for the official explaatio, or, for a simpler explaatio see Jim Wilso s Uderstadig HBase ad BigTable, at Hbase_ad_BigTable. MapReduce a cliet for performig parallel MapReduce o data stored i BigTable tables o GFS Sawzall a higher-level query ad aalysis laguage that rus o top of MapReduce, simplifyig filterig, aggregatio, ad statistics aalysis; aalogous to SQL Workqueue schedules tasks ad restarts jobs that fail Chubby system for coordiatig distributed applicatios, icludig cofiguratio ad sychroizatio The MapReduce platform icludes ode moitorig, fault detectio, ad queuig processes that help maage MapReduce jobs. From the user perspective, MapReduce provides a platform for operatig o may data items i parallel while isolatig the user from the details of ruig a distributed program, i.e., data distributio, replicatio, fault tolerace, ad schedulig. Google also uses the proprietary compressio algorithms BMDiff ad Zippy to shrik the size of data stored. Hadoop, a ope source Java framework for ruig applicatios i parallel across large clusters of commodity hardware, was created by Doug Cuttig, the developer of Lucee (search tool) ad Nutch (distributed web crawler). Cuttig was iflueced by what he leared about GFS ad MapReduce i , ad the Hadoop project grew out of his work o Nutch. I 2006 Doug was hired by Yahoo!, got a team of egieers, ad started the ope source Apache Foudatio Hadoop project to give Yahoo! the same type of distributed : 15

18 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica processig that Google was ejoyig with their MapReduce platform. By early 2008 Hadoop was able to hit web-scale distributio ( com/files/cuttig.pdf). Hadoop has become the primary MapReduce platform used outside of Google. A Hadoop Summit i March, 2008, drew more tha 400 people. Over half were ruig Hadoop, with at least 15% ruig Hadoop o a miimum of 100 odes. Hadoop has oe or more corollaries to the Google MapReduce stack, as show i the followig table.: Google MapReduce Hadoop Notes GFS Hadoop Distributed File System (HDFS) Keeps triplicates of all data distributed across cluster odes BigTable HBase Alteratives iclude HyperTable MapReduce Hadoop MapReduce Hadoop Core budles MapReduce ad HDFS Hadoop Streamig Provides Hadoop access to ay stdi/stdout biary, icludeig iterpreted laguages like Shell Script, Pytho, Rails, ad Perl Sawzall Pig (Pig Lati) Data flow ad executio laguage Hive Facebook ope source query ad aalysis framework built o top of Hadoop Workqueue JobTracker Schedules ad maages jobs Chubby ZooKeeper Coordiatio system for distributed applicatios, icludig cofiguratio ad sychroizatio Compressio: BMDiff, Zippy zlib/gzip, LZO, bzip2 Google compressio optimized for processig speed, ot compressio Hadoop has a vibrat ad egaged user ad developer commuity. We expect Hadoop ad related offerigs to cotiue to improve, add fuctioality, ad geerate ew busiesses that support the Hadoop commuity. Implemetig MapReduce as a methodology ad as a data processig egie is best served by cosiderig your staff s collective skills ad experiece: Hadoop requires learig ew system admiistratio skills for istallig, cofigurig, moitorig, tuig, ad maiteace. Admiisterig multiple servers; addig ew servers to scale Simpler, more flexible data structures MapReduce table structures should be familiar to developers who work with programmig laguage data structures 16 :

19 Developers ad aalysts most familiar with relatioal databases will eed to lear the differet MapReduce data structures Gettig buy-i from techical resources ad/or fidig experieced MapReduce resources ca help with adoptio Traiig ad/or pilot projects ca help staff lear MapReduce ad ew approaches to data maagemet Programmig resources may be required to build high-level user iterfaces to replace RDBMS-orieted tools MapReduce advatages: Fast performace eabled by parallel processig o distributed data Trasparet, fault-tolerat executio of parallel data-hadlig processes Built-i resiliecy/fault tolerace coheret with scalig to large clusters Scalig to large-scale data volumes Potetial for thousads of odes Largest Hadoop clusters have 2,000 odes; Google s MapReduce is rumored to use more tha 10K servers (perhaps may more) for MapReduce jobs Scalig ad performace o commodity hardware Ca icrease ad decrease size of cluster via recofiguratio Prove cloud computig optios Simpler, more flexible data structure No requiremet to predefie data structure desig Parallelism ad resiliecy come free to developers ad aalysts, i.e., o explicit codig for parallelism is required Other cosideratios: Hardware ad eergy costs rise as the umber of servers icreases Admiistratio, desig, ad aalysis ifrastructure tools are available, but still maturig RDBMS-orieted DBA, desig, ad query tools do t work with MapReduce Less compute efficiecy compared to more heavily idexed alteratives i some circumstaces Depeds o data structure desig ad query complexity Ca lead to eedig more servers or servers that cosume more eergy per process : 17

20 Release February 2009 Big Data: Techologies ad Techiques for Large-Scale Data Roger Magoulas ad Be Lorica Future features of MapReduce: More SQL-like ad easier-to-use query ad aalysis tools (see Hive) Icreased scalig Hadoop has a desig target of 10K ode clusters i 2009 Tools to simplify admiistratio, cofiguratio, istallatio ad moitorig processes The orgaizatios we spoke with who use MapReduce had a cosistetly high opiio of their experiece. They liked the scalig ad flexible data structures. Their developers ad aalysts were quickly traied ad up-to-speed ad stayed egaged with the data ad, remai ethusiastic about usig MapReduce. MapReduce complemets a experimetal approach towards data, i.e., loadig raw data ito simple data structures ad ruig ad hoc ad oe-off aalysis util query patters emerge that ca be tured ito more formal ad easy-toaalyze data structures. This experimetal, discovery-orieted approach helps make MapReduce a good fit for orgaizatios tryig to make data cetral to busiess strategy ad decisio makig. Drawbacks oted iclude trial ad error fiddlig to get cofiguratio optimized ad maturity issues aroud documetatio ad features all cosidered relatively isigificat ad ot a hidrace to adoptio. Key Takeaway For most databases, a sigle server properly cofigured is eough. It s a big leap to move from oe to two servers, but oce you do, growig a cluster is relatively easy. Commodity hardware works, rides the mass market iovatio curve, ad avoids vedor lock-i. Key Techology Dimesios There s a lot to keep i mid whe ivestigatig big data techology. Alog with careful attetio to data, staff skills, ad usage, we recommed cosiderig the followig techology dimesios to make the best decisio for your orgaizatio. Sigle Server ad Distributed Data/ Parallel Processig Clusters Sigle Server A sigle box with oe or more sigle or multi-core CPUs ad direct-attached or etwork disk storage. Assumptios: Server performace ad data capacity cotiues to improve, i.e., CPUs gai cores ad capacity (per Moore s law), hard disk desity icreases, memory desity icreases, ad other factors make high-ed servers powerful eough to hadle ever-icreasig big data loads 18 :

loud Some of the most significant changes in information Choose Power Computing and the By Rob Bristow, Ted Dodds, Richard Northam, and Leo Plugge

loud Some of the most significant changes in information Choose Power Computing and the By Rob Bristow, Ted Dodds, Richard Northam, and Leo Plugge loud Computig ad Power the to Choose By Rob Bristow, Ted Dodds, Richard Northam, ad Leo Plugge Some of the most sigificat chages i iformatio techology are those that have give the idividual user greater

More information

Big Data. Better prognoses and quicker decisions.

Big Data. Better prognoses and quicker decisions. Big Data. Better progoses ad quicker decisios. With Big Data, you make decisios o the basis of mass data. Everyoe is curretly talkig about the aalysis of large amouts of data, the Big Data. It gives compaies

More information

Outsourcing Toolkit. a special advertising supplement. published by asae & the center for association leadership

Outsourcing Toolkit. a special advertising supplement. published by asae & the center for association leadership Outsourcig Toolkit a special advertisig supplemet to the SEPTEMBER 2009 issue of Associatios ow published by asae & the ceter for associatio leadership 2009 Outsourcig Toolkit What s the best way to maage

More information

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Dryad: Distributed Data-Parallel Programs from Sequetial uildig locks Michael Isard Microsoft esearch, Silico Valley drew irrell Microsoft esearch, Silico Valley Mihai udiu Microsoft esearch, Silico Valley

More information

For information on all Morgan Kaufmann publications, visit our Web site at or

For information on all Morgan Kaufmann publications, visit our Web site at or Morga Kaufma Publishers is a imprit of Elsevier. 30 Corporate Drive, Suite 400, Burligto, MA 01803, USA This book is prited o acid-free paper. # 2009 by Elsevier Ic. All rights reserved. Desigatios used

More information

By Deloitte & Touche LLP Dr. Patchin Curtis Mark Carey

By Deloitte & Touche LLP Dr. Patchin Curtis Mark Carey C o m m i t t e e o f S p o s o r i g O r g a i z a t i o s o f t h e T r e a d w a y C o m m i s s i o T h o u g h t L e a d e r s h i p i E R M R I S K A S S E S S M E N T I N P R A C T I C E By Deloitte

More information

Securing your business

Securing your business Iteratioal Chamber of Commerce The world busiess orgaizatio Securig your busiess A compaio for small or etrepreeurial compaies to the 2002 OECD Guidelies for the security of etworks ad iformatio systems:

More information

Adverse Health Care Events Reporting System: What have we learned?

Adverse Health Care Events Reporting System: What have we learned? Adverse Health Care Evets Reportig System: What have we leared? 5-year REVIEW Jauary 2009 For More Iformatio: Miesota Departmet of Health Divisio of Health Policy P.O. Box 64882 85 East Seveth Place, Suite

More information

Smarter MRO 5 strategies for increasing speed, improving reliability, and reducing costs all at the same time

Smarter MRO 5 strategies for increasing speed, improving reliability, and reducing costs all at the same time Smarter MRO 5 strategies for icreasig speed, improvig reliability, ad reducig costs all at the same time Cotets Itroductio 3 The ed state: A itegrated approach to MRO 4 Achieve short ad cosistet turaroud

More information

Customer Relationship Management Next Generation (CRM NG)

Customer Relationship Management Next Generation (CRM NG) Customer Relatioship Maagemet Next Geeratio (CRM NG) How to simply ad effectively obtai a full view of relevat ed customer data, allowig you to drive real-time decisios ad optimally tailor your offerigs

More information

Managing deliverability. Technical Documentation Adobe Campaign v6.1

Managing deliverability. Technical Documentation Adobe Campaign v6.1 Maagig deliverability Techical Documetatio Adobe Campaig v6.1 2014, Adobe All rights reserved. Published by Adobe Systems Ic. Terms of use Privacy Ceter A trademark symbol (,, etc.) deotes a Adobe trademark.

More information

Spin-out Companies. A Researcher s Guide

Spin-out Companies. A Researcher s Guide Spi-out Compaies A Researcher s Guide Cotets Itroductio 2 Sectio 1 Why create a spi-out compay? 4 Sectio 2 Itellectual Property 10 Sectio 3 Compay Structure 15 Sectio 4 Shareholders ad Directors 19 Sectio

More information

Successful health and safety management

Successful health and safety management Health ad Safety Successful health ad safety maagemet This is a free-to-dowload, web-friedly versio of HSG65 (Secod editio, published 1997). This versio has bee adapted for olie use from HSE s curret prited

More information

Work Placement in Third-Level Programmes. Edited by Irene Sheridan and Dr Margaret Linehan

Work Placement in Third-Level Programmes. Edited by Irene Sheridan and Dr Margaret Linehan Work Placemet i Third-Level Programmes Edited by Iree Sherida ad Dr Margaret Lieha Work Placemet i Third-Level Programmes Edited by Iree Sherida ad Dr Margaret Lieha The REAP Project is a Strategic Iovatio

More information

Managing contractors. A guide for employers. HSE Books

Managing contractors. A guide for employers. HSE Books Health ad Safety Maagig cotractors A guide for employers This is a free-to-dowload, web-friedly versio of HSG159 (First editio, published 1997). This versio has bee adapted for olie use from HSE s curret

More information

Investigating accidents and incidents

Investigating accidents and incidents Health ad Safety Ivestigatig accidets ad icidets A workbook for employers, uios, safety represetatives ad safety professioals Ivestigatig accidets ad icidets This is a free-to-dowload, web-friedly versio

More information

The Quality Management Plan

The Quality Management Plan QM The Quality Maagemet Pla A Practical, Patiet-Cetered Template Jue 2011 Primary Authors Dale S. Beso, MD, FACPE Peyto G. Towes, Jr., MHSA Special Cotributor Daiel Dobbs About the Authors DALE S. BENSON,

More information

Child: Family: Place: Radical efficiency to improve outcomes for young children. Croydon

Child: Family: Place: Radical efficiency to improve outcomes for young children. Croydon Child: Family: Place: Radical efficiecy to improve outcomes for youg childre Croydo 2 cotets Foreword...5 Executive summary...7 1 Cotext...13 The public sector eeds ew models to improve public services...14

More information

Business Intelligence on the Cloud: Overview and Use Cases

Business Intelligence on the Cloud: Overview and Use Cases White Paper Busiess Itelligece o the Cloud: Overview ad Use Cases The use of Busiess Itelligece (BI) i the cloud is a game-chager, as it makes BI affordable ad easily available as compared to traditioal

More information

supply-chain management (scm)

supply-chain management (scm) 7 supply-chai maagemet (scm) 7. 1 Theory: SCM ad logistics Learig outcomes Lear about the theory of supply-chai maagemet. Use collocatios coected with supply-chai maagemet. Desig a supply chai. Itroductio

More information


ASSET MANAGEMENT SoLUTioNS ASSET Maagemet Solutios A Trusted Service Provider ABS Nautical Systems, the software developmet divisio of ABS, desigs applicatios to meet the eeds of the maritime commuity. ABS Nautical

More information

Optus Partners for Mobility Solutions.

Optus Partners for Mobility Solutions. PARTNER SOLUTIONS GUIDE Optus Parters for Mobility Solutios. Your essetial guide to improvig productivity ad growig your busiess with cost-effective, comprehesive mobility solutios o the Optus Mobile etwork.

More information

Cognizant s Core Values and Standards of Business Conduct

Cognizant s Core Values and Standards of Business Conduct Cogizat s Core Values ad Stadards of Busiess Coduct Table of Cotets Itroductio 2 Our Reputatio is i Your Hads 3 The Right Way at Cogizat 4 Our Core Values ad Stadards i Actio 4 To Whom these Stadards Apply

More information

Things Your Next Firewall Must Do

Things Your Next Firewall Must Do 10 Thigs Your Next Firewall Must Do Itroductio: 10 Thigs Your Next Firewall Must Do Much has bee made about brigig applicatio visibility ad cotrol ito etwork security. The reaso is obvious: applicatios

More information

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Type Less, Find More: Fast Autocompletion Search with a Succinct Index Type Less, Fid More: Fast Autocompletio Search with a Succict Idex Holger Bast Max-Plack-Istitut für Iformatik Saarbrücke, Germay Igmar Weber Max-Plack-Istitut für Iformatik Saarbrücke,

More information

SIP TRUNKING 101: A Primer

SIP TRUNKING 101: A Primer SIP TRUNKING 101: A Primer AUGUST 2012 KEY BENEFITS Cost reductio i the order of 30-50% Icreased flexibility, as umbers are o loger tied to locatios. Cost effective disaster recovery capability Support

More information

Professional Development Conference for Engineers and Scientists CUSTOMER CASE STUDY BOOKLET. ni.

Professional Development Conference for Engineers and Scientists CUSTOMER CASE STUDY BOOKLET. ni. Professioal Developmet Coferece for Egieers ad Scietists CUSTOMER CASE STUDY BOOKLET This booklet cotais a broad rage of customer-writte case studies, icludig a selectio of the best submissios to the NIDays

More information

Crowds: Anonymity for Web Transactions

Crowds: Anonymity for Web Transactions Crowds: Aoymity for Web Trasactios Michael K. Reiter ad Aviel D. Rubi AT&T Labs Research I this paper we itroduce a system called Crowds for protectig users aoymity o the worldwide-web. Crowds, amed for

More information

World Bank Group Strategy. October 2013

World Bank Group Strategy. October 2013 World Bak Group Strategy October 2013 Cotets Executive Summary 1 World Bak Group Strategy 5 1. Itroductio 5 A. The Developmet Ageda for Reducig Poverty ad Sharig Prosperity 6 B. Focusig the WBG o the

More information

Big Data in Payments Unparalleled Opportunity for Strategic Excellence

Big Data in Payments Unparalleled Opportunity for Strategic Excellence A Poit of View Big Data i Paymets Uparalleled Opportuity for Strategic Excellece Small data is goe. Data is just goig to get bigger ad bigger ad bigger, ad people just have to thik differetly about how

More information