Big Data Storage

Transcription

1 HBase IntroductionandNewDevelopments AndrewPurtell

2 Outline BigDataandCloudComputing HBaseIntroduction NewFeatures ACIDGuarantees MultiDataCenterReplication Security Coprocessors WrapUp

3 BigDataandCloudComputing

4 BigData BigDatadefined Scale beyond the limits of conventional data storage solutions today either in terms of capacity or of the sustainablecostofthesolution Trillions(10^12)ofdataitems Petabytes(10^15bytes)ofdatavolume Google encountered Big Data in their operations and devisedarchitecturalsolutionsforit,includingbigtable BigTableistheinspirationforHBase TheBigDataAge Accordingtooneestimate,globally the world created 150 exabytes of datain2005 This year, the world may create morethan1,500exabytesofdata

5 MediumData What about Medium Data? We like to say that Facebook doesn t run Hadoop because it has a lot of data, but that Facebook has a lot of data because it runs Hadoop. Businesses that use Hadoop find that keeping data is worthwhile because Hadoop helps themprocessitinnewways. MikeOlson,CEO,Cloudera Manybusiness,evensmallones,duringthecourseof their normal operations can generate petabytes of dataperyear Iftheyretainit,theycanmineitandgaininsights Open source analytics enablers like the Hadoop software ecosystem of which HBase is a part makethisanemergingreality

6 CloudComputing "Internet based access to highly scalable pay per use ITcapabilities" YnemaMangum,SUNMicrosystems Anevolutionofnetworkcomputing Describesinfrastructure as a service(iaas)prettywelll Workstation Network Grid Cloud Cloudcomputingisclient servercomputingthatabstracsthe detailsoftheserveraway Scalefree Resourcesanywhere/everywhere Looselycoupledcomputing Decentralized,openstandards Opentechnologies Newownershipmodel

7 CloudComputing Scalefreecomputing A limitless pool of on demand resources is a game changer Paybyuseinsteadofprovisioningforpeak Elasticscalingupanddown,scaleupverylarge Optimizecoststoactualservicedemand Worry only about the application or service, not about infrastructure Capacity Capacity Demand Demand Time Time Staticdatacenter Datacenterinthecloud Unusedresources

8 Convergence As we enter the age of Big Data we have the scale (and scale free computational nature) of the Cloud to manageit TheCloudisadriverofBigDataevenasitisameans todealwithit HadoopandHBaseareCloudscalearchitectures ContainerforBig(andMedium)Data Scalefreecomputationalframeworkfor managingit

9 HBase(andHadoop)Introduction

10 HadoopThePlatform Atransparentlyscalablecomputingplatform Cloudscale griddataprocessing 2009GraySortwinner:0.578terabytes/minute,anew worldrecord 10Knodes,100millionfiles,10petabytes Sortaterabyte(1,000,000,000,000bytes)in62seconds Sortapetabyte(1,000,000,000,000,000bytes)in16.25hours

11 HadoopTheEcosystem RDBMS BI,ETL ETL Sqoop Hive Pig MapReduce(Computation) HBase(DistributedMap) HDFS (HadoopDistributedFileSystem) Avro(Serialization) MapReduceframework(Core) Pluggableclustertaskscheduler(Core) Distributedreplicatedfaulttolerantfilesystem(HDFS) Horizontallyscalabledistributedfaulttolerantdatabase (HBase) Various value adds: Add on packages (analytics, management),distributions,dashboards,etc. Zookeepr(Coordination)

12 SeekVersusSortandMerge Atscale,disktimedominatesstorageandcomputation Twodatabaseparadigms CPU,RAM,anddisksizedoubleevery18 24months Seektimeremainsnearlyconstant(~5%peryear) Seekdominant:Indexed(B Tree)seekandreplace(RDBMS) Transferdominant:sort/merge(MapReduce,Bigtable) Seekisinefficientcomparedtotransferatscale Given: Updating1%ofentries(100,000,000)takes: 10MB/secondtransferbandwidth 10millisecondsdiskseektime 100bytesperentry(10billionentries) 10kBperpage(1billionpages) 1,000dayswithrandomB Treeupdates 100dayswithbatchedB Treeupdates 1daywithsortandmerge Logstructureddataaccessonstreamingfilesystem

13 HBaseTheHadoopDatabase Apersistentdistributedhashmap...andseparatenamespaces Tables...andanindex Rows...andlocalityofI/Oreferences Columnfamilies...andtimeranges Timestamps

14 HBaseTheHadoopDatabase Google: BigTable is a distributed storage system for managingstructureddatathatisdesignedtoscaletoa verylargesize:petabytesofdataacrossthousandsof commodityservers Goal: Store billions of rows * millions of columns * thousandsofversions An open source version of BigTable, enhanced with additionalfeaturesdevelopedbythecommunity FastfaultrecoveryviaZookeeper Querypushdownviaserversidequeryandscannerfilters Optimizationsforrealtimequeries Rollingrestarts PushmetricstologfilesorGanglia( Valuetimetolives(TTLs) AdministrativeGUIandcommandlineshell AHadoopsubproject TheusualASFthingsapply(license,JIRA,etc)

15 HBaseTheHadoopDatabase Designed,liketheHadoopplatform,forBigData To handle Big Data, we discard transactions and relationaldatamodels Nodistributedtransactions Nocomplexlocking Nowaitsordeadlocks Updatethroughsortandmergeinsteadofseekandreplace Incontrast,RDBMSsystems Distributedtransactions,complexlocking Seekandreplaceupdatestrategy Waits and deadlocks rise non linearly with transaction size andconcurrency Squareofconcurrency Thirdpoweroftransactionsize Relationalfeaturesareabandonedatscaleanyway Applicationlevelshardingasalastresortisnotasolution

16 DataModel Distributedpersistentsparsemap Multidimensionalkeys <row>,<column>:<qualifier>,<timestamp> Keysarearbitrarystrings Datagroupedbycolumns Accesstorowdataisatomic Multiversioning and timestamps avoid edit conflicts causedbyconcurrentdecoupledprocesses

17 GroupedbyColumns? Notaspreadsheet Instead,thinkoftags Valuesofanylength,nopredefinednamesorwidths

18 ColumnOrientedStores Atableconsistsofoneormore columnfamilies Columnnamesmayhaveanoptional qualifier family:qualifier Qualifiersareadditionallevelofindexing Eachfamilyinseparatestorefiles First,binarysearchoverindex(memory resident)intocolumnstoretofindany matchingrow Thenscantheblocktofindvalueswith matchingqualifiers Valuesarestoredinsortedorder Optionalfilelevelcompression (GZIP,LZO) Lexiographicallysimilarvaluesare packedadjacenttoeachotherfor goodlocalityofi/o;itisfastandcheap toscanadjacentrowsandcolumns

19 TablesAndRegions Rowsarestoredinbyte lexographicsortedorder Tablesaredynamicallysplitintoregions Regionsarehostedonanumberofregionservers As regions grow, they are split and distributed evenly amongthestorageclustertolevelload Splitsare almost instantaneous Finegrainedloadbalancing Regionsaremigratedawayfromhighly loadednodes Enablesfastrecovery Masterrapidlyredeploysregionsfrom failednodestoothers

20 "StrongC"(Consistency) CAPTheorem Simplifieddefinitions: C:Consistency Thesystemcanhandle thefailureofsome nodesandlossofsome messages HBase (BigTable) Youcanhaveonly2of3 BigTableisa"CP"architecture Thesystemalways returnsananswer (immediately) P:PartitionTolerance Writesarevisibletoall readersatthesametime A:Availability Brewer,2000 Strongconsistencywithfault(partition)tolerance Valuestorageiscanonical:Everyvalueappearsinoneregiononlyand eachregionisassignedtoonlyoneregionserveratatime

21 MoreonConsistency HBaseisstronglyconsistent In comparison with other "NoSQL" systems, HBase hasworkableatomicprimitives Rowlocks Atomic compareandset (CAS) and increment/decrement operators Allmutationsareatomicintherow Multiversioning and timestamps can help with applicationlayerconsistencyconcerns Every value appears in one region only, within the appropriateboundary[startkey,endkey]foritsrowkey Eachregionisassignedtoonlyoneregionserveratatime All edits are timestamped and the storage system supports storageandretrievalofmultipleversionsofavalue Applications can query for the latest version according to timestamp, time range, or most recent N versions, dependingonrequirements

22 MoreonConsistency Versionpool Asynchronousprocessesupdatetableindependently Microsecond resolution timestamps make edit conflicts unlikely UsedinconjunctionwithNTPclocksynchronization,providesglobal ordering Crosssitereplicationpreservestimestamps ApplicationcanqueryNversionsorbytimestamprangefor independentlystorededitsandmergeorresolvethem CURRENT O (O,t2) O (O,t1) O VERSION POOL (Olderversions thatmaybe useful)

23 IntegrationwithHadoop WithHDFS WithZooKeeper HBase relies on DFS replication for data durability and availability WALusesappendfeature WithoutHDFS,regionscouldnotbemigrated HBase compaction interacts faviorably with HDFS block placement Trackclustermembershipanddetectdeadservers Supports master election and recovery in multi master deployments AutomaticMasterfailover Rollingupgradesofpointreleases Modifysomeclusterconfigurationwithoutfullclusterrestart

24 IntegrationwithHadoop WithMapReduce TableInputFormat TableOutputFormat SplitscorrespondtoregionsforoptimalI/O TasksscheduledonRegionServershostingthetableregions Thisisfirstclass integrationintothe Hadoopstack Imagecredit:LarsGeorge

25 IntegrationwithHadoop Imagecredit:LarsGeorge

26 MoreDetail:WritePath Whenawriteoccurs Itiswrittentoawriteaheadlog Itisbufferedinmemory("memstore") Deletesarejustanotherkindofwrite,adeletemarker Periodically,thecacheiswrittentodisk,creatinganewfileineachstore forthecolumnsbeingflushed A compaction occurs when the number of files in an store exceeds a threshold:allstoresaremergesortedintoasinglenewstorefile Deletedandexpiredvaluesaregarbagecollectedduringcompactions Compactionsaredonein thebackground Periodically,thelogfile isclosedandanewone created Oldlogfilesaregarbage collected Updatingviarewritingis 100x1000xfasterthan updateviaseekand replaceatlargescale

27 MoreDetail:ReadPath Whenareadoccurs Checkfordatainmemstore Lookfordatainpersisteddata,fromnewesttooldest Stopbackwardssearchforagiven{table,row,column,[timestamp]} coordinateifadeletemarkerisfound Stop search if enough versions have been found to satisfy the search criteria Ignore expired values if TTLs are configured on the relevant column families Expired values will be garbage collected at next compaction, just like deletes

28 NewFeatures DataDurabilityandACIDGuarantees

29 ACIDGuarantees ACID Atomicity Consistency Isolation Durability Up to now, the ACID guarantees that HBase has providedhavenotbeenwellenumerated,butwenow haveaspecification In JavaDocin HBASE 2294 ( 2294) Many of these guarantees have always been informallyprovided Somehavebeenclarifiedforthisrelease Next: Additions to the unit test suite to continuously validatetheimplementationandtestforregressions

30 ACIDGuarantees Allmutationsareatomicwithinarow The checkandput API happens atomically like the typicalcompareandset(cas)operation The order of mutations is seen to happen in a well definedorderforeachrow,withnointerleaving All rows returned via any access API will consist of a completerowthatexistedatsomepoint Thisistrueacrosscolumnfamilies A scan is not a consistent view of a table; scans do notexhibitsnapshotisolation;instead: Anyputwilleithertotallysucceedortotallyfail APIsthatmutateseveralrowswillnotbeatomicacrossthe multiplerows Anyrowreturnedbythescanwillbeaconsistentview Ascanwillalwaysreflectaviewofthedataatleastasnew asthebeginningofthescan

31 ACIDGuarantees When a client receives a success response for any mutation, that mutation is immediately visible to both thatclientandanyotherclient Arowwillneverexhibit"time travel"properties A series of mutations moves a row sequentially through a seriesofstates Anysequenceofconcurrentreadswillreturnasubsequence Anyversionofacellthathasbeenreturnedtoaread operationisguaranteedtobedurablystored

32 DataDurability HDFS Aworkingappend This feature is critical for HBase to be able to guaranteetoclientsthatwriteshavebeenpersistedto disk(inthewriteaheadlog) Normally not a problem but there are some narrow failurecasesremaining HDFS 200 provides working append and HBase has supportforitwhichsolvestheproblem In Developed Intesting About1 2monthsaway HBase with a Hadoop release including HDFS 200 willinsurethatifastoresucceedsthedataisguaranteedto bepersisted

33 NewFeatures MultiDataCenterReplication

34 MultiDataCenterReplication Logshipping Lazy/asynchronous Updateeverywhere ConvergencewithmultiversioninginsteadofACID Use column and/or table descriptor attributes to specify the scopeofdataitemsstoredthere Supportsarbitrarytopologies:mesh,spoke and wheel,tree, pipeline,etc. RegionServersdothework Local:Donotreplicate Global:Replicateeverywhere Onlyreplicategloballyscopedcells Scopeisspecifiedasanintegertoenablemorecomplexpoliciesas theyaredevelopedinthefuture Replicationispeertopeer(cluster) In A subset of RS nominate themselves via ZK to act as endpointsforinter clusterreplication Shiplogsbetweenthemselvesinthebackground

35 MultiDataCenterReplication 2)Sendeditstopeerclusters 3)DeleteWALoncealleditsareacked 1)CollectglobaleditsfromnewlyrolledWAL LogBackup

36 NewFeatures SecureHBase

37 SecureHBase Isolationguarantees Auditsupport,authentication,authorization NewsecurityfeaturesinHadoopfromY! AddrolebasedaccesscontrolmodeltoHBase Kerberosstrongauthentication DataisolationattheHDFSlayer SecureRPC GetHBaseworkingonthissubstratebottomup LeverageKerberostoestablishuseridentity Manageametatablethatassociatesuserswithroles ACLsontables,possiblyalsooncolumnfamilies Superuserprivilegeforadministration IntegratethisintoHBasetopdown IntegratewithHDFSlayersecurity? Meetefficientlyandeffectivelyinthemiddle

38 SecureHBase Motiviations Multitenant cloud platforms and hosted services based (in part)onhbaseandhdfs Weneedtobeabletoreasonaboutuserisolationanddata integritycontrolsforcertification,e.g.sas 70 Isolationand Integrity DataAccessLayer HBase HDFS (HadoopDistributedFileSystem) SecureRPC AuditTrace Zookeeper

39 SecureHBase AuthenticationandAuthorization HBase Master krb(user) krb(user) blocktoken Clients (Servicefront ends) krb(user) krb(hdfs) blocktoken blocktoken HBase RegionServer HDFS NameNode HDFS DataNode Authentication Server(AS) LDAP Bootstrap DB TicketGranting Server(TGS) LDAP blocktoken KerberosKDC ClientsaccessHDFSandHBaseservicesindependently AllactionsrequiresavalidHDFSblocktokenacquiredviakrbauthenticationtoNameNode HDFSDataNodeswillnotservereadsandwritesunlessgivenvalidblocktokenforblock(s) AllRPCissecure(SASL+GSSAPI)

40 SecureHBase Audit HBase Master HBase HBase RegionServer RegionServer HDFS NameNode HDFS HDFS HDFS DataNode DataNode DataNode HDFS HDFS DataNode ZooKeeper DataNode JobTracker HDFS HDFS DataNode TaskTrackers DataNode Clients (Servicefront ends) HDFS HDFS DataNode MRtasks DataNode ChukwaLog4JAppender LogaggregationsolutionproposalbasedonApacheChukwa SubsetofplatformresourcesmustbereservedforprivateChukwa/MRcluster

41 NewFeatures Coprocessors

42 Coprocessors In0.21 AnewBigTableFeature NewSinceOSDI'06 Arbitrarycodethatrunsrunnexttoeachtabletintable As tablets split and move, coprocessor code automatically splits/movestoo High levelcallinterfaceforclients Callsaddressedtorowsorrangesofrows "Veryflexiblemodelforbuildingdistributedservices" "Automaticscaling,loadbalancing,requestrouting" ExampleCoprocessorUses Coprocessorclientlibraryresolvestoactuallocations Calls across multiple rows automatically split into multiple parallelizedrpcs ScalablemetadatamanagementforColossus Distributed language model serving for machine translation system Distributedqueryprocessingforfull textindexingsupport Regularexpressionsearchsupportforcoderepository

43 HBaseCoprocessors InspiredbyBigTableCoprocessors Genericextensionmechanism Coprocessorsareassociatedwithtablesviaatableattribute Tableattributeisapath(e.g.HDFSURI)tojarfile Jar is loaded into the regionservers when table regions are openedandnewfunctionbecomespartoftheregionserver Motiviation Currently extending HBase means HRegionServerandHRegionInterface Basic Hadoop architectural computationwithdata Resultingextensionsaremutuallyexclusive principle subclassing of colocating Computationherecanbe Calculationofaggregatesoverregiondata:count(),sum(),etc. Managementofsecondaryindices Dynamicindexing MorecomplexdatamodelslayeredonHBaseforscalability Querypushdownwitharbitrarilycomplexpredicates

44 HBaseCoprocessors RegionObserver If the coprocessor implements this interface, it will be interposedinallregionactionsviaupcalls Chainingofmultipleobservers(bypriority) Canmediate(veto)actions ManyextensionscanbebuiltontopofRegionObserver If the coprocessor implements this interface, it can receive arbitrarymethodinvocationsfromclients Combine RegionObserver and CommandTarget to extendhbaseinarbitraryways Secondaryindexes Filters Propagationconstraints? CommandTarget Enablessecuritypolicyextensions Mediatorscanbechainedaheadofobservers Mappinglayers,e.g.ORMs NativeRDFtuplestore Cloudfilesystem

45 HBaseCoprocessors MapReduce Ifthecoprocessorimplementsthisinterface,clientscancall uptoitexecuteparallelmapreduceoverthehbasecluster Runsconcurrentlyonallregionsofthetable LikeHadoopMapReduce Mappers,reducers,partitioners,intermediates UnlikeHadoopMapReduce NottableMapReduce,parallelregionMapReduce Sharedmemory Multithreaded(workerpools) Concurrencyofmappersandreducersisspecifiedseparately Scannerlikeinterfacetoretrieveresults Usesleases Forefficientimplementationofaggregatingfunctions Ajobisonlyaliveaslongasithasalease Forlongrunningjobstheclientmustperiodicallypollstatustokeepit alive;jobswithoutinterestwillbecancelled Retrievalby"scanner"willalsorenewthelease

46 WrapUp

47 ForMoreInformation HBaseWebsiteandWiki MailingList #hbaseonfreenode Committers and core contributors are here on a regular basis MoreactivethantheHadoopforums FollowHBaseonTwitter! IRCChannel

48 TheEnd Q&A Contactinfo: AndrewPurtell