Distibuted Computing and Big Data: Hadoop and Map Bill Keenan, Diecto Tey Heinze, Achitect Thomson Reutes Reseach & Development Agenda R&D Oveview Hadoop and Map Oveview Use Case: Clusteing Legal Documents 2 1
Thomson Reutes Leading souce of intelligent infomation fo the wold s businesses and pofessionals. 55,000+ employees acoss moe than 100 counties Financial, Legal, Tax and Accounting, Healthcae, Science and Media makets Poweed by the wold s most tusted news oganization (Reutes). 3 Oveview of Copoate R&D 40+ compute scientists Reseach scientists, Ph.D. o equivalent Softwae enginees, achitects, poject manages Highly focused aeas of expetise Infomation etieval, text categoization, financial eseach Financial analysis Text & data mining, machine leaning Web sevice development, Hadoop 4 2
Ou Intenational Roots 5 Role Of Copoate R&D Anticipate Reseach Patne Delive 6 3
Hadoop and Map Big Data and Distibuted Computing Big Data at Thomson Reutes Moe than 10 petabytes in Eagan alone Majo data centes aound globe: financial makets, tick histoy, healthcae, public ecods, legal documents Distibuted Computing Multiple achitectues and use cases Focus today: using multiple seves, each woking on pat of job, each doing same task Key Challenges: Wok distibution and ochestation Eo ecovey Scalability and management 8 4
Hadoop & Map Hadoop: A softwae famewok that suppots distibuted computing using Map Distibuted, edundant file system (HDFS) Job distibution, balancing, ecovey, schedule, etc. Map: A pogamming paadigm that is composed of two functions (~ elations) Map Both ae quite simila to thei functional pogamming cousins Many add-ons 9 Hadoop Clustes NameNode: stoes location of all data blocks Job Tacke: wok manage Task Tacke: manages tasks on one Data Node Client accesses data on HDFS, sends jobs to Job Tacke NameNode Job Tacke Seconday NameNode Client Task Tacke Data Node 1 Task Tacke Data Node 2 Task Tacke Data Node N 10 5
HDFS Key Concepts Google File System Small # of lage files Steaming batch pocesses Redundant, ack awae Failue esistant Wite-once (usually), ead many Single point failue Incomplete secuity Not only fo Map 11 Map/ Key Concepts <key,value> Mappes: input -> intemediate kv pais s: intemediate -> output kv pais InputSplits Pogess epoting Shuffling, patitioning Scheduling Task distibution Topology awae Distibuted cache Recovey Compession Bad Recods Speculative execution 12 6
Use Cases Quey log pocessing Quey mining Text Mining XML tansfomations Classification Document Clusteing Entity Extaction 13 Case Study: Lage Scale ETL Big Data: Public Recods Waehouse loading long pocess, expensive infastuctue, complex management Combine data fom multiple epositoies (extact, tansfom, load) Idea: Use Hadoop s natual ETL capabilities Use existing shaed infastuctue 14 7
Why Hadoop Big data billions of documents Needed to pocess each document, combine infomation Expected multiple passes, multiple types of tansfomations Minimal wokflow coding 15 Use Case: Language Modeling Build Languages Models fom clustes of legal documents Lage initial copus: 7,000,000 xml documents Copus gows ove time 8
Pocess Pepae the input Remove duplicates fom the copus Remove stop wods (common English, high fequency tems) Stem Convet to binay (spase TF vecto) Ceate centoids fo seed clustes Pocess List of Document IDs belonging to this Seed Clustes Seed clustes encoded Encoded Documents Input Output Cluste Centoids C-values fo each document 9
Pocess Clusteing Iteate until numbe of clustes equals goal Multiply matix of document vectos and matix of cluste centoids Assign document to best cluste Mege clustes and e-compute centoids Pocess Seed Clustes C-vectos Geneate Cluste vectos Input Repeat the loop until all the clustes ae meged. Mege Clustes Run Algoithm Mege List W-values 10
Pocess Validate and Analyze Clustes Ceate classifie fom clustes Assign all non-clusteed documents to clustes using the classifie Build Language Model fo each cluste HDFS Sample Flow Task Task Task Task Task node node node node node Task node Mappe Mappe Mappe Mappe Mappe Mappe HDFS 11
Pepae Input using Hadoop Fits the Map/ paadigm Each document is atomic: documents can be equally distibuted within the HDFS Each mappe emoves stop wods, tokenizes, and stems Mappes emit token counts, hashes, and tokenized documents s build Document Fequency dictionay (basically, the wod count example) s the hashes h to a single document (de-duplication) d Additional Map/ convets tokenized documents to spase vectos using the DF dictionay Additional Map maps document vectos and seed cluste ids and educe geneates centoids HDFS Sample Flow Distibute documents Task Task Task Task Task node node node node node Task node Remove stop wods,stem Mappe Mappe Mappe Mappe Mappe Mappe Emit filteed documents, wod counts Emit dictionay, filteed documents HDFS 12
HDFS Sample Flow Distibute filteed documents Task Task Task Task Task node node node node node Task node Compute vectos Mappe Mappe Mappe Mappe Mappe Mappe Emit vectos by seed cluste id Compute cluste centoids Emit vectos, seed cluste centoids HDFS Clusteing using Hadoop Map/ paadigm Each document vecto is atomic: documents can be equally distibuted within the HDFS Mappe initialization equied loading lage matix of cluste centoids Lage memoy utilization to hold matix multiplications Decompose matices into smalle chunks and un multiple map/educe steps to obtain final esult matix (new clustes) 13
Validate and Analyze Clustes using Hadoop Map/ paadigm A document classifie based on the documents within the clustes was built n.b. the classie itself was tained using Hadoop Un-clusteed documents (still in the HDFS) ae classified in a mappe and assigned a cluste id. A eduction step then takes each set of oiginal documents in a cluste and ceates a language model fo each cluste HDFS Sample Flow Distibute documents Task Task Task Task Task node node node node node Task node Extact n-gams, emit by cluste id Mappe Mappe Mappe Mappe Mappe Mappe Build language model fo each cluste HDFS 14
Using Hadoop Othe Expeiments WestlawNext Log Pocessing Billions of aw usage events ae geneated Used Hadoop to map aw events to a use s individual session s ceated complex session objects Session objects educible to xml fo xpath queies fo mining use behavio Remote Logging Povide a way to ceate and seach centalized Hadoop job logs, by host, job, and task ids Send the logs to a message queue Bowse the queue o Pull the logs fom the queue and etain them in a db Remote Logging: Bowsing Client 15
Lessons leaned State of Hadoop Weak secuity model, changes in woks Cluste configuation, management and optimization still sometimes difficult Uses can oveload a cluste. Need to balance optimization and safety. Leaning cuve modeate Quick to un fist naïve MR pogams Skill/expeience equied fo advanced o optimized pocesses 31 Lessons Leaned Loading HDFS is time consuming: Wote multitheaded loade to educe bound IO Multiple step pocess needed to be e-un using diffeent test copuses: Wote paameteized Pel scipt to submit jobs to the Hadoop cluste Test Hadoop on a single node cluste fist: Install Hadoop locally Local mode within Eclipse (Windows, Mac) Pseudo-distibuted mode (Mac, Cygwin, VMWae) using Hadoop plugin (Kamasphee) 16
Lessons Leaned Tacking intemediate esults: Detect bad o inconsistent esults afte each iteation Recod messages to Hadoop node logs Ceate emote logge (event detecto) to boadcast status Regession tests: Small, sample copus un though local Hadoop and distibuted Hadoop. Intemediate and final esults compaed against efeence esults ceated by baseline Matlab application Lessons Leaned Pefomance evaluations Detect bad o inconsistent esults afte each iteation Don t accept long duation tasks as nomal Regession test on Hadoop took 4 hous while same Matlab test took seconds (because of the need to spead the matix opeations ove seveal map/educe steps) Re-evaluated coe algoithm and found ways to eliminate and compess steps elated to cluste meging Diect convesion of mathematics, as developed, to java stuctues t and map/educe was not efficient i New clusteing pocess no longe uses Hadoop: 6 hous on single machine vs. 6 days on 20 node Hadoop cluste As size copus gows, we will need to migate new cluste algoithm back to Hadoop 17
Lessons Leaned Pefomance evaluations Leveage combines and mappe statics the amount of data duing shuffle Lessons Leaned Releases ae still volatile Coe API changed significantly fom elease.19 to.20 Functionality elated to distibuted cache changed (application files loaded to each node at untime) Eclipse Hadoop plugins Souce code only with elease.20 and only woks in olde Eclipse vesions on Windows Kamasphee plugin (Eclipse and NetBeans) moe matue but still moe of a concept than poductive Just use Eclipse fo testing Hadoop code in local mode Develop altenatives fo handling distibuted cache 18
Reading Hadoop The Definitive Guide Tom White O Reilly Data-Intensive Text Pocessing with Map Jimmy Lin and Chis Dye Univesity of Mayland http://www.umiacs.umd.edu/~jimmylin/book.html Questions? Thank you. 38 19