Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com
HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer Delivery using HTTP, RTSP, RTMP, MMS, FTP protocol Storage survived from 336 disk failures 3 name node failures
HDFS as a Storage Pros Easy to scale Commodity hardware Fault tolerant High throughput Cost dropped to 20% of SAN including maintenance Cons Not mountable Does not handle well a lot of files IO is unpredictable (replication, map-reduce) Most existing delivery server can not access HDFS Couldn t store MANY small files like jpg, gif, html, txt Buffering on video streaming
HDFS Mountable? hdfs-fuse With very high load, we faced memory leak, hanging Windows system can not use fuse. Webhdfs NFS gateway CloudVFS - we built our own Run as Java daemon. support FUSE, CIFS, NFS, FTP. Including cache Can run on windows (CIFS) FTP Application OS FUSE CIFS CloudVFS
Many small files When a disk or a datanode fails. - starting to replicate, speed of (number of nodes * 2 ) blocks / dfs.replication.interval(default 3) seconds - If you have 1M jpg files to replicate, on 10 nodes cluster, it ll take more than 40 hours. Data node periodically scan blocks - does not handle many files well Namenode keeps metadata in it s memory - and memory is limited
Handling small files Replace hdfs implementation <property> <name>fs.hdfs.impl</name> <value>com.nflabs.cloudvfs.hdfs.smallhdfs</value> </property> SmallHDFS driver first looking for._dir_.har. /dir file1 file2... file10000 scan directory tree and create._dir_.har archive using mapreduce /dir._dir_.har
Large scale log analysis system Delivery servers are starting to generate logs So, we built log analysis system
Large scale log analysis system The first log analysis system Calculates simple statistics like, throughput, hit/sec a web server RRD database / graph HTTP PUT a python script As service growing up, processing speed of a web server couldn t catch up the log generation speed.
Large scale log analysis system Logs are uploaded to HDFS Python code are converted to M/R java code a web server RRD database / graph HTTP PUT cron job Map-Reduce a hadoop cluster Now, processing speed catches up log generation speed We could add more analysis like, top url rank, where clients come from. RRD database wasn t flexible enough. RRD file becomes too big.
Large scale log analysis system Processed results are saved into Hbase Web server implements google chart API a server HTTP PUT cron job Hbase Web server google chart API Map-Reduce a hadoop cluster Hbase provides better flexibility, scalability Now, writing MR becomes pain
Large scale log analysis system Hive helped a lot to quickly develop new statistics features a server HTTP PUT cron job Hbase Web server Hive google chart API Map-Reduce a hadoop cluster As more hive jobs are added, controlling, scheduling job becomes complicated, problematic
Large scale log analysis system oozie replaces a cron job. a server HTTP PUT oozie Hbase Web server Hive google chart API Map-Reduce a hadoop cluster
Large scale log analysis system Flume replaces a HTTP log collector a server oozie Hbase Web server Hive Map-Reduce google chart API Flume HDFS a hadoop cluster
Large scale log analysis system for last 4 years 1328TB+, 3658400M+ records
Hadoop for data scientist What is hadoop for data scientist? Data scientist is a human, and human want Analytical language & environment Many Libraries Interactive Visualization Share
Hadoop for data scientist Tools, Languages. M/R Java Hive Pig R Scala (spark)... Recently many opensource ML libraries are born Mahout (http://mahout.apache.org/) cloudera-ml (https://github.com/cloudera/ml) Mlbase (http://mlbase.org/) Cascade Pattern (http://www.cascading.org/pattern/)
Demonstration
Hadoop landscape ML-base Cloudera-ML HCatalog MRQL Stinger Pig Drill Shark Hive Impala tajo
An opensource analytical tool/environment for hadoop Is there any? Zeppelin Interactive Data Visualization Runtime Environment (abstract, connect different libraries / computing platform) Sharing network like CRAN or CPAN https://github.com/nflabs/zeppelin https://groups.google.com/forum/#!forum/zeppelin-developers
Thanks!