Case Study : 3 different hadoop cluster deployments

Size: px

Start display at page:

Download "Case Study : 3 different hadoop cluster deployments"

Justina Foster
9 years ago
Views:

1 Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected]

2 HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer Delivery using HTTP, RTSP, RTMP, MMS, FTP protocol Storage survived from 336 disk failures 3 name node failures

customer Delivery using HTTP, RTSP, RTMP, MMS, FTP

3 HDFS as a Storage Pros Easy to scale Commodity hardware Fault tolerant High throughput Cost dropped to 20% of SAN including maintenance Cons Not mountable Does not handle well a lot of files IO is unpredictable (replication, map-reduce) Most existing delivery server can not access HDFS Couldn t store MANY small files like jpg, gif, html, txt Buffering on video streaming

files IO is unpredictable (replication, map-reduce) Most existing delivery server can not

4 HDFS Mountable? hdfs-fuse With very high load, we faced memory leak, hanging Windows system can not use fuse. Webhdfs NFS gateway CloudVFS - we built our own Run as Java daemon. support FUSE, CIFS, NFS, FTP. Including cache Can run on windows (CIFS) FTP Application OS FUSE CIFS CloudVFS

5 Many small files When a disk or a datanode fails. - starting to replicate, speed of (number of nodes * 2 ) blocks / dfs.replication.interval(default 3) seconds - If you have 1M jpg files to replicate, on 10 nodes cluster, it ll take more than 40 hours. Data node periodically scan blocks - does not handle many files well Namenode keeps metadata in it s memory - and memory is limited

interval(default 3) seconds - If you have 1M jpg files to replicate, on 10 nodes cluster, it ll

6 Handling small files Replace hdfs implementation <property> <name>fs.hdfs.impl</name> <value>com.nflabs.cloudvfs.hdfs.smallhdfs</value> </property> SmallHDFS driver first looking for._dir_.har. /dir file1 file2... file10000 scan directory tree and create._dir_.har archive using mapreduce /dir._dir_.har

smallhdfs</value> </property> SmallHDFS driver first looking for._dir_.har.

7 Large scale log analysis system Delivery servers are starting to generate logs So, we built log analysis system

8 Large scale log analysis system The first log analysis system Calculates simple statistics like, throughput, hit/sec a web server RRD database / graph HTTP PUT a python script As service growing up, processing speed of a web server couldn t catch up the log generation speed.

RRD database / graph HTTP PUT a python script As service growing up,

9 Large scale log analysis system Logs are uploaded to HDFS Python code are converted to M/R java code a web server RRD database / graph HTTP PUT cron job Map-Reduce a hadoop cluster Now, processing speed catches up log generation speed We could add more analysis like, top url rank, where clients come from. RRD database wasn t flexible enough. RRD file becomes too big.

Now, processing speed catches up log generation speed We could add more analysis like, top

10 Large scale log analysis system Processed results are saved into Hbase Web server implements google chart API a server HTTP PUT cron job Hbase Web server google chart API Map-Reduce a hadoop cluster Hbase provides better flexibility, scalability Now, writing MR becomes pain

job Hbase Web server google chart API Map-Reduce a hadoop cluster

11 Large scale log analysis system Hive helped a lot to quickly develop new statistics features a server HTTP PUT cron job Hbase Web server Hive google chart API Map-Reduce a hadoop cluster As more hive jobs are added, controlling, scheduling job becomes complicated, problematic

Hive google chart API Map-Reduce a hadoop cluster As more hive jobs

12 Large scale log analysis system oozie replaces a cron job. a server HTTP PUT oozie Hbase Web server Hive google chart API Map-Reduce a hadoop cluster

13 Large scale log analysis system Flume replaces a HTTP log collector a server oozie Hbase Web server Hive Map-Reduce google chart API Flume HDFS a hadoop cluster

14 Large scale log analysis system for last 4 years 1328TB+, M+ records

15 Hadoop for data scientist What is hadoop for data scientist? Data scientist is a human, and human want Analytical language & environment Many Libraries Interactive Visualization Share

16 Hadoop for data scientist Tools, Languages. M/R Java Hive Pig R Scala (spark)... Recently many opensource ML libraries are born Mahout ( cloudera-ml ( Mlbase ( Cascade Pattern (

.. Recently many opensource ML libraries are born Mahout (http://mahout.

17 Demonstration

18 Hadoop landscape ML-base Cloudera-ML HCatalog MRQL Stinger Pig Drill Shark Hive Impala tajo

19 An opensource analytical tool/environment for hadoop Is there any? Zeppelin Interactive Data Visualization Runtime Environment (abstract, connect different libraries / computing platform) Sharing network like CRAN or CPAN

different libraries / computing platform) Sharing network like CRAN or CPAN

20 Thanks!

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open