Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Outline Introduction to Hadoop The Hadoop ecosystem Related projects How to start Introduction to GAE What is GAE Overview of runtime environment Scalable services Advantages and limitations Billing and free quotas Demo, and how to start 2 / 30
Hadoop Stack & Google s Equivalents Google MapReduce GFS BigTable Hadoop Hadoop MapReduce HDFS HBase Programming Framework Distributed File System Distributed Column Database Sawzall PIG / Hive High-level Language Chubby Zookeeper Distributed Consensus Engine 3 / 30
Pig Data-flow oriented language Pig latin Datatypes include sets, associative arrays, tuples High-level language for routing data, allows easy integration of Java for complex tasks Developed at Yahoo! 4 / 30
Hive SQL-based data warehousing app Feature set is similar to Pig Language is more strictly SQL-esque Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets Partition columns Sampling Buckets Developed at Facebook 5 / 30
HBase Row/Column store Billions of rows * millions of columns Column-oriented nulls are free Untyped stores bytes. Constraint access model (key,value) look up Limited transactions ( only one row) 6 / 30
Hbase Data Model Data schema Disk storage 7 / 30
Hbase Design & Features Design similar to GFS Features Name node Master server Data node Region server, organized in columns and cells Fault tolerant and auto load balancing Fast access to cells, and fast scan over the ranges of rows. More flexible schema than traditional database. Less transaction support and weak consistency guarantee 8 / 30
HBase as a MapReduce Input Each row is an input record to MapReduce MapReduce jobs can sort/search/index/query data in bulk *If you are interested in knowing more about HBase, you may take a look at Cloudera s training video on HBase. 9 / 30
Zookeeper Distributed consensus engine Provides well-defined concurrent access semantics: Leader election Service discovery Distributed locking / mutual exclusion Message board / mailboxes 10 / 30
Pipes, Streaming Multi-language connector libraries for MapReduce Write native-code MapReduce in C++ Write MapReduce passes in arbitrary scripting languages 11 / 30
Hadoop related projects Avro: A data serialization system Chukwa Hadoop log aggregation Scribe More general log aggregation Mahout Machine learning library Cassandra Column store database on a P2P backend 12 / 30
Hadoop Status Still under active development Current stable release: 0.20.2 ( Hadoop official websites) There are some other well-maintained distribution Cloudera s CDH2 Yahoo s Distribution: Hadoop 0.20.10 Supported platform Linux as production platform/win32 as a dev platform Get yourself started with (Also Lab1 s task) Download a Hadoop stable release Setup a single-node Hadoop installation Try out the HDFS operations Read WordCount Example codes, and run your first MR job on Hadoop 13 / 30
Introduction to Google App Engine (GAE) SaaS Software as a Service PaaS Platform as a Service #2: The application development on top of PaaS platform IaaS Infrastructure as a Service #1: The technology drives PaaS 14 / 30
What is Google App Engine A PaaS platform for hosting web applications in Googlemanaged data centers. Released on April 08 with Python support. Java included on May 09. + + = Google App Engine Java Language Google Web Toolkit Google App Engine for Java 15 / 30
A Traditional Scalable Website 16 / 30
A GAE Scalable Website 17 / 30
GAE Advantages Easy to use, scale and manage Run your application on Google s infrastructure Forgot worries of managing your servers Think about developing more features for your web, let Google manage the rest No server restart, no network issues 18 18 / 30
GAE Architecture 19 19 / 30
GAE Java Runtime Environment Java 6 VM Servlet 2.5 Container HTTP Session support (need to enable explicitly) JDO/JPA for Datastore API JSR 107 for Memcache API javax.mail for Mail API javax.net.urlconnection for URLFetch API http://code.google.com/appengine/docs/java/runtime.html 20 / 30
Java Standards on GAE http://code.google.com/appengine/docs/java/runtime.html 21 / 30
Datastore API Storing data and manipulation Based on Bigtable Not a relational database GQL (Google Query Language) Need to use JDO/JPA http://code.google.com/appengine/docs/java/datastore/ 22 22 / 30
Memcache Better than Datastore Storage on memory rather on disk Arbitrary key-value pair mapping It implements JCache interface 1MB limit per entry Free quota 8.6M/day, 800 request/sec http://code.google.com/appengine/docs/java/memcache/ 23 23 / 30
Users & Authentication @gmail.com address Apps for Domain Admin Privileges http://code.google.com/appengine/docs/java/users/ 24 24 / 30
URLFetch Load external URL Asynchronous support HTTP/HTTPS Max 10 second response Max 1MB data http://code.google.com/appengine/docs/java/urlfetch/ 25 25 / 30
Even More Datastore database storage and operations Memcache API high performance in-memory key-value cache User Accounts using Google accounts for authentication URLFetch invoking external URLs Mail sending mail from your application XMPP sending/receiving XMPP-compatible instant messages Task Queues for invoking background processes Images for image manipulation Cron Jobs scheduled tasks on defined time http://code.google.com/appengine/docs/java/apis.html 26 26 / 30
Who is using GAE? http://code.google.com/appengine/casestudies.html 27 / 30
GAE Demo Demo site: http://shen-ma.appspot.com/ Source availale at: https://code.google.com/p/shenma-wish/ 28 / 30
How Do You Start The best way to learn is by practice! Following GAE s Getting-Started: Java, and have your first application online in 2 hrs. (Also Lab 1 Task) Recommend everybody using Eclipse as Dev IDE, GAE offers a very nice plugin Other GAE examples available on our course website 29 / 30
Intro done, ready to get your hands dirty!