Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti Kallonen helping with technicalities answering questions Course work Exam!
Communication Communication between students and personnel is carried out with Slack team collaboration tool all students will get an e-mail invitation the main channel for asking questions
Exam The popularity of the course surprised For the course work we are in plan C grading 150 students with our current plan does not work therefore, the course will have an exam sorry
Course Work Hands on course Course work Groups of three students Java and Python (maybe Scala) Hadoop, HDFS, MapReduce, Spark, Hue, Traffic data by Finnish Transport Agency Report
Course Work Cloudera CDH 5 -virtual machine Each group should install to own laptop Virtualbox, for instance Data preparation, import, analysis, dissemination The tasks will be published during the course This week s task: form a group of three
Today Big data Data Science Hadoop HDFS
Big Data World is drowning in data click stream data is collected by web servers NYSE generates 1 TB trade data every day MTC collects 5000 attributes for each call Smart marketers collect purchasing habits More data usually beats better algorithms
Three Vs of Big Data Volume: amount of data Transaction data stored through the years, unstructured data streaming in from social media, increasing amounts of sensor and machine-tomachine data Velocity: speed of data in and out streaming data from RFID, sensors, Variety: range of data types and sources structured, unstructured
Big Data Variability Data flows can be highly inconsistent with periodic peaks Complexity Data comes from multiple sources. linking, matching, cleansing and transforming data across systems is a complex task
Data Science Definition: Data science is an activity to extracts insights from messy data Facebook analyzes location data to identify global migration patterns to find out the fanbases to different sport teams A retailer might track purchases both online and in-store to targeted marketing
Data Science
New Challenges Compute-intensiveness raw computing power Challenges of data intensiveness amount of data complexity of data speed in which data is changing
Data Storage Analysis Hard drive from 1990 store 1,370 MB speed 4.4 MB/s Hard drive 2010s store 1 TB speed 100 MB/s
Scalability Grows without requiring developers to rearchitect their algorithms/application Horizontal scaling Vertical scaling
Parallel Approach Reading from multiple disks in parallel 100 drives having 1/100 of the data => 1/100 reading time Problem: Hardware failures replication Problem: Most analysis tasks need to be able to combine data in some way MapReduce Hadoop
Hadoop Hadoop is a frameworks of tools libraries and methodologies Operates on large unstructured datasets Open source (Apache License) Simple programming model Scalable
Hadoop A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) Core Hadoop has two main systems: Hadoop Distributed File System: self-healing highbandwidth clustered storage MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction
Hadoop Administrators Installation Monitor/Manage Systems Tune Systems End Users Design MapReduce Applications Import Export data Work with various Hadoop Tools
Hadoop Developed by Doug Cutting and Michael J. Cafarella Based on Google MapReduce technology Designed to handle large amounts of data and be robust Donated to Apache Foundation in 2006 by Yahoo
Hadoop Design Principles Moving computation is cheaper than moving data Hardware will fail, manage it Hide execution details from the user Use streaming data access Use simple file system coherency model Hadoop is not a replacement for SQL, always fast and efficient quick ad-hoc querying
Hadoop MapReduce Collocate data with compute node data access is fast since its local (data locality) Network bandwidth is the most precious resource in the data center MR implementations explicit model the network topology MR operates at a high level of abstraction programmer thinks in terms of functions of key and value pairs
Hadoop MapReduce MR is a shared-nothing architecture tasks do not depend on each other failed tasks can be rescheduled by the system Invented by Google used for producing search indexes applicable to many other problems too
Hadoop Ecosystem Common A set of components and interfaces for distributed file systems and general I/O Avro A serialization system for efficient, cross-language RPC and persistent storage MapReduce Distributed data processing model and execution environment
Hadoop Ecosystem HDFS A Distributed filesystem Pig, Hive HBase, ZooKeeper, Sqoop, Oozie
RDBMS vs HDFS Schema-on-Write (RDBMS) Schema must be created before any data can be loaded An explicit load operation which transforms data to DB internal structure New columns must be added explicitly before new data for such columns can be loaded into the DB Schema-on-Read (HDFS) Data is simply copied to the file store, no transformation is needed A SerDe (Serializer /Deserlizer) is applied during read time to extract the required columns (late binding) New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it
Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google s FlumeJava) 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.
Hadoop Distributed File System Hadoop comes with distributed file system called HDFS (Hadoop Distributed File System) Based on Google s GFS (Google File System) HDFS provides redundant storage for massive amounts of data using commodity hardware Data in HDFS is distributed across all data nodes Efficient MapReduce processing
HDFS Design File system on commodity hardware Survives even with high failure rates of the components Supports lots of large files File size hundreds GB or several TB Main design principles Write once, read many times Rather streaming reads, than frequent random access High throughput is more important than low latency
HDFS Architecture HDFS operates on top of existing file system Files are stored as blocks (default size 64 MB, different from file system blocks) File reliability is based on block-based replication Each block of a file is typically replicated across several DataNodes (default replication is 3) NameNode stores metadata, manages replication and provides access to files No data caching (because of large datasets), but direct reading/streaming from DataNode to client
HDFS Architecture NameNode stores HDFS metadata filenames, locations of blocks, file attributes Metadata is kept in RAM for fast lookups The number of files in HDFS is limited by the amount of available RAM in the NameNode HDFS NameNode federation can help in RAM issues: several NameNodes, each of which manages a portion of the file system namespace
HDFS Architecture DataNode stores file contents as blocks Different blocks of the same file are typically stored on different DataNodes Same block is typically replicated across several DataNodes for redundancy Periodically sends report of all existing blocks to the NameNode DataNodes exchange heartbeats with the NameNode
HDFS Architecture Built-in protection against DataNode failure If NameNode does not receive any heartbeat from a DataNode within certain time period, DataNode is assumed to be lost In case of failing DataNode, block replication is actively maintained NameNode determines which blocks were on the lost DataNode The NameNode finds other copies of these lost blocks and replicates them to other nodes
High-Availability (HA) Issues: NameNode Failure NameNode failure corresponds to losing all files on a file system % sudo rm --dont-do-this / For recovery, Hadoop provides two options Backup files that make up the persistent state of the file system Secondary NameNode Also some more advanced techniques exist
HA Issues: the secondary NameNode The secondary NameNode is not mirrored NameNode Required memory-intensive administrative functions NameNode keeps metadata in memory and writes changes to an edit log The secondary NameNode periodically combines previous namespace image and the edit log into a new namespace image, preventing the log to become too large Keeps a copy of the merged namespace image, which can be used in the event of the NameNode failure Recommended to run on a separate machine Requires as much RAM as primary NameNode
Network Topology HDFS is aware how close two nodes are in the network From closer to further 0: Processes in the same node 2: Different nodes in the same rack 4: Nodes in different racks in the same data center 6: Nodes in different data centers
File Block Placement Clients always read from the closest node Default placement strategy One replica in the same local node as client Second replica in a different rack Third replica in different, randomly selected, node in the same rack as the second replica Additional (3+) replicas are random
Balancing Hadoop works best when blocks are evenly spread out Support for DataNodes of different size In optimal case the disk usage percentage in all DataNodes approximately the same level Hadoop provides balancer daemon Re-distributes blocks Should be run when new DataNodes are added
Accessing Data Data can be accessed using various methods Java API: Demo C API Command line / POSIX (FUSE mount) Command line / HDFS client: Demo HTTP
HDFS URI All HDFS (CLI) commands take path URIs as arguments URI format scheme://authority/path The scheme and authority are optional If not specified the default (set in the configuration file) one are used
Conclusions Pros Support for very large files Designed for streaming data Commodity hardware Cons Not designed for low-latency data access Architecture does not support lots of small files No support for multiple writers / arbitrary file modifications (Writes always at the end of the file)