Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to summarize data Develop Hive and Pig queries to simplify data analysis Test and debug jobs using MRUnit Monitor task execution and cluster health What is this course about? The availability of large data sets presents new opportunities and challenges to organizations of all sizes. This course provides the hands-on programming skills to leverage the Apache Hadoop platform to efficiently process a variety of Big Data. Additionally, you learn to test and deploy Big Data solutions on commodity clusters. This course also covers Pig, Hive, HBase and other components of the Hadoop ecosystem. Further, this course teaches testing, deployment and best practices to architect and develop a complete Big Data solution. Who will benefit from this course? This course is for developers, architects and testers who desire hands-on experience writing code for Hadoop. It can also be helpful to technical managers interested in the development process. What background do I need? You should have Java experience at the level of Course 471, Java Programming Introduction: Hands-On or equivalent experience. Exposure to SQL is helpful. Will there be any programming in the course? Yes! Approximately 40 percent of the course time is devoted to hands-on programming. What tools and platforms are used? The platform is Java running on RedHat Linux. The tools used include Eclipse and various text editors.
Which Big Data products does this course use? The course covers a number of Big Data products including Apache Hadoop, MapReduce, Hadoop Distributed File System (HDFS), HBase, Hive, and Pig. Additional parts of the Hadoop ecosystem will be covered such as Sqoop, Oozie and MRUnit. Other datastores will be mentioned for comparison. What is Big Data? Big Data is a term used to define data sets that have the potential to rapidly grow so large that they become unmanageable. The Big Data movement includes new tools and ways of storing information that allow efficient processing and analysis for informed business decision-making. What is Hadoop? Hadoop is an open source implementation of MapReduce by the Apache group and is the most widely used platform on which to solve problems in processing large, complex data sets that would otherwise be intractable using conventional means. It is a high performance distributed storage and processing system. Hadoop fills the gap in the market by effectively storing and providing computational capabilities for substantial amounts of data. There is commercial support from multiple vendors and prepackaged cloud solutions. What is MapReduce? MapReduce is a parallel programming model that allows distributed processing on large data sets on a cluster of computers. MapReduce was originally implemented by Google as part of their searching and indexing of the Internet. It has since grown in popularity and is quickly being adopted by most industries. How are Hadoop programs developed? Primarily programs are written in Java although Hadoop has facilities to handle programs written in other languages like C++, Python, and.net. Programs can also be written in scripting languages like Pig. Data in HDFS can be queried using a SQL-like syntax with Hive.
What are the advantages of using Hadoop? Hadoop provides the ability to process and analyze more data than was previously possible at a lower cost It runs on scalable commodity clusters It has self-healing capabilities to survive hardware failures It operates on various types of data and adapts to meet varying degrees of structure HDFS automatically provides robustness and redundancy for performance and reliability There are many associated projects that enhance the Hadoop ecosystem and ease development Introduction to Hadoop Identifying the business benefits of Hadoop Surveying the Hadoop ecosystem Selecting a suitable distribution Parallelizing Program Execution Meeting the challenges of parallel programming Investigating parallelizable challenges: algorithms, data and information exchange Estimating the storage and complexity of Big Data Parallel programming with MapReduce Dividing and conquering large-scale problems Uncovering jobs suitable for MapReduce Solving typical business problems
Implementing Real-World MapReduce Jobs Applying the Hadoop MapReduce paradigm Configuring the development environment Exploring the Hadoop distribution Creating the components of MapReduce jobs Introducing the Hadoop daemons Analyzing the stages of MapReduce processing:splitting, mapping, shuffling and reducing Building complex MapReduce jobs Selecting and employing multiple mappers and reducers Leveraging built-in mappers, reducers and partitioners Coordinating jobs with Oozie workflow scheduler Streaming tasks through various programming languages Customizing MapReduce Solving common data manipulation problems Executing algorithms:parallel sorts, joins and searches Analyzing log files, social media data and e-mails Implementing partitioners and combiners Identifying network bound, CPU bound and disk I/O bound parallel algorithms Reducing network traffic with combiners Dividing the workload efficiently using partitioners Collecting metrics with counters
Persisting Big Data with Distributed Data Stores Making the case for distributed data Achieving high performance data throughput Recovering from media failure through redundancy Interfacing with Hadoop Distributed File System (HDFS) Breaking down the structure and organization of HDFS Loading raw data and retrieving results Reading and writing data programmatically Partitioning text or binary data Manipulating Hadoop SequenceFile types Structuring data with HBase Migrating from structured to unstructured storage Applying NoSQL concepts with schema on read Connecting to HBase from MapReduce jobs Comparing HBase to other types of NoSQL data stores Simplifying Data Analysis with Query Languages Unleashing the power of SQL with Hive Structuring data with the Hive MetaStore Extracting, Transforming and Loading (ETL) data Querying with HiveQL Accessing Hive servers through JDBC Extending HiveQL with User-Defined Functions (UDF) Executing workflows with Pig
Developing Pig Latin scripts to consolidate workflows Integrating Pig queries with Java Interacting with data through the grunt console Extending Pig with User-Defined Functions (UDF) Managing and Deploying Big Data Solutions Testing and debugging Hadoop code Logging significant events for auditing and debugging Debugging in local mode Validating requirements with MRUnit Deploying, monitoring and tuning performance Deploying to a production cluster Optimizing performance with administrative tools Monitoring job execution through web user interfaces