Data Modeling Considerations in Hadoop and Hive
|
|
|
- Shavonne Richardson
- 10 years ago
- Views:
Transcription
1 Technical Paper Data Modeling Considerations in and Hive Clark Bradley, Ralph Hollinshead, Scott Kraus, Jason Lefler, Roshan Taheri October 2013
2
3 Table of Contents Introduction... 2 Understanding HDFS and Hive... 2 Project Environment... 4 Hardware... 4 Software... 5 The Cluster... 7 The Client Server... 9 The RDBMS Server... 9 Data Environment Setup... 9 Approach for Our Experiments Results Experiment 1: Flat File versus Star Schema Experiment 2: Compressed Sequence Files Experiment 3: Indexes Experiment 4: Partitioning Experiment 5: Impala Interpretation of Results Conclusions Appendix Queries Used in Testing Flat Tables References... 26
4 Data Modeling Considerations in and Hive Introduction It would be an understatement to say that there is a lot of buzz these days about big data. Because of the proliferation of new data sources such as machine sensor data, medical images, financial data, retail sales data, radio frequency identification, and web tracking data, we are challenged to decipher trends and make sense of data that is orders of magnitude larger than ever before. Almost every day, we see another article on the role that big data plays in improving profitability, increasing productivity, solving difficult scientific questions, as well as many other areas where big data is solving problems and helping us make better decisions. One of the technologies most often associated with the era of big data is Apache. Although there is much technical information about, there is not much information about how to effectively structure data in a environment. Even though the nature of parallel processing and the MapReduce system provide an optimal environment for processing big data quickly, the structure of the data itself plays a key role. As opposed to relational data modeling, structuring data in the Distributed File System (HDFS) is a relatively new domain. In this paper, we explore the techniques used for data modeling in a environment. Specifically, the intent of the experiments described in this paper was to determine the best structure and physical modeling techniques for storing data in a cluster using Apache Hive to enable efficient data access. Although other software interacts with, our experiments focused on Hive. The Hive infrastructure is most suitable for traditional data warehousing-type applications. We do not cover Apache HBase, another type of database, which uses a different style of modeling data and different use cases for accessing the data. In this paper, we explore a data partition strategy and investigate the role indexing, data types, files types, and other data architecture decisions play in designing data structures in Hive. To test the different data structures, we focused on typical queries used for analyzing web traffic data. These included web analyses such as counts of visitors, most referring sites, and other typical business questions used with weblog data. The primary measure for selecting the optimal structure for data in Hive is based on the performance of web analysis queries. For comparison purposes, we measured the performance in Hive and the performance in an RDBMS. The reason for this comparison is to better understand how the techniques that we are familiar with using in an RDBMS work in the Hive environment. We explored techniques such as storing data as a compressed sequence file in Hive that are particular to the Hive architecture. Through these experiments, we attempted to show that how data is structured (in effect, data modeling) is just as important in a big data environment as it is in the traditional database world. Understanding HDFS and Hive Similar to massively parallel processing (MPP) databases, the power of is in the parallel access to data that can reside on a single node or on thousands of nodes. In general, MapReduce provides the mechanism that enables access to each of the nodes in the cluster. Within the framework, Hive provides the ability to create and query data on a large scale with a familiar SQL-based language called HiveQL. It is important to note that in these experiments, we strictly used Hive within the environment. For our tests, we simulated a typical data warehouse-type workload where data is loaded in batch, and then queries are executed to answer strategic (not operational) business questions. 2
5 Technical Paper According to the Apache Software Foundation, here is the definition of Hive: Hive is a data warehouse system for that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. To demonstrate how to structure data in, our examples used the Hive environment. Using the SAS/ACCESS engine, we were able to run our test queries through the SAS interface, which is executed in the Hive environment within our cluster. In addition, we performed a cursory examination of Impala, the SQL on top of tool offered by Cloudera. All queries executed through SAS/ACCESS to were submitted via the Hive environment and were translated into MapReduce jobs. Although it is beyond the scope of this paper to detail the inner-workings of MapReduce, it is important to understand how data is stored in HDFS when using Hive to better understand how we should structure our tables in. By gaining some understanding in this area, we are able to appreciate the effect data modeling techniques have in HDFS. In general, all data stored in HDFS is broken into blocks of data. We used Cloudera s distribution of version 4.2 of for these experiments. The default size of each data block in Cloudera 4.2 is 128 MB. As shown in Figure 1, the same blocks of data were replicated across multiple nodes to provide reliability if a node failed, and also to increase the performance during MapReduce jobs. Each block of data is replicated three times by default in the environment. The NameNode in the cluster serves as the metadata repository that describes where blocks of data are located for each file stored in HDFS. Figure 1: HDFS Data Storage[5]
6 Data Modeling Considerations in and Hive At a higher level, when a table is created through Hive, a directory is created in HDFS on each node that represents the table. Files that contain the data for the table are created on each of the nodes, and the Hive metadata keeps track of where the files that make up each table are located. These files are located in a directory with the name of the table in HDFS in the /user/hive/warehouse folder by default. For example, in our tests, we created a table named BROWSER_DIM. We can use an HDFS command to see the new table located in the /user/hive/warehouse directory. By using the command hadoop fs -ls, the contents of the browser_dim directory are listed. In this directory, we find a file named browser_dim.csv. HDFS commands are similar to standard Linux commands. By default, distributes the contents of the browser_dim table into all of the nodes in the cluster. The following hadoop fs tail command lists the last kilobyte of the file listed: Safari r x Safari r _29 Macintosh x Safari r _29 Macintosh x Safari r _31 Macintosh x800 The important takeaway is to understand at a high level how data is stored in HDFS and managed in the Hive environment. The physical data modeling experiments that we performed ultimately affect how the data is stored in blocks in HDFS and in the nodes where the data is located and how the data is accessed. This is particularly true for the tests in which we partitioned the data using the Partition statement to redistribute the data based on the buckets or ranges defined in the partitions. Project Environment Hardware The project hardware was designed to emulate a small-scale cluster for testing purposes, not a large-scale production environment. Our blades had only two CPUs each. Normally, cluster nodes have more. However, the size of the cluster and the data that we used are large enough to make conclusions about physical data modeling techniques. As shown in Figure 2, our hardware configuration was as follows: Overall hardware configuration: 1 Dell M1000e server rack 10 Dell M610 blades Juniper EX GbE switch 4
7 Technical Paper Blade configuration: Intel Xeon X GHz processor Dell PERC H700 Integrated RAID controller Disk size: 543 GB FreeBSD iscsi Initiator driver HP P2000 G3 iscsi dual controller Memory: 94.4 GB Linux Figure 2: The Project Hardware Environment Software The project software created a small-scale cluster and included a standard RDBMS server and a client server with release 9.3 of Base SAS software with supporting software.
8 Data Modeling Considerations in and Hive The project software included the following components: CDH (Cloudera s Distribution Including Apache ) version o Apache o Apache Hive o HUE ( User Experience) o Impala 1.0 o Apache MapReduce o Apache Oozie o Apache ZooKeeper Apache Sqoop Base SAS 9.3 A major relational database 6
9 Technical Paper Figure 3: The HDFS Architecture The Cluster The cluster can logically be divided into two areas: HDFS, which stores the data, and MapReduce, which processes all of the computations on the data (with the exception of a few tests where we used Impala). The NameNode on nodes 1 and 2 and the JobTracker on node 1 (in the next figure) serve as the master nodes. The other six nodes are acting as slaves.[1] Figure 3 shows the daemon processes of the HDFS architecture, which consist of two NameNodes, seven DataNodes, two Failover Controllers, three Journal Nodes, one HTTP FS, one Balancer, and one Hive Metastore. The NameNode located on blade Node 1 is designated as the active NameNode. The NameNode on Node 2 is serving as the standby. Only one NameNode can be active at a time. It is responsible for controlling the data storage for the cluster. When the NameNode on Node 2 is active, the DataNode on Node 2 is disabled in accordance with accepted HDFS procedure. The DataNodes act as instructed by the active NameNode to coordinate the storage of data. The Failover Controllers are daemons that monitor the NameNodes in a high-availability environment. They are responsible for updating the ZooKeeper session information and initiating state transitions if the health of the associated NameNode wavers.[2] The
10 Data Modeling Considerations in and Hive JournalNodes are written to by the active NameNode whenever it performs any modifications in the cluster. The standby NameNode has access to all of the modifications if it needs to transition to an active state.[3] The HTTP FS provides the interface between the operating system on the server and HDFS.[4] The Balancer utility distributes the data blocks across the nodes evenly.[5] The Hive Metastore contains the information about the Hive tables and partitions in the cluster.[6] Figure 4 depicts the system s MapReduce architecture. The JobTracker is responsible for controlling the parallel processing of the MapReduce functionality. The TaskTrackers act as instructed by the JobTracker to process the MapReduce jobs.[1] 8
11 Technical Paper Figure 4: The MapReduce Architecture The Client Server The client server (Node 9, not pictured) had Base SAS 9.3, Hive 0.8.0, a client, and a standard RDBMS installed. The SAS installation included Base SAS software and SAS/ACCESS products. The RDBMS Server A relational database was installed on Node 10 (not pictured) and was used for comparison purposes in our experiments. Data Environment Setup The data for our experiments was generated to resemble a technical company s support website. The company sells its products worldwide and uses Unicode to support foreign character sets. We created 25 million original weblog sessions featuring 90 million clicks, and then duplicated it 90 times by adding unique session identifiers to each row. This bulked-up flat file was loaded into the RDBMS and via SAS/ACCESS and Sqoop. For our tests, we needed both a flat file representation of the data and a typical star schema design of the same data. Figure 5 shows the data in the flat file representation.
12 Data Modeling Considerations in and Hive Figure 5: The Flat File Representation A star schema was created to emulate the standard data mart architecture. Its tables are depicted in Figure 6. Figure 6: The Entity-Relationship Model for the Star Schema 10
13 Technical Paper To load the fact and dimension tables in the star schema, surrogate keys were generated and added to the flat file data in SAS before loading the star schema tables in the RDBMS and. The dimension tables and the PAGE_CLICK_FACT table in the RDBMS were loaded directly through a SAS program and loaded directly into the RDBMS through the SAS/ACCESS engine. The surrogate keys from the dimension tables were added to the PAGE_CLICK_FACT table via SQL in the RDBMS. The star schema tables were loaded directly from the RDBMS into using the Sqoop tool. The entire process for loading the data in both star schemas is illustrated in Figure 7.
14 Data Modeling Considerations in and Hive Figure 7: The Data Load Process SAS Data Collection Data Duplication Bulked-Up Data SAS/ACCESS RDBMS Flat File Apache Sqoop Dimension Table Data Sets Created with Surrogate Keys Star Schema Data Sets SAS/ACCESS RDBMS Star Schema Apache Sqoop As a side note, we uncovered a quirk that occurs when loading data from an RDBMS to HDFS or vice versa through Hive. Hive uses the Ctrl+A ASCII control character (also known as the start of heading or SOH control character in Unicode) as its default delimiter when creating a table. Our data had ^A sprinkled in the text fields. When we used the Hive default delimiter, Hive was not able to tell where a column started and ended due to the dirty data. All of our data loaded, but when we queried the data, we discovered the issue. To fix this, we redefined the delimiter. The takeaway is that you need to be data-aware before choosing delimiters to load data into using the Sqoop utility. Once the data was loaded, the number of rows in each table was observed as shown in Figure 8. Figure 8: Table Row Numbers Table Name Rows PAGE_CLICK _FACT PAGE_DIM REFERRER_DIM BROWSER_DIM 1.45 billion 2.23 million million thousand STATUS_CODE 70 PAGE_CLICK_FLAT 1.45 billion 12
15 Technical Paper In terms of the actual size of the data, we compared the size of the fact tables and flat tables in both the RDBMS and environment. Because we performed tests in our experiments on both the text file version of the Hive tables as well as the compressed sequence file version, we measured the size of the compressed version of the tables. Figure 9 shows the resulting sizes of these tables. Figure 9: Table Sizes Table Name RDBMS (Text File) (Compressed Sequence File) PAGE_CLICK_FACT GB GB GB PAGE_CLICK_FLAT GB GB GB Approach for Our Experiments To test the various data modeling techniques, we wrote queries to simulate the typical types of questions business users might ask of clickstream data. The full SQL queries are available in Appendix A. Here are the questions that each query answers: 1. What are the most visited top-level directories on the customer support website for a given week and year? 2. What are the most visited pages that are referred from a Google search for a given month? 3. What are the most common search terms used on the customer support website for a given year? 4. What is the total number of visitors per page using the Safari browser? 5. How many visitors spend more than 10 seconds viewing each page for a given week and year? As part of the criteria for the project, the SQL statements were used to determine the optimal structure for storing the clickstream data in and in an RDBMS. We investigated techniques in Hive to improve the performance of the queries. The intent of these experiments was to investigate how traditional data modeling techniques apply to the and Hive environment. We included an RDBMS only to measure the effect of tuning techniques within the and Hive environment and to see how comparable techniques work in an RDBMS. It is important to note that there was no intent to compare the performance of the RDBMS to the and Hive environment, and the results were for our particular hardware and software environment only. To determine the optimal design for our data architecture, we had the following criteria: There would be no unnecessary duplication of data. For example, we did not want to create two different flat files tuned for different queries. The data structures would be progressively tuned to get the best overall performance for the average of most of the queries, not just for a single query. We began our experiments without indexes, partitions, or statistics in both schemas and in both environments. The intent of the first experiment was to determine whether a star schema or flat table performed better in Hive or in the RDBMS for our queries. During subsequent rounds of testing, we used compression and added indexes and partitions to tune the data
16 Data Modeling Considerations in and Hive structures. As a final test, we ran the same queries against our final data structures using Impala. Impala bypasses the MapReduce layer used by Hive. The queries were run using Base SAS on the client node with an explicit SQL pass-through for both environments. All queries were run three times on a quiet environment to obtain accurate performance information. We captured the timings through the SAS logs for the Hive and RDBMS tests. Client session timing was captured in Impala for the Impala tests because SAS does not currently support Impala. Results Experiment 1: Flat File versus Star Schema The intent of this first experiment was to determine whether the star schema or flat table structure performed better in each environment in a series of use cases. The tables in this first experiment did not have any tuning applied such as indexing. We used standard text files for the tables. Results for Experiment 1 Flat File Star Schema RDBMS Flat File RDBMS Star Schema Min. Time Max. Time Average (MM:SS) (MM:SS) (MM:SS) 1 51:42 52:13 52: :55 50:46 50: :53 56:36 55: :37 52:37 51: :43 50:25 50:00 Min. Time Max. Time Average (H:MM:SS) (H:MM:SS) (H:MM:SS) 1 09:40 11:22 10: :08 09:57 09: :53 55:37 52: :04 15:14 14:33 5 9:57 10:32 10:13 Min. Time Max. Time Average (H:MM:SS) (H:MM:SS) (H:MM:SS) 1 1:04:35 1:17:49 1:09:13 2 1:09:26 1:09:52 1:09:35 3 1:08:14 1:08:53 1:08:32 4 1:06:22 1:07:44 1:07:06 5 1:04:20 1:04:57 1:04:31 Min. Time Max. Time Average (MM:SS) (MM:SS) (MM:SS) 1 33:03 33:41 33: :19 33:35 33: :28 34:27 34: :58 33:09 33: :00 33:56 33:35 14
17 Technical Paper Analysis of Experiment 1 Schema Difference Schema Difference Flat File Average (MM:SS) Star Schema Average (H:MM:SS) Improvement (Flat to Star) 1 52:00 10:33 42: :36 09:35 41: :54 52:46 03: :28 14:33 36: :00 10:33 39: RDBMS Schema Difference Flat File Average (H:MM:SS) Star Schema Average (MM:SS) Improvement (Star to Flat) 1 1:09:13 33:26 35:47 2 1:09:35 33:28 36:07 3 1:08:32 34:02 34:30 4 1:07:06 33:03 34:03 5 1:04:31 33:35 30:56
18 Data Modeling Considerations in and Hive RDBMS Schema Difference As you can see, both the Hive table and the RDBMS table in the star schema structure performed significantly faster compared to the flat file structure. This results for Hive were surprising, given the more efficient practice in HDFS of storing data in a denormalized structure to optimize I/O. Although the star schema was faster in the text file environment, we decided to complete the remaining experiments for using the flat file structure because it is the more efficient data structure for and Hive. The book Programming Hive says, The primary reason to avoid normalization is to minimize disk seeks, such as those typically required to navigate foreign key relations. Denormalizing data permits it to be scanned from or written to large, contiguous sections of disk drives, which optimizes I/O performance. However, you pay the penalty of denormalization, data duplication and the greater risk of inconsistent data. [8] Experiment 2: Compressed Sequence Files The second experiment applied only to the HIve environment. In this experiment, the data in HDFS was converted from uncompressed text files to compressed sequence files to determine whether the type of file for the table in HDFS made a difference in query performance. Results for Experiment 2 Sequen ce File Min. Time Max Time Average (MM:SS) (MM:SS) (MM:SS) 1 04:44 04:48 04: :27 04:41 05: :51 06:04 05: :35 05:47 05: :30 05:40 05:35 16
19 Technical Paper Schema Difference Text File Average (MM:SS) Sequence File Average (H:MM:SS) Improvement (Text to Sequence) 1 52:00 04:47 47: :36 05:34 45: :54 05:57 49: :28 05:40 45: :00 05:35 44:25 1 Schema Difference The results of this experiment clearly show that the compressed sequence file was a much better file format for our queries than the uncompressed text file. Experiment 3: Indexes In this experiment, indexes were applied to the appropriate columns in the Hive flat table and in the RDBMS fact table. Statistics were gathered for the fourth set of tests. In Hive, a B-tree index was added to each of the six columns (BROWSER_NM, DETAIL_TM, DOMAIN_NM, FLASH_ENABLED_FLG, QUERY_STRING_TXT, and REFERRER_DOMAIN_NM) used in the queries. In the RDBMS, a bitmap index was added to each foreign key in the PAGE_CLICK_FACT table, and a B-tree index was added to each of the five columns (DOMAIN_NM, FLASH_ENABLED_FLG, REFERRER_DOMAIN_NM, QUERY_STRING_TXT, and SECONDS_SPENT_ON_PAGE_CNT) used in the queries that were not already indexed. Results for Experiment 3 Flat File Min. Time Max Time Average (MM:SS) (MM:SS) (MM:SS) 1 01:17 01:28 01: :25 01:33 01: :55 06:03 05: :32 01:37 01: :42 04:45 04:43
20 Data Modeling Considerations in and Hive RDBMS Star Schema Schema Difference Min. Time Max Time Average (MM:SS) (MM:SS) (MM:SS) 1 00:04 00:04 00: :25 01:01 00: :25 00:43 00: :07 00:07 00: :25 00:31 00:27 No Indexes Average (MM:SS) Indexed Average (MM:SS) Improvement (No Indexes to Indexed) 1 04:47 01:22 03: :34 01:29 04: :57 05:59 (00:02) 4 05:40 01:34 04: :35 04:43 00:52 1 Schema Difference RDBMS Schema Difference No Indexes Average (H:MM:SS) Indexed Average (MM:SS) Improvement (No Indexes to Indexed) 1 33:26 00:04 33: :28 00:39 32: :02 26:29 07: :03 00:07 32: :35 00:27 33:08 18
21 Technical Paper RDBMS Schema Difference Analysis of Experiment 3: With the notable exception of the third query in the environment, adding indexes provided a significant increase in performance across all of the queries. Experiment 4: Partitioning In experiment 4, we added partitioning to the DETAIL_DT column in both the flat table in Hive and in the fact table in the star schema in the RDBMS. A partition was created for every date value. Results for Experiment 4 Flat File RDBMS Star Schema Min. Time Max Time Average (MM:SS) (MM:SS) (MM:SS) 1 00:50 01:00 00: :04 01:10 01: :42 07:41 07: :07 01:13 01: :25 02:28 02:26 Min. Time Max Time Average (MM:SS) (MM:SS) (MM:SS) 1 00:01 00:03 00: :02 00:06 00: :33 45:40 45: :02 00:46 00: :01 00:03 00:02
22 Data Modeling Considerations in and Hive Schema Difference No Partitions Average (MM:SS) Partitioned Average (H:MM:SS) Improvement (No Partitions to Partitioned) 1 01:22 00:55 00: :29 01:06 00: :59 07:04 (01:05) 4 01:34 01:09 00: :43 02:26 02:17 Schema Difference RDBMS Schema Difference No Partitions Average (MM:SS) Partitioned Average (MM:SS) Improvement (No Partitions to Partitioned) 1 26:01 00:02 25: :39 00:04 00: :31 45:32 (45:01) 4 17:28 00:17 17: :27 00:02 00:25 20
23 Technical Paper RDBMS Schema Difference Charts for queries 2, 3, and 5 have been rescaled Partitioning significantly improved all queries except for the third query. 3 was slightly slower in Hive and significantly slower in the RDBMS. Experiment 5: Impala In experiment 5, we ran the queries using Impala on the Hive compressed sequence file table with compression and indexes. Impala bypasses MapReduce to use its own distributed query access engine. Results for Experiment 5 Impala Flat File Time (MM:SS) 1 00: : : : :46 Analysis of Experiment 5 Impala Flat File Average Hive Impala Time Improvement (MM:SS) (MM:SS) (Hive to Impala) 1 01:22 00:10 01: :29 00:04 01: :59 15:48 (09:49) 4 01:34 00:04 01: :43 06:46 (02:03)
24 Data Modeling Considerations in and Hive 1 Impala Flat File The results for Impala were mixed. Three queries ran significantly faster, and two queries ran longer. Interpretation of Results The results of the first experiment were surprising. When we began the tests, we fully expected that the flat file structure would perform better than the star schema structure in the and Hive environment. In the following table, we provide information that helps explain the differences in the amounts of time processing the queries. For example, the amount of memory required is significantly higher in the flat table structure for the query. Moreover, the number of mappers and reducers needed to run the query was significantly higher for the flat table structure. Altering system settings, such as TaskTracker heap sizes, showed benefits in the denormalized table structure. However, the goal of the experiment was to work with the default system settings in Cloudera 4.2 and investigate the effects of structural changes on the data. Unique Visitors per Page for Safari DENORMALIZED NORMALIZED DIFF Virtual Memory (GB) 7,452 2,927 4,525 Heap (GB) 3,912 1,512 2,400 Read (GB) Table Size (GB) 1, Execution Plan 3967 maps/999 reduce 1279 maps/352 reduce Time (minutes) Our second experiment showed the performance increase that emerged from transitioning from text files to sequence files in Hive. This performance improvement was expected. However, the magnitude of the improvement was not. The queries ran about ten times faster when the data was stored in compressed sequential files than when the data was stored in uncompressed text files. The compressed sequence file optimizes disk space usage and I/O bandwidth performance by using binary encoding and splittable compression. This proved to be the single biggest factor with regard to data structures in Hive. For this experiment, we used block compression with SnappyCodec. 22
25 Technical Paper In our third experiment, we added indexes to the fact table in the RDBMS and to the flat table in Hive. As expected, the indexes generally improved the performance of the queries. The one exception was the third query, where adding the indexes did not show any improvement in Hive. The Hive Explain Plan helps explain why this is happening. In the highlighted section of the Hive Explain Plan, we see that there are no indexes used in the predicate of the query. Given the characteristics of the data, this makes sense because almost all of the values of DOMAIN_NM were the support site itself. The referring domain was predominantly Hive Explain Plan: STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-3 depends on stages: Stage-2 Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: visits_pp_ph_summary:page_click_flat_seq_ns TableScan alias: page_click_flat_seq_ns Filter Operator predicate: expr: ((domain_nm = support.foo.com') and (referrer_domain_nm = ' type: boolean Experiment 4 added partitioning to the table structures. Both the fact table and the flat table were partitioned by date in the DETAIL_DT column. Partitioning tables changes how Hive structures the data storage. In addition to the directory for each table, Hive creates subdirectories reflecting the partitioning structure. When a query is executed on Hive, the query does not need to scan the entire directory for the query. Rather, partition elimination enables the query to go directly to the subdirectory or subdirectories where that data is located to retrieve the results. Because many of our queries used DETAIL_DT in the WHERE clause of the query, execution time improved. The same improvement was seen in the RDBMS, which was able to use partition elimination for most of the queries. In the case of the third query, the predicate does not include DETAIL_DT. In this case, having partitions actually hurt query performance because the query needed to examine each partition individually to locate the relevant rows. The decrease in performance was significant in the RDBMS. Experiment 5 explored Impala and gauged the performance. Impala is a query engine that provides more SQL functionality in the Hive environment. Impala does not use MapReduce to process the data. Rather, it provides direct access to the data in HDFS through its own proprietary engine. Overall, three of the queries ran significantly faster in Impala. Two of the queries were worse in Impala. Interestingly, in our query for the top referrers, we needed to add a LIMIT clause following the ORDER BY clause because this is currently a requirement for Impala queries. Similar to the issue in query 3 in the Hive environment, query 3 was slow because a full table scan of all of the partitions was required to retrieve the data.
26 Data Modeling Considerations in and Hive Conclusions Through our experiments, we have shown that structuring data properly in Hive is as important as in an RDBMS. The decision to store data in a sequence file format alone accounted for a performance improvement of more than 1,000%. The judicious use of indexes and partitions resulted in significant performance gains by reducing the amount of data processed. For data architects working in the Hive environment, the good news is that many of the same techniques such as indexing that are used in a traditional RDBMS environment are applicable. For those of us who are familiar with MPP databases, the concept of partitioned data across nodes is very familiar. The key takeaway is that we need to understand our data and the underlying technology in to effectively tune our data structures. Simply creating a flat table or star schema does not result in optimized structures. We need to understand how our data is distributed, and we need to create data structures that work well for the access patterns of our environment. Being able to decipher MapReduce job logs as well as run explain plans are key skills to effectively model data in Hive. We need to be aware that tuning for some queries might have an adverse impact on other queries as we saw with partitioning. Future experimentation should look into the performance enhancements offered with other Hive file formats, such as RCFile, which organizes data by column rather than by row. Another data modeling test could examine how well collection data types in Hive work compared to traditional data types for storing data. As big data technology continues to advance, the features that are available for structuring data will continue to improve, and further options for improving data structures will become available. 24
27 Technical Paper Appendix Queries Used in Testing Flat Tables 1. select top_directory, count(*) as unique_visits from (select distinct visitor_id, split(requested_file, '[\\/]')[1] as top_directory from page_click_flat where domain_nm = 'support.sas.com' and flash_enabled='1' and weekofyear(detail_tm) = 48 and year(detail_tm) = 2012 ) directory_summary group by top_directory order by unique_visits; 2. select domain_nm, requested_file, count(*) as unique_visitors, month from (select distinct domain_nm, requested_file, visitor_id, month(detail_tm) as month from page_click_flat where domain_nm = 'support.sas.com' and referrer_domain_nm = ' ) visits_pp_ph_summary group by domain_nm, requested_file, month order by domain_nm, requested_file, unique_visitors desc, month asc; 3. select query_string_txt, count(*) as count from page_click_flat where query_string_txt <> '' and domain_nm='support.sas.com' and year(detail_tm) = '2012' group by query_string_txt order by count desc; 4. select domain_nm, requested_file, count(*) as unique_visitors from (select distinct domain_nm, requested_file, visitor_id from page_click_flat where domain_nm='support.sas.com' and browser_nm like '%Safari%' and weekofyear(detail_tm) = 48 and year(detail_tm) = 2012 ) uv_summary group by domain_nm, requested_file order by unique_visitors desc; 5. select domain_nm, requested_file, count(*) as unique_visits from (select distinct domain_nm, requested_file, visitor_id from page_click_flat where domain_nm='support.sas.com' and weekofyear(detail_tm) = 48 and year(detail_tm) = 2012 and seconds_spent_on_page_cnt > 10; ) visits_summary group by domain_nm, requested_file order by unique_visits desc;
28 Data Modeling Considerations in and Hive References [1] Understanding Clusters and the Network. Available at Accessed on June 1, [2] Sammer, E Operations. Sebastopol, CA: O'Reilly Media. [3] HDFS High Availability Using the Quorum Journal Manager. Apache Software Foundation. Available at Accessed on June 5, [4] HDFS over HTTP - Documentation Sets alpha. Apache Software Foundation. Available at Accessed on June 5, [5] Yahoo! Tutorial. Yahoo! Developer Network. Available at Accessed on June 4, [6] Configuring the Hive Metastore. Cloudera, Inc. Available at Accessed on June 18, [7] Kestelyn, J. Introducing Parquet: Efficient Columnar Storage for Apache. Available at Accessed on August 2, [8] Capriolo, E., D. Wampler, and J. Rutherglen Programming Hive. Sebastopol, CA: O'Reilly Media. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright 2013 SAS Institute Inc., Cary, NC, USA. All rights reserved. 26
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
Cloudera Manager Training: Hands-On Exercises
201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian
From Relational to Hadoop Part 1: Introduction to Hadoop Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Tutorial Logistics 2 Got VM? 3 Grab a USB USB contains: Cloudera QuickStart VM Slides Exercises
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
MySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Big Data Introduction
Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Hadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.
Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices
Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Optimizing the Performance of Your Longview Application
Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
Alternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
CURSO: ADMINISTRADOR PARA APACHE HADOOP
CURSO: ADMINISTRADOR PARA APACHE HADOOP TEST DE EJEMPLO DEL EXÁMEN DE CERTIFICACIÓN www.formacionhadoop.com 1 Question: 1 A developer has submitted a long running MapReduce job with wrong data sets. You
