The Fast-Track to Hands-On Understanding of Big Data Technology
|
|
|
- Alexia McDowell
- 10 years ago
- Views:
Transcription
1 The Fast-Track to Hands-On Understanding of Big Data Technology Stream Social Media to Hadoop & Create Reports in Less Than a Day (Part 2) COLLABORATIVE WHITEPAPER SERIES
2 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Big Data might be intimidating to the most seasoned IT professional. It s not simply the charged nature of the term Big that is ominous, but the underlying technology is app-centric in a very open-source way. If you are like most professionals who don t have a working knowledge of MapReduce, JSON, Hive, or Flume, diving into the deep-end of the Big Data technology pool may seem like a time-consuming process. Even if you possess these skill sets, the prospect of launching a Hadoop environment and deploying an application that streams Twitter data into the environment in a way that is accessible through standard ODBC tools would seem like a task measured in weeks not days. It may surprise most people looking to get hands-on with Big Data technology that each of us can do so in short time, and with the right approach, you can stream live social data to your own Hadoop cluster and report on the information through Excel in less than one day. In an instructive manner, this whitepaper series enables you with a fast track approach to create your personal Big Data lab environment powered by Apache Hadoop. This first part in the series will engage IT professionals with a passing interest in Big Data by providing them with: Reasons to explore the world of Big Data and Big Data skills gap. A practical, lightweight approach to getting hands-on with Big Data technology. Describe the use case and the supporting technical components in more detail. Provide step-by-step instructions of how to setup the lab environment, and direct individuals to Cloudera s streaming Twitter agent tutorial. We will enhance Cloudera s tutorial in the following ways: - Make the tutorial real-time. - Provide steps to establish ODBC connectivity and how to execute Cloudera s sample queries in Excel. - Configure and register libraries at an overall environment level. - Provide sample code and troubleshooting tips. 2
3 I. A reason to explore the universe of Big Data options before taking the first step. For those of us with day jobs, there aren t enough hours in the day to invest a lot of time in dissecting the various Big Data technology players, or building relevant open source components that won t necessarily prove anything from a technology or business point of view. Fortunately, the following game plan provides Before beginning this exercise, the first question that may be asked by IT professionals is why would one care to explore the universe of Big Data? The fact is that the universe of data is expanding at an accelerating rate, and increasingly the data growth is driven by sources of unstructured or machine-generated Big Data (e.g. from social media, blogs, the Internet of Things ). The latest IDC Digital Universal Study reveals an explosion of stored information: more than 2.8 zettabytes or roughly 3 billion terabytes of information was created and replicated in 2012 alone. To put this number in perspective, this means that terabytes of information was produced per second over the course of a year. Figure 1: Why choose Hadoop Organizations are increasingly aware that this unrefined data represents an opportunity to gain valuable insight into ongoing clinical research, monitoring financial risk, etc. From a practical perspective, this means that business people will be asking IT for answers to questions that can be supported by sources of Big Data, and in some cases processed by Big Data technologies. A recent Harvard Business Review survey suggests that 85 percent of organizations had funded Big Data initiatives in process or in the planning stage, but the survey reveals a severe gap in analytical skills and 70 percent of respondents describe finding qualified data scientists as challenging to very difficult 1. Thus, Big Data introduces an opportunity to the business, but exposes a skills and technology gap for IT. This gap must be filled in short time otherwise businesses will find themselves at a competitive disadvantage, and IT s ability to support the business will be questioned. II. The right approach If you are convinced that an understanding of Big Data is important to your business and IT initiatives, in most cases, you need to formulate a practical, low-cost, and ultimately relevant approach of understanding the technology and conventional use cases that resonate with the business. After all, IT resources are stretched thin, and many of us in IT that are new to the world of Big Data could spend weeks getting up to speed on the various 3
4 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Figure 2: Why CDH was used Consideration Building the Hadoop environment from scratch as opposed to using CDH. Using CDH over HortonWorks or MapR. Deploying Hadoop in the Cloud. Direction and Rationale We considered building our Hadoop environment from scratch through the Apache Hadoop projects. If your learning objectives include understanding what it takes to ensure compatibility of each Hadoop project, or if you need to tweak the source code, then you should include this step in the approach. Given the time commitment, it seemed more useful to take an existing distribution that ensured interoperability and compatibility of the projects. Alternatives from HortonWorks and MapR were considered, specifically Microsoft s HDInsight distribution that uses HortonWorks. Ultimately, Cloudera s software and support resources and its Twitter Feed are used in the example, which are available for download and general use. Cloudera also has VM images with a free edition of the Cloudera manager and Hadoop available with the entire Apache Hadoop project required by the scenario for download. Deploying the environment to the cloud was considered, and in some cases it may be preferred. An instance of Microsoft HDInsight was used running in Azure, and would have been pursued at a greater length but unfortunately the lease on the Azure instance expired and inquiries on how to extend the lease went unanswered. a universal use case as a starting point, and practical ways with which you can get a lab environment running so that you can mobilize business sponsors and technical staff around Big Data capabilities in less than eight hours. For learning purposes, it makes sense to pursue a fairly common scenario across industries. In our case, we will attempt to stream social media (specifically tweets from Twitter) to a lab environment, and then we will report on the data through everyone s favorite BI tool, Excel. The use case will be described in more detail later on. From a technical training perspective, the approach relies heavily on Apache Hadoop. Figure 1 lists the reasons why Hadoop is the preferred platform to learn Big Data and to implement this scenario: There are many ways to deploy Apache Hadoop. Our example relies on Cloudera s distribution of Apache Hadoop (CDH) running in a Linux VM image. Figure 2 lists key considerations and why CDH was used. The CDH stack (Figure 3) summarizes the core projects included with CDH, and the projects relevant to the use case are captioned 2. It should be noted that this is a learning exercise, not a performance benchmark. Thus a single-node Hadoop running inside of a Linux VM is deemed sufficient for those of us wanting to learn Hadoop. If performance tuning is crucial to your learning objectives, then a more robust environment would be required. The business use case would still be relevant, since streaming live social data will generate millions of transactions, depending on the key 4
5 Figure 3: Cloudera s distribution including Apache Hadoop (CDH) 3 Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. Hive is a data warehouse system for Hadoop that facilities easy data summarization, ad-hoc queries, and the analysis of datasets stored in HDFS. Hive provides a mechanism to project structure onto this data and query that data using a SQL-like language called HiveQL. Oozie is a workflow scheduler system to manage Apache Hadoop Jobs. Oozie Coordinator jobs are recurrent Oozie workflow jobs triggered by time (frequency) and data availability. Captures the Hive metadata, transparent to the overall application. CDH 100% open source Cloud WH Whirr UI and Workflow HU Hue OO Oozie Metadata AC Access Integration SQ Sqoop Batch Processing HI Hive Pl Pig MA Mahout DF Datafu Real-time Access and Compute MS Metastore FL Flume FILE Fuse DFS REST WEBHDFS HTTFS Batch Compute MR MapReduce MR2 MapReduce2 Resource Management and Coordination = YA Yarn ZO Zookeeper IM Impala SQL ODBC & JDBS Storage HDFS = Hadoop DFS HB HBase Hadoop supports ODBC and JDBC connectivity is supported for interfacing with other platforms like databases, BI, and data integration tools. MapReduce is a software framework for easily writing applications which process vast amounts of data (multiterabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner. HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and Data Nodes that store the actual data. Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services. All of these kinds of services are used in some form or another by distributed applications. words you have specified. The specifications of the VM image and the host matching are listed in the Appendix. Finally, there are ways to make this use case more comprehensive. For instance, once the data streaming is captured, you could use Apache Mahout to cluster and classify the data using the various algorithms available. Since much of the classification would be dependent on business input, it seems reasonable to take a first iteration through the use case as presented, and then proceed with next steps in concert with more involvement or direction from the business. III. Use case and supporting Hadoop components Streaming social media data is a fairly common use case for Big Data and applies across industries. Cloudera provides a 5
6 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Figure 4: Streaming social media use case and supporting technical components 1. Twitter users generate tweets Twitter users tweet about various topics. In this example, we want to capture in real-time any tweet related to Big Data keywords. hadoop analytics big data mapreduce Hive contains an external table named tweets that is modeled on the JSON message from Twitter. In addition to partitioning the table by year, month, day, and hour through the Q script executed within Oozie, Hive will be used to aggregate tweets from Excel. These queries perform aggregations listed above, and are executed through ODBC. Big Data tweets by time zone and day 5. Hive aggregates data from Excel Top 15 Big Data hashtags Top 10 retweeted users on Big Data topics HQL (Hive Query Language) 2. Flume organizes keywords 4. JAR files are created in Hive 3. Oozie process runs Hive script A Flume agent runs on our CDH VM. It streams any tweet containing the Big Data keywords listed above to HDFS. It places the tweets in a folder structure organized by year, month, day, and hour. It creates a new subfolder as it rolls forward each time interval. HDFS There are two JARs that will be registered in Hive. One defines the JSON extensions for the Hive external table tweets. The second instructs Map Reduce to exclude any TMP files created by Flume. Flume writes to a TMP file. These TMP files cannot be accessed by MapReduce. JSON SerDe = Custom MapReduce pathfilter Hive Q Script: Add hourly partition An Oozie workflow runs in the background and once a new subdirectory is created by the Flume Agent, it will execute a Hive script that adds a new partition to a tweet table corresponding to the year, month, day, and hour of the directory. 6. Hive initiates MapReduce program CDH V4.1.1 single node cluster Running on 64-bit Linux VMWare Hive will translate any query that requires aggregation, sorting, or filtering into a MapReduce program. MapReduce will grab the data from HDFS and pass the result set to Hive. tutorial that represents an implementation of this use case. This paper will build on Cloudera s tutorial, and extend it by making the data available in real-time and reporting on the data in Excel. Using the approach described above and by following the instructions, you will have Tweets streaming into your Hadoop sandbox reportable in Excel in less than a business day. Cloudera s tutorial is documented thoroughly in a series of blog postings and the source code is available on GitHub 4-7. Figure 4 represents our version of the streaming Twitter tutorial. Major components are numbered and their purpose explained in Figure 4. 6
7 IV. How to stream social data to Hadoop in less than a day Now that we have established a rationale, an approach, and use case for learning Big Data, we can get started. Despite the many moving parts listed in the use case, you can have the streaming social media use case operational in your own lab environment in less than a working day. F The following lists the steps of how to make this happen, and any non-obvious instructions to follow that are not provided by the instructions in the tutorial. Where appropriate, explanations have been provided to ensure the significant concepts and mechanics are understood and reinforced. 1. First and foremost, you need a CDH lab environment. Building and configuring this environment from the OS up could take time. Fortunately, Cloudera provides a VM image that is available for download with all of the necessary Hadoop projects pre-installed and pre-configured. You can download the VM from Cloudera s website. 2. To run the lab environment, you will also need VMware player. You can download and install the VMware player from the VMware website. 3. Verify you have sufficient resources to run the VM image on your host machine. Please refer to Cloudera s system requirements, and the appendix for the host and guest machine specifications used in this example. 4. Start the VM. Once started, you can begin the Cloudera Twitter tutorial. For the most part, you can follow the instructions exactly as provided in the GitHub Tutorial instructions. The following instructions should be followed in addition to those provided by Cloudera, and the rationale for the amendments is also provided: 4.1. Unless you have a need to build the JARs from scratch, you should find the JARs referenced in the tutorial will already exist on the VM image provided by Cloudera. If you build the JARs from the source, you will probably need two days to get the tutorial operational Before starting, the GCC library was missing from the VM image. To include the library (which is required to install other libraries): sudo su - yum install gcc 4.3. When following the steps under Configuring Flume : Step 3 We had to manually create the flume-ng-agent file with the following contents: # FLUME_AGENT_NAME=kings-river-flume FLUME_AGENT_NAME=TwitterAgent Step 4 If you are not familiar with the details of your Twitter app, this step may cause confusion. All that is required is a Twitter account. Once you have a Twitter account, you need to register the Flume Twitter agent with twitter so that Twitter has a record of your agent and can govern the various 3rd parties that stream Twitter data To register your Twitter App, go to 7
8 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Sign-in with your Twitter account Click Create a new Application Enter the following information: 8
9 Your new application will provide you with four security tokens that will be specified in the flume.conf file. These properties are highlighted below Using the values for the application properties highlighted above, enter the following parameters in flume.conf. If flume.conf does not exist on /etc/flume-ng/conf, please download it from the GitHub project: TwitterAgent.sources.Twitter.consumerKey = <consumer_key_from_twitter> TwitterAgent.sources.Twitter.consumerSecret = <consumer_secret_from_twitter> TwitterAgent.sources.Twitter.accessToken = <access_token_from_twitter> TwitterAgent.sources.Twitter.accessTokenSecret = <access_token_secret_from_twitter> In flume.conf, modify the following parameter according to the key words in which you want to filter tweets. Note that the default flume.conf provided by Cloudera misspelled data scientist; the correct spelling is listed in red below: TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientist, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing At this point you probably realize the importance of flume.conf. In addition to containing the details of the Twitter app and the key words, it contains the following parameters which govern how big the Flume files are before it rolls into a new file. These parameters are significant because as you change them, the latency of the tweets will also change. The complete listing of the Flume parameters can be on Cloudera s website. TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 # number of events written to file before it flushed to HDFS TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 # File size to trigger roll (in bytes) TwitterAgent.sinks.HDFS.hdfs.rollCount = # Number of events written to file before it rolled Place flume.conf under /etc/flume-ng/conf as instructed in Step When following the steps under Setting up Hive : Now copy hive-serdes-1.0-snapshot.jar in Step 1 to /usr/lib/hadoop cp hive-serdes-1.0-snapshot.jar /usr/lib/hadoop After step 4, you ll want to create a new Java package using the following steps. There is no Java programming knowledge required; simply follow these instructions. It is necessary to create this Java class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to HDFS. 8 mkdir com mkdir com/twitter mkdir com/twitter/util export CLASSPATH=/usr/lib/hadoop/hadoop-common cdh4.1.2.jar:hadoop-common.jar 9
10 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology vi com/twitter/util/filefilterexcludetmpfiles.java Copy the Java source code in the appendix into the file and save it. javac com/twitter/util/filefilterexcludetmpfiles.java jar cf TwitterUtil.jar com cp TwitterUtil.jar /usr/lib/hadoop Edit the file /etc/hive/conf/hive-site.xml, and add the following tags. The first property ensures that you won t have to add the JSON SerDe package and the new customer package that excludes Flume temporary files for each Hive session. This will become part of the overall Hive configurations that is available to each Hive session. The second tags instruct MapReduce of the class name and location of the new Java class that we created and compiled above. <property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hadoop/hive-serdes-1.0-snapshot jar,file:///usr/lib/ hadoop/twitterutil.jar</value> </property> <property> <name>mapred.input.pathfilter.class</name> <value>com.twitter.utilfilefilterexcludetmpfiles</value> </property> Bounce the hive servers: sudo service hive-server stop sudo service hive-server2 stop sudo service hive-server start sudo service hive-server2 start 4.5. When following the steps under Prepare the Oozie workflow : For all steps, download the Oozie files from the Cloudera GitHub site Before Step 4, edit the job.properties file accordingly Make sure the following parameters reference localhost.localdomain referenced, not just localhost: namenode=hdfs://localhost.localdomain:8020 jobtracker=localhost.localdomain: The jobstart, jobend, tzoffset, and initialdataset require explanation. Let s say Flume is streaming the tweets to a HDFS folder, /user/flume/tweets/*. The parameter initialdataset instructs the workflow what the earliest year, month, day, and hour for which you have data and therefore can add a partition to the Hive tweets table. jobstart should be set to the initialdataset +/- the tzoffset. Finally, jobend tells the Oozie workflow when to wind down, so 10
11 it can be set well into the future. In the following example, the parameters specify that the first set of Tweets live on HDFS under / user/flume/tweets/2013/01/07/08, and once the directory is available it will create execute the Hive Query Language script addpartition.q. jobstart= t13:00z jobend= t23:00z initialdataset= t08:00z tzoffset= Edit coord-app.xml: a. Change timezone from America/Los_Angeles to America/ New_York (or the corresponding timezone for your location): initial-instance= ${initialdataset} timezone= America/New_York > b. Remove the following tags. This is extremely important in making the tutorial as real-time as possible. The default Oozie workflow has defined a readyindicator which acts as a wait event. It instructs the workflow to create a new partition after an hour completes. Thus, if you leave this configuration as-is, there will be a lag as great as one-hour between tweets and when the tweets can be queried. The reason for this default configuration is that the tutorial did not define the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary Flume files. Because we have deployed this custom package, we do not have to force a full hour to complete before querying tweets. <data-in name= readyindicator dataset= tweets > <!-- I ve done something here that is a little bit of a hack. Since Flume doesn t have a good mechanism for notifying an application of when it has rolled to a new directory, we can just use the next directory as an input event, which instructs Oozie not to kick off a coordinator action until the next dataset starts being available. --> <instance>${coord:current(1 + (coord:tzoffset() / 60))}<instance> </data-in> If you haven t done so already, enable the Oozie web console according to the Cloudera documentation. Doing so allows Oozie coordinating jobs and workflows to be accessed from the console located at oozie/ Once you have started the Flume Agent (under Starting the data pipeline ), you will see 11
12 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Tweets streaming to your HDFS You can browse the HDFS directory structure from the Hadoop NameNode console on your cluster. You can also access the cluster from localdomain:50070/dfshealth.jsp If you are experiencing technical issues, please reference the Troubleshooting Guide in the appendix 5. Setup ODBC connectivity through Excel: 5.1. ODBC connectivity to Hive from an application is a logical extension of the Cloudera Twitter tutorial There are several ODBC drivers for Hive, but many were not compatible with Excel (e.g. Cloudera s ODBC driver for Tableau) or not compatible with Cloudera s environment (Microsoft s ODBC driver for Hive, which only worked when connecting to Microsoft HDInsight) We successfully used MapR s ODBC driver for Windows located here. Since we are running 32-bit Excel, we needed to download the 32-bit ODBC driver for Hive, but MapR has a driver for 64-bit as well Download and install the appropriate ODBC driver from MapR s website Configure an ODBC connection to the Hive database We recommend specifying an entry in your Windows hosts file (C:\ Windows\System32\drivers\etc\ hosts) to alias the IP address for your VM machine. You can get the IP address from your VM by typing in the command ifconfig cloudera-vm Create an ODBC connection to the Hive Database: Open a new Excel workbook: 12
13 From Data tab Select From Other Sources Select From Data Connection Wizard Select ODBC DSN, click next. 13
14 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Select the DSN you set up using the MapR driver (Cloudera Hive VM MapR) Select the tweets table Select Finish Select Properties. This is the important part because we must 14
15 override the HQL in order for the query to execute. At the time this article was written, the major ODBC drivers append default to Hive Query and the MapR ODBC driver is the only one able to establish connectivity which would allow us to override the HQL Select Definition tab. Using one of the Hive queries provided in the Appendix, copy the HQL and paste it into the Command Text. Also save password Hit OK to import the data Repeat for the remaining queries in the appendix. Create as many queries as you see fit. HQL is very SQL-like and for many of us that know SQL will be easy to adapt the queries from the appendix into other statements that provide the views you need. 15
16 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology V. Summary Once you have successfully completed this tutorial, you should have a clearer understanding of Hadoop, specifically: 1. A quick overview of core Hadoop projects and how each is used to support streaming social media and reporting through a standard ODBC connection. 2. An operational Hadoop sandbox that can be used for training, local development, and proof of concepts that you can navigate and explore. 3. A real-world reference model for a use case illustrating the amazing streaming capabilities in Hadoop. 4. How to model semi-structured JSON data in Hive and query it in a conventional manner. Lastly, this exercise should leave individuals wanting to take the Hadoop experience to the next level. Independently, you can layer in a Mahout program to cluster and classify the tweets, thereby simulating some form of sentiment analysis. You may also want to layer in Geospatial data into the set to provide more advanced analytics. You could consider streaming data from other social media sites (if so, we recommend starting here). Above all, you may want to show someone from the business to illustrate what this new technology can do. By demystifying Big Data technology, you can take your understanding and ability to support additional business use cases to these next levels. Appendix: Custom Java code for MapReduce PathFilter package com.twitter.util; import java.io.ioexception; import java.util.arraylist; import java.util.list; import org.apache.hadoop.fs.path; import org.apache.hadoop.fs.pathfilter; public class FileFilterExcludeTmpFiles implements PathFilter { public boolean accept(path p) { String name = p.getname(); return!name.startswith( _ ) &&!name.startswith(. ) &&!name.endswith(. tmp ); } } 16
17 Appendix: Hardware/software environment Figure 5 OS Processor Memory Disk Software Host Windows 7 Enterprise 64-bit Intel Core 2 Duo CPU 2.26GHz, 2.27GHz 8GB (7.9 Addressable) 300GB VM Player build Microsoft Office 32-bit Guest CentOS 6.2 Linux 64-bit Intel Core 2 Duo CPU 2.26GHz 2,98GB 23.5GB Cloudera Manager Free Edition CDH4.1.2 Appendix: Troubleshooting guide Figure 6 Error Message/Stack Trace Cause Resolution FAILED: RuntimeException MetaException(message:org.apache. hadoop.hive.serde2.serdeexception SerDe com.cloudera.hive. serde.jsonserde does not exist Hive cannot find hive-serdes-1.0- SNAPSHOT.jar 1. Place hive-serdes-1.0-snapshot.jar in / usr/lib/hadoop. 2. Edit /etc/hive/conf/hive-site.xml, add the following: <property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hadoop/ hive-serdes-1.0-snapshot. jar,file:///usr/lib/hadoop/ TwitterUtil.jar</value> </property> :57:37,027 INFO org.apache.oozie.command. coord.coordactioninputcheckxcommand: USER[-] GROUP[-] TOKEN[-] APP[-] JOB[ oozieoozi-C] ACTION[ oozieoozi-C@2] [ oozie-oozi- C@2]::ActionInputCheck:: In checklistofpaths: hdfs:// localhost. localdomain:8020/user/flume/tweets/ 2013/01/17/10 is Missing. Main class [org.apache.oozie.action.hadoop.hivemain], exit code [10001] OLE DB or ODBC error: [MapR][Hardy] (22) Error from ThriftHiveClient: Query returned non-zero code: 2, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop. hive.ql.exec.mapredtask; HY000. An error occurred while the partition, with the ID of Tweets By Timezone_cbf7182e-a7a6-416c-a3fd-d7f484952cc6, Name of Tweets By Timezone was being processed. The current operation was cancelled because another operation in the transaction failed. Permissions on / user/flume/* Missing MySQL driver Flume temp file permissions issue 3. Start and restart the hive services Change perms on /user/flume: sudo -u flume hadoop fs -chmod -R 777 / user/flume cp /var/lib/oozie/mysql-connectorjava. jar oozie-workflows/lib Walk through the instructions, Setting up Hive to ensure the custom Java class to set the MapReduce pathfilter is built, deployed and referenced in Hive as specified. 17
18 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Figure 6 (continued) Appendix: Excel queries9 Top 15 Big Data hashtags Tweets by time zone and day SELECT SELECT LOWER(hashtags.text), user.time_zone, COUNT(*) AS total_count SUBSTR(created_at, 0, 3), COUNT(*) AS total_count FROM tweets FROM tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE user.time_zone IS NOT NULL GROUP BY LOWER(hashtags.text) GROUP BY user.time_zone, ORDER BY total_count DESC SUBSTR(created_at, 0, 3) LIMIT 15 ORDER BY total_count DESC LIMIT 15 18
19 Top 10 retweeted users on Big Data topics SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_ screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10 Top 200 most active users on Big Data topics select user.screen_name, count(*) tweet_cnt from tweets group References 1. gap_no_pan.html 2. Definitions from the Apache Hadoop website for each respective package 3. cdh.html datawith-hive/ Known issue with Flume, see jira/browse/flume Adapted from analyzing-twitter-data-with-hadoop/ order by user.screen_name by tweet_cnt desc limit
20 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Collaborative Consulting is a leading information technology services firm dedicated to helping our clients achieve business advantage through the use of strategy and technology. We deliver a comprehensive set of solutions across multiple industries, with a focus on business process and program management, information management, software solutions, and software performance and quality. We also have a set of offerings specific to the life sciences and financial services industries. Our unique model offers both onsite management and IT consulting as well as U.S.-based remote solution delivery. To learn more about Collaborative, please visit our website at us at [email protected], or contact us at Copyright 2014 Collaborative Consulting, LLC. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. WP
Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 3.0 Please note: This appliance is for testing and educational purposes only; it is unsupported and not
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Creating a universe on Hive with Hortonworks HDP 2.0
Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.
Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Keyword: YARN, HDFS, RAM
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Big Data and
From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian
From Relational to Hadoop Part 1: Introduction to Hadoop Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Tutorial Logistics 2 Got VM? 3 Grab a USB USB contains: Cloudera QuickStart VM Slides Exercises
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Real World Hadoop Use Cases
Real World Hadoop Use Cases JFokus 2013, Stockholm Eva Andreasson, Cloudera Inc. Lars Sjödin, King.com 1 2012 Cloudera, Inc. Agenda Recap of Big Data and Hadoop Analyzing Twitter feeds with Hadoop Real
Cloudera Manager Training: Hands-On Exercises
201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
Bringing Big Data to People
Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc
Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc Instructor : Dr.Sadegh Davari Mentors : Dilhar De Silva, Rishita Khalathkar Team Members : Ankur Uprit Kiranmayi Ganti Pinaki Ranjan
Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
Set Up Hortonworks Hadoop with SQL Anywhere
Set Up Hortonworks Hadoop with SQL Anywhere TABLE OF CONTENTS 1 INTRODUCTION... 3 2 INSTALL HADOOP ENVIRONMENT... 3 3 SET UP WINDOWS ENVIRONMENT... 5 3.1 Install Hortonworks ODBC Driver... 5 3.2 ODBC Driver
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and
Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook
Talend Real-Time Big Data Talend Real-Time Big Data Overview of Real-time Big Data Pre-requisites to run Setup & Talend License Talend Real-Time Big Data Big Data Setup & About this cookbook What is the
Hadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
Microsoft SQL Server 2012 with Hadoop
Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the
Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data
Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data 1 Introduction SAP HANA is the leading OLTP and OLAP platform delivering instant access and critical business insight
CDH 5 Quick Start Guide
CDH 5 Quick Start Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
CDH installation & Application Test Report
CDH installation & Application Test Report He Shouchun (SCUID: 00001008350, Email: [email protected]) Chapter 1. Prepare the virtual machine... 2 1.1 Download virtual machine software... 2 1.2 Plan the guest
#TalendSandbox for Big Data
Evalua&on von Apache Hadoop mit der #TalendSandbox for Big Data Julien Clarysse @whatdoesdatado @talend 2015 Talend Inc. 1 Connecting the Data-Driven Enterprise 2 Talend Overview Founded in 2006 BRAND
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Lessons Learned: Building a Big Data Research and Education Infrastructure
Lessons Learned: Building a Big Data Research and Education Infrastructure G. Hsieh, R. Sye, S. Vincent and W. Hendricks Department of Computer Science, Norfolk State University, Norfolk, Virginia, USA
Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:
Introduction BIG DATA is a term that s been buzzing around a lot lately, and its use is a trend that s been increasing at a steady pace over the past few years. It s quite likely you ve also encountered
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
vcenter Operations Management Pack for SAP HANA Installation and Configuration Guide
vcenter Operations Management Pack for SAP HANA Installation and Configuration Guide This document supports the version of each product listed and supports all subsequent versions until a new edition replaces
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering
MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
and Hadoop Technology
SAS and Hadoop Technology Overview SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS and Hadoop Technology: Overview. Cary, NC: SAS Institute
Supported Platforms. HP Vertica Analytic Database. Software Version: 7.1.x
HP Vertica Analytic Database Software Version: 7.1.x Document Release Date: 10/14/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
Big Data and Apache Hadoop Adoption:
Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards
Using The Hortonworks Virtual Sandbox
Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop This work by Hortonworks, Inc. is licensed under a Creative Commons Attribution- ShareAlike3.0 Unported License. Legal Notice Copyright 2012
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
White Paper: What You Need To Know About Hadoop
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
Upcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
Big Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University [email protected] 16.6-2015
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
SQL on NoSQL (and all of the data) With Apache Drill
SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013
Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the
MySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
AtScale Intelligence Platform
AtScale Intelligence Platform PUT THE POWER OF HADOOP IN THE HANDS OF BUSINESS USERS. Connect your BI tools directly to Hadoop without compromising scale, performance, or control. TURN HADOOP INTO A HIGH-PERFORMANCE
FileMaker 12. ODBC and JDBC Guide
FileMaker 12 ODBC and JDBC Guide 2004 2012 FileMaker, Inc. All Rights Reserved. FileMaker, Inc. 5201 Patrick Henry Drive Santa Clara, California 95054 FileMaker and Bento are trademarks of FileMaker, Inc.
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
How to Hadoop Without the Worry: Protecting Big Data at Scale
How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance
Talend Big Data Sandbox
Talend Big Data Sandbox Big Data Insights Cookbook Table of Contents Table of Contents... 2 1 Overview... 4 1.1 Setup Talend Big Data Sandbox... 4 1.1.1 Pre-requisites to Running Sandbox... 5 1.1.2 Setup
Kognitio Technote Kognitio v8.x Hadoop Connector Setup
Kognitio Technote Kognitio v8.x Hadoop Connector Setup For External Release Kognitio Document No Authors Reviewed By Authorised By Document Version Stuart Watt Date Table Of Contents Document Control...
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing
Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing Manually provisioning and scaling Hadoop clusters in Red Hat OpenStack OpenStack Documentation Team Red Hat Enterprise Linux OpenStack
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Building Your Big Data Team
Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.
Ankush Cluster Manager - Hadoop2 Technology User Guide
Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush User Manual 1.5 Ankush User s Guide for Hadoop2, Version 1.5 This manual, and the accompanying software and other documentation, is protected
FileMaker 11. ODBC and JDBC Guide
FileMaker 11 ODBC and JDBC Guide 2004 2010 FileMaker, Inc. All Rights Reserved. FileMaker, Inc. 5201 Patrick Henry Drive Santa Clara, California 95054 FileMaker is a trademark of FileMaker, Inc. registered
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Big Data Introduction
Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights
Microsoft Big Data. Solution Brief
Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
The Greenplum Analytics Workbench
The Greenplum Analytics Workbench External Overview 1 The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Modernizing Your Data Warehouse for Hadoop
Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist [email protected] O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP
Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
The Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. [email protected] [email protected] @OrionGM The Inside Scoop
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
