The Fast-Track to Hands-On Understanding of Big Data Technology

Transcription

1 The Fast-Track to Hands-On Understanding of Big Data Technology Stream Social Media to Hadoop & Create Reports in Less Than a Day (Part 2) COLLABORATIVE WHITEPAPER SERIES

2 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Big Data might be intimidating to the most seasoned IT professional. It s not simply the charged nature of the term Big that is ominous, but the underlying technology is app-centric in a very open-source way. If you are like most professionals who don t have a working knowledge of MapReduce, JSON, Hive, or Flume, diving into the deep-end of the Big Data technology pool may seem like a time-consuming process. Even if you possess these skill sets, the prospect of launching a Hadoop environment and deploying an application that streams Twitter data into the environment in a way that is accessible through standard ODBC tools would seem like a task measured in weeks not days. It may surprise most people looking to get hands-on with Big Data technology that each of us can do so in short time, and with the right approach, you can stream live social data to your own Hadoop cluster and report on the information through Excel in less than one day. In an instructive manner, this whitepaper series enables you with a fast track approach to create your personal Big Data lab environment powered by Apache Hadoop. This first part in the series will engage IT professionals with a passing interest in Big Data by providing them with: Reasons to explore the world of Big Data and Big Data skills gap. A practical, lightweight approach to getting hands-on with Big Data technology. Describe the use case and the supporting technical components in more detail. Provide step-by-step instructions of how to setup the lab environment, and direct individuals to Cloudera s streaming Twitter agent tutorial. We will enhance Cloudera s tutorial in the following ways: - Make the tutorial real-time. - Provide steps to establish ODBC connectivity and how to execute Cloudera s sample queries in Excel. - Configure and register libraries at an overall environment level. - Provide sample code and troubleshooting tips. 2

3 I. A reason to explore the universe of Big Data options before taking the first step. For those of us with day jobs, there aren t enough hours in the day to invest a lot of time in dissecting the various Big Data technology players, or building relevant open source components that won t necessarily prove anything from a technology or business point of view. Fortunately, the following game plan provides Before beginning this exercise, the first question that may be asked by IT professionals is why would one care to explore the universe of Big Data? The fact is that the universe of data is expanding at an accelerating rate, and increasingly the data growth is driven by sources of unstructured or machine-generated Big Data (e.g. from social media, blogs, the Internet of Things ). The latest IDC Digital Universal Study reveals an explosion of stored information: more than 2.8 zettabytes or roughly 3 billion terabytes of information was created and replicated in 2012 alone. To put this number in perspective, this means that terabytes of information was produced per second over the course of a year. Figure 1: Why choose Hadoop Organizations are increasingly aware that this unrefined data represents an opportunity to gain valuable insight into ongoing clinical research, monitoring financial risk, etc. From a practical perspective, this means that business people will be asking IT for answers to questions that can be supported by sources of Big Data, and in some cases processed by Big Data technologies. A recent Harvard Business Review survey suggests that 85 percent of organizations had funded Big Data initiatives in process or in the planning stage, but the survey reveals a severe gap in analytical skills and 70 percent of respondents describe finding qualified data scientists as challenging to very difficult 1. Thus, Big Data introduces an opportunity to the business, but exposes a skills and technology gap for IT. This gap must be filled in short time otherwise businesses will find themselves at a competitive disadvantage, and IT s ability to support the business will be questioned. II. The right approach If you are convinced that an understanding of Big Data is important to your business and IT initiatives, in most cases, you need to formulate a practical, low-cost, and ultimately relevant approach of understanding the technology and conventional use cases that resonate with the business. After all, IT resources are stretched thin, and many of us in IT that are new to the world of Big Data could spend weeks getting up to speed on the various 3

4 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Figure 2: Why CDH was used Consideration Building the Hadoop environment from scratch as opposed to using CDH. Using CDH over HortonWorks or MapR. Deploying Hadoop in the Cloud. Direction and Rationale We considered building our Hadoop environment from scratch through the Apache Hadoop projects. If your learning objectives include understanding what it takes to ensure compatibility of each Hadoop project, or if you need to tweak the source code, then you should include this step in the approach. Given the time commitment, it seemed more useful to take an existing distribution that ensured interoperability and compatibility of the projects. Alternatives from HortonWorks and MapR were considered, specifically Microsoft s HDInsight distribution that uses HortonWorks. Ultimately, Cloudera s software and support resources and its Twitter Feed are used in the example, which are available for download and general use. Cloudera also has VM images with a free edition of the Cloudera manager and Hadoop available with the entire Apache Hadoop project required by the scenario for download. Deploying the environment to the cloud was considered, and in some cases it may be preferred. An instance of Microsoft HDInsight was used running in Azure, and would have been pursued at a greater length but unfortunately the lease on the Azure instance expired and inquiries on how to extend the lease went unanswered. a universal use case as a starting point, and practical ways with which you can get a lab environment running so that you can mobilize business sponsors and technical staff around Big Data capabilities in less than eight hours. For learning purposes, it makes sense to pursue a fairly common scenario across industries. In our case, we will attempt to stream social media (specifically tweets from Twitter) to a lab environment, and then we will report on the data through everyone s favorite BI tool, Excel. The use case will be described in more detail later on. From a technical training perspective, the approach relies heavily on Apache Hadoop. Figure 1 lists the reasons why Hadoop is the preferred platform to learn Big Data and to implement this scenario: There are many ways to deploy Apache Hadoop. Our example relies on Cloudera s distribution of Apache Hadoop (CDH) running in a Linux VM image. Figure 2 lists key considerations and why CDH was used. The CDH stack (Figure 3) summarizes the core projects included with CDH, and the projects relevant to the use case are captioned 2. It should be noted that this is a learning exercise, not a performance benchmark. Thus a single-node Hadoop running inside of a Linux VM is deemed sufficient for those of us wanting to learn Hadoop. If performance tuning is crucial to your learning objectives, then a more robust environment would be required. The business use case would still be relevant, since streaming live social data will generate millions of transactions, depending on the key 4

5 Figure 3: Cloudera s distribution including Apache Hadoop (CDH) 3 Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. Hive is a data warehouse system for Hadoop that facilities easy data summarization, ad-hoc queries, and the analysis of datasets stored in HDFS. Hive provides a mechanism to project structure onto this data and query that data using a SQL-like language called HiveQL. Oozie is a workflow scheduler system to manage Apache Hadoop Jobs. Oozie Coordinator jobs are recurrent Oozie workflow jobs triggered by time (frequency) and data availability. Captures the Hive metadata, transparent to the overall application. CDH 100% open source Cloud WH Whirr UI and Workflow HU Hue OO Oozie Metadata AC Access Integration SQ Sqoop Batch Processing HI Hive Pl Pig MA Mahout DF Datafu Real-time Access and Compute MS Metastore FL Flume FILE Fuse DFS REST WEBHDFS HTTFS Batch Compute MR MapReduce MR2 MapReduce2 Resource Management and Coordination = YA Yarn ZO Zookeeper IM Impala SQL ODBC & JDBS Storage HDFS = Hadoop DFS HB HBase Hadoop supports ODBC and JDBC connectivity is supported for interfacing with other platforms like databases, BI, and data integration tools. MapReduce is a software framework for easily writing applications which process vast amounts of data (multiterabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner. HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and Data Nodes that store the actual data. Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services. All of these kinds of services are used in some form or another by distributed applications. words you have specified. The specifications of the VM image and the host matching are listed in the Appendix. Finally, there are ways to make this use case more comprehensive. For instance, once the data streaming is captured, you could use Apache Mahout to cluster and classify the data using the various algorithms available. Since much of the classification would be dependent on business input, it seems reasonable to take a first iteration through the use case as presented, and then proceed with next steps in concert with more involvement or direction from the business. III. Use case and supporting Hadoop components Streaming social media data is a fairly common use case for Big Data and applies across industries. Cloudera provides a 5

6 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Figure 4: Streaming social media use case and supporting technical components 1. Twitter users generate tweets Twitter users tweet about various topics. In this example, we want to capture in real-time any tweet related to Big Data keywords. hadoop analytics big data mapreduce Hive contains an external table named tweets that is modeled on the JSON message from Twitter. In addition to partitioning the table by year, month, day, and hour through the Q script executed within Oozie, Hive will be used to aggregate tweets from Excel. These queries perform aggregations listed above, and are executed through ODBC. Big Data tweets by time zone and day 5. Hive aggregates data from Excel Top 15 Big Data hashtags Top 10 retweeted users on Big Data topics HQL (Hive Query Language) 2. Flume organizes keywords 4. JAR files are created in Hive 3. Oozie process runs Hive script A Flume agent runs on our CDH VM. It streams any tweet containing the Big Data keywords listed above to HDFS. It places the tweets in a folder structure organized by year, month, day, and hour. It creates a new subfolder as it rolls forward each time interval. HDFS There are two JARs that will be registered in Hive. One defines the JSON extensions for the Hive external table tweets. The second instructs Map Reduce to exclude any TMP files created by Flume. Flume writes to a TMP file. These TMP files cannot be accessed by MapReduce. JSON SerDe = Custom MapReduce pathfilter Hive Q Script: Add hourly partition An Oozie workflow runs in the background and once a new subdirectory is created by the Flume Agent, it will execute a Hive script that adds a new partition to a tweet table corresponding to the year, month, day, and hour of the directory. 6. Hive initiates MapReduce program CDH V4.1.1 single node cluster Running on 64-bit Linux VMWare Hive will translate any query that requires aggregation, sorting, or filtering into a MapReduce program. MapReduce will grab the data from HDFS and pass the result set to Hive. tutorial that represents an implementation of this use case. This paper will build on Cloudera s tutorial, and extend it by making the data available in real-time and reporting on the data in Excel. Using the approach described above and by following the instructions, you will have Tweets streaming into your Hadoop sandbox reportable in Excel in less than a business day. Cloudera s tutorial is documented thoroughly in a series of blog postings and the source code is available on GitHub 4-7. Figure 4 represents our version of the streaming Twitter tutorial. Major components are numbered and their purpose explained in Figure 4. 6

7 IV. How to stream social data to Hadoop in less than a day Now that we have established a rationale, an approach, and use case for learning Big Data, we can get started. Despite the many moving parts listed in the use case, you can have the streaming social media use case operational in your own lab environment in less than a working day. F The following lists the steps of how to make this happen, and any non-obvious instructions to follow that are not provided by the instructions in the tutorial. Where appropriate, explanations have been provided to ensure the significant concepts and mechanics are understood and reinforced. 1. First and foremost, you need a CDH lab environment. Building and configuring this environment from the OS up could take time. Fortunately, Cloudera provides a VM image that is available for download with all of the necessary Hadoop projects pre-installed and pre-configured. You can download the VM from Cloudera s website. 2. To run the lab environment, you will also need VMware player. You can download and install the VMware player from the VMware website. 3. Verify you have sufficient resources to run the VM image on your host machine. Please refer to Cloudera s system requirements, and the appendix for the host and guest machine specifications used in this example. 4. Start the VM. Once started, you can begin the Cloudera Twitter tutorial. For the most part, you can follow the instructions exactly as provided in the GitHub Tutorial instructions. The following instructions should be followed in addition to those provided by Cloudera, and the rationale for the amendments is also provided: 4.1. Unless you have a need to build the JARs from scratch, you should find the JARs referenced in the tutorial will already exist on the VM image provided by Cloudera. If you build the JARs from the source, you will probably need two days to get the tutorial operational Before starting, the GCC library was missing from the VM image. To include the library (which is required to install other libraries): sudo su - yum install gcc 4.3. When following the steps under Configuring Flume : Step 3 We had to manually create the flume-ng-agent file with the following contents: # FLUME_AGENT_NAME=kings-river-flume FLUME_AGENT_NAME=TwitterAgent Step 4 If you are not familiar with the details of your Twitter app, this step may cause confusion. All that is required is a Twitter account. Once you have a Twitter account, you need to register the Flume Twitter agent with twitter so that Twitter has a record of your agent and can govern the various 3rd parties that stream Twitter data To register your Twitter App, go to 7

8 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Sign-in with your Twitter account Click Create a new Application Enter the following information: 8

9 Your new application will provide you with four security tokens that will be specified in the flume.conf file. These properties are highlighted below Using the values for the application properties highlighted above, enter the following parameters in flume.conf. If flume.conf does not exist on /etc/flume-ng/conf, please download it from the GitHub project: TwitterAgent.sources.Twitter.consumerKey = <consumer_key_from_twitter> TwitterAgent.sources.Twitter.consumerSecret = <consumer_secret_from_twitter> TwitterAgent.sources.Twitter.accessToken = <access_token_from_twitter> TwitterAgent.sources.Twitter.accessTokenSecret = <access_token_secret_from_twitter> In flume.conf, modify the following parameter according to the key words in which you want to filter tweets. Note that the default flume.conf provided by Cloudera misspelled data scientist; the correct spelling is listed in red below: TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientist, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing At this point you probably realize the importance of flume.conf. In addition to containing the details of the Twitter app and the key words, it contains the following parameters which govern how big the Flume files are before it rolls into a new file. These parameters are significant because as you change them, the latency of the tweets will also change. The complete listing of the Flume parameters can be on Cloudera s website. TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 # number of events written to file before it flushed to HDFS TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 # File size to trigger roll (in bytes) TwitterAgent.sinks.HDFS.hdfs.rollCount = # Number of events written to file before it rolled Place flume.conf under /etc/flume-ng/conf as instructed in Step When following the steps under Setting up Hive : Now copy hive-serdes-1.0-snapshot.jar in Step 1 to /usr/lib/hadoop cp hive-serdes-1.0-snapshot.jar /usr/lib/hadoop After step 4, you ll want to create a new Java package using the following steps. There is no Java programming knowledge required; simply follow these instructions. It is necessary to create this Java class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to HDFS. 8 mkdir com mkdir com/twitter mkdir com/twitter/util export CLASSPATH=/usr/lib/hadoop/hadoop-common cdh4.1.2.jar:hadoop-common.jar 9

10 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology vi com/twitter/util/filefilterexcludetmpfiles.java Copy the Java source code in the appendix into the file and save it. javac com/twitter/util/filefilterexcludetmpfiles.java jar cf TwitterUtil.jar com cp TwitterUtil.jar /usr/lib/hadoop Edit the file /etc/hive/conf/hive-site.xml, and add the following tags. The first property ensures that you won t have to add the JSON SerDe package and the new customer package that excludes Flume temporary files for each Hive session. This will become part of the overall Hive configurations that is available to each Hive session. The second tags instruct MapReduce of the class name and location of the new Java class that we created and compiled above. <property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hadoop/hive-serdes-1.0-snapshot jar,file:///usr/lib/ hadoop/twitterutil.jar</value> </property> <property> <name>mapred.input.pathfilter.class</name> <value>com.twitter.utilfilefilterexcludetmpfiles</value> </property> Bounce the hive servers: sudo service hive-server stop sudo service hive-server2 stop sudo service hive-server start sudo service hive-server2 start 4.5. When following the steps under Prepare the Oozie workflow : For all steps, download the Oozie files from the Cloudera GitHub site Before Step 4, edit the job.properties file accordingly Make sure the following parameters reference localhost.localdomain referenced, not just localhost: namenode=hdfs://localhost.localdomain:8020 jobtracker=localhost.localdomain: The jobstart, jobend, tzoffset, and initialdataset require explanation. Let s say Flume is streaming the tweets to a HDFS folder, /user/flume/tweets/*. The parameter initialdataset instructs the workflow what the earliest year, month, day, and hour for which you have data and therefore can add a partition to the Hive tweets table. jobstart should be set to the initialdataset +/- the tzoffset. Finally, jobend tells the Oozie workflow when to wind down, so 10

11 it can be set well into the future. In the following example, the parameters specify that the first set of Tweets live on HDFS under / user/flume/tweets/2013/01/07/08, and once the directory is available it will create execute the Hive Query Language script addpartition.q. jobstart= t13:00z jobend= t23:00z initialdataset= t08:00z tzoffset= Edit coord-app.xml: a. Change timezone from America/Los_Angeles to America/ New_York (or the corresponding timezone for your location): initial-instance= ${initialdataset} timezone= America/New_York > b. Remove the following tags. This is extremely important in making the tutorial as real-time as possible. The default Oozie workflow has defined a readyindicator which acts as a wait event. It instructs the workflow to create a new partition after an hour completes. Thus, if you leave this configuration as-is, there will be a lag as great as one-hour between tweets and when the tweets can be queried. The reason for this default configuration is that the tutorial did not define the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary Flume files. Because we have deployed this custom package, we do not have to force a full hour to complete before querying tweets. <data-in name= readyindicator dataset= tweets >  <instance>${coord:current(1 + (coord:tzoffset() / 60))}<instance> </data-in> If you haven t done so already, enable the Oozie web console according to the Cloudera documentation. Doing so allows Oozie coordinating jobs and workflows to be accessed from the console located at oozie/ Once you have started the Flume Agent (under Starting the data pipeline ), you will see 11

12 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Tweets streaming to your HDFS You can browse the HDFS directory structure from the Hadoop NameNode console on your cluster. You can also access the cluster from localdomain:50070/dfshealth.jsp If you are experiencing technical issues, please reference the Troubleshooting Guide in the appendix 5. Setup ODBC connectivity through Excel: 5.1. ODBC connectivity to Hive from an application is a logical extension of the Cloudera Twitter tutorial There are several ODBC drivers for Hive, but many were not compatible with Excel (e.g. Cloudera s ODBC driver for Tableau) or not compatible with Cloudera s environment (Microsoft s ODBC driver for Hive, which only worked when connecting to Microsoft HDInsight) We successfully used MapR s ODBC driver for Windows located here. Since we are running 32-bit Excel, we needed to download the 32-bit ODBC driver for Hive, but MapR has a driver for 64-bit as well Download and install the appropriate ODBC driver from MapR s website Configure an ODBC connection to the Hive database We recommend specifying an entry in your Windows hosts file (C:\ Windows\System32\drivers\etc\ hosts) to alias the IP address for your VM machine. You can get the IP address from your VM by typing in the command ifconfig cloudera-vm Create an ODBC connection to the Hive Database: Open a new Excel workbook: 12

13 From Data tab Select From Other Sources Select From Data Connection Wizard Select ODBC DSN, click next. 13

14 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Select the DSN you set up using the MapR driver (Cloudera Hive VM MapR) Select the tweets table Select Finish Select Properties. This is the important part because we must 14

15 override the HQL in order for the query to execute. At the time this article was written, the major ODBC drivers append default to Hive Query and the MapR ODBC driver is the only one able to establish connectivity which would allow us to override the HQL Select Definition tab. Using one of the Hive queries provided in the Appendix, copy the HQL and paste it into the Command Text. Also save password Hit OK to import the data Repeat for the remaining queries in the appendix. Create as many queries as you see fit. HQL is very SQL-like and for many of us that know SQL will be easy to adapt the queries from the appendix into other statements that provide the views you need. 15

16 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology V. Summary Once you have successfully completed this tutorial, you should have a clearer understanding of Hadoop, specifically: 1. A quick overview of core Hadoop projects and how each is used to support streaming social media and reporting through a standard ODBC connection. 2. An operational Hadoop sandbox that can be used for training, local development, and proof of concepts that you can navigate and explore. 3. A real-world reference model for a use case illustrating the amazing streaming capabilities in Hadoop. 4. How to model semi-structured JSON data in Hive and query it in a conventional manner. Lastly, this exercise should leave individuals wanting to take the Hadoop experience to the next level. Independently, you can layer in a Mahout program to cluster and classify the tweets, thereby simulating some form of sentiment analysis. You may also want to layer in Geospatial data into the set to provide more advanced analytics. You could consider streaming data from other social media sites (if so, we recommend starting here). Above all, you may want to show someone from the business to illustrate what this new technology can do. By demystifying Big Data technology, you can take your understanding and ability to support additional business use cases to these next levels. Appendix: Custom Java code for MapReduce PathFilter package com.twitter.util; import java.io.ioexception; import java.util.arraylist; import java.util.list; import org.apache.hadoop.fs.path; import org.apache.hadoop.fs.pathfilter; public class FileFilterExcludeTmpFiles implements PathFilter { public boolean accept(path p) { String name = p.getname(); return!name.startswith( _ ) &&!name.startswith(. ) &&!name.endswith(. tmp ); } } 16

17 Appendix: Hardware/software environment Figure 5 OS Processor Memory Disk Software Host Windows 7 Enterprise 64-bit Intel Core 2 Duo CPU 2.26GHz, 2.27GHz 8GB (7.9 Addressable) 300GB VM Player build Microsoft Office 32-bit Guest CentOS 6.2 Linux 64-bit Intel Core 2 Duo CPU 2.26GHz 2,98GB 23.5GB Cloudera Manager Free Edition CDH4.1.2 Appendix: Troubleshooting guide Figure 6 Error Message/Stack Trace Cause Resolution FAILED: RuntimeException MetaException(message:org.apache. hadoop.hive.serde2.serdeexception SerDe com.cloudera.hive. serde.jsonserde does not exist Hive cannot find hive-serdes-1.0- SNAPSHOT.jar 1. Place hive-serdes-1.0-snapshot.jar in / usr/lib/hadoop. 2. Edit /etc/hive/conf/hive-site.xml, add the following: <property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hadoop/ hive-serdes-1.0-snapshot. jar,file:///usr/lib/hadoop/ TwitterUtil.jar</value> </property> :57:37,027 INFO org.apache.oozie.command. coord.coordactioninputcheckxcommand: USER[-] GROUP[-] TOKEN[-] APP[-] JOB[ oozieoozi-C] ACTION[ oozieoozi-C@2] [ oozie-oozi- C@2]::ActionInputCheck:: In checklistofpaths: hdfs:// localhost. localdomain:8020/user/flume/tweets/ 2013/01/17/10 is Missing. Main class [org.apache.oozie.action.hadoop.hivemain], exit code [10001] OLE DB or ODBC error: [MapR][Hardy] (22) Error from ThriftHiveClient: Query returned non-zero code: 2, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop. hive.ql.exec.mapredtask; HY000. An error occurred while the partition, with the ID of Tweets By Timezone_cbf7182e-a7a6-416c-a3fd-d7f484952cc6, Name of Tweets By Timezone was being processed. The current operation was cancelled because another operation in the transaction failed. Permissions on / user/flume/* Missing MySQL driver Flume temp file permissions issue 3. Start and restart the hive services Change perms on /user/flume: sudo -u flume hadoop fs -chmod -R 777 / user/flume cp /var/lib/oozie/mysql-connectorjava. jar oozie-workflows/lib Walk through the instructions, Setting up Hive to ensure the custom Java class to set the MapReduce pathfilter is built, deployed and referenced in Hive as specified. 17

18 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Figure 6 (continued) Appendix: Excel queries9 Top 15 Big Data hashtags Tweets by time zone and day SELECT SELECT LOWER(hashtags.text), user.time_zone, COUNT(*) AS total_count SUBSTR(created_at, 0, 3), COUNT(*) AS total_count FROM tweets FROM tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE user.time_zone IS NOT NULL GROUP BY LOWER(hashtags.text) GROUP BY user.time_zone, ORDER BY total_count DESC SUBSTR(created_at, 0, 3) LIMIT 15 ORDER BY total_count DESC LIMIT 15 18

19 Top 10 retweeted users on Big Data topics SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_ screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10 Top 200 most active users on Big Data topics select user.screen_name, count(*) tweet_cnt from tweets group References 1. gap_no_pan.html 2. Definitions from the Apache Hadoop website for each respective package 3. cdh.html datawith-hive/ Known issue with Flume, see jira/browse/flume Adapted from analyzing-twitter-data-with-hadoop/ order by user.screen_name by tweet_cnt desc limit

20 COLLABORATIVE WHITE PAPER SERIES: The Fast-Track to Hands-On Understanding of Big Data Technology Collaborative Consulting is a leading information technology services firm dedicated to helping our clients achieve business advantage through the use of strategy and technology. We deliver a comprehensive set of solutions across multiple industries, with a focus on business process and program management, information management, software solutions, and software performance and quality. We also have a set of offerings specific to the life sciences and financial services industries. Our unique model offers both onsite management and IT consulting as well as U.S.-based remote solution delivery. To learn more about Collaborative, please visit our website at us at [email protected], or contact us at Copyright 2014 Collaborative Consulting, LLC. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. WP