Hadoop, Hive & Spark Tutorial
|
|
|
- Cory Griffin
- 9 years ago
- Views:
Transcription
1 Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information about the systems you are referred to the corresponding documentation pages: Apache Hadoop 2.6: Apache Hive 1.2: Spark 1.4: Virtual Machine and First Steps A Virtual Machine (VM) is provided with all the software already installed and configured. Within the machine the source code and jars are also provided, along with the used datasets. As a first step you have to download the VM and open it with VirtualBox. The VM holds an Ubuntu LTS operative system with the essential packages needed for the execution of the examples. The VM is configured with 4GB RAM by default, but you can modify the corresponding settings to fit best with your host system. The VM is linked to 2 hard disks, one containing the OS and the other configured to be used as swap if necessary. The VM s name is hadoopvm. Be careful to change this name, as it is used in the configuration files of the provided systems. The user created for the tests is hadoop with password hadoop. You may also connect to the VM with ssh on port Hadoop 2.6, Apache Hive 1.2 and Apache Spark 1.4 are installed in the user account under bin directory. You can use the corresponding start/stop scripts (e.g., start-dfs.sh, start-yarn.sh) but a pair of scripts are provided to start and stop all the services. Their usage is as following: # start_servers.sh... Use Hadoop, Hive or Spark... # stop_servers.sh Hadoop s distributed filesystem (HDFS) is already formatted. If you want to reformat it from scratch you may use the following command prior to starting the servers: # hadoop namenode -format 1
2 1.2 Examples and Datasets The examples are included in the $HOME/examples directory. There you can find the sources of all the examples of the tutorial as well as the jar used for the MapReduce examples. The datasets are included in the $HOME/datasets directory. Two scripts are given to extract, transform and load the datasets into HDFS. 1.3 NASA web logs This dataset contains 205MB of uncompressed logs extracted from the HTTP requests submitted to the NASA Kennedy Space Center web server in July This is an example of an HTTP request log: [01/Jul/1995:00:00: ] "GET /hist/apollo/ HTTP/1.0" The line contains the requesting host, the date, the HTTP request (HTTP methods, url and optionally the HTTP version), the HTTP return code and the number of returned bytes. More information about the dataset can be found in NASA-HTTP.html. In order to load the dataset into HDFS, move to the datasets directory (/home/hadoop/datasets) and execute: #./load_nasa_access_log.sh As a result, the uncompressed file will be uploaded to /data/logs HDFS directory. You may check this with the following command: # hadoop fs -ls -R /data 1.4 MovieLens dataset The second dataset corresponds to the MovieLens 1M dataset, which contains 1 million ratings from 6000 users on 4000 movies. The description of the dataset is provided as part of the zip file and is also available at txt. The dataset is divided into three main files: Users: Each line contains information about a user: its identifier, gender, age, occupation and zip-code. Movies: Each line contains information about a movie: identifier, title (as it appears in IMDB, including release year) and a list of genres. Ratings: Each line contains a user rating with the following information: user identifier, movie identifier, rating (5-star scale) and timestamp. Originally the fields within the files are separated by two colons (::), but in order to facilitate importing in Apache Hive, those are replaced by asterisks (*) before loading the data into HDFS. Genres are separated by the vertical bar ( ). Additionally, the movies data is also formatted in JSON format, as this format will be used in Apache Spark. In order to load the dataset into HDFS, move to the datasets directory and execute: #./load_movieles.sh 2
3 2 Hadoop MapReduce In the first part of the tutorial we are going to use Apache Hadoop MapReduce to process the information contained in the NASA logs datasets. Before executing the examples, make sure you have successfully started HDFS and YARN servers. 2.1 Introduction MapReduce programs basically consist in two user defined functions called respectively map and reduce. These functions are applied to records of key-value pairs and are executed in parallel. They also produce key-value pairs, the only restriction being that output key-value pairs of the map functions are of the same type as input key-value pairs of the reduce function 1. Between the execution of both functions, an intermediate phase, called shuffle, is executed by the framework, in which intermediate records are grouped and sorted by key. Hadoop MapReduce provides two different APIs to express mapreduce jobs, called mapred and mapreduce. In this tutorial, the first API is used. 2.2 Basic Examples Methods Count. In the first example we are going to count how many requests have been submitted for each HTTP method. This example is very close to the typical WordCount example given in many MapReduce tutorials. The source code of this example can be found in the class fr.inria.zenith.hadoop.tutorial.logs.methodcount. MapReduce allows the user to specify the InputFormat in charge of reading the files, and produce the input key-value pairs. Throughout this tutorial, we will use TextInputFormat, which generates a record for each line, where the key is the offset of the beginning of the line and the value the content of the line. The map function for this example just extracts the HTTP method from the line and emits a key-value pair containing the method and ONE, meaning that this method have been found one time. Here is the main part of the code of the map function: private final static IntWritable ONE = new IntWritable(1); private Text method = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { } String linestr = value.tostring(); int startrequest = linestr.indexof( " ); int endrequest = linestr.indexof( ", startrequest + 1); String httprequest = linestr.substring(startrequest + 1, endrequest); method.set(httprequest.split(" ")[0]); output.collect(method, ONE); Input types (LongWritable and Text) are given by the InputFormat. The intermediate key value pairs are sent to the OutputCollector and their types defined in the class. Note that in 1 Actually, the reduce function receives a key and a set of values of the same type as the map output 3
4 the source code provided in the VM, additional sanity checks are included to deal with incorrect inputs. The reduce function will receive for each method, a collection with the counts of appearances. It has just to sum them and emit the sum. public void reduce(text code, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(code, new IntWritable(sum)); } Finally, a MapReduce job has to be created and launched. The job is represented by a JobConf. This object is used to specify all the configuration of the job, including map and reduce classes, input and output formats, etc. Map and reduce functions have been included respectively into two pair of classes called Map and Reduce. A combiner, which is in charge of performing partial aggregates, is also specified (in this case, the same class as the reduce is used). Input and output paths are specified through FileInputFormat and FileOutputFormat classes. public int run(string[] args) throws Exception { JobConf conf = new JobConf(getConf(), MethodsCount.class); conf.setjobname("methods_count"); String inputpath = args[0]; String outputpath = args[1]; conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); } JobClient.runJob(conf); return 0; To execute this program you can use the following command: # hadoop jar HadoopTutorial.jar fr.inria.zenith.hadoop.tutorial.logs.methodscount /data/logs /out/logs/methods_count The command contains the jar with the program, the class to be executed, and the input and output paths specified as parameters. 4
5 Web Page Popularity. The next example obtains for each web page, the number of requests received during the month. The code is very similar to the previous one and is included in the class WebPagePopularity in the same package as before. The only difference is the addition of a filter, as we only consider as web pages, those resources finishing by.html or /. The addition of filters in MapReduce is done programmatically and implies just to emit or not a given key-value pair. This is represented by this block of code: String url = httprequest.split(" ")[1]; if (url.endswith(".html") url.endswith("/")) { pageurl.set(url); output.collect(pageurl, ONE); } Parameters and Counters Image Traffic. In the last example we want to obtain the total amount of traffic (returned bytes) generated by the images (resources ending with.gif). Only the images generating more that a specific amount of traffic (specified as a parameter) are returned. This example contains several additional elements. Firstly, in this case, all the fields of the logs are obtained with a regular expression. Whenever the expression does not find any match we increment a counter. We proceed the same way for entries with no information about the returned bytes. When executing the program, those counters will be appended to the information given to the user. enum Counters {MALFORMED_LOGS, NO_BYTES_LOGS}... if (m.find()) {... } else { reporter.incrcounter(counters.malformed_logs, 1); } The second novelty of this example is the inclusion of a user defined parameter. In this case, the parameter filters the pages generating less than a given amount of traffic, specified in MBs. Parameters are passed in the job configuration like this: conf.setint("img_traffic.min_traffic", Integer.parseInt(args[2])); Parameters are read in the map or reduce class by using the configure public void configure(jobconf job) { mintraffic = job.getint("img_traffic.min_traffic", 0) * (1024 * 1024); } To specify the minimum traffic when executing the job, we add it to the command line arguments: # hadoop jar HadoopTutorial.jar fr.inria.zenith.hadoop.tutorial.logs.imagetraffic /data/logs /out/logs/image_traffic 2 5
6 Additional queries Note that the result of the last query is sorted by image url. It would be more interesting to sort the pages by descending order of traffic. We can not do this in a single MapReduce job. In order to do that we need to execute an additional job. In the method main(), we should create an additional JobConf to be executed after the first job and in which the input path is the output path of the first job. The map and reduce functions to sort the output would be very simple. 6
7 3 Apache Hive In this part of the tutorial we are going to use Apache Hive to execute the same queries as in the previous example and observe that it is much easier to write them in the new framework. Additionally we are going to use the MovieLens dataset to execute other type of queries. Hive provides a interactive shell that we are going to use to execute the queries. To enter the shell, just type: # hive Alternatively, you can create a file for each query and execute it independently with the option hive -f <filename>. 3.1 Introduction Hive is a data warehouse solution to be used on top of Hadoop. It allows to access the files in HDFS the same way as MapReduce and query them using an SQL-like query language, called HiveQL. The queries are transformed into MapReduce jobs and executed in the Hadoop framework. The metadata containing the information about the databases and schemas is stored in a relational database called the metastore. You can plug any relational database providing JDBC. In this tutorial we are using the simpler embedded derby database provided with Hive. All the queries used in this part of the tutorial can be found in the directory examples/src/hive. 3.2 Nasa Logs Databases and Tables As in any traditional database, in Hive we can create and explore databases and tables. We are going to create a database for the Nasa logs dataset. create database if not exists nasa_logs; We can get information about this database by using the command describe. describe database nasa_logs; As a result of this command we can learn about the location of this database, which should be the HDFS directory hive/warehouse/nasa logs.db, as it is the default location indicated in the configuration files. Of course this default location can be changed. It is also possible to specify a particular location for a given database. Now we move to the database by typing: use nasa_logs; The next step is to create a table containing the information of the logs. We will specify the fields and types as in standard SQL. Additionally we are going to tell Hive how to parse the input files. In this case, we use a regular expression as in the ImageTraffic example. 7
8 create external table logs ( host string, req_time string, req_code string, req_url string, rep_code string, rep_bytes string ) row format serde org.apache.hadoop.hive.contrib.serde2.regexserde with serdeproperties ( "input.regex" = ([^ ]*) - - \\[(.*)\\] "([^ ]*) ([^ ]*).*" ([^ ]*) ([^ ]*) ) stored as textfile location "/data/logs"; Different row formats and file types can be specified. The default row format for textfile is based on separators. In this case we use a SerDe provided with Hive and based on regular expressions. You could create your own SerDe and use it to parse the tables in a personalized way. We have defined the table as external and specified the location. This keeps the table in that location and makes Hive assume that it does not own the data, meaning that the data is not removed when the database is dropped. Now that the table is created we can query it directly using SQL commands, for instance: select * from logs limit 10; Note: if we want the output to include the headers of the columns we can set the corresponding property: set hive.cli.print.header=true; Queries At this point we are ready to execute the same examples we have implemented previously with MapReduce. Methods Count. We use a simple SQL expression with a group by: select req_code, count(*) as num_requests from logs group by req_code; Note that this query generates a MapReduce job as you can see from the output logs. Web Page Popularity. We have enriched the query by sorting the output and limiting the number of pages. Note that here we use a HiveQL specific expression, rlike, which allows to use standard regular expressions. 8
9 select req_url, count(*) as num_requests from logs where req_url rlike "html$" or req_url rlike "/$" group by req_url order by num_requests desc limit 20; Image Traffic Again we sort the output by traffic, as suggested at the end of MapReduce section. As explained before we need two jobs to perform that, and we can observe that in the output logs. select req_url, sum(rep_bytes) as traffic from logs where req_url like "%gif" group by req_url having traffic > 5 * 1024 * 1024 order by traffic desc; 3.3 Movielens Dataset Databases and Tables As before we create a database and move to it: create database if not exists movielens; use movielens; Now we have to create 3 tables, one for the users, one for the movies and one for the ratings. In this case we are going to use the defualt SerDe for textfile. We just have to specify the field separator and the array separator. Indeed, Hive, as opposed to SQL, allows to use complex structures as types, such as arrays, structs and maps. create table movies ( id int, title string, genres array<string> ) row format delimited fields terminated by "*" collection items terminated by " " lines terminated by "\n" stored as textfile; Note than in this case we do not specify a location, Hive creates a directory for the table in the database directory. Additionally we have to load the data to it. load data inpath "/data/movielens_1m_simple/movies" into table movies; 9
10 As a result of the load command, Hive has moved the files into the table directory. Since we need the other tables for the last part of the tutorial, we define them as external and specify their he location. create external table users ( id int, sex string, age int, occupation int, zipcode string ) row format delimited fields terminated by "*" collection items terminated by " " lines terminated by "\n" stored as textfile location "/data/movielens_1m_simple/users"; create external table ratings ( user_id int, movie_id int, rating int, ts bigint ) row format delimited fields terminated by "*" collection items terminated by " " lines terminated by "\n" stored as textfile location "/data/movielens_1m_simple/ratings"; Arrays We have seen that Hive also allows arrays as types in the tables. It is possible to access specific positions of the array as in a typical programming language. It is also possible to generate a tuple for each element of the array with the command explode: select explode(genres) as genre from movies; In order to combine explode with a projection of any other column we need to use lateral view. In this way, if we want to show the movie title and the genres at the same time we have to execute a query like this: select title, genre from movies lateral view explode(genres) genre_view as genre; 10
11 Most Popular Genre. With these new expressions we can do some interesting queries. For instance, which is the most popular genre. We also use another syntax feature of HiveQL: from (select explode(genres) as genre from movies) genres select genres.genre, count(*) as popularity group by genres.genre order by popularity desc; Joins Hive also allows to perform joins on tables. The syntax is similar to that of standard SQL. Note that the order of the tables in the join matters, as Hive assumes that the last table is the largest to optimize the resulting MapReduce jobs. Titles of the Best Rated Movies. with the best average rating. With this join query we show the title of the 20 movies select m.title as movie, avg(r.rating) as avg_rating from movies m join ratings r on m.id = r.movie_id group by m.title order by avg_rating desc limit 20; 11
12 4 Apache Spark In the last part of the tutorial we use Spark to execute the same queries previously executed with MapReduce and Hive. Spark provides APIs to execute jobs in Java, Python, Scala and R and two interactive shells in Python and Scala. The examples of this tutorial are executed with the Python API. To open the corresponding shell just type: # pyspark Alternatively, if you want to benefit from ipython features you can use the following command: # IPYTHON=1 pyspark 4.1 Introduction Apache Spark is a general-purpose cluster computing system which provides efficient execution by performing in-memory transformations of data. As opposed to MapReduce, it allows arbitrary workflows with any number of stages and provides a broader choice of transformations, including map and reduce. Spark can work with different cluster managers such as Hadoop s YARN or Apache Mesos. In this tutorial we are using the standalone mode. The first thing to do when executing a program in Spark is to create a SparkContext. When using the shell, this is automatically done so this step is not necessary. In this case, the context is stored in the variable sc. conf = SparkConf().setAppName("spark_tutorial").setMaster("spark://hadoopvm:7077") sc = SparkContext(conf=conf) 4.2 Nasa Logs RDDs The core structures in Spark are the Resilient Distributed Datasets, RDDs, fault-tolerant collections of elements that can be operated in parallel. RDDs can be created from external data sources, such as HDFS. They can also be obtained by applying transformations to another RDD. For instance, we can create and RDD from the nasa logs file: raw_logs = sc.textfile("hdfs://localhost:9000/data/logs/nasa_access_log_jul95") Note that in Spark, execution is done lazily: only when results are required, the executions are performed. We can force Spark to read the file, for instance, by getting the first 10 elements: raw_logs.take(10) Now we can start to apply operations to the data. First, we are going to parse the data, so it is conveniently structured. First we define a parsing function with the same regular expression as before: 12
13 import re def parse_log_line(line): m = re.search( ([^ ]*) - - \\[(.*)\\] "([^ ]*) ([^ ]*).*" ([^ ]*) ([^ ]*), line) if m: record = (m.group(1),m.group(2),m.group(3), m.group(4)) record += (int(m.group(5)),) if m.group(5).isdigit() else (None,) record += (int(m.group(6)),) if m.group(6).isdigit() else (0,) return record else: return None Then we apply this function to each line and filter void tuples. As we can see it is possible to chain transformations as the result of one transformation is also an RDD. Here we use two types of transformations: map: which applies a function to all the elements of the RDD and filter: which filters out those elements for which the provided function returns false. logs = raw_logs.map(parse_log_line).filter(lambda r: r).cache() As said before, Spark executes operations lazily. If we want something to happen, we can execute an action like showing the first element: logs.first() Another important thing we have done is caching. Since logs is an RDD that we are going to reuse, we can cache it in memory so that all the previous transformations done to generate it do not need to be executed each time. To indicate that we use the method cache(). At this point we can reexecute the examples shown before with Spark. Method Count. We just apply the same map and reduce functions as in MapReduce but with Spark s syntax. count_methods = logs.map(lambda record: (record[2], 1))\.reduceByKey(lambda a, b: a + b) We can show all the results with the action collect() 2. count_methods.collect() Web Page Popularity. To enrich this example, after performing the grouping and count, we sort the results by popularity. We need to swap the elements of each tuple and then use the transformation sortbykey(). The option passed to this method indicates that the ordering should be done in descending order. web_page_urls = logs.map(lambda record: record[3])\.filter(lambda url: url.endswith(".html") or url.endswith("/")) web_page_popularity = web_page_urls.map(lambda url: (url, 1))\ 2 We illustrate the use of collect() here because we know the number of results is low. As a rule of thumb, do not use this command as it can lead to problems of memory, since it will load all the RDD elements in a variable. 13
14 web_page_popularity.take(20).reducebykey(lambda a, b: a + b)\.map(lambda kv: (kv[1], kv[0]))\.sortbykey(false) Image Traffic. For the image traffic example we use the same strategy as before with an additional filter to remove images with less that the minimum traffic. min_traffic = 150 bytes_per_image = logs.map(lambda r: (r[3], int(r[5])))\.filter(lambda url_bytes: url_bytes[0].endswith(".gif")) img_traffic = bytes_per_image.reducebykey(lambda a, b: a + b)\.map(lambda kv: (kv[1] / (1024 * 1024), kv[0]))\.sortbykey(false) selected_img_traffic = img_traffic.filter(lambda a: a[0] > min_traffic) selected_img_traffic.take(20) Spark SQL Apache Spark also offers the user the possibility of querying the RDDs by using SQL statements. For that purpose, we first need to create a SQLContext. from pyspark.sql import SQLContext sqlcontext = SQLContext(sc) Spark SQL provides different mechanisms to define SQL tables. In this first example we are going to automatically infer the table schema. Other possible methods to define SQL tables will be presented with the MovieLens dataset. We first load the data and store it in a RDD. We need a different parsing function as we need to create elements of type Row. Note that the name of the optional parameters passed in the row construction define the names of the columns of the SQL table. import re from pyspark.sql import Row def parse_log_line(line): m = re.search( ([^ ]*) - - \\[(.*)\\] "([^ ]*) ([^ ]*).*" ([^ ]*) ([^ ]*), line) if m: return Row(host=m.group(1), req_date=m.group(2), req_method=m.group(3), req_url=m.group(4), rep_code=int(m.group(5)) if m.group(5).isdigit() else None, rep_bytes=int(m.group(6)) if m.group(6).isdigit() else 0) else: return None logs = raw_logs.map(parse_log_line).filter(lambda r: r).cache() Then we can use the schema inferring mechanism to generate the table. We also need to register the table so that it can be accessed by SQL statements. Finally, we show the obtained schema: 14
15 schemalogs = sqlcontext.inferschema(logs) schemalogs.registertemptable("logs") schemalogs.printschema() Now we can use SQL to query the data. For instance, the first example would be done like this: count_methods = sqlcontext.sql( "select req_method, count(*) as num_requests " "from logs " "group by req_method") count_methods.collect() You can try to execute the other examples by reusing the SQL commands executed in the Hive part of the tutorial. 4.3 MovieLens Dataset For the Movielens dataset we are going to use three additional mechanisms to load each of the three tables. Additionally, we use a HiveContext, which is a superset of SQLContext as it provides the same methods plus HiveQL compatibility. from pyspark.sql import HiveContext sqlcontext = HiveContext(sc) User-defined Schema In Spark SQL, the user can manually specify the schema of the table. The first step, as always, is to load the RDD and parse it. We define a new parsing function that applies the corresponding type conversions. def parse_user(line): fields = line.split("*") fields[0] = int(fields[0]) fields[2] = int(fields[2]) fields[3] = int(fields[3]) return fields users_lines = \ sc.textfile("hdfs://localhost:9000/data/movielens_1m_simple/users/users.dat") users_tuples = users_lines.map(parse_user) Now, we can manually define the columns by specifying its name, type and whether the values can be null or not. Then, we use this schema to create and register the table. from pyspark.sql.types import StructField, StringType, IntegerType, StructType fields = [ StructField("id", IntegerType(), False), StructField("gender", StringType(), False), StructField("age", IntegerType(), False), StructField("occupation", IntegerType(), True), StructField("zipcode", StringType(), False), ] 15
16 users_schema = StructType(fields) users = sqlcontext.applyschema(users_tuples, users_schema) users.registertemptable("users") users.printschema() JSON Spark, either within its core or as third-party libraries, provides mechanisms to load standard file formats, such as JSON, Parquet files or csv files. Here we will load the movies table from a JSON file. We can observe that this mechanism is very simple to use. movies = sqlcontext.jsonfile(hdfs_prefix + "/data/movielens_1m_mod/movies.json") movies.registertemptable("movies") movies.printschema() Hive Table Loading Finally, tables can be loaded as in Hive. It is not the case of this tutorial, but if we have correctly linked Hive and Spark and use this mechanism, changes made in both systems will be reflected in the same metastore and both systems can operate over the same schemas. sqlcontext.sql( "create table ratings ( " "user_id int, " "movie_id int, " "rating int, " "ts bigint) " "row format delimited " "fields terminated by * " "lines terminated by \n " "stored as textfile") sqlcontext.sql( "load data inpath /data/movielens_1m_simple/ratings/ratings.dat " "into table ratings") Hive Queries When using a HiveContext we can use the HiveQL syntax specific features to access our tables. For instance, we can use the expression explode that we have seen before: genres = sqlcontext.sql( "select explode(genres) as genre " "from movies") genres.take(10) Data Frames Maybe you have notice that when creating a schema, a DataFrame is created, instead of a RDD. Apart from allowing the usage of SQL statements, DataFrames also provide additional methods to query its contents. 16
17 age_oc2 = users.filter("gender = M ")\.select("occupation","age")\.groupby("occupation")\.avg("age")\.orderby("avg(age)", ascending=0) age_oc2.take(10) 17
Hadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Word count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
This is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
Data Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology
Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Hadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)
Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Hadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)
MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
BIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
MR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
Big Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
Big Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to
Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008
Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Hive Interview Questions
HADOOPEXAM LEARNING RESOURCES Hive Interview Questions www.hadoopexam.com Please visit www.hadoopexam.com for various resources for BigData/Hadoop/Cassandra/MongoDB/Node.js/Scala etc. 1 Professional Training
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Big Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)
Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Hadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
Certification Study Guide. MapR Certified Spark Developer Study Guide
Certification Study Guide MapR Certified Spark Developer Study Guide 1 CONTENTS About MapR Study Guides... 3 MapR Certified Spark Developer (MCSD)... 3 SECTION 1 WHAT S ON THE EXAM?... 5 1. Load and Inspect
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
H2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
How to Run Spark Application
How to Run Spark Application Junghoon Kang Contents 1 Intro 2 2 How to Install Spark on a Local Machine? 2 2.1 On Ubuntu 14.04.................................... 2 3 How to Run Spark Application on a
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
IDS 561 Big data analytics Assignment 1
IDS 561 Big data analytics Assignment 1 Due Midnight, October 4th, 2015 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Introduction To Hive
Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The
Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.
Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
Hadoop Tutorial. General Instructions
CS246: Mining Massive Datasets Winter 2016 Hadoop Tutorial Due 11:59pm January 12, 2016 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) [email protected]
Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) [email protected] The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services
Spark Application Carousel. Spark Summit East 2015
Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start
MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015
MarkLogic Connector for Hadoop Developer s Guide 1 MarkLogic 8 February, 2015 Last Revised: 8.0-3, June, 2015 Copyright 2015 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents
Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us
DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design
Hadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email [email protected] if you have questions or need more clarifications. Nilay
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Introduction To Hadoop
Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the
Connecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
Hadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
