Hadoop for HPCers: A Hands-On Introduction. Part I: Overview, MapReduce. Jonathan Dursi, SciNet Michael Nolta, CITA. Agenda
|
|
|
- Candace Bradley
- 10 years ago
- Views:
Transcription
1 Hadoop for HPCers: A Hands-On Introduction Jonathan Dursi, SciNet Michael Nolta, CITA Part I: Overview, MapReduce Agenda VM Test High Level Overview Hadoop FS Map Reduce Hadoop MR + Python Hadoop MR Break Hands On with Examples Word count Inverted Index Document Similarity Matrix Multiplication Diffusion
2 Detailed VM instructions Install VirtualBox (free) for your system. Download and unzip the course VM from SciNetHadoopVM.zip Start Virtual box; click New ; give your VM a name. Select Linux as Type, and Ubuntu as Version. Give your VM at least 2048MB RAM, more would be better. Select Use an existing virtual hard drive, and choose the.vdi file you downloaded. Click Create. Before starting your VM, enable easy network access between the host and VM. Go into the VirtualBox app preferences VirtualBox > Preferences > Network and, if one doesn t already exist, add a host-only network. Select the new VM and click Settings. Under System, make sure Enable IO APIC is checked. Then under Network, select Adapter 2, Enable it, and attach it to Host-only adapter. Click OK. This will allow you to easily transfer files to and from your laptop and the virtual machine. Also under System, then Processor, give your VM a couple of cores to play with; for safety, you might want to bring down the Execution cap to 50% or so. Start the VM; username is hadoop-user, password is hadoop. Open a terminal; run source./init.sh
3 Let s Get Started! Fire up your course VM Open terminal; source init.sh cd wordcount make You ve run your (maybe) first Hadoop job!
4 Hadoop 2007 OSS implementation of 2004 Google MapReduce paper Consists of distributed filesytem HDFS, core runtime, an implementation of Map- Reduce. Hardest to understand for HPCers: Java Pronounced Hay-doop.
5 Hadoop Ecosystem usage exploded Mahout data mining Spark/Shark iterative computing Impala Interactive DB Creation of many tools building atop Hadoop infrastructure Met a real need
6 Disk (MB/s), CPU (MIPS) Data Intensive Computing Data volumes increasing massively Mahout data mining Clusters, storage capacity increasing massively 1000x! Disk speeds are not keeping pace. Seek speeds even worse than read/write
7 Scale-Out Disk streaming speed ~ 50MB/s 3TB =17.5 hrs 1PB = 8 months Scale-out (weak scaling) - filesystem distributes data on ingest
8 Scale-Out Seeking too slow Batch processing Pass through entire data set in one (or small number) of passes
9 Combining results Each node preprocesses its local data Shuffles its data to a small number of other nodes Final processing, output is done there
10 Fault Tolerance Data also replicated upon ingest Runtime watches for dead tasks, restarts them on live nodes Re-replicates
11 What is it good at? Classic Hadoop is all about batch processing of massive amounts of data (Not much point below ~1TB) Map-Reduce is relatively loosely coupled; one shuffle phase. Very strong weak scaling in this model - more data, more nodes. Batch: process all data in one go w/ classic Map Reduce (New Hadoop has many other capabilities besides batch) tightly coupled loosely coupled Databases MPI Hadoop Map-Reduce interactive batch
12 What is it good at? tightly coupled Compare with databases - very good at working on small subsets of large databases MPI DBs - very interactive for many tasks DBs - have been difficult to scale loosely coupled DB Hadoop Map-Reduce serial highly parallel
13 What is it good at? tightly coupled Compare with HPC (MPI) Also typically batch Can (and does) go up to enormous scales MPI Works extremely well for very tightly coupled problems: zillions of iterations/ timesteps/exchanges. loosely coupled DB Hadoop Map-Reduce serial highly parallel
14 Hadoop vs HPC We HPCers might be tempted to an unseemly smugness. They solved the problem of disk-limited, loosely-coupled, data analysis by throwing more disks at it and weak scaling? Ooooooooh.
15 Hadoop vs HPC High Hadoop + We d be wrong. A single novice developer can write real, scalable, node data-processing tasks in Hadoop-family tools in an afternoon. MPI... less so. Developer Productivity Low DB Ecosystem MPI serial highly parallel
16 Data Distribution: Disk Hadoop and similar architectures handle one hard part of parallelism for you - data distribution. On disk: HDFS distributes, replicates data as it comes in Keeps track; computations local to data
17 Data Distribution: Network On network: Map Reduce works in terms of key-value pairs. Preprocessing (map) phase ingests data, emits (k,v) pairs (key1,17) (key5, 23) (key1,99) (key2, 12) (key1,83) (key2, 9) Shuffle phase assigns reducers, gets all pairs with same key onto that reducer. Programmer does not have to design communication patterns (key1,[17,99]) (key5,[23,83]) (key2,[12,9])
18 Makes the problem easier Decomposing the problem, and, Getting the intermediate data where it needs to go,... are the hardest parts of parallel programming with HPC tools. Hadoop does that for you automatically for a wide range of problems. High Developer Productivity Low DB serial Hadoop + Ecosystem MPI highly parallel
19 Built a reusable substrate The filesystem (HDFS) and the MapReduce layer were very well architected. Mahout data mining Spark/Shark iterative computing Impala Interactive DB Enables many higher-level tools Data analysis, machine learning, NoSQL DBs,... Extremely productive environment And Hadoop 2.x (YARN) is now much much more than just MapReduce
20 and Hadoop vs HPC Not either-or anyway Use HPC to generate big / many simulations, Hadoop to analyze results Use Hadoop to preprocess huge input data sets (ETL), and HPC to do the tightly coupled computation afterwards. Besides,...
21 1: Everything s Converging These models are all converging at the largest scales Good ideas are good ideas. MPI is trying to grow fault tolerance (but MPI codes?) fault tolerance DBs Hadoop massive parallelism MPI DBs are trying to scale up tightly coupled
22 1: Everything s Converging People are building tools for tightly coupled computation atop Hadoop-like frameworks Hadoop will probably do tightly coupled well long before MPI codes do fault tolerance (cf. Spark, Giraph ) fault tolerance DBs Hadoop tightly coupled massive parallelism MPI
23 2: Computation is Computation These models of computation aren t that different Different problems fall in different models sweet spots. In VM, cd ~/diffusion; make (will take a while) Distributed 1d diffusion PDE Will also look at (more reasonably) sparse matrix multiplication
24 Hadoop Job Workflow Let s take a look at the Makefile in the wordcount example Three basic tasks; building program; copying files in; running; getting output
25 Hadoop Job Workflow Building program; compile to bytecode against the current version of Hadoop Build a.jar file which contains all the relevant classes; this.jar file gets shipped off in its entirety to workers
26 Hadoop Job Workflow Running the program: must first copy the input files ( hdfs dfs - put ) onto the Hadoop file system Remove ( hdfs dfs -rm -r ) the output directory if it exists Run the program by specifying the input jar file and the class of the program, and give it any arguments Type out (cat) the output file.
27 The Hadoop Filesystem hdfs dfs -[cmd] HDFS is a distributed parallel filesytem Not a general purpose file system doesn t implement posix can t just mount it and view files Access via hdfs dfs commands Also programatic (java) API Security slowly improving cat chgrp chmod chown copyfromlocal copytolocal cp du dus expunge get getmerge ls lsr mkdir movefromlocal mv put rm rmr setrep stat tail test text touchz
28 The Hadoop Filesystem Required to be: able to deal with large files, large amounts of data scalable reliable in the presence of failures fast at reading contiguous streams of data only need to write to new files or append to files require only commodity hardware
29 The Hadoop Filesystem As a result: Replication Supports mainly high bandwidth, not low latency No caching (what s the point if just streaming reads) Poor support for seeking around files Poor support for zillions of files Have to use separate API to see filesystem Modeled after Google File System (2004 Map Reduce paper)
30 Blocks in HDFS HDFS is a block-based file system. A file is broken into blocks, these blocks are distributed across nodes Blocks are large; 64MB is default, many installations use 128MB or larger Large block size - time to stream a block much larger than time disk time to access the block. hdfs fsck / -files -blocks lists all blocks in all files
31 Datanodes and Namenode Two different types of nodes in the filesystem Namenode - stores all metadata and block locations in memory. Updates are stored to persistent journal Lots of files bad Datanodes - store and retrieve blocks for client or namenode Namenode all metadata /user/ljdursi/wordcount /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3
32 Datanodes and Namenode Namenode /user/ljdursi/wordcount Newer versions of Hadoop - federation (different namenodes for /user, /data, /project, etc) Newer versions of Hadoop - High Availability namenode pairs all metadata /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3
33 Writing a file Writing a file multiple stage process Create file Get nodes for blocks Start writing Data nodes coordinate replication Get ack back Complete Client: Write newdata.dat 1. create Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat
34 Writing a file Writing a file multiple stage process Create file Get nodes for blocks Start writing Data nodes coordinate replication Get ack back Complete Client: Write newdata.dat 2. get nodes Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat
35 Writing a file Writing a file multiple stage process Create file Get nodes for blocks Start writing Data nodes coordinate replication Get ack back Complete 3. start writing Client: Write newdata.dat Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat
36 Writing a file Writing a file multiple stage process Create file Get nodes for blocks Start writing Data nodes coordinate replication Get ack back Complete Client: Write newdata.dat 4. repl Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat
37 Writing a file Writing a file multiple stage process Create file Get nodes for blocks Start writing Data nodes coordinate replication Get ack back (while writing) Complete Client: Write newdata.dat 5. ack Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat
38 Writing a file Writing a file multiple stage process Create file Get nodes for blocks Start writing Data nodes coordinate replication Get ack back Complete Client: Write newdata.dat 6. complete Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat
39 Where to Replicate? Tradeoff to choosing replication locations Close: faster updates, less network bandwidth switch 1 switch 2 Further: better failure tolerance Default strategy: first copy on different location on same node, second on different rack (switch), third on same rack location, different node. Strategy configurable. Need to configure Hadoop file system to know location of nodes rack1 rack2
40 Reading a file Client: Read lines from bigdata.dat 1. Open Reading a file shorter Get block locations Read Namenode /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3
41 Reading a file Client: Read lines from bigdata.dat 2. Get block locations Reading a file shorter Get block locations Read Namenode /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3
42 Reading a file Client: Read lines from bigdata.dat 3. read blocks Reading a file shorter Get block locations Read Namenode /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3
43 Configuring HDFS Need to tell HDFS how to set up filesystem data.dir, name.dir - where on local system (eg, local disk) to write data parameters like replication - how many copies to make default name - default file system to use Can specify multiple <configuration> <property> <name>fs.default.name</name> <value>hdfs://your.server.name.com:9000</value> </property>! <property> <name>dfs.data.dir</name> <value>/home/username/hdfs/data</value> </property>! <property> <name>dfs.name.dir</name> <value>/home/username/hdfs/name</value> </property>! <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
44 Configuring HDFS $HADOOP_PREFIX/etc/hadoop/core-site.xml For us: Only one node to be used, our laptops default: localhost! <configuration>! <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property>! </configuration>
45 Configuring HDFS $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml Since only one node, need to specify replication factor of 1, or will always fail <configuration>...! <property> <name>dfs.replication</name> <value>1</value> </property>! </configuration>
46 Configuring HDFS ~/.bashrc Also need to make sure that environment variables are set path to Java, path to Hadoop...! export JAVA_HOME=/usr/lib/jvm/default-java export HADOOP_VERSION=1.1.2 export HADOOP_PREFIX=/path/to/hadoop-${HADOOP_VERSION}!... $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh...! export JAVA_HOME=/usr/lib/jvm/default-java!...
47 Configuring HDFS $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys Finally, have to make sure that passwordless login is enabled Can start processes on various FS nodes
48 Configuring HDFS Once configuration files are set up, can format the namenode like so Then you can start up just the file systems: Done for you in init.sh... $ hdfs namenode -format $ start-dfs.sh...
49 Using HDFS Now once the file system is up and running, you can copy files back and forth get/put, copyfromlocal/ copytolocal Default wd is /user/${username} Nothing like a cd Try copying a Makefile or something to HDFS, doing an ls, then copying it back and make sure it s stayed same. hadoop fs -[cmd] cat chgrp chmod chown copyfromlocal copytolocal cp du dus expunge get getmerge ls lsr mkdir movefromlocal mv put rm rmr setrep stat tail test text touchz
50 Hadoop Job Workflow Building the program Running a Map Reduce program...
51 MapReduce Two classes of compute tasks: a Map and a Reduce Map processes one element at a time, emits results as (key, value) pairs. All results with same key are gathered to the same reducers Reducers process list of values, emit results as (key, value) pairs. (key1,17) (key5, 23) (key1,99) (key2, 12) (key5,83) (key2, 9) (key1,[17,99]) (key5,[23,83]) (key2,[12,9])
52 Map All coupling is done during the shuffle phase Embarrassingly parallel task - all map Take input, map it to output, done. (Famous case: NYT using Hadoop to convert 11 million image files to PDFs - almost pure serial farm job)
53 Reduce Reducing gives the coupling In the case of the NYT task, not quite embarrassingly parallel; images from multi-page articles Convert a page at a time, gather images with same article id onto node for conversion. (key1,17) (key5, 23) (key1,99) (key2, 12) (key5,83) (key2, 9) (key1,[17,99]) (key5,[23,83]) (key2,[12,9])
54 Shuffle The shuffle is part of the Hadoop magic By default, keys are hashed and hash space is partitioned between reducers On reducer, gathered (k,v) pairs from mappers are sorted by key, then merged together by key Reducer then runs on one (k,[v]) tuple at a time (key1,17) (key5, 23) (key1,99) (key2, 12) (key5,83) (key2, 9) (key1,[17,99]) (key5,[23,83]) (key2,[12,9])
55 Shuffle If you do know something about the structure of the problem, can supply your own partitioner Assign keys that are similar to each other to same node Reducer still only sees one (k, [v]) tuple at a time. (key1,17) (key5, 23) (key1,99) (key2, 12) (key5,83) (key2, 9) (key1,[17,99]) (key5,[23,83]) (key2,[12,9])
56 Word Count Was used as an example in the original MapReduce paper Now basically the hello world of map reduce Do a count of words of some set of documents. A simple model of many actual web analytics problem file01 Hello World! Bye World file02 output/part Hello 2! World 2! Bye 1! Hadoop 2! Goodbye 1 Hello Hadoop Goodbye Hadoop
57 Word Count How would you do this with a huge document? Each time you see a word, if it s a new word, add a tick mark beside it, otherwise add a new word with a tick...but hard to parallelize (updating the list) file01 Hello World! Bye World file02 output/part Hello 2! World 2! Bye 1! Hadoop 2! Goodbye 1 Hello Hadoop Goodbye Hadoop
58 Word Count MapReduce way - all hard work is done by the shuffle - eg, automatically. Map: just emit a 1 for each word you see file01 Hello World! Bye World (Hello,1)! (World,1)! (Bye, 1)! (World,1) file02 Hello Hadoop Goodbye Hadoop (Hello, 1)! (Hadoop, 1)! (Goodbye,1)! (Hadoop, 1)
59 Word Count Shuffle assigns keys (words) to each reducer, sends (k,v) pairs to appropriate reducer Reducer just has to sum up the ones (Hello,1)! (World,1)! (Bye, 1)! (World,1) (Hello,[1,1])! (World,[1,1])! (Bye, 1) (Hello, 1)! (Hadoop, 1)! (Goodbye,1)! (Hadoop, 1) (Hadoop, [1,1])! (Goodbye,1) Hello 2! World 1! Bye 1 Hadoop 2! Goodbye 1
60 Hadoop Job Workflow Building the program Class is expected to have particular methods Let s look at WordCount.java
61 main() The main() routine in a MapReduce computation creates a Job with a Configuration Set details of Input/Output, etc Then runs the job.
62 main() The heart of doing work in Hadoop originally was MapReduce Create a Map routine and a Reduce routine Wire those into the job. (Reduce is optional)
63 Python/Hadoop count map.py Before getting into the Java, let s look at a language probably more familiar to most of us. A mapper task - just reads the stdin stream pointed at it, spits out tab-separated lines (word,1)
64 Python/Hadoop count reduce.py A simple reducer gets partitioned sorted streams of (Hello,1) (Hello,1) (Goodbye,1) and sums the counts prints (word,sum) at end map.py
65 Python/Hadoop count Can use this approach in serial using standard shell tools: $ cd wordcount-streaming! $ cat input/* Hello World Bye World Hello Hadoop Goodbye Hadoop! $ cat input/*./map.py sort./reduce.py Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
66 Python/Hadoop count Can also fire this off in parallel with Hadoop streaming interface, designed to work with other languages Hadoop decides how many maps, reduces to fire off $ hadoop jar $(STREAMING_DIR)hadoop-streaming-$(HADOOP_VERSION).jar \!!! -file./map.py -mapper./map.py \!!! -file./reduce.py -reducer./reduce.py \!!! -input $(INPUT_DIR) \!!! -output $(OUTPUT_DIR) Other interfaces for more programatic interfaces (Pipes - C++; Dumbo - better Python interface, etc) Streaming seems to work roughly as well or better
67 Number of mappers Mapping is tightly tied to the Hadoop file system Block-oriented Input splits - blocks of underlying input files One mapper handles all the records in one split One mapper per input split Only one replication is mapped usually
68 Mapper and I/O The code for your mapper processes one record The map process executes it for every record in the split It gets passed in one (key, value) pair, and updates an Output Collector with a new (key, value) pair.
69 Mapper and I/O Mapper works one record at a time That means the input file format must have a way to indicate end of record. We re going to be using plain text file, because easy to understand, but there are others (often more appropriate for our examples)
70 Mapper and I/O If record crosses block boundary, must be sent across network Another good reason for large blocks - small fraction of data has to be sent
71 Mapper and I/O Mappers can work with compressed a files But obviously works best if the compression algorithm is splittable - do you need to read the whole file to understand a chunk? bzip2 - slow but splittable Other possibilities
72 Mapper and I/O Mapper doesn t explicitly do any I/O Input is wired up at job configuration time Set Input format and input paths
73 Reducer and I/O Similarly, reducer doesn t explicitly do any I/O Set the output format, and the output Key/Value types that will be written. Send output to an OutputCombiner, and output gets sent out. At the end, each reducer writes out its own file, partr-n
74 Number of reducers Number of mappers set by input splits Can suggest reducing that Set of reducers is by default chosen based on input size amongst other things Our problems here - always so small that only one is used (only part-r-00000) part-r part-r part-r-00002
75 Number of reducers Can explicitly set number of reduce tasks Try this - in streaming example, do make run-2reducers or in WordCount.java, main, add line job.setnumreducetasks(2); Different reducers get different words (keys), different outputs from these keys hdfs dfs -getmerge : gets all files in a directory and cat s them Goodbye!1! Hadoop! 2 part part Bye! 1! Hello! 2! World! 2
76 MapReduce in Java In a strongly typed language, we have to pay a bit more attention to types than with just text streams Everything s a key-value pair, but don t have to have same type. In our examples, always using TextInputFormat, so (k1,v1) is always going to be Object (line # w/in split) and Text, but others could change (k1,v1) (k2,v2) (k2,[v2]) (k3,v3)
77 MapReduce in Java Input types determined input format Reduce outputs specified by the Output Key/Value classes If not specified, assumed output of mapper (=input of reduce) same as output of reduce. (k2=k3, v2=v3)
78 Map in Java Map implements Mapper<k1,v1,k2,v2> Note special types - IntWritable, not Integer; Text, not String Hadoop comes with its own set of classes which wrap standard classes but implement Write methods for serialization (to network or disk).
79 Map in Java k2,v2 - Text, IntWriteable eg, ( word, 1) Actual work done is very minimal; Get the string out of the Text value; Tokenize it (split it by spaces) While there are more tokens, emit (word,one)
80 Reduce in Java k2,v2 - Text, IntWriteable (check) k3,v3 also Text,IntWriteable Incoming values for a given key are pre-concatenated into an iterable (couldn t do this for streaming interface; don t know enough about structure of keys/values. )
81 Reduce in Java Work is very simple. Operates on a single (k,[v]). Loop over values (have to.get() the Integer from the IntWritable) sum them up Make a new IntWritable with value from sum Collect (key,sum)
82 Combiners One more useful thing to know You can have a combiner. Run by each mapper on the output of the mapper, before its fed to the shuffle. Required (k2,[v2]) (k2,v2)
83 Combiners One more useful thing to know You can have a combiner. Run by each mapper on the output of the mapper, before its fed to the shuffle. Required (k2,[v2]) (k2,v2) Dumb to send every (the,1) over the network; combine lets you collate the output of each mapper individually before feeding to reducers Hello World! Bye World map (Hello,1)! (World,1)! (Bye, 1)! (World,1) combine (Hello,2)! (World,1)! (Bye, 1) Hello Hadoop Goodbye Hadoop (Hello, 1)! (Hadoop, 1)! (Goodbye,1)! (Hadoop, 1) (Hello, 1)! (Hadoop, 2)! (Goodbye,1)
84 Combiners In this case, the combiner is just the reducer Not all problems lend themselves to the obvious use of a combiner, and in general it won t be identical to the reducer. If reducer is commutative and associative, can use as the combiner.
85 First hands-on More to get you into the mode of writing Java We have the same example in wordcount-worksheet, but with the guts of map, reduce left out. Practice writing the code. Feel free to google for how to do things in Java, but don t just blast the lines from examples... Can use your favourite local editor and scp file to VM
86 First hands-on VM: To copy files back and forth, find the IP a of the VM (We enabled this in virtualbox with the IO APIC/Adapter 2 stuff) hadoop-user/hadoop $ ifconfig grep 192! inet addr: [...] Host $ scp WordCount.java hadoop-user@ :! hadoop-user@ 's password: hadoop
87 Web Monitor Open browser on laptop go to (e.g.) Look at the previous jobs run Hadoop has to keep track of the running of individual map, reduce tasks and job status for fault-tolerance reasons Presents a nice web interface to the hadoop cluster
88 Beyond WordCount Let s start going a little bit beyond simple wordcount astro_01 abstract galaxy! supernova star genomics_03 abstract gene! expression dna cd ~/inverted-index make run! First, take a look at word count broken down by document 5 new papers each from 8 disciplines, taken from arxiv, pdftotext output/part astro_01 abstract 1! astro_01 galaxy 1! genomics_03 abstract 1! genomics_03 gene 1
89 WordCount by Doc Map is a little more sophisticated - strips out stop words ( the, and,...) Also only pay attention to words > 3 letters (strip out noise from pdf-to-text conversion - eqns, etc)
90 WordCount by Doc Mapper: while the value here is still one, the key is now filename + + word (why?)
91 WordCount by Doc Reducer is exactly the same
92 Inverted Index: astro_01 genomics_03 Want to use this as a starting point to build an inverted index For each word, in what documents does it occur? What is going to be the key out of the mapper? The value? What is going to be the reduction operation? abstract galaxy! supernova star output/part abstract gene! expression dna abstract astro_01 genomics_03! galaxy astro_01! gene genomics_03! supernova astro_01! expression genomics_03
93 Hands on: astro_01 genomics_03 Implement the inverted index For now, don t worry about repeated items InvertedIndex.java Test with make runinverted abstract galaxy! supernova star output/part abstract gene! expression dna abstract astro_01 genomics_03! galaxy astro_01! gene genomics_03! supernova astro_01! expression genomics_03
94 Document Similarity astro_01 genomics_03 Wordcount-by-document: Bag of words approach Document is characterized by its wordcounts Can find similarity of two documents through normalized dot product of their vector representation. abstract galaxy! supernova expression astro_01 genomics_03 abstract gene! expression dna {abstract:1, galaxy:1, supernova:1, star:1} {abstract:1, gene:1, expression:1, dna:1} S a,g = w a w g w a w g
95 Document Similarity astro_01 abstract galaxy! supernova expression genomics_03 abstract gene! expression dna astro_01 genomics_03 {abstract:1, galaxy:1, supernova:1, star:1} {abstract:1, gene:1, expression:1, dna:1} cd ~/document-similarity make S a,g = w a w g w a w g
96 Document Similarity Wordcount-by-document: Bag of words approach a= ( abstract galaxy expression gene supernova dna star ) Document is characterized by its wordcounts g= ( ) Can find similarity of two documents through normalized dot product of their vector representation. S a,g = w a w g w a w g = = 1 4
97 Document Similarity So taken the bags-of-words as a given, how do we do the computation? a= ( abstract galaxy expression gene supernova dna star ) What s the map phase, and the reduce phase? g= ( S a,g = w a w g w a w g = ) = 1 4
98 Document Similarity Easiest to think about the reduce phase first. a= ( abstract galaxy expression gene supernova dna star ) What is going to be the single computation done by a single reducer? And what information does it need to perform that computation? g= ( S a,g = w a w g w a w g = ) = 1 4
99 Document Similarity: Reducer The single piece of computation that needs to be done at the reduce stage are the matrix elements Sa,g. The computation is straightforward. What is the key? What data does it need? Reducer S a,g = w a w g w a w g Means key is...? Means data it needs is...?
100 Document Similarity: Mapper astro_01 abstract 1 It s the mapper s job to read in the data and direct it to the correct reducer by setting the key astro_01 galaxy 1 genomics_03 abstract 1 So mapper reads in (astro_01, abstract 1 ). Which reducer needs that information?
101 Document Similarity: Mapper Map phase: broadcast (astro_01, abstract 1 ) to all key pairs that will need astro_01 astro_01 abstract 1 key: astro_01 x, x=astro_02, astro_03,...genomics_01,... value: astro_01 abstract 1 (We re just putting everything in text strings here but we could have keys and values which were tuples...)
102 Document Similarity: Reducer Reducer: Collect all (say) astro_01 genomics_03 keys Sort into elements for the two documents Calculate the result
103 Document Similarity: Mapper Map: loop over documents emit value for each document pair
104 Document Similarity: Reducer Reducer: Put values into appropriate sparse vector (Parsing is just because we re using text for everything, which you really wouldn t do)
105 Document Similarity: Reducer Then the computation is easy.
106 Document Similarity: But where did we get.. But we need as input: The wordcounts by document The list of documents Where do they come from? astro_01 genomics_03 {abstract:1, galaxy:1, supernova:1, star:1} {abstract:1, gene:1, expression:1, dna:1}
107 Document Similarity: But where did we get.. Chains of Map-Reduce Jobs! 1st pass - wordcounts, document list 2nd pass - similarity scores Can do this programatically (within main), or just by running 2 hadoop jobs...
108 Document Similarity: But where did we get.. Chains of Map-Reduce Jobs! 1st pass - wordcounts 2nd pass - similarity scores
109 A note on similarity Ignore the normalization for a second Just the dot products What we ve done is a sparse matrix multiplication entirely in Hadoop. S i,j =w i w j S =WW T
110 Matrix multiplication cd ~/matmult! Reads in matrix name, rows, columns Hands on - fill in the map, reduce. input/part A ! A ! A ! A ! A ! A ! A ! A !...! B ! B ! B
111 One-D Diffusion cd ~/diffuse make clean make! Implements a 1d diffusion PDE
112 One-D Diffusion Inputs: Pre-broken up domain 1d gaussian constant diffusion - should maintain Gaussianity What is the map? What is the reduce? 0: ! 1: ! 2: ! 3: ! 4:
113 Questions?
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015
CS242 PROJECT Presented by Moloud Shahbazi Spring 2015 AGENDA Project Overview Data Collection Indexing Big Data Processing PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Data Science Analytics & Research Centre
Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
Hadoop Shell Commands
Table of contents 1 FS Shell...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 cp... 4 1.8 du... 4 1.9 dus...5 1.10 expunge...5 1.11 get...5 1.12
From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian
From Relational to Hadoop Part 1: Introduction to Hadoop Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Tutorial Logistics 2 Got VM? 3 Grab a USB USB contains: Cloudera QuickStart VM Slides Exercises
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
Hadoop Shell Commands
Table of contents 1 DFShell... 3 2 cat...3 3 chgrp...3 4 chmod...3 5 chown...4 6 copyfromlocal... 4 7 copytolocal... 4 8 cp...4 9 du...4 10 dus... 5 11 expunge... 5 12 get... 5 13 getmerge... 5 14 ls...
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
HDFS File System Shell Guide
Table of contents 1 Overview...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 count... 4 1.8 cp... 4 1.9 du... 5 1.10 dus...5 1.11 expunge...5
HDFS Installation and Shell
2012 coreservlets.com and Dima May HDFS Installation and Shell Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
File System Shell Guide
Table of contents 1 Overview...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 count... 4 1.8 cp... 5 1.9 du... 5 1.10 dus...5 1.11 expunge...6
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
Hadoop Installation MapReduce Examples Jake Karnes
Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
cloud-kepler Documentation
cloud-kepler Documentation Release 1.2 Scott Fleming, Andrea Zonca, Jack Flowers, Peter McCullough, El July 31, 2014 Contents 1 System configuration 3 1.1 Python and Virtualenv setup.......................................
Cloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
To reduce or not to reduce, that is the question
To reduce or not to reduce, that is the question 1 Running jobs on the Hadoop cluster For part 1 of assignment 8, you should have gotten the word counting example from class compiling. To start with, let
Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email [email protected] if you have questions or need more clarifications. Nilay
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
RDMA for Apache Hadoop 0.9.9 User Guide
0.9.9 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze
Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)
Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1
102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Basic Hadoop Programming Skills
Basic Hadoop Programming Skills Basic commands of Ubuntu Open file explorer Basic commands of Ubuntu Open terminal Basic commands of Ubuntu Open new tabs in terminal Typically, one tab for compiling source
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
MapReduce and the New Software Stack
20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the
