Big Data Analytics Opportunities and Challenges

Size: px
Start display at page:

Download "Big Data Analytics Opportunities and Challenges"

Transcription

1 Big Data Analytics Opportunities and Challenges Anup Kumar, Professor and Director of MINDS Lab Computer Engineering and Computer Science Department University of Louisville

2 Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 2

3 Big Data Applications Advertising and marketing Customer shopping patterns Response to promotional campaign Manufacturing Maintenance of machine health Social Media Browsing and sentiment analysis Impact on buying patterns Government data Efficient process management 3

4 Big Data Applications (cont d) Stock Market Stock performance prediction Healthcare Management Patient health monitoring Impact of preventive care Financial Institutions Fraud detection and mitigation Weather Prediction Impact analysis and better disaster management 4

5 Big Data Analytics: Benefits Saving money Fraud detection in medical industry Risk Management Lower cost and better outcomes Real time decision making Sensor data analysis Influence of social sentiment on patient health Getting to know your customer better Targeted advertisement Better recommendations 5

6 Components of Big Data Volume Data generated by 2020 will be in Zettabytes (10 21 ) Soon after that the measure will be Yottabyte (10 24 ) and Brontabyte (10 27 ) Variety Structured (transactional data) Unstructured (image, video and text data) Velocity Rate of data generation Increase in the ability to process data Variability/Veracity Trustworthiness of data Quality of data 6

7 Key Components of Big Data Analytics Understanding Advanced Analytics Association analysis Clustering, Classification, Regression Recommendation framework Time series analysis Handling volume and velocity of data Map/Reduce framework Cloud computing framework Custom frameworks Dealing with Variety of Data Structured Data Analytics (Advanced Analytics) Unstructured Data Analytics (Text Processing) 7

8 Data Analysis Options Descriptive analytics : Specifies the data characteristics How to describe the system? What happened in the system and when? What are the parameters in the system? What is the impact of a parameter on the system? Is there any correlation between the parameters? Predictive analytics: uses data mining and predictive modeling It can answer the following questions What are the future trends? What is the decision based on past history? Perform what if analysis 8

9 Handling Large Volume Hadoop Based Implementation of Analytics In Database Analytics 9

10 Hadoop Based Analytics Allows analytics to be carried out on any type of data store Provides a standard framework for computation On cloud environment On in premises network Advantages Vendor independent Allows any type of analytics to be carried out Disadvantages Complex implementation 10

11 In Database Analytics Allows analytic computation to be carried out in the database Uses: SQL SQL extensions Advantages Higher analytics efficiency, easy usability, better database manageability Analytic centralization may allow easy security, data, and version management Disadvantages Vendor dependent Cost 11

12 Handling of Variety of Data Structured Data Analytics Association analysis Clustering Classification Regression Recommendation framework Time series analysis Unstructured Data Analytics Association analysis Clustering Classification Text processing 12

13 Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 13

14 Introduction to Hadoop Open source from Apache Free download from Runs on commodity hardware Lowers procurement, licensing, and operational costs Scalable Incrementally add hardware nodes as needed Can accommodate growth in data and processing requirements Fault tolerant Automatic data replication and task reallocation 14

15 Core Components of Hadoop Big Data Architecture Integration Data Processing (MapReduce) Security Operations Data Storage (HDFS) 15

16 Hadoop Distributed File System (HDFS) Designed for large scale data processing Allows storage of large data files Runs on top of native file system on each node Data is stored in units known as blocks 64MB (default) although normally larger Large files are stored in many blocks over many nodes Blocks are replicated to multiple nodes 16

17 HDFS Blocks and Replication File Block 1 Block 2 Block 3 Node 1 Block 1 Node 2 Block 1 Node 3 Block 3 Block 2 Block 3 Block 2 17

18 Hadoop Node Types in a Cluster Nodes in Hadoop have different roles to play HDFS nodes NameNodes DataNodes MapReduce nodes JobTracker TaskTracker 18

19 Hadoop Node Interaction Start Hadoop Cluster JobTracker MapReduce TaskTracker TaskTracker HDFS NameNode DataNode DataNode DataNode DataNode 19

20 Role of JobTracker Receives jobs from clients and manages jobs Master in master slave architecture with TaskTrackers Determines job execution plan Assigns nodes to different processing tasks Monitors tasks as they execute Detects failures and restarts/re runs tasks 20

21 Role of TaskTracker Manages individual tasks that the JobTracker issues Can be map or reduce jobs Single TaskTracker can run many map or reduce tasks in parallel Keeps JobTracker updated with task progress using heartbeat signal Failure to send heartbeat results in JobTracker resubmitting task To another TaskTracker node 21

22 Role of NameNode HDFS is also a master slave architecture NameNode is the HDFS master directing DataNode slave nodes DataNodes perform low level input/output This is where data is stored Keeps track of files How they are broken into blocks Which nodes blocks are stored on Only one per Hadoop cluster! Single point of failure, so should not use commodity hardware 22

23 Role of DataNode Performs low level work reading and writing HDFS blocks to local file system Communicates with: NameNode Provides information about which blocks they are storing DataNodes to replicate data blocks Many per Hadoop cluster Normally run on inexpensive commodity hardware 23

24 Hadoop Execution Modes Hadoop can be configured to run in one of three modes Local (standalone) mode Pseudo distributed mode Fully distributed mode Four key configuration files define which mode to run in core-site.xml hdfs-site.xml mapred-site.xml 24

25 Local (Standalone) Mode This is the default mode for Hadoop No assumptions are made about hardware Does not use HDFS Used for developing and debugging application logic of Hadoop programs 25

26 Pseudo Distributed Mode Hadoop runs in clustered mode Cluster size is one Useful for developing and debugging Hadoop applications Enables examination of: Memory and HDFS usage Configuration files used to set port numbers used for NameNode and JobTracker 26

27 Fully Distributed Mode This is the mode for running Hadoop applications in production Configuration requires setup of Master nodes Host of NameNode and JobTracker components Slave nodes Machines running DataNode and TaskTracker components NameNode hosts a report on HDFS status on port number Enables checking of each DataNode in cluster JobTracker provides status report on MapReduce jobs on port Reports about ongoing jobs 27

28 Introduction to Pig Hadoop is powerful for processing large datasets Requires significant programming skills to develop applications Pig is a Hadoop extension that simplifies Hadoop programming It is Easy to use It is Higly scalable There are two major components of Pig A data processing language called Pig Latin A compiler to generate Pig Latin to Hadoop programs 28

29 Introduction to Hive A data warehousing package built on top of Hadoop Originally developed at Facebook Target audience for Hive is data analysts comfortable with SQL Provides access to ad hoc queries and data analysis On large scale datasets Focuses on structured data Enables optimizations to be performed Provides SQL like language HiveQL Based on tables, rows, columns No need to know about Hadoop programming 29

30 Working With Hive Hive requires a metastore service Provides schemas to structure the data Helps with efficient querying and processing Implemented using tables in a relational database By default, Hive uses Java s built in Derby database Hive can be accessed from Command line interface Web interface known as Hive web Interface 30

31 Introduction to Apache Sentry Features Secure authorization Ability to control and enforce access to data Grant privileges on data for authenticated users. Fine grained access control Control at the database, table, and view level As an example of fine grained control One group can see all data Another group can see only non sensitive data filtered by a view Sentry provides unified security platform: Existing Hadoop Kerberos security for authentication Sentry access policy applied to all applications e.g. Pig, Hive etc 31

32 Introduction to Oozie Oozie is a Hadoop workflow engine Manages data processing activities Available from Simplifies the management of Recurring and data driven jobs Re execution of failed jobs Comprised of A workflow engine Can execute different types of jobs A coordinator engine 32

33 Oozie Workflows Oozie workflows are directed acyclic graphs (DAGs) Graph nodes are either Action nodes Perform a workflow task Control flow nodes Determine the logic between tasks There is always exactly one start node and at least one terminal node start action action terminal terminal 33

34 Pig, Hive, and Impala in Big Data Architecture Integration Pig Hive Impala Mahout Security Operations Map Reduce Sqoop HDFS Sentry Flume Oozie 34

35 Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 35

36 Hadoop Based Analytics Allows analytics to be carried out on any type of data store Provides standard framework for computation On cloud environment On in premises network Advantages Vendor independent Allows any type of analytics to be carried out Provides flexible and adaptable architecture for implementation Disadvantages Complex implementation 36

37 MapReduce Features Allows use of large computational resources Inter cluster communication is managed by MapReduce MapReduce architecture supports Data and task distribution Fault monitoring Task and data replication Simple programming model Limitations of MapReduce Cannot solve all the problems 37

38 MapReduce: A Pragmatic Approach It can solve many Big Data problems Data filtering Statistics and aggregation Graph analytics Decision Tree and classification Clustering and recommendations Practical Distributed API Easier to understand and use Higher level APIs exist To reduce the complexity of programming Ability to schedule multi stage jobs 38

39 Phases in MapReduce Processing Data Processing with MapReduce goes through three phases Map Phase Processes the data and generate <key, Value> pairs Shuffle Phase Moves the data <key, value> pairs to appropriate processing node for reduction Reduce Phase Processes the data <key, value> pairs to generate final output Hadoop can use multiple machines for each phase 39

40 Association Rule Example In order to compute support The number of times each product and its combinations occur in the data has to be calculated The original transaction file format 1, milk (M), bread (B) 2, bread (B), butter (T) 3, milk (M), bread (B), butter(t) 4, milk(m) The input data file would be: M B MB B T BT M B T MB MT BT MBT M 40

41 Association Rule MapReduce Input Splitting Mapping Shuffling Reducing M B MB B T BT M B T MB MT BT MBT M M B MB B T BT M B T MB MT BT MBT M,1 B,1 MB, 1 B,1 T, 1 BT,1 M, 1 B,1 T, 1 MB,1 BT,1 MT,1 MBT,1 M,1 B,1 M,1 B,1 M,1 B,1 T,1 T,1 MB,1 MB,1 BT,1 BT,1 MT,1 MT,1 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M M,1 41

42 Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 42

43 Steps in Structured Data Analytics Step 1: Business domain analysis Step 2: Data exploration and investigation Step 3: Data preparation and cleaning Step 4: Model design and development Step 5: Model verification and testing Step 6: Analyze the output 43

44 Descriptive and Predictive Analytics Descriptive analytics provides knowledge discovered from the data set Examples include Clustering Correlation analysis Pattern discovery Association rules Predictive analytics builds a model to predict future behavior Examples include Regression models Time series analysis 44

45 Introduction to Mahout Open source machine learning library Leveraging the power of MapReduce Hadoop Currently available algorithms are Various clustering algorithms Singular value decomposition Parallel Frequent Pattern mining Naive Bayes classifier Random forest decision tree based classifier Collaborative filtering User and item based recommenders Others 45

46 Mahout Architecture Application Application Application Application Application Mahout: Machine Learning Library Clustering, Classification, Recommendation, Filtering, etc. Hadoop/MapReduce Storage Storage Storage Storage Compute Compute Compute Compute Storage Storage Storage Storage Compute Compute Compute Compute 46

47 Mahout Machine Learning Library Interaction Command line interaction Allows execution of machine learning algorithms using Mahout commands Pros: Easy use of algorithms Cons: Have to specify large number of options; fixed number of options are available Application Program Interface based interaction Allows execution of machine learning algorithms in the source code Pros: More flexibility is available in using the algorithms Cons: Have to know programming 47

48 Clustering: Introduction Clustering is the grouping of available data based on similarities on some pre specified criteria Uses unsupervised learning techniques Uses statistical methods A way of looking for patterns or structure in the data that are of interest Allows partitioning of data in different groups Identifies a set of patterns and structures Can be used as a standalone technique to gain insight into data distribution 48

49 Clustering in Mahout Two core components of clustering are An algorithm to group items in clusters Kmeans clustering Hierarchical clustering Many others Concept of similarity and dissimilarity This specifies how close various data locations are 49

50 Business Applications for Clustering Market research Segmentation Targeted advertisements Customer categorization Government management Crime hot spot analysis Census data for economic and social analysis Inventory management Location based product stocking 50

51 Basic Steps in Clustering Select features from the data sets Choose clustering approach Select parameters for clustering 51

52 /bin/mahout kmeans -i input-file -o output -c clusters -dm org.apache.mahout.common.distance.cosinedistancemeasure -x 5 -ow -cd 1 -k 25 Executing Mahout Clustering Algorithm 52

53 Association Rules: Introduction Association rule (AR) mining involves finding one of the following from among a set of data Frequent patterns Associations Correlation Causal structures AR is used to find interesting relationships between variables in transactionoriented data sets Association rules are of the form {X} > {Y } with [support, confidence, lift] 53

54 Business Applications of Association Rules Cross selling and up selling Market basket analysis Catalog design Keeping the related items together Planning and monitoring Network design based on web logs Planning web design using web usage log mining In general, any decision based on knowledge of previous transactions by sets of users 54

55 Introduction to Classification Classification separates data into various categories Comprises assigning a class label to a set of unclassified data sets The classification of an unknown observation to a class is based on the model Thus, classification is categorized as supervised learning Classification is a two step process Model training and creation Based on training data set where the observations have already been classified Model testing Apply the model to test data to classify them in appropriate categories 55

56 Classification Terminology Training examples Complete set of inputs for model development and testing Training data Subset of training examples used for building the model Model Set of rules generated after training the classifier Test data Remaining examples from training examples not used in training data to study how effective the model was Predictor variables Set of variables that are used to decide the output for target variable Target variable This is the variable the classifier is estimating, based on predictor variables 56

57 The Structure of Input for Mahout Classifier The general structure for input for training is TV, PV 1, PV 2, PV 3,..,PV n The general structure for test input is??, PV 1, PV 2, PV 3,..,PV n?? refers to the target variable that needs to be evaluated based on predictor variable TV, PV 1, PV 2, PV 3,..,PV n TV, PV 1, PV 2, PV 3,..,PV n TV, PV 1, PV 2, PV 3,..,PV n Classifier model generation??, PV 1, PV 2, PV 3,..,PV n??, PV PV = predictor variable 1, PV 2, PV 3,..,PV n TV = target variable??, PV 1, PV 2, PV 3,..,PV n TV Classifier generated model TV TV 57

58 Business Applications of Classification Loan application processing Categorization of customer types Identification of product classes Classification of a privacy policy as safe or unsafe Medical diagnosis Targeted marketing 58

59 Classification Modeling Techniques Decision tree Random forest Many others 59

60 Decision Tree Modeling This provides tree structure with Internal nodes representing a test on a certain attribute The result of the test is represented by the branch at internal node A class is represented by a leaf node Starts with the root node and splits into multiple branches May further split into new nodes Ends in leaf nodes that contain the decisions Model can be in the form of a set of rules or a decision tree Rules specify the decision making process of the model Uses recursive partitioning approach (divide and conquer) The process stops when partitioning does not improve the outcome 60

61 Algorithms Supported by Mahout for Clustering Stochastic Gradient Descent (SGD) Can run in sequential, online, and incremental mode Data set of less than tens of millions is suitable for processing Naive Bayes Can run in parallel mode Data set of millions to hundreds of millions is suitable for processing Random forest Can run in parallel mode Data set of less than tens of millions is suitable for processing 61

62 Command for Training the SGD Classifier Training: bin/mahout trainlogistic [options] Major Options --input <file> --output <file> --target <variable> --categories <n> --predictors <pv1>.<pvn> --types <tpv1>,<tpvn> --passes --rate --quiet Input file Output file Target variable Possibilities in TV Predictor variable Types of PVs Explanation Number of times input should be used Specify initial learning rate Generate less details of execution 62

63 Command for Evaluating the SGD Classifier Evaluating: bin/mahout runlogistic Options [options] Explanation --auc --scores --confusion --input <file> --model <model> --quiet Show AUC score Show TV and scores for each input Display confusion matrix Input file Read model from specified file Generate less details of execution 63

64 Medicine Disease recommendation Drug recommendation Case based search Marketing Cell phone companies for identifying users that may switch Recommending books at Amazon Recommending products on the web sites Education Application for Recommendation Framework Universities guiding students what courses to take Conference organizers assigning papers to reviewers 64

65 Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 65

66 Scope of Text Mining Data Mining Statistics Natural Language Processing Text Mining Web Mining Information Retrieval Computational Linguistics 66

67 Challenges in Text Mining Each document text may contain large amounts of text High dimensionality Ambiguity of content due to language features Sematic issues Words and phrases may not be semantically independent Complexity of natural language processing 67

68 General Steps in Text Mining Feature selection Determine ngrams necessary Text Preprocessing Removal of numbers Removal of punctuation marks Text case conversion as needed Stop word removal (can use pre specified list or generic list) Stemming (identify word by its root) 68

69 Stop Words Most common words in English that do not contribute to classification, clustering or association are: Articles a, an, the Conjunctions and, or Prepositions as, by, of Pronouns you, she, he, it Text documents are high dimension data Removal of stop words acts as technique for dimensionality reduction Other non context related words can also be removed 69

70 Stemming The process for reducing inflected (or sometimes derived) words to their stem, base or root form Typically achieved by removing ing, s, er ed etc. For example: mining, miner, mines, mined Stemmed word mine Common Algorithms are Porter s Algorithm KSTEM Algorithm Snowball Stemming 70

71 Steps in Association Mining Loading the data Text preprocessing (as needed) Cleaning Punctuation removal Number removal Stop word removal Stemming Building term document matrix Finding frequent term association 71

72 Sample Tweeter Data Set

73 Building a Term Document Matrix OUTPUT: A term-document matrix (16 terms, 20 documents) Non-/sparse entries: 11/309 Sparsity : 97% Maximal term length: 21 Weighting : term frequency (tf) Docs Terms data databases dataframe datasets datasetshttptconxrbuh datastream datatable details detection development

74 Word Frequency in Tweets 74

75 Displaying the Aggregate Output 75

76 Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 76

77 Healthcare Expenses Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334),

78 Healthcare Data Types EHR Public Health Public Health Social Behavioral 78

79 Billing Data Healthcare Data International Classification of Diseases ( ICD) Current Procedural Terminology (CPT) Lab results Logical Observation Identifiers Names and Codes (LOINC) Medication National Drug Code (NDC) by Food and Drug Administration (FDA) 79

80 Healthcare Data (Cont d) Clinical notes Unstructured text data Image Data Unstructured data Social Interaction data Unstructured data 80

81 Analytic Platform Structured EHR Clustering Patients / Context Feature Selection Classification Unstructured EHR Recommendation 81

82 Impact of Data Driven Features Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA

83 Applications of Patient Similarity Heart Failure Prediction Likelihood of Blood Pressure onset Disease recommendation Medicine recommendation 83

84 Information Analysis Options Image based Retrieval Case based Retrieval 84

85 Image Query Image based Retrieval Given a query image and find the most similar images Case based Retrieval Given a case description, details of the symptoms, tests including images Find similar cases including images with case descriptions 85

86 Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 86

87 Benefits of Health Care Analytics Better diagnosis Better health care delivery Better value for patient, provider and payer Better innovation Better living 87

88 Big Data Analysis Challenge Machine Learning Background Domain Expertise Distributed Programming Skills 88

89 Big Data Analytics: Barriers Cost of analytics Lack of skilled talent Difficult to architect a Big Data Solutions Big data scalability issues Limited capability of existing database analytics Tangible business justification Lack of understanding of Big Data benefits 89

90 Thank you! Questions? 90

Big Data Analytics and Healthcare

Big Data Analytics and Healthcare Big Data Analytics and Healthcare Anup Kumar, Professor and Director of MINDS Lab Computer Engineering and Computer Science Department University of Louisville Road Map Introduction Data Sources Structured

More information

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1 Healthcare Analytics

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Microsoft SQL Server 2012 with Hadoop

Microsoft SQL Server 2012 with Hadoop Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

How to Hadoop Without the Worry: Protecting Big Data at Scale

How to Hadoop Without the Worry: Protecting Big Data at Scale How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance

More information

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Exploration and Visualization of Post-Market Data

Exploration and Visualization of Post-Market Data Exploration and Visualization of Post-Market Data Jianying Hu, PhD Joint work with David Gotz, Shahram Ebadollahi, Jimeng Sun, Fei Wang, Marianthi Markatou Healthcare Analytics Research IBM T.J. Watson

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network

More information

Oracle Big Data Fundamentals Ed 1 NEW

Oracle Big Data Fundamentals Ed 1 NEW Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big

More information

Certified Big Data and Apache Hadoop Developer VS-1221

Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification

More information

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012 PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random

More information

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

BIG DATA - HADOOP PROFESSIONAL amron

BIG DATA - HADOOP PROFESSIONAL amron 0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics charlie.berger@oracle.com www.twitter.com/charliedatamine

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Communicating with the Elephant in the Data Center

Communicating with the Elephant in the Data Center Communicating with the Elephant in the Data Center Who am I? Instructor Consultant Opensource Advocate http://www.laubersoltions.com sml@laubersolutions.com Twitter: @laubersm Freenode: laubersm Outline

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through

More information