Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

Similar documents
Big Data Explained. An introduction to Big Data Science.

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Modern (Computational) Approaches to Big Data Analytics. CSC 576 Computer Science, University of Rochester Instructor: Ji Liu

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Hadoop IST 734 SS CHUNG

Hadoop Ecosystem B Y R A H I M A.

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Large scale processing using Hadoop. Ján Vaňo

BIG DATA TRENDS AND TECHNOLOGIES

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How To Scale Out Of A Nosql Database

BIG DATA What it is and how to use?

Big Data and Apache Hadoop s MapReduce

Hadoop. Sunday, November 25, 12

Big Data Operations: Basis for Benchmarking Big Data Systems

Hadoop Big Data for Processing Data and Performing Workload

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data on Microsoft Platform

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Outline. What is Big data and where they come from? How we deal with Big data?

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop implementation of MapReduce computational model. Ján Vaňo

Alternatives to HIVE SQL in Hadoop File Structure

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA TECHNOLOGY. Hadoop Ecosystem

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source Google-style large scale data analysis with Hadoop

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Using Big Data to Explore New Opportunities. Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5

Analysis of Web Archives. Vinay Goel Senior Data Engineer

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

HDFS. Hadoop Distributed File System

Data processing goes big

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Generic Log Analyzer Using Hadoop Mapreduce Framework

Chase Wu New Jersey Ins0tute of Technology

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop and Map-Reduce. Swati Gore

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big data for the Masses The Unique Challenge of Big Data Integration

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Map Reduce & Hadoop Recommended Text:

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Manifest for Big Data Pig, Hive & Jaql

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Accelerating and Simplifying Apache

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Apache Hadoop FileSystem and its Usage in Facebook

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

What is cloud computing?

Real Time Big Data Processing

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Hadoop & its Usage at Facebook

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Oracle Big Data SQL Technical Update

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Internals of Hadoop Application Framework and Distributed File System

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Challenges and Opportunities in Data Mining: Personalization

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Big Data and Industrial Internet

Big Data: Tools and Technologies in Big Data

BIRT in the World of Big Data

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Log Mining Based on Hadoop s Map and Reduce Technique

Native Connectivity to Big Data Sources in MSTR 10

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Hadoop Architecture. Part 1

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

INTRODUCTION TO CASSANDRA

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Transcription:

Introduction to Big Data & Basic Data Analysis Freddy Wetjen, National Library of Norway.

Big Data EveryWhere! Lots of data may be collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Social Network Mobile usage & tracing

How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) ebay has 6.5 PB of user data + 50 TB/day (5/2009) CERN s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.

Type of Data Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Record structured (programming language) Graph Data Social Network, Semantic Web (RDF), Usage data Streaming Data You can only scan the data once

What to do with these data? Aggregation and Statistics Data warehouse and OLAP Indexing, Searching, and Querying Keyword based search Pattern matching (XML/RDF) Knowledge discovery Data Mining Statistical Modeling Reasoning?

Examples of Big data applications Analysis of customer usage patterns (Netflix). Finding customer habits and do recommandations. Data warehousing. Seismic data analysis Creating and connecting large collections of mixed media. Gaining knowledge and experience across large heterogenous datasets. Knowledge based reasoning after information extraction. The problem is to be able to connect the large number of (different types of) dots to understand the pattern.

Google Hadoop;Big data enabler GFS (Global File System) Language for expressing distributed processing Other tools/frameworks Monitoring/alarming Tools & libraries Data access Map/Reduce Hadoop FS Hadoop world

Global File System (GFS); Examplified with Hadoop Scalable open source framework for distributed storage of data. Scalable to petabytes and 1000's of nodes; heavily paralizable Started out from google... Buildt on top of commodity filesystems. Fault tolerant,scalable, in use...

Benefits of a Hadoop Global file system All data available from a single mount point. Single point of management of filesystem. Single point of job management. Security,scalabililty and redundancy buildt into the system

Map Reduce;Expressing distributed processing for Hadoop Two steps in expressing Processing; Map and Reduce. Map step to express the processing for each data unit. Reduce step to coordinate the different result sets. Specialized language for expressing this MapReduce Pig Java based framework Pig latin

Map Reduce is the key processing concept of Hadoop Controlling processing in the global environment from a single point Pig script language similar syntax to unix shell scripts Pig extendable with java programs (JRE/jar) distribution of the jar file HADOOP handles distribution of both data and processing.

Other commonly used tools with Hadoop Hive key value database system, Limited sql support. Lucene Advanced file based indexing system (from the google index..). Mahout decision support framework for implementing reasoning with large data. AWS Implementing web services on top. Monitoring tools

Language processing and big data; New possibilities Language processing is embarissingly easy to distribute And to run massively in paralell Tokenization is the heart of language processing. Different language processing goals requires different tokenization Add Machine learning techniques and semantic knowledge.

Language processing and big data with hadoop; perfect match Large collections of data in a distributed storage Different configurable tokenizations (rule based,tree bank,regexp,topic shifts,words,sentences,actions,scenary..) Mahout for modeling of semantic knowledge. Hugely paralizable processing engine at your fingertips.

Starting up N gram Tokenization and indexing all text in the National library (newspapers,books, publications). Everything we have digitized and will digitize :-) Representing metadata and semantic data about the different texts Providing instant N gram analysis to all text with references back to text. Extending the facilities to open for more relational analysis (Ngrams given...) Reasoning with this using Mahout (Apache Reasoning framework). Circumstancial processing.

Ngram example in NB texts; Term usage through the years

NLP loopback Using Ngrams to enrich the quality of production lines. Improving quality of text with more complex character settings error correction. Building wordlists with different approaches. For instance, a particular time frame, newspaper genre etc.

Where do we go from here? Text based reasoning engine? Towards a framework for representing and doing machine extrable knowledge(semantics). From text to metadata to knowledge. Better understanding of the user.