Hurtownie Danych i Business Intelligence: Big Data



Similar documents
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data and Data Science: Behind the Buzz Words

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Lecture Data Warehouse Systems

MapReduce with Apache Hadoop Analysing Big Data

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop IST 734 SS CHUNG

How To Scale Out Of A Nosql Database

Big Data Technologies Compared June 2014

BIG DATA TRENDS AND TECHNOLOGIES

<Insert Picture Here> Big Data

So What s the Big Deal?

Hadoop Ecosystem B Y R A H I M A.

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tap into Hadoop and Other No SQL Sources

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

NoSQL Data Base Basics

Large scale processing using Hadoop. Ján Vaňo

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA What it is and how to use?

NoSQL for SQL Professionals William McKnight

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Structured Data Storage

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Introduction to NOSQL

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Open source large scale distributed data management with Google s MapReduce and Bigtable

Applications for Big Data Analytics

Real Time Big Data Processing

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Explained. An introduction to Big Data Science.

Implement Hadoop jobs to extract business value from large and varied data sets

How To Handle Big Data With A Data Scientist

Play with Big Data on the Shoulders of Open Source

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Big Data Analytics - Accelerated. stream-horizon.com

Big Data: Tools and Technologies in Big Data

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Data Services Advisory

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

What Next for DBAs in the Big Data Era

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Big Data and Apache Hadoop s MapReduce

Big Data Analytics Nokia

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Enterprise Operational SQL on Hadoop Trafodion Overview

Big Data Management and Security

The Future of Data Management

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Microsoft Analytics Platform System. Solution Brief

NoSQL and Hadoop Technologies On Oracle Cloud

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Data processing goes big

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Native Connectivity to Big Data Sources in MSTR 10

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Introducing Oracle Exalytics In-Memory Machine

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Hadoop implementation of MapReduce computational model. Ján Vaňo

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Cloud Scale Distributed Data Storage. Jürmo Mehine

In-Memory Analytics for Big Data

Big Data With Hadoop

Survival Tips for Big Data Impact on Performance Share Pittsburgh Session 15404

Cost-Effective Business Intelligence with Red Hat and Open Source

Putting Apache Kafka to Use!

Hadoop & its Usage at Facebook

Oracle Big Data SQL Technical Update

Luncheon Webinar Series May 13, 2013

Challenges for Data Driven Systems

How To Write A Database Program

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Can the Elephants Handle the NoSQL Onslaught?

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop and Map-Reduce. Swati Gore

Federated Cloud-based Big Data Platform in Telecommunications

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Performance and Scalability Overview

Transcription:

Hurtownie Danych i Business Intelligence: Big Data Robert Wrembel Politechnika Poznańska Instytut Informatyki Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Outline Introduction to Big Data Big Data Architectures GFS, HDFS, Hadoop Some other hot trends 2

Big Data Big Data storage internal company BI system hobbies + behaviour of customers targeted marketing sentiment analysis brand image job seeking employment trends prices of goods inflation APR Big Data analytics 3 Big Data Huge Volume Every minute: 48 hours of video are uploaded onto Youtube 204 million e-mail messages are sent 600 new websites are created 600000 pieces of content are created over 100000 tweets are sent (~ 80GB daily) Sources: social data web logs machine generated 4

Big Data Sensors mechanical installations (refineries, jet engines, crude oil platforms, traffic monitoring, utility installations, irrigation systems) one sensor on a blade of a turbine generates 520GB daily a single jet engine can generate 10TB of data in 30 minutes telemedicine telecommunication 5 Big Data High Velocity of data volume growth uploading the data into an analytical system Variety (heterogeneity) of data formats structured - relational data and multidimensional cube data unstructured or semistructured - text data semantic Web XML/RDF/OWL data geo-related data sensor data Veracity (Value) - the quality or reliability of data 6

Big Data - Problems Storage volume fast data access fast processing Real-time analysis analyzing fast-arriving streams of data 7 Types of processing Batch processing - standard DW refreshing Real-time / near real-time data analytics answers with the most updated data up to the moment the query was sent the analytical results are updated after a query has been executed Streaming analytics a system automatically updates results about the data analysis as new pieces of data flow into the system as-it-occurs signals from incoming data without the need to manually query for anything 8

Real-time / Near real-time architecture data stream active component main-memory engine OLTP + OLAP 9 Real-time / Near real-time refreshing refreshing users traditional DW refreshing users real-time DW 10

Big Data Architecture massive data processing server - MDPS (aggregation, filtering) clicks tweets facebook likes location information... real-time decision engine - RTDE analtytics server - AS complex event processor - CEP reporting server - RS 11 Big Data Architecture Scalability RTDE - nb of events handled MDPS - volume of data and frequency of data processing AS - complexity of computation, frequency of queries RS - types of queries, nb of users CEP - # events handled Type of data RTDE - unstructured, semistructured (texts, tweets) MDPS - structured AS - structured RS - structured CEP - unstructured and structured 12

Big Data Architecture Workload RTDE - high write throughput MDPS - long-running data processing (I/O and CPU intensive): data transformations,... AS - compute intensive (I/O and CPU intensive) RS - various types of queries Technologies RTDE - key-value, in-memory MDPS - Hadoop AS - analytic appliences RS - in-memory, columnar DBs Conclusion very complex architecture with multiple components the need of integration 13 IBM Architecture Data warehouse augmentation: the queryable data store. IBM software solution brief. 14

Big Data Architecture 15 Big Data Architecture columnar storage and query coordinate and schedule workflows high level language for processing MapReduce data ingest coordinate and manage all the components http://www.cloudera.com/content/cloudera/en/resources/library/training/ap ache-hadoop-ecosystem.html 16

Big Data Architecture user interface to Hive SQL-like language for data analysis supports selection, join, group by,... high level language for processing MapReduce columnar storage and query based on BigTable manages TB of data web log loader (log scrapper), periodical loading into Hadoop aggregating log entries (for offline analysis) coordinate and manage all the components service discovery, distributed locking,... 17 Big Data Architecture coordinate and schedule workflows schedule and manage Pig, Hive, Java, HDFS actions RDB-like interface to data stored in Hadoop high level languages for processing MapReduce distributed web log loader (log scrapper), periodical loading into Hadoop (for offline analysis) 18

Big Data Architecture workflow (batch job) scheduler (e.g., data extraction, loading into Hadoop) SQL to Hadoop: command line tool for importing any JDBC data source into Hadoop distributed web log loader and aggregator (in real time) table-like data storage + in memory caching managing the services, e.g., detercting addition or removal of Kafka's brokers and consumers, load balancing 19 Big Data Architecture high level languages for processing MapReduce UI for Hadoop (e.g., HDFS file browser, MapReduce job designer and browser, query interfaces for Hive, Pig, Impala, Oozie, application for creating workflows, Hadoop API) workflow coordination and scheduling distributed web log loader (log scrapper), periodical loading into Hadoop (for offline analysis) coordinate and manage all the components 20

Big Data Architecture Windows Azure Java OM Streaming OM HiveQL PigLatin.NET/C#/F (T)SQL NOSQL ETL Tomasz Kopacz - Microsoft Polska: prezentacja Windows Azure, Politechnika Poznańska, czerwiec 2013 21 Data Stores NoSQL Key-value DB data structure collection, represented as a pair: key and value data have no defined internal structure the interpretation of complex values must be made by an application processing the values operations create, read, update (modify), and delete (remove) individual data - CRUD the operations process only a single data item selected by the value of its key Voldemort, Riak, Redis, Scalaris, Tokyo Cabinet, MemcacheDB, DynamoDB 22

Data Stores Column family (column oriented, extensible record, wide column) definition of a data structure includes key definition column definitions column family definitions column family stored separately, common to all data items (~ shared schema) column stored with a data item, specific for the data item CRUD interface H-Base, HyperTable, Cassandra, BigTable, Accumulo, SimpleDB 23 Data Stores Document DB typically JSON-based structure of documents SimpleDB, MongoDB, CouchDB, Terrastore, RavenDB, Cloudant Graph DB nodes, edges, and properties to represent and store data every node contains a direct pointer to its adjacent element Neo4j, FlockDB, GraphBase, RDF Meronymy SPARQL 24

GFS Google implementation of DFS (cf. The Google File System - whitepaper) Distributed FS For distributed data intensive applications Storage for Google data Installation hundreds of TBs of storage, thousands of disks, over a thousand cheep commodity machines The architecture is failure sensitive fault tolerance error detection automatic recovery constant monitoring is required 25 GFS Typical file size: multiple GB Operations on files mostly appending new data multiple large sequential writes no updates of already appended data mostly large sequential reads small random reads occur rarely file size at least 100MB millions of files 26

GFS Files are organized hierarchically in directories Files are identified by their pathnames Operations on files: create, delete, open, close, read, write, snapshot (creates a copy of a file or a directory tree), record append (appends data to the same file concurrently by multiple clients) GFS cluster includes single master multiple chunk servers 27 GFS client 1: file name, chunk index master 2: chunk handle, chunk replica locations 3: chunk handle, byte range sent to one replica management + heartbit messages 4: data............ chunk server chunk server chunk server S. Ghemawat, H. Gobioff, S-T. Leung. The Google File System. http://research.google.com/archive/gfs.html 28

Hadoop Apache implementation of DFS http://hadoop.apache.org/docs/stable/hdfs_design.html 29 Example In 2010 Facebook stored over 30PB in Hadoop Assuming: 30,000 1TB drives for storage typical drive has a mean time between failure of 300,000 hours 2.4 disk drive fails daily 30

Integration with Hadoop IBM BigInsights Cloudera distribution + IBM custom version of Hadoop called GPFS Oracle BigData appliance based on Cloudera for storing unstructured content Informatica HParser to launch Informatica process in a MapReduce mode, distributed on the Hadoop servers Microsoft dedicated Hadoop version supported by Apache for Microsoft Windows and for Azure EMC Greenplum, HP Vertica, Teradata Aster Data, SAP Sybase IQ provide connectors directly to HDFS 31 Technology Application % of organizations surveyed 32

Programming Languages Top Languages for analytics, data mining, data science Sept 2013, source: http://www.datasciencecentral.com/profiles/blogs/top-languages-foranalytics-data-mining-data-science The most popular languages continue to be R (61%) Python (39%) SQL (37%) SAS (20%) 33 Programming Languages Growth from 2012 to 2013 Pig Latin/Hive/other Hadoop-based languages 19% R 16% SQL 14% (the result of increasing number of SQL interfaces to Hadoop and other Big Data systems?) Decline from 2012 to 2013 Lisp/Clojure 77% Perl 50% Ruby 41% C/C++ 35% Unix shell/awk/sed 25% Java 22% 34

Big Data Questionnaire: 339 experts of data management (XII, 2012) Question: what are the plans of using Big Data in their organizations Answers: 14% highly probable 19% no plans Problems 21% not enough knowledge on Big Data 15% no clear profits from using Big Data 9% poor data quality 35 Data Scientist 36

RDBMS vs. NoSQL: the Future? TechTarget: Relational database management system guide: RDBMS still on top http://searchdatamanagement.techtarget.com/essentialg uide/relational-database-management-system-guide- RDBMS-still-on-top "While NoSQL databases are getting a lot of attention, relational database management systems remain the technology of choice for most applications" 37 RDBMS vs. NoSQL: the Future? R. Zicari: Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch. Trends and Information on Big Data, New Data Management Technologies, and Innovation. Oct, 2014, available at: http://www.odbms.org/blog/2014/10/big-datamanagement-american-express-interview-sastry-durvasulakevin-murray/ "The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS)." Kevin Murray. 38

RDBMS Conceptual and logical modeling methodologies and tools Rich SQL functionality Query optimization Concurrency control Data integrity management Backup and recovery Performance optimization buffers' tuning storage tuning advanced indexing in-memory processing Application development tools 39 NoSQL Flexible "schema" suitable for unstructured data Massively parallel processing Cheap hardware + open source software 40

Some other trends Apache Derby: Java-based ANSI SQL database Splice Machine Derby (redesigned query optimizer to support parallel processing) on HBase (parallel processing) + Hadoop (parallel storage and processing) Web Table: https://research.google.com/tables?hl=en 41 Trends cont. Analyzing Twitter posts Google flu trend maps http://www.slate.com/blogs/moneybox/2014/04/04/ twitter_economics_university_of_michigan_stanford_r esearchers_are_using.html "Google tracks flu activity around the world by monitoring how many people are searching flu-related terms in different areas, and with what frequency. We have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms " tweets on unemployment well correlate with real governmental data http://www.washingtonpost.com/blogs/wonkblog/wp /2014/05/30/the-weird-google-searches-of-theunemployed-and-what-they-say-about-the-economy/ 42

Trends cont. Big Data integration and cleaning to get correct data Extracting text analytics Tracking the evolution of entities in the Internet over time ACM SIGMOD Blog http://wp.sigmod.org/ 43