Big Data. A general approach to process external multimedia datasets. David Mera

Size: px

Start display at page:

Download "Big Data. A general approach to process external multimedia datasets. David Mera"

Marcus Pitts
8 years ago
Views:

1 Big Data A general approach to process external multimedia datasets David Mera Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University Brno, Czech Republic 7/10/2014

2 Table of Contents

3 Table of Contents

4 Introduction Big Data Huge new datasets are constantly created. 90% of the data in the world today has been created in the last two years", Organizations have potential access to a wealth of information, but they do not know how to get value out of it 1 Source: SINTEF. Big Data - for better or worse

2013 1 Organizations have potential access to a wealth of information, but

Introduction Big Data Big Data phenomenon Volume refers to the vast amount of data generated every second Variety refers to the different forms

5 Introduction Big Data Big Data phenomenon Volume refers to the vast amount of data generated every second Variety refers to the different forms of data Velocity refers to the speed at which new data are generated Veracity refers to the reliability of the data Value Volume Velocity Variety

forms of data Velocity refers to the speed at which new data are

6 Introduction Multimedia Big Data Multimedia Big Data 100 hours of video are uploaded to YouTube every minute 350 millions of photos are uploaded every day to Facebook (2012) Each day, 60 million photos are uploaded on Instagram... 70% 60% Non-Structured Data Internet Traffic 2 2 Source: IBM 2013 Global Technological Outlook report

Facebook (2012) Each day, 60 million photos are uploaded on Instagram.

7 Introduction Multimedia Big Data Getting information from large volumes of multimedia data Content-based retrieval techniques Findability problem Extraction of suitable features Time-consuming task Feature extraction approaches Sequential approach not affordable Distributed computing: Cluster computing, Grid computing High computer skills Ad-hoc approaches Low reusability. Lack of handling failures Distributed computing: Big data approaches Batch data: Map-Reduce paradigm (Apache Hadoop) Stream data: S4, Apache Storm.

affordable Distributed computing: Cluster computing, Grid computing High computer skills Ad-hoc approaches Low reusability.

8 Table of Contents

9 Big Data processing frameworks Apache Hadoop Apache Hadoop characteristics (Map-Reduce paradigm) Batch data processing system Commodity computing No specialized distributed-computing skills are required Machine communication Task scheduling Scalability Handling failures Automatic partition of the input data

specialized distributed-computing skills are required Machine communication

10 Big Data processing frameworks Hadoop Map-Reduce paradigm Input pairs Tuples(key, value) Intermediate pairs Tuples(I-key, I-value) Output pairs Tuples(O-key, O-value) Split 0 Map Reduce... Split 1 Map Input Data Split 2 Map... Reduce Split n Map... Reduce

I-value) Output pairs Tuples(O-key, O-value) Split 0 Map Reduce.

11 Big Data processing frameworks Apache Hadoop Weaknesses and limitations Large files optimization Batch data processing Response time Hard configuration process - iterative optimization Lack of real-time processing The parallelization level cannot be altered in running time

time Hard configuration process - iterative optimization Lack of

12 Big Data processing frameworks Apache Storm Apache Storm characteristics Real-time processing system Commodity computing No specialized distributed-computing skills are required Set of generic tools to build distributed graphs of computation Machine communication Task scheduling Scalability Handling failures The parallelization can be adapted in processing time

required Set of generic tools to build distributed graphs of computation Machine

13 Big Data processing frameworks Apache Storm Storm runs topologies Streams: unbounded sequence of tuples Spouts: source of streams s: input streams some processing new streams C C1 D Spout Stream of data C2 Stream of data' D 1 Spout Stream of data Stream of data A A 1 A 2... A n B B 1 B 2... C n Stream of data' Stream of data' Stream of data' D D n E E 1 E 2 E n... B n

Stream of data C2 Stream of data' D 1 Spout Stream of data Stream of data A A 1 A 2.

14 Big Data processing frameworks Apache Storm Weaknesses and limitations Lack of support for processing batch data low-level framework Pull mode Specific scenario configurations

15 Table of Contents

16 Prototype General overview Prototype goals Efficient processing of huge external datasets Heterogeneous data management Processing of arbitrary functions Infrastructure flexibility Handling failures

17 Prototype General overview job de nition Job Jar les Parser Server Distributed Topology Infrastructure Storm creator Topology Manager Storm topology Job Output Stream of data External data source Cluster Distributed File System

creator Topology Manager Storm topology Job Output

18 Prototype General overview Job job de nition Jar les Job Interface job de nition Server Parser Parser Topology Storm Topology creator creator Topology Manager Jar les Storm topology Job Output Stream of data External data source Cluster Distributed File System

Topology creator creator Topology Manager Jar les Storm topology

19 Prototype Job definition <job> <name>...</name> <datasource>...</datasource> <data save="bool"> <operators> <operator>* <class> <name>...</name> <method>...</method> </class> <data save="bool"> <operators>...</operators> </data> </operator> </operators> </data> </Job>

20 Prototype Job definition <job> <name>...</name> <datasource>...</datasource> Topology name <data save="bool"> <operators> <operator>* <class> <name>...</name> <method>...</method> </class> <data save="bool"> <operators>...</operators> </data> </operator> </operators> </data> </Job>

21 Prototype Job definition <job> <name>...</name> <datasource>...</datasource> Topology name Spout <data save="bool"> <operators> <operator>* <class> <name>...</name> <method>...</method> </class> <data save="bool"> <operators>...</operators> </data> </operator> </operators> </data> </Job>

22 Prototype Job definition Spouts Socket Apache Kafka Distributed messaging system Topology name byte[] Spout

23 Prototype Job definition <job> <name>...</name> <datasource>...</datasource> <data save="bool"> <operators> <operator>* Topology name Spout Stream of data Save <class> <name>...</name> <method>...</method> </class> <data save="bool"> <operators>...</operators> </data> </operator> </operators> </data> </Job>

24 Prototype Job definition <job> <name>...</name> <datasource>...</datasource> <data save="bool"> <operators> <operator>* <class> <name>...</name> <method>...</method> </class> Topology name Spout Stream of data Save Stream processing Operation Class name (inside Jar le) public byte[] methodname(byte[]) <data save="bool"> <operators>...</operators> </data> </operator> </operators> </data> </Job>

25 Prototype Job definition s Saves Data storage into HDFS Buffer Hadoop SequenceFiles Worker Processing tuples public byte[] methodname(byte[]) Topology name Worker byte[] Spout Worker Save

26 Prototype Job definition <job> <name>...</name> <datasource>...</datasource> <data save="bool"> <operators> <operator>* <class> <name>...</name> <method>...</method> </class> <data save="bool"> <operators>...</operators> </data> </operator> </operators> </data> </Job> Topology name Spout Stream of data Save Stream processing Operation Class name (inside Jar le) public byte[] methodname(byte[]) Stream of data Save...

27 Prototype Job definition Topology name Worker byte[] Spout Worker Save

28 Prototype Job definition Job job de nition Jar les Job Interface job de nition Parser Storm Topology creator Activate Topology monitor Topology Manager Topology Jar les Storm topology Storm Stream of data Kafka Tuples Topology deployment Job Output External data source Cluster Distributed File System Hadoop File System

29 Prototype Monitoring system Internal monitoring system Max pending tuples parameter. Topology starts with a low parameter value. Every X seconds the monitor checks the acked tuples. First iteration the monitor increases the parameter value. Next iterations: Current acked tuples > previous acked tuples Increasing parameter value. Current acked tuples < previous acked tuples Decreasing parameter value. Current acked tuples == previous acked tuples Doing nothing unless this scenario was repeated X times Increasing parameter value.

30 Prototype Monitoring system External monitoring system Administrator can add rules. Rule = (metric, operator, value, action) The monitor gets topology metrics every X seconds. Each bolt produces a set of metrics. The monitors evaluates each rules using the bolt metrics The monitor applies the rule action in every which has triggered it.

31 Prototype Monitoring system - Example Rule1:(capacity,<,0.4,-1) Rule2:(capacity,>,0.8,+2) Worker-A Worker-B Capacity=0.75 Capacity=0.86 byte[] Spout Save-A Save-B Capacity=0.25 Capacity=0.3

32 Table of Contents

33 Prototype General overview Goals Efficient processing of huge external datasets Heterogeneous data management Processing of arbitrary functions Infrastructure flexibility Handling failures Data relations management Efficient processing of huge internal datasets

34 Big Data A general approach to process external multimedia datasets Thank you for your attention!

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop