Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

Size: px
Start display at page:

Download "Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology"

Transcription

1 Data science for the datacenter: processing logs with Apache Spark William Benton Red Hat Emerging Technology

2 BACKGROUND

3 Challenges of log data

4 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour

5 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00

6 Challenges of log data postgres WARN CRIT DEBUG httpd GET GET GET POST GET (404) syslog WARN WARN (ca. 2000)

7 Challenges of log data postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT 2016) Django WARN syslog (ca. WARN haproxy k8s WARN DEBUG CRIT WARN WARN

8 Challenges of log data postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN How many services are CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT generating logs in your Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN datacenter today? postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN

9 Apache Spark Graph Query ML Streaming Spark core (distributed collections, scheduler) ad hoc Mesos YARN

10 DATA INGEST

11 Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd

12 Collecting log data collecting normalizing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources

13 Collecting log data collecting normalizing warehousing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources Storing normalized records in ES indices

14 Collecting log data warehousing Storing normalized records in ES indices

15 Collecting log data warehousing analysis Storing normalized records in ES indices cache warehoused data as Parquet files on Gluster volume local to Spark cluster

16 Schema mediation

17 Schema mediation

18 Schema mediation

19 Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.

20 Structured queries at scale logs.select("level").distinct.as[string].collect

21 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert

22 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc)

23 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc) kubelet kube-proxy ERR journal systemd

24 FEATURE ENGINEERING

25 From log records to vectors What does it mean for two log records to be similar?

26 From log records to vectors What does it mean for two log records to be similar? red green blue orange

27 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001

28 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns

29 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > 00010

30 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles

31 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles -> >

32 From log records to vectors What does it mean for two log records to be similar?

33 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...}

34 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...} -> > > > >

35 Similarity and distance

36 Similarity and distance

37 Similarity and distance (q - p) (q - p)

38 Similarity and distance n p i - q i i=1

39 Similarity and distance n p i - q i i=1

40 Similarity and distance p q p q

41 Similarity and distance

42 Similarity and distance

43 Similarity and distance

44 Similarity and distance =.7

45 Other interesting features host01 WARN WAR host02 WARN DEBUG host03 WARN WARN

46 Other interesting features host01 WARN host02 DEBUG WARN DEBUG host03 FO WARN

47 Other interesting features host01 DEBUG WARN BUG host02 WARN host03 WARN WARN WARN WARN

48 Other interesting features : Everything is great! Just checking in to let you know I m OK.

49 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers.

50 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers. : Phoenix datacenter is on fire; may not rise from ashes.

51 NATURAL LANGUAGE and LOG DATA

52 Introducing word2vec Madrid Spain France

53 Introducing word2vec Madrid Spain France

54 Introducing word2vec Madrid Spain France

55 Introducing word2vec Madrid Spain France

56 Introducing word2vec Madrid Spain France

57 Introducing word2vec Madrid Spain Paris France v("madrid") - v("spain") + v("france") v("paris")

58 Preprocessing text What is a word? How are log words different from words found in conventional prose?

59 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot

60 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd etcd ZooKeeper

61 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null etcd ZooKeeper

62 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

63 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd OPEN_MAX /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

64 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper

65 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT QEMU systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper

66 // assume messages is an RDD of log message texts val oneletter = new scala.util.matching.regex(".*([a-za-z]).*") def tokens(s: String, post: String=>String): Seq[String] = collapsewhitespace(s).split(" ").map(s => post(strippunctuation(s))).collect { case oneletter(_) => token } val tokenseqs = messages.map(line => tokens(line), identity[string]) val w2v = new org.apache.spark.mllib.feature.word2vec val model = w2v.fit(tokenseqs)

67 Insights from log messages

68 Insights from log messages nova

69 Insights from log messages containers glance nova vm images instances

70 Insights from log messages password containers glance nova vm images instances

71 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images

72 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images sh

73 Insights from log messages opened password publickey accepted containers glance nova IPMI vm images instances AUDIT_SESSION /usr/bin/bash sh

74 VISUALIZING STRUCTURE and FINDING OUTLIERS

75 Multidimensional data

76 Multidimensional data [4,7]

77 Multidimensional data [4,7]

78 Multidimensional data [4,7] [2,3,5]

79 Multidimensional data [4,7] [2,3,5]

80 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

81 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

82 A linear approach: PCA

83 A linear approach: PCA

84

85 A nonlinear approach: t-sne

86 A nonlinear approach: t-sne p( )

87 A nonlinear approach: t-sne p( ) p( )

88

89 Tree-based approaches

90 Tree-based approaches if red yes if orange if!red no if!gray yes if!orange if!gray no

91 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

92 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

93 Self-organizing maps

94 Self-organizing maps

95 Training SOMs

96 Training SOMs

97 Training SOMs

98 Training SOMs

99 Self-organizing maps

100 Self-organizing maps

101 Self-organizing maps

102 Self-organizing maps

103 Self-organizing maps

104 Self-organizing maps

105 Self-organizing maps

106 Self-organizing maps

107 Self-organizing maps

108 Finding outliers with SOMs

109 Finding outliers with SOMs

110 Finding outliers with SOMs

111 Finding outliers with SOMs

112 Finding outliers with SOMs

113 Outliers in log data

114 Outliers in log data 0.95

115 Outliers in log data

116 Outliers in log data

117 Outliers in log data An outlier is any record whose best match was at least 4σ below the mean

118

119 Out of 310 million log records, we identified % as outliers.

120

121

122 Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.

123 LESSONS LEARNED and KEY TAKEAWAYS

124 Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. The best practice is to warehouse ES indices to Parquet files and write Spark jobs against them.

125 Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.

126 Natural language tips Conventional preprocessing pipelines probably won t get you the best results on log message texts. Consider the sorts of words and word variants you ll want to recognize with your preprocessing pipeline. Use an approximate whitelist for unusual words.

127 Memory and partitioning Large JVM heaps can lead to appalling GC pauses. You probably won t be able to use all of your memory in a single JVM without executor timeouts. Consider memory budget at the driver when choosing a partitioning strategy.

128 Feature engineering Design your feature engineering pipeline so you can translate feature vectors back to factor values. Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models.

129

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Log management with Logstash and Elasticsearch. Matteo Dessalvi Log management with Logstash and Elasticsearch Matteo Dessalvi HEPiX 2013 Outline Centralized logging. Logstash: what you can do with it. Logstash + Redis + Elasticsearch. Grok filtering. Elasticsearch

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1 Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

xpaaerns on Spark, Shark, Tachyon and Mesos

xpaaerns on Spark, Shark, Tachyon and Mesos xpaaerns on Spark, Shark, Tachyon and Mesos Spark Summit 2014 Claudiu Barbura Sr. Director of Engineering A>geo Agenda xpa&erns Architecture From Hadoop to BDAS & our contribu

More information

Spark Application Carousel. Spark Summit East 2015

Spark Application Carousel. Spark Summit East 2015 Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came

More information

Kafka & Redis for Big Data Solutions

Kafka & Redis for Big Data Solutions Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

Information Retrieval Elasticsearch

Information Retrieval Elasticsearch Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Logitoring : log driven monitroing. the Rocket science. and. Eugene Istomin. IT Architect. e.istomin@edss.ee. Cone Center,Tallinn

Logitoring : log driven monitroing. the Rocket science. and. Eugene Istomin. IT Architect. e.istomin@edss.ee. Cone Center,Tallinn Logitoring : log driven monitroing and the Rocket science Eugene Istomin IT Architect e.istomin@edss.ee Cone Center,Tallinn Topic goal: talking about a common way of delivering, storing and analyzing monitoring/log/trace

More information

Simba Apache Cassandra ODBC Driver

Simba Apache Cassandra ODBC Driver Simba Apache Cassandra ODBC Driver with SQL Connector 2.2.0 Released 2015-11-13 These release notes provide details of enhancements, features, and known issues in Simba Apache Cassandra ODBC Driver with

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith?????? 122+ NoSQL Database

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache

More information

Real Time Data Processing using Spark Streaming

Real Time Data Processing using Spark Streaming Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O

More information

Log infrastructure & Zabbix. logging tools integration

Log infrastructure & Zabbix. logging tools integration Log infrastructure & Zabbix logging tools integration About me Me Linux System Architect @ ICTRA from Belgium (...) IT : Linux & SysAdmin work, Security, ICTRA ICT for Rail for Transport Mobility Security

More information

SUMMARY. Satellite is an application Two Sigma wrote to monitor, alert, and auto-administer our Mesos clusters.

SUMMARY. Satellite is an application Two Sigma wrote to monitor, alert, and auto-administer our Mesos clusters. SUMMARY Example of what Satellite can do: When swap falls below 1 day, turn the host off; if 20% of the cluster is turned off, send a pagerduty alert. Satellite is an application Two Sigma wrote to monitor,

More information

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org @apacheignite @dsetrakyan Agenda About In- Memory

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine

More information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go

More information

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13 Scaling Pinterest Yash Nelapati Ascii Artist Pinterest is... An online pinboard to organize and share what inspires you. Growth March 2010 Page views per day Mar 2010 Jan 2011 Jan 2012 May 2012 Growth

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Package RCassandra. R topics documented: February 19, 2015. Version 0.1-3 Title R/Cassandra interface

Package RCassandra. R topics documented: February 19, 2015. Version 0.1-3 Title R/Cassandra interface Version 0.1-3 Title R/Cassandra interface Package RCassandra February 19, 2015 Author Simon Urbanek Maintainer Simon Urbanek This packages provides

More information

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks

More information

Using Logstash and Elasticsearch analytics capabilities as a BI tool

Using Logstash and Elasticsearch analytics capabilities as a BI tool Using Logstash and Elasticsearch analytics capabilities as a BI tool Pashalis Korosoglou, Pavlos Daoglou, Stefanos Laskaridis, Dimitris Daskopoulos Aristotle University of Thessaloniki, IT Center Outline

More information

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE Contents 1. Pattern Overview... 3 Features 3 Getting started with the Web Application Pattern... 3 Accepting the Web Application Pattern license agreement...

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Logging on a Shoestring Budget

Logging on a Shoestring Budget UNIVERSITY OF NEBRASKA AT OMAHA Logging on a Shoestring Budget James Harr jharr@unomaha.edu Agenda The Tools ElasticSearch Logstash Kibana redis Composing a Log System Q&A, Conclusions, Lessons Learned

More information

Apache Flink. Fast and Reliable Large-Scale Data Processing

Apache Flink. Fast and Reliable Large-Scale Data Processing Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream

More information

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH Real-time Data Analytics mit Elasticsearch Bernhard Pflugfelder inovex GmbH Bernhard Pflugfelder Big Data Engineer @ inovex Fields of interest: search analytics big data bi Working with: Lucene Solr Elasticsearch

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Quantifind s story: Building custom interactive data analytics infrastructure

Quantifind s story: Building custom interactive data analytics infrastructure Quantifind s story: Building custom interactive data analytics infrastructure Ryan LeCompte @ryanlecompte Scala Days 2015 Background Software Engineer at Quantifind @ryanlecompte ryan@quantifind.com http://github.com/ryanlecompte

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

VMware Identity Manager Connector Installation and Configuration

VMware Identity Manager Connector Installation and Configuration VMware Identity Manager Connector Installation and Configuration VMware Identity Manager This document supports the version of each product listed and supports all subsequent versions until the document

More information

CS312 Solutions #6. March 13, 2015

CS312 Solutions #6. March 13, 2015 CS312 Solutions #6 March 13, 2015 Solutions 1. (1pt) Define in detail what a load balancer is and what problem it s trying to solve. Give at least two examples of where using a load balancer might be useful,

More information

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem

More information

Big Data Research in the AMPLab: BDAS and Beyond

Big Data Research in the AMPLab: BDAS and Beyond Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Ankush Cluster Manager - Hadoop2 Technology User Guide

Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush User Manual 1.5 Ankush User s Guide for Hadoop2, Version 1.5 This manual, and the accompanying software and other documentation, is protected

More information

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12

More information

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases Background Inspiration: postgresapp.com demo.beatstream.fi (modern desktop browsers without

More information

The Melvyl Recommender Project: Final Report

The Melvyl Recommender Project: Final Report Performance Testing Testing Goals Current instances of XTF used in production serve thousands to hundreds of thousands of documents. One of our investigation goals was to experiment with a much larger

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise Linux A first-class citizen in Windows Azure Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise 1 First, I am software developer (C/C++, ASM, C#, Java, Node.js,

More information

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics with Cassandra, Spark & MLLib Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Spark Job Server. Evan Chan and Kelvin Chu. Date

Spark Job Server. Evan Chan and Kelvin Chu. Date Spark Job Server Evan Chan and Kelvin Chu Date Overview Why We Needed a Job Server Created at Ooyala in 2013 Our vision for Spark is as a multi-team big data service What gets repeated by every team: Bastion

More information

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A Scalable Data Transformation Framework using the Hadoop Ecosystem A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional

More information

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services

More information

Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET. ISGC 2013, March 2013

Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET. ISGC 2013, March 2013 Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET ISGC 2013, March 2013 Agenda Introduction Collecting logs Log Processing Advanced analysis Resume Introduction Status

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12 Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

Trend Micro Big Data Platform and Apache Bigtop. 葉 祐 欣 (Evans Ye) Big Data Conference 2015

Trend Micro Big Data Platform and Apache Bigtop. 葉 祐 欣 (Evans Ye) Big Data Conference 2015 Trend Micro Big Data Platform and Apache Bigtop 葉 祐 欣 (Evans Ye) Big Data Conference 2015 Who am I Apache Bigtop PMC member Apache Big Data Europe 2015 Speaker Software Engineer @ Trend Micro Develop big

More information

Streaming items through a cluster with Spark Streaming

Streaming items through a cluster with Spark Streaming Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member

More information

Graylog2 Lennart Koopmann, OSDC 2014. @_lennart / www.graylog2.org

Graylog2 Lennart Koopmann, OSDC 2014. @_lennart / www.graylog2.org Graylog2 Lennart Koopmann, OSDC 2014 @_lennart / www.graylog2.org About me 25 years old Living in Hamburg, Germany @_lennart on Twitter Co-Founder of TORCH - The Graylog2 company. Graylog2 history Started

More information

Using the VCDS Application Monitoring Tool

Using the VCDS Application Monitoring Tool CHAPTER 5 This chapter describes how to use Cisco VQE Client Configuration Delivery Server (VCDS) Application Monitoring Tool (AMT). VCDS is a software component installed on each VQE Tools server, the

More information

Java Monitoring. Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer

Java Monitoring. Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer Java Monitoring Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer A Bit About Me Current: Past: Pre-Sales Engineer (1997 present) WaveMaker Wily Persistence GemStone Application

More information

Processing millions of logs with Logstash

Processing millions of logs with Logstash and integrating with Elasticsearch, Hadoop and Cassandra November 21, 2014 About me My name is Valentin Fischer-Mitoiu and I work for the University of Vienna. More specificaly in a group called Domainis

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org #apacheignite Agenda Apache Ignite (tm) In- Memory

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://

More information

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at

More information

Data Services Advisory

Data Services Advisory Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains

More information

Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory

Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory A Little Context! The Five Golden Principles of Security! Know your system! Principle

More information

Creating Big Data Applications with Spring XD

Creating Big Data Applications with Spring XD Creating Big Data Applications with Spring XD Thomas Darimont @thomasdarimont THE FASTEST PATH TO NEW BUSINESS VALUE Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these

More information

Apache Spark and Distributed Programming

Apache Spark and Distributed Programming Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Building Your Big Data Team

Building Your Big Data Team Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.

More information

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016 Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information