Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology
|
|
- Peter Weaver
- 8 years ago
- Views:
Transcription
1 Data science for the datacenter: processing logs with Apache Spark William Benton Red Hat Emerging Technology
2 BACKGROUND
3 Challenges of log data
4 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour
5 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00
6 Challenges of log data postgres WARN CRIT DEBUG httpd GET GET GET POST GET (404) syslog WARN WARN (ca. 2000)
7 Challenges of log data postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT 2016) Django WARN syslog (ca. WARN haproxy k8s WARN DEBUG CRIT WARN WARN
8 Challenges of log data postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN How many services are CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT generating logs in your Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN datacenter today? postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN
9 Apache Spark Graph Query ML Streaming Spark core (distributed collections, scheduler) ad hoc Mesos YARN
10 DATA INGEST
11 Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd
12 Collecting log data collecting normalizing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources
13 Collecting log data collecting normalizing warehousing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources Storing normalized records in ES indices
14 Collecting log data warehousing Storing normalized records in ES indices
15 Collecting log data warehousing analysis Storing normalized records in ES indices cache warehoused data as Parquet files on Gluster volume local to Spark cluster
16 Schema mediation
17 Schema mediation
18 Schema mediation
19 Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.
20 Structured queries at scale logs.select("level").distinct.as[string].collect
21 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert
22 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc)
23 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc) kubelet kube-proxy ERR journal systemd
24 FEATURE ENGINEERING
25 From log records to vectors What does it mean for two log records to be similar?
26 From log records to vectors What does it mean for two log records to be similar? red green blue orange
27 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001
28 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns
29 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > 00010
30 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles
31 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles -> >
32 From log records to vectors What does it mean for two log records to be similar?
33 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...}
34 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...} -> > > > >
35 Similarity and distance
36 Similarity and distance
37 Similarity and distance (q - p) (q - p)
38 Similarity and distance n p i - q i i=1
39 Similarity and distance n p i - q i i=1
40 Similarity and distance p q p q
41 Similarity and distance
42 Similarity and distance
43 Similarity and distance
44 Similarity and distance =.7
45 Other interesting features host01 WARN WAR host02 WARN DEBUG host03 WARN WARN
46 Other interesting features host01 WARN host02 DEBUG WARN DEBUG host03 FO WARN
47 Other interesting features host01 DEBUG WARN BUG host02 WARN host03 WARN WARN WARN WARN
48 Other interesting features : Everything is great! Just checking in to let you know I m OK.
49 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers.
50 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers. : Phoenix datacenter is on fire; may not rise from ashes.
51 NATURAL LANGUAGE and LOG DATA
52 Introducing word2vec Madrid Spain France
53 Introducing word2vec Madrid Spain France
54 Introducing word2vec Madrid Spain France
55 Introducing word2vec Madrid Spain France
56 Introducing word2vec Madrid Spain France
57 Introducing word2vec Madrid Spain Paris France v("madrid") - v("spain") + v("france") v("paris")
58 Preprocessing text What is a word? How are log words different from words found in conventional prose?
59 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot
60 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd etcd ZooKeeper
61 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null etcd ZooKeeper
62 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper
63 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd OPEN_MAX /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper
64 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper
65 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT QEMU systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper
66 // assume messages is an RDD of log message texts val oneletter = new scala.util.matching.regex(".*([a-za-z]).*") def tokens(s: String, post: String=>String): Seq[String] = collapsewhitespace(s).split(" ").map(s => post(strippunctuation(s))).collect { case oneletter(_) => token } val tokenseqs = messages.map(line => tokens(line), identity[string]) val w2v = new org.apache.spark.mllib.feature.word2vec val model = w2v.fit(tokenseqs)
67 Insights from log messages
68 Insights from log messages nova
69 Insights from log messages containers glance nova vm images instances
70 Insights from log messages password containers glance nova vm images instances
71 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images
72 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images sh
73 Insights from log messages opened password publickey accepted containers glance nova IPMI vm images instances AUDIT_SESSION /usr/bin/bash sh
74 VISUALIZING STRUCTURE and FINDING OUTLIERS
75 Multidimensional data
76 Multidimensional data [4,7]
77 Multidimensional data [4,7]
78 Multidimensional data [4,7] [2,3,5]
79 Multidimensional data [4,7] [2,3,5]
80 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]
81 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]
82 A linear approach: PCA
83 A linear approach: PCA
84
85 A nonlinear approach: t-sne
86 A nonlinear approach: t-sne p( )
87 A nonlinear approach: t-sne p( ) p( )
88
89 Tree-based approaches
90 Tree-based approaches if red yes if orange if!red no if!gray yes if!orange if!gray no
91 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray
92 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray
93 Self-organizing maps
94 Self-organizing maps
95 Training SOMs
96 Training SOMs
97 Training SOMs
98 Training SOMs
99 Self-organizing maps
100 Self-organizing maps
101 Self-organizing maps
102 Self-organizing maps
103 Self-organizing maps
104 Self-organizing maps
105 Self-organizing maps
106 Self-organizing maps
107 Self-organizing maps
108 Finding outliers with SOMs
109 Finding outliers with SOMs
110 Finding outliers with SOMs
111 Finding outliers with SOMs
112 Finding outliers with SOMs
113 Outliers in log data
114 Outliers in log data 0.95
115 Outliers in log data
116 Outliers in log data
117 Outliers in log data An outlier is any record whose best match was at least 4σ below the mean
118
119 Out of 310 million log records, we identified % as outliers.
120
121
122 Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.
123 LESSONS LEARNED and KEY TAKEAWAYS
124 Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. The best practice is to warehouse ES indices to Parquet files and write Spark jobs against them.
125 Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.
126 Natural language tips Conventional preprocessing pipelines probably won t get you the best results on log message texts. Consider the sorts of words and word variants you ll want to recognize with your preprocessing pipeline. Use an approximate whitelist for unusual words.
127 Memory and partitioning Large JVM heaps can lead to appalling GC pauses. You probably won t be able to use all of your memory in a single JVM without executor timeouts. Consider memory budget at the driver when choosing a partitioning strategy.
128 Feature engineering Design your feature engineering pipeline so you can translate feature vectors back to factor values. Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models.
129
Log management with Logstash and Elasticsearch. Matteo Dessalvi
Log management with Logstash and Elasticsearch Matteo Dessalvi HEPiX 2013 Outline Centralized logging. Logstash: what you can do with it. Logstash + Redis + Elasticsearch. Grok filtering. Elasticsearch
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationxpaaerns on Spark, Shark, Tachyon and Mesos
xpaaerns on Spark, Shark, Tachyon and Mesos Spark Summit 2014 Claudiu Barbura Sr. Director of Engineering A>geo Agenda xpa&erns Architecture From Hadoop to BDAS & our contribu
More informationSpark Application Carousel. Spark Summit East 2015
Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationBig Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationKafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
More informationFacebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
More informationInformation Retrieval Elasticsearch
Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches
More informationApache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationLogitoring : log driven monitroing. the Rocket science. and. Eugene Istomin. IT Architect. e.istomin@edss.ee. Cone Center,Tallinn
Logitoring : log driven monitroing and the Rocket science Eugene Istomin IT Architect e.istomin@edss.ee Cone Center,Tallinn Topic goal: talking about a common way of delivering, storing and analyzing monitoring/log/trace
More informationSimba Apache Cassandra ODBC Driver
Simba Apache Cassandra ODBC Driver with SQL Connector 2.2.0 Released 2015-11-13 These release notes provide details of enhancements, features, and known issues in Simba Apache Cassandra ODBC Driver with
More informationSpark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
More informationВовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith?????? 122+ NoSQL Database
More informationArchitectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
More informationReal Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
More informationLog infrastructure & Zabbix. logging tools integration
Log infrastructure & Zabbix logging tools integration About me Me Linux System Architect @ ICTRA from Belgium (...) IT : Linux & SysAdmin work, Security, ICTRA ICT for Rail for Transport Mobility Security
More informationSUMMARY. Satellite is an application Two Sigma wrote to monitor, alert, and auto-administer our Mesos clusters.
SUMMARY Example of what Satellite can do: When swap falls below 1 day, turn the host off; if 20% of the cluster is turned off, send a pagerduty alert. Satellite is an application Two Sigma wrote to monitor,
More informationApache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org @apacheignite @dsetrakyan Agenda About In- Memory
More informationbrief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine
More informationSOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationEvaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing
Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go
More informationScaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13
Scaling Pinterest Yash Nelapati Ascii Artist Pinterest is... An online pinboard to organize and share what inspires you. Growth March 2010 Page views per day Mar 2010 Jan 2011 Jan 2012 May 2012 Growth
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationPackage RCassandra. R topics documented: February 19, 2015. Version 0.1-3 Title R/Cassandra interface
Version 0.1-3 Title R/Cassandra interface Package RCassandra February 19, 2015 Author Simon Urbanek Maintainer Simon Urbanek This packages provides
More informationJVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks
More informationUsing Logstash and Elasticsearch analytics capabilities as a BI tool
Using Logstash and Elasticsearch analytics capabilities as a BI tool Pashalis Korosoglou, Pavlos Daoglou, Stefanos Laskaridis, Dimitris Daskopoulos Aristotle University of Thessaloniki, IT Center Outline
More informationWEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE
WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE Contents 1. Pattern Overview... 3 Features 3 Getting started with the Web Application Pattern... 3 Accepting the Web Application Pattern license agreement...
More informationPulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationLogging on a Shoestring Budget
UNIVERSITY OF NEBRASKA AT OMAHA Logging on a Shoestring Budget James Harr jharr@unomaha.edu Agenda The Tools ElasticSearch Logstash Kibana redis Composing a Log System Q&A, Conclusions, Lessons Learned
More informationApache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
More informationReal-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH
Real-time Data Analytics mit Elasticsearch Bernhard Pflugfelder inovex GmbH Bernhard Pflugfelder Big Data Engineer @ inovex Fields of interest: search analytics big data bi Working with: Lucene Solr Elasticsearch
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationQuantifind s story: Building custom interactive data analytics infrastructure
Quantifind s story: Building custom interactive data analytics infrastructure Ryan LeCompte @ryanlecompte Scala Days 2015 Background Software Engineer at Quantifind @ryanlecompte ryan@quantifind.com http://github.com/ryanlecompte
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationVMware Identity Manager Connector Installation and Configuration
VMware Identity Manager Connector Installation and Configuration VMware Identity Manager This document supports the version of each product listed and supports all subsequent versions until the document
More informationCS312 Solutions #6. March 13, 2015
CS312 Solutions #6 March 13, 2015 Solutions 1. (1pt) Define in detail what a load balancer is and what problem it s trying to solve. Give at least two examples of where using a load balancer might be useful,
More informationThe Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
More informationBig Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationAnkush Cluster Manager - Hadoop2 Technology User Guide
Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush User Manual 1.5 Ankush User s Guide for Hadoop2, Version 1.5 This manual, and the accompanying software and other documentation, is protected
More informationCapitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationSpark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
More informationITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
More informationBig Data, Fast Data, Complex Data. Jans Aasman Franz Inc
Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12
More informationNoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases
NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases Background Inspiration: postgresapp.com demo.beatstream.fi (modern desktop browsers without
More informationThe Melvyl Recommender Project: Final Report
Performance Testing Testing Goals Current instances of XTF used in production serve thousands to hundreds of thousands of documents. One of our investigation goals was to experiment with a much larger
More informationHADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
More informationLinux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise
Linux A first-class citizen in Windows Azure Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise 1 First, I am software developer (C/C++, ASM, C#, Java, Node.js,
More informationDesigning Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationLarge-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
More informationBig Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
More informationChase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationSpark Job Server. Evan Chan and Kelvin Chu. Date
Spark Job Server Evan Chan and Kelvin Chu Date Overview Why We Needed a Job Server Created at Ooyala in 2013 Our vision for Spark is as a multi-team big data service What gets repeated by every team: Bastion
More informationA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional
More informationNear Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
More informationEfficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET. ISGC 2013, March 2013
Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET ISGC 2013, March 2013 Agenda Introduction Collecting logs Log Processing Advanced analysis Resume Introduction Status
More informationNative Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy
Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics
More informationDistributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12
Distributed Calculus with Hadoop MapReduce inside Orange Search Engine What is Big Data? $ 5 billions (2012) to $ 50 billions (by 2017) Forbes «Big Data is the new definitive source of competitive advantage
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More informationTrend Micro Big Data Platform and Apache Bigtop. 葉 祐 欣 (Evans Ye) Big Data Conference 2015
Trend Micro Big Data Platform and Apache Bigtop 葉 祐 欣 (Evans Ye) Big Data Conference 2015 Who am I Apache Bigtop PMC member Apache Big Data Europe 2015 Speaker Software Engineer @ Trend Micro Develop big
More informationStreaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
More informationGraylog2 Lennart Koopmann, OSDC 2014. @_lennart / www.graylog2.org
Graylog2 Lennart Koopmann, OSDC 2014 @_lennart / www.graylog2.org About me 25 years old Living in Hamburg, Germany @_lennart on Twitter Co-Founder of TORCH - The Graylog2 company. Graylog2 history Started
More informationUsing the VCDS Application Monitoring Tool
CHAPTER 5 This chapter describes how to use Cisco VQE Client Configuration Delivery Server (VCDS) Application Monitoring Tool (AMT). VCDS is a software component installed on each VQE Tools server, the
More informationJava Monitoring. Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer
Java Monitoring Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer A Bit About Me Current: Past: Pre-Sales Engineer (1997 present) WaveMaker Wily Persistence GemStone Application
More informationProcessing millions of logs with Logstash
and integrating with Elasticsearch, Hadoop and Cassandra November 21, 2014 About me My name is Valentin Fischer-Mitoiu and I work for the University of Vienna. More specificaly in a group called Domainis
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationApache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org #apacheignite Agenda Apache Ignite (tm) In- Memory
More informationIntroduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
More informationRealtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
More informationData Services Advisory
Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains
More informationLog Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory
Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory A Little Context! The Five Golden Principles of Security! Know your system! Principle
More informationCreating Big Data Applications with Spring XD
Creating Big Data Applications with Spring XD Thomas Darimont @thomasdarimont THE FASTEST PATH TO NEW BUSINESS VALUE Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these
More informationApache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
More informationHDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
More informationQsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationBuilding Your Big Data Team
Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.
More informationBig Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
More informationNative Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
More information