REAL TIME ANALYTICS WITH SPARK AND KAFKA
|
|
|
- Peter McGee
- 10 years ago
- Views:
Transcription
1 REAL TIME ANALYTICS WITH SPARK AND KAFKA NIXON PATEL, Chief Data Scientist, Brillio LLC. Predictive Analytics & Business Insights September 15 th, 2015
2 AGENDA Who Am I? What is Real-Time? What is Spark? Spark Overview. Why Spark? Why Spark Over Hadoop MapReduce? How Does Spark Work? What is Spark Streaming? What is Kafka? What is Mllib? What is Lambda Architecture? Data Flow in Lambda Architecture Spark Streaming as Real-Time Micro- Batching Kafka as Distributed, Fault tolerant RT Broker Real-Time Data Streaming Analytics Architecture What is Spark Streaming? Spark Streaming Terminology. Discretized Stream Processing (DStream) Windowed DStream Mapped DStream Network Input DStream Architecture For RT Analytics with SS And Kafka Technology Stack For Demo Analytics Model Inferences Drawn With Tableau Dashboard
3 WHO AM I? Chief Data Brillio SVP - Business Head Emerging Technologies Collabera 2014 Also, A Serial & Parallel Entrepreneur YantraSoft, A Speech Products Company Bhrigus, A Speech & ERP Solutions Company YantraeSolar, A 5 MW Solar Photovoltaic plant to generate Electricity from Solar Energy ecomserver, A Global Solutions & Products Company for Telecom, IVR and Web Servers Objectware, A Boutique Company Specialized in providing Design of High End Derivatives Trading Systems IBM Research Engineer, Research Triangle Park, NC IIT BTech (Hons) -Chemical Engineering NJIT MS Computer Science Rutgers Master of Business and Science in Analytics-- Ongoing
4 WHAT IS REAL TIME (RT)? Like there is no true Unstructured data so there is no RT. Only Near Real Time (NRT) NRT systems can respond to data as it receives it without persisting in a DB ( Present ) Present could mean different for different scenarios- Options Trader it is in Milliseconds, Ecommerce Site it is Attention IT Span IS ABOUT ABILITY TO MAKE BETTER DECISIONS & TAKE MEANINGFUL ACTIONS AT THE RIGHT TIME DETECTING FRAUD WHILE SOMEONE IS SWIPING A CREDIT CARD TRIGGERING AN OFFER WHILE THE SHOPPER IS STANDING IN CHECKOUT LINE PLACING AN AD ON A WEBSITE WHILE SOMEONE IS READING A SPECIFIC ARTICLE COMBING & ANALYZING DATA SO YOU CAN TAKE RIGHT ACTION AT RIGHT TIME & RIGHT PLACE
5 WHAT IS SPARK? Brief Relevant Historical Information Of Spark Map Reduce Paper 2004 Spark Paper by AMP Lab Spark Release July 15, Map Google Yahoo 2008 Hadoop Summit Apache Spark Top Level Spark 1.5 Preview Release in Databricks Aug 18
6 SPARK OVERVIEW Spark Stack Spark SQL Spark Streaming Mlib (machine learning) Apache Spark GraphX (graph) Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS,Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. Ease of Use Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Generality Spark SQL For SQL and unstructuredgraph Processing data processing MLib Machine Learning Algorithms GraphX Spark Streaming Stream processing of live data stream Write applications quickly in Java, Scala, Python, R. Spark offers over 80 highlevel operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Combine SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, Mlib for machine learning, Graph X, and Spark Streaming. You can combine these libraries seamlessly in the same application.
7 WHY SPARK? Limitations of MapReduce Model MR Programming Model is very Complex & has high overhead in launch of new M/R Performance bottlenecks as Streams needs multiple transformation Most Machine Learning Algorithms are iterative as multiple iterations improves results With Disk based approach each iteration s output is written to disk making it slow Spark unifies Batch, Real-Time (Interactive) & Iterative apps into a single Framework Spark s lazy evaluation of lineage graph reduces wait states with better pipelining Spark s Optimized heap use of large memory spaces Use of Functional Programming model makes creation and maintenance of apps simple HDFS friendly, Unified Toolset, Pipelined Oriented
8 WHY SPARK OVER HADOOP MAPREDUCE? Hadoop Execution Flow Spark execution flow Input Data on Disk Tuples ( on Disk) Tuples ( on Disk) Tuples ( on Disk) MR1 MR2 MR3 MR4 Output Data on Disk Input Data on Disk RDD1 (in memory) RDD2 (in memory) RDD3 (in memory) t1 t2 t3 t4 Output Data on Disk Hadoop MR Record Spark Record Spark 1 PB Data Size TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes # Cores physical 6592 virtualized 6080 virtualized Cluster Disk throughput Sort Benchmark Daytona Rules Network 3150 GB/s (est.) 618 GB/s 570 GB/s Yes Yes No Dedicated data center, 10Gbps Virtualized (EC2)10Gbps network Virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Sort rate/node databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-inlarge-scale-sorting.html 0.67 GB/min 20.7 GB/min 22.5 GB/min
9 Driver Program Actio ns Persiste nce HOW SPARK WORKS? How Spark Works- SparkContext Driver Program sc=new SparkContext rdd=sc.textfile( h dfs:// ) rdd.filter( ) rdd.cache rdd.count Writes rdd.map User (Developer) SparkContext Worker Node Executer Task Datanode Cache Task How Spark Works RDD Operations Cluster Manager HDFS Worker Node Executer Task Datanode Cache Task How Spark Works- RDD Driver Program sc=new SparkContext rdd=sc.textfile( h dfs:// ) rdd.filter( ) rdd.cache rdd.count Writes rdd.map User (Developer) RDD (Resilient Distributed Dataset) Partitions of Data Dependencies between partitions RDD- Resilient Distributed Dataset Storage Types: MEMORY _ONLY, MEMORY_AND_DISK, DISK_ONLY, Fault Tolerant Controlled portioning to optimize data placemen Manipulated Using rich s of operators Create a new dataset from an existing one. Lazy in nature, executed only when some action is performed. Example - Map(func) - Filter(func) - Distinct() Returns a value or exports data after performing a computation. Example - Count() - Reduce(func) - Collect - Take() Caching dataset inmemory for future operations. Store on disk or RAM or mixed - Persist() - Cache() A big collection of data having following properties Immutable Lazy evaluate Cacheable Type Inferred
10 LISTENERS INCOMING OUT GOING DESTINATIONS WHAT ARE THE CHALLENGES OF A REAL TIME STREAMING DATA PROCESSING? RTSD Processing challenges are Complex Collection of RT events Millions/Second Needs events correlation using Complex Event Proc Flexible Data Input Data Fall Over Snapshots Besides Vol & Var needs to handle Vel & Ver Needs Parallel processing of data being collected Needs Fault Tolerance Needs to Handle events with Low Needs to handle data in distributed Latency environment Real-time, fault-tolerant, scalable, in-stream processing Guaranteed Data Delivery HTTP STREAM PROCESSING APPLICATION LOGS DATA PARTNERS BATCH UPLOAD NOSQL DATABASE HADOOP CUSTOM CONNECTORS QUEUE QUEUE RDBMS
11 LAMBDA ARCHITECTURE* LA satisfies the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up. Batch Layer Serving layer 2 3 Batch View 1 New Data Master Dataset Batch View Quer y 5 Speed layer 4 Real-time view Real-time view Quer y * ( Nathan Marz )
12 DATA FLOW IN LAMBDA ARCHITECTURE 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: managing the master dataset (an immutable, append-only set of raw data), to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time view Seems Like a Complex Solution Besides, there is a general notion that Real-Time Layer is prone to FAILURE! Why? Computation happens in memory A system Crash will erase the state of the Real-Time System Bad Deployment Out Of Memory Delayed upstream data will give inaccurate metrics
13 SPARK STREAMING AS RT MICRO-BATCHING SYSTEM COMES TO RESCUE Since the Real Time layer only compensates for the last few hours of data, everything the real-time layer computes is eventually overridden by the batch layer. So if you make a mistake or something goes wrong in the real-time layer, the batch layer will correct it. Nathan Marz Big Data: Principles and best practices of scalable real-time data systems Spark Streaming, A Micro Batch Layer on top of Core Spark corrects the fault tolerance issue Spark Streaming Provides Efficient and fault-tolerant state full stream processing Integrates with Spark s batch and interactive processing Wait A minute Has the issue raised in component #1 of the Lambda Architecture addressed? Not Yet!!!!
14 APACHE KAFKA COMES FOR RESCUE FOR FAULT #1 IN LA This is very well resolved by The Log from Jay Kreps Espresso Voldemort Oracle User Tracking Operational Logs Operational Metrics The Log Turns This Mess.. Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Security Production Services Espresso Voldemort Oracle User Tracking Operational Logs Operational Metrics INTO. THIS United Log Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Security Production Services
15 COMPONENTS OF REAL TIME DATA STREAMING ANALYTICS ARCHITECTURE Lambda Architecture Stack 1. Unified Log Apache Kafka 2. Batch Layer Hadoop for Storage 3. Serving Layer MySQL, Cassandra, NoSQL or other KV Stores 1 New Data Batch Layer 2 Master Dataset Serving layer 3 Batch View Batch View Query 5 4. Real-Time Layer Spark Streaming 5. Visualization Layer -- Speed layer 4 Real-time view Real-time view Query Tableau
16 WHAT IS SPARK STREAMING? Extends Spark for doing large scale stream processing Integrates with Spark s batch and interactive processing Scales to 100s of nodes and achieves second scale Provides a simple batch-like API for implementing complex latencies algorithms Efficient and fault-tolerant state full stream processing These batches are called Discrete Stream ( DStreams ) Kafka Flume HDFS ZeroMQ Spark Streaming HDFS Databases Dashboards Twitter Credit: /
17 SPARK STREAMING TERMINOLOGY DStreams A Sequence of Mini-Batches, where each mini- batch is represented as a Spark RDD Defined by its Input Source Kafka, Twitter, HDFS, Flume, TCP, Sockets Akka Actor Defined by a time window called the Batch Interval Each RDD in Stream contains records received by Spark Streaming during Batch RDDs are replicated in Each Interval batch of DStream cluster for fault is replicated as RDD tolerance Each DS operation result in RDD transformation There are APIs to access these RDDs directly Streaming Flow Input data stream Spark Streaming DStream to RDD DStream time 1 Data from time 0 to 1 Batches of input data time 2 Data from time 1 to 2 Spark Engine time 3 Data from time 2 to 3 Batches of processed data time 4 Data fro time 3 to 4 RDDs from Batch and Stream can be combined
18 DSTREAM PROCESSING Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches Spark Streaming Spar k Live data stream Batches of X seconds Processed results
19 DSTREAM PROCESSING Run a streaming computation as a series of very small, deterministic batch jobs Batch sizes as low as ½ second, latency ~ 1 second Potential for combining batch processing and streaming processing in the same system Spark Streaming Spar k Live data stream Batches of X seconds Processed results
20 EXAMPLE: WINDOWED DSTREAM Windowin Window g 0s 1s 2s 3s 4s 5s 6s 7s Slide By default: Window = slide = batch duration
21 EXAMPLE: WINDOWED DSTREAM Windowin Window = 3s g 0s 1s 2s 3s 4s 5s 6s 7s Slide = 2s The resulting DStream consists of 3 seconds micro-batches Each resulting micro-batch overlaps the preceding one by 1 seconds
22 EXAMPLE: MAPPED DSTREAM Dependencies: Single parent DStream Slide Interval: Same as the parent DStream Compute function for time t: Create new RDD by applying map function on parent DStream s RDD of time t override def compute(time: Time): Option[RDD[U]] = { } parent.getorcompute(time).map(_.map[u](mapfunc)) Gets RDD of time t if already computed once, or generates it Map function applied to generate new RDD
23 ARCHITECTURE FOR THE REAL TIME ANALYTICS WITH SPARK STREAMING & KAFKA Batch Oriented Layer Hive Twitter Flume HDFS Kafka Producer Kafka Broker ZK scikitlearn model scikitlearn Spark Streamin g MySQL Tableau Dashboar d Streaming Layer -The above architecture uses a combination of batch oriented and streaming layer as in the case of Lambda architecture. -For the batch oriented architecture Hive which is an abstraction on top of MapReduce is used. -Scikit-learn is used to create the model from the tweets in HDFS after the labelling of the tweets has been done.. -For processing the streaming tweets, Spark framework is used.
24 TECHNOLOGY STACK USED FOR THE DEMO Operating System Ubuntu bit Server on Microsoft Azure Languages Java 1.8 Scala 2.10 Python Database MySQL Big Data Software Apache Flume Apache Hive Apache Hadoop Kafka Apache Spark 1.4.1
25 KAFKA PRODUCER Batch Oriented Layer Hive Twitter Flume HDFS Kafka Producer Kafka Broker ZK scikitlearn model scikitlearn Spark Streamin g MySQL Tableau Dashboar d Streaming Layer
26 KAFKA PRODUCER The Kafka producer uses the Hosebird Client (hbc - developed by Twitter to pull the data for the topics of interest like the presidential candidates of 2016 elections. Hbc is Java HTTP client for consuming the Twitter s Streaming API. Once the tweets are pulled by Hbc, they are published to the Kafka topic where different subscribers can pull the tweets for further analysis. The tweets are pulled and published in a real time. As soon as someone tweets on the relevant topics, the same is pulled by Hbc and published to the Kafka topic.
27 GETTING THE RELEVANT TWEET FOR 2016 PRESIDENTIAL ELECTIONS Users tweet about politics, movies, sports and other interesting topics. On the StatusesFilterEndpoint the appropriate hashtags, keywords and Twitter User Id have to be specified as below to get the relevant tweets. The StatusesFilterEndpoint is part of the Hbc API provided by Twitter. More details about the StatusesFilterEndpoint API can be found at The below are the hashtags and the keywords for which the Tweets are being pulled. endpoint.trackterms(lists.newarraylist("#election2016", "#2016", "#FeelTheBern", "#bernie2016", "#berniesanders", "#Walker16", "#scottwalker", "#gop", "#hillaryclinton", "#politics", "#donaldtrump", "#trump", "#teaparty", "#obama", "#libertarians", "#democrats", "#randpaul", "#tedcruz", "#republicans", "#jebbush", "#hillary2016", "#gopdebate", "#bencarson", "#usapresidentialelection", "Abortion", "Budget", "Economy", "Civil Rights", "Corporations", "Crime", "Defence spending", "Drugs", "Education", "Energy and Oil", The below "Environment", are the Twitter "Families User ID and for Children", the presidential "Foreign candidates Policy", "Free got from Trade", "Government Reform", "Gun Control", "Health Care", "Homeland Security", "Immigration", endpoint.followings(lists.newarraylist(new "Infrastructure and Technology", Long( ), "Jobs", "Social new Security", Long( ), "Tax Reform", new Long( ), "State welfare")); new Long(939091), new Long( ), new Long( ), new Long( L), new Long( ), new Long( L), new Long( ), new Long( ), new Long( ), new Long( ), new Long( L), new Long( ), new Long( L), new Long( ), new Long( ), new Long( ), new Long( ), new Long( L), new Long( ), new Long( ), new Long( ),
28 CODE FOR PUSHING THE MESSAGES FROM THE KAFKA PRODUCER TO THE BROKER //Create an instance of the Kafka Producer Producer<String, String> producer = new Producer<String, String>(producerConfig); //Create an instance of Kafka KeyedMessage which is passed to the Kafka Broker later KeyedMessage<String, String> message = null; try { //Populate the KeyedMessage with the topic, key and the message message = new KeyedMessage<String, String>(topic, "key", queue.take().trim()); } catch (InterruptedException e) { } e.printstacktrace(); //Send the message from the producer to the broker
29 KAFKA BROKER Batch Oriented Layer Hive Twitter Flume HDFS Kafka Producer Kafka Broker ZK scikitlearn model scikitlearn Spark Streamin g MySQL Tableau Dashboar d Streaming Layer
30 KAFKA BROKER The Kafka Broker ( provides a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs. Kafka was developed by LinkedIn and is now a Top Level Project of the Apache Software Foundation. Producer Producer Producer Kafka Broker Consumer Consumer Consumer The previous slide mentions the Kafka producer. The Kafka consumers are the Flume and Spark Streaming in this scenario.
31 FLUME Batch Oriented Layer Hive Twitter Flume HDFS Kafka Producer Kafka Broker ZK scikitlearn model scikitlearn Spark Streamin g MySQL Tableau Dashboar d Streaming Layer
32 FLUME Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has better connectivity to the Big Data ecosystem like HDFS and HBase when compared to Kafka, so it makes it easy to develop applications in Flume. But, Flume doesn t provide a reliable system when compared to Kafka. Kafka automatically replicates the events across machines and so changes of loosing the data are minimized. Kafka Source Flume is being configured to pull the messages from Kafka ( #kafka-source), pass them to the memory channel and the channel send it to the HDFS. The next slide has the Flume configuration on how the different Flume components (Kafka Source, Memory Channel, HDFS Sink) are defined and chained together to the flow shown above. Memory Channel HDFS Sink
33 flume1.channels.memory-channel-1.type = memory flume1.channels.memory-channel-1.capacity = flume1.channels.memory-channel- 1.transactionCapacity = 100 FLUME CONFIGURATION FILE # Sources, channels, and sinks are defined per # agent name, in this case flume1. flume1.sources = kafka-source-1 flume1.channels = memory-channel-1 flume1.sinks = hdfs-sink-1 # For each source, channel, and sink, set standard properties. flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.kafkasource flume1.sources.kafka-source-1.zookeeperconnect = localhost:2181 flume1.sources.kafka-source-1.topic = twitter-topic flume1.sources.kafka-source-1.batchsize = 100 flume1.sources.kafka-source-1.batchdurationmillis = 100 flume1.sources.kafka-source-1.channels = memory-channel-1 # Other properties are specific to each type of # source, channel, or sink. In this case, we # specify the capacity of the memory channel. #Specify the HDFS Sink properties flume1.sinks.hdfs-sink-1.channel = memory-channel-1 flume1.sinks.hdfs-sink-1.type = hdfs flume1.sinks.hdfs-sink-1.hdfs.writeformat = Text flume1.sinks.hdfs-sink-1.hdfs.filetype = DataStream flume1.sinks.hdfs-sink-1.hdfs.fileprefix = tweets flume1.sinks.hdfs-sink-1.hdfs.uselocaltimestamp = true flume1.sinks.hdfs-sink-1.hdfs.path = hdfs://localhost:9000/user/analysis/tweets flume1.sinks.hdfs-sink-1.hdfs.rollcount=10 flume1.sinks.hdfs-sink-1.hdfs.rollsize=0 flume1.sinks.hdfs-sink-1.hdfs.rollinterval=0
34 HIVE Batch Oriented Layer Hive Twitter Flume HDFS Kafka Producer Kafka Broker ZK scikitlearn model scikitlearn Spark Streamin g MySQL Tableau Dashboar d Streaming Layer
35 HIVE The tweets in HDFS are in JSON format. Hive which provides a SQL layer of abstraction is used to for analysis of the JSON tweets in a batch oriented fashion. Hive out of the box can only understand csv, tsv and other types of data, but not the JSON data. So, a custom JSON (SerDe Serializer Deserializer) has to be used for the same. A SerDe allows Hive to read the data from a table, and write it back to HDFS in any custom format. The queries through Hive are batch oriented in nature, so Spark Streaming has been used as discussed in the coming slides.
36 SPARK STREAMING Batch Oriented Layer Hive Twitter Flume HDFS Kafka Producer Kafka Broker ZK scikitlearn model scikitlearn Spark Streamin g MySQL Tableau Dashboar d Streaming Layer
37 SPARK STREAMING Spark streaming makes it easy to build scalable fault-tolerant streaming applications. The main advantage of Spark streaming is that it lets reuse of the same code for batch processing as well as processing the streaming data. Spark had been configured to receive data from Kafka. More about the integration at The Spark Streaming program has been developed in Python and uses Scikit learn to figure out the sentiment of the tweet in a real time fashion and populate the same in the MySQL database.
38 SPARK STREAMING CODE SNIPPET #Create the Spark Context and Spark Streaming Context sc = SparkContext(master = "spark://demo-ubuntu:7077", appname="pythonstreamingkafkawordcount") ssc = StreamingContext(sc, 1) #Get the tweets from Kafka, Spark Streaming is a consumer to the Kafka broker zkquorum, topic = sys.argv[1:] kvs = KafkaUtils.createStream(ssc, zkquorum, "spark-streaming-consumer", {topic: 1}) #Figure out the sentiment of the tweet and write the same to MySQL kvs.map(lambda x: x[1]).map(classify_tweet).pprint()
39 EXAMPLE GET HASHTAGS FROM TWITTER kvs = KafkaUtils.createStream(ssc, zkquorum, "sparkstreaming-consumer", {topic: 1}) new DStream - a sequence of distributed datasets (RDDs) representing a distributed stream of data ZooKeeper Kafka Topic Twitter Streaming API t t+1 t+2 tweets DStream stored in memory as an RDD (immutable, distributed dataset)
40 EXAMPLE GET HASHTAGS FROM TWITTER kvs.map(lambda x: x[1]).map(classify_tweet).pprint() Transformation: the classify_tweet python function calculates the sentiment of the tweet and writes to MySQL t t+1 t+2 tweets DStream Map Map Map Tweet sentiment is calculated and put in MySQL database.
41 SCIKIT-LEARN Batch Oriented Layer Hive Twitter Flume HDFS Kafka Producer Kafka Broker ZK scikitlearn model scikitlearn Spark Streamin g MySQL Tableau Dashboar d Streaming Layer
42 ANALYTICS MODEL In order to classify tweets as positive or negative, we built a model using the Random Forest classifier in Python s scikitlearn package Used a publically available dataset of pre-classified tweets as our training data set Extracted n-grams of the text (uni-grams, bi-grams and trigrams), and created dummy variables from them, where 1 indicates that an n-gram is present in a tweet Examples of some n-grams are on the right. Here, the count indicates the number of times a string appears Using this, we can compute the TF-IDF to select important words TF(t) = (Number of times term t appears) / (Total number of terms in the document) IDF(t) = log_e(total number of documents / Number of documents with term t in it)
43 ANALYTICS MODEL The data set contained around 1500 variables, out of which the Random Forest selected a maximum of 100 in each tree. There were 30 trees in total (this was the point at which the error rate converged) Split the data set in an 80:20 ratio for training and testing, and then trained the RF model on the 80% dataset. The 20% dataset was used for testing the results (sample is shown below) Using K-fold cross validation (with K=5), we tested the model across multiple splits, and Sentiment Negative Positive Count Prediction Accuracy averaged out the results, and obtained an overall Negative % accuracy of ~80% Positive % Count Result Accuracy 75.4% 84.2% 79.7%
44 RANKING CANDIDATES ON THEIR SIMILARITIES 1. A similarity matrix is a transposed n x n matrix of n candidates, where each candidate has a score with respect to another 2. A candidate has the maximum score with himself, and a gradually decreasing score with candidates less like him/her 3. The scoring used some of the following metrics : Overall popularity of the candidate The positive sentiment towards the candidate Chris Christie has a max score with himself The issues being discussed, and the sentiment towards the candidate for his/her stand on the issue
45 MYSQL DATABASE 1. The output of the analytic model, and the classification of tweets into positive and negative sentiments, are stored in a table on the database 2. The sentiment matrix is also stored as another table on the database 3. The two tables are linked by the Candidate s name 4. Whenever a new tweet is streamed in real-time, both tables are updated Incoming Tweet Classified Tweets Timestamp Tweet Sentiment Candidate Party Keyword INSERT INTO TABLE RF Classifier + Scoring Model Candidate UPDATE TABLE Similarity Matrix Candidate Candidate_2 Relationshi p Similarity Score
46 TABLEAU DASHBOARD 1. A connection is created between the database and Tableau, which gets refreshed in real-time when the database gets updated 2. The front page gives an overview of tweet counts, tracks the daily number of tweets, most popular issues being discussed, and the number of tweets for each candidate The Tweet Counter shows the number of tweets each party receives, and the break-up of the positives and negatives
47 TABLEAU DASHBOARD Clicking on the positive or negative portion of the party s pie chart, will drill down each of the graphs. Here, we select the Positives of the Democrats, and can see their daily progress, their most important issues, and how their candidates are faring Selecting Democrats filters the word cloud and candidate plots
48 TABLEAU DASHBOARD If we click on a word in the word cloud, we are able to view all candidates who are talking about that issue We can even see the positive and negative reception of the candidate towards that issue Clicking on Jobs displays all candidates talking about Jobs
49 TABLEAU DASHBOARD The second page is network graph of candidates, where each one is linked to the other based on their similarity score The score ranges from 0 to 6, with the links varying from red to green Using the buttons on the top right, we can see the candidates that are most alike Displays the similar candidate links. Here, Joe Biden is similar to Lincoln Chafee and Jim Webb, but dissimilar to Hillary Clinton
50 TABLEAU DASHBOARD Clicking on a candidates bubble, displays all the other candidates that are similar to him Clicking on Joe Biden displays his information, and all the candidates similar to him
51 TABLEAU DASHBOARD Similarly, we can change the view and see the candidates most unlike a particular candidate, by toggling the button on the top rightz If we select Dissimilar candidates, and select Jeb Bush, we see his information displayed, and all the candidates most unlike him
52 THANK YOU Predictive Analytics & Business Insights Team All Attendees Everyone on the Next slide..
53 REFERENCES AND CITATION FOR THIS PRESENTATION My personal thanks to Praveen Sripati, Jim D Souza, Sandeep Devagiri, Deepak Vinoth, Divya, Sanisha, Mayank Pant, Srinivas Guntur and rest of team at Brillio who made this presentation and demo possible. 1. Spark Streaming. Apache Software Foundation databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Big Data: Principles and best practices of scalable real-time data systems 6. Hausenblas, Michael, and Nathan Bijnens. "Lambda Architecture". N.p., The Log from Jay Kreps 8. jjmalina/pygotham Discretized Streams: A Fault-Tolerant Model for Scalable Stream
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Putting Apache Kafka to Use!
Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
Next-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!
SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble! Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Kafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai [email protected] Intel Software and Services Group
Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai [email protected] Intel Software and Services Group Project Overview Research & open source projects initiated by AMPLab in UC Berkeley
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Big Data Analysis: Apache Storm Perspective
Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @
The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Analytics on Spark & Shark @Yahoo
Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
Conquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Big Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets Andrew Psaltis About Me Recently started working at Ensighten on Agile Maketing Platform Prior 4.5 years
Unified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org @apacheignite @dsetrakyan Agenda About In- Memory
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Introduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Openbus Documentation
Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
