Research Statement Joseph E. Gonzalez
|
|
|
- Darrell Nelson
- 10 years ago
- Views:
Transcription
1 As we scale to increasingly parallel and distributed architectures and explore new algorithms and machine learning techniques, the fundamental computational models and abstractions that once separated systems and machine learning research are beginning to fail. Some of the recent advances in machine learning have come from new systems that can apply complex models to big data problems. Likewise, some of the recent advances in systems have exploited fundamental properties in machine learning and analytics to reach new points in the system design space. By considering the design of scalable learning systems from both perspectives, we can address larger problems, expose new opportunities in algorithm and system design, and define the new fundamental computational models and abstractions that will accelerate research in these complementary fields. Research Summary: My research spans the design of scalable machine learning algorithms and data-intensive systems and has introduced: new machine learning algorithms that leverage advances in asynchronous scheduling and transaction processing to achieve efficient parallelization with strong guarantees new systems that exploit statistical properties to execute machine learning algorithms ordersof-magnitude faster than contemporary distributed systems new abstractions that have redefined the boundaries between machine learning and systems. Machine Learning: The future of machine learning hinges on our ability to learn from vast amounts of high-dimensional data and to train the big models they support. To create scalable machine learning algorithms that fully utilize advances in hardware and system design, I have adopted a systems approach to machine learning research. By decomposing machine learning algorithms into smaller parts and identifying and exploiting common patterns, I have developed new highly scalable algorithms that leverage advances in system design and provide strong guarantees. In my thesis, I developed, analyzed, and implemented a set of Bayesian inference algorithms that exploit the conditional independence structure of graphical models to parallelize computation across distributed asynchronous systems. As a postdoc, I decomposed nonparametric inference and submodular optimization algorithms into exchangeable transactions and applied techniques in scalable transaction processing to derive parallel algorithms with strong guarantees. Systems: Many recent developments in large-scale system design were driven by applications in analytics and machine learning that exposed new opportunities for parallelism, scheduling, concurrency control, and asynchrony and led to new points in the system design space. My approach to systems research builds on my knowledge of machine learning to identify patterns that yield new abstractions and system design opportunities. My thesis work introduced the graph-parallel abstraction which has served as a platform for subsequent parallel algorithms and applications, and has become the foundation for a wide range of graph-processing systems. Building on variations of the graph-parallel abstraction, I created the GraphLab and PowerGraph systems which leverage fundamental properties of data to execute machine learning and graph analytics algorithms orders of magnitude faster than contemporary data processing systems. As a postdoc, I created GraphX which unifies data-parallel and graph-parallel systems by developing new distributed join optimizations and new tabular representations of graph data. Impact: From graph-parallel systems to the parameter server my research produced some of the most widely used open-source systems for large-scale machine learning. The combination of work on algorithms and systems led to the creation of GraphLab Inc. which has already commercialized my research, enabling applications ranging from product targeting to cybersecurity. 1
2 Thesis Research Summary: My thesis work introduced algorithms, abstractions, and systems for scalable inference in graphical models and played a key role in defining the space of parallel inference algorithms and graph processing systems and abstractions. Graphical Model Inference Algorithms: I developed, analyzed, and implemented a collection of message passing [1, 2, 3] and MCMC [4] Bayesian inference algorithms that leverage the Markov Random Field structure to efficiently utilize parallel processing resources. By asynchronously constructing prioritized spanning trees, these algorithms expose substantial parallelism and also accelerate convergence. The work on MCMC inference incorporated techniques in graph coloring with new Markov blanket locking protocols to preserve ergodicity while enabling the application of parallel resources. By abandoning the popular bulk synchronous parallel model, and envisioning a new more asynchronous graph-centric model, I was able to design more scalable algorithms with stronger guarantees. This early work on algorithm design laid the foundation for the design of graph-parallel abstractions and systems and was a key factor in the broad adoption of the GraphLab open-source project. Abstractions: Guided by the work on parallel inference, I introduced the graph-parallel [5, 6], gather-apply-scatter (GAS) [7], and parameter server [8] abstractions which capture the fundamental computational and data-access patterns in a wide range of machine learning algorithms. These abstractions isolate the design of machine learning algorithms from the challenges of large-scale distributed asynchronous systems enabling research into algorithms and systems to proceed in parallel. The graph-parallel and GAS abstractions, exploit common properties in graph algorithms to expose new opportunities to leverage asynchrony and optimize communication and data-layout. The parameter server abstraction exploits the abelian group structure and sparse parameter updates in many machine learning algorithms to expose new opportunities for distributed caching and aggregation. The work on graph parallel abstractions has played a key role in recent graph systems and algorithms research and has influenced systems including GraphChi, X-Stream, and Naiad. The simplicity and widespread applicability of the parameter server has led to its adoption in many largescale machine learning systems including many of the recent systems for deep learning and language modeling. The work on abstractions reduced the complexity of machine learning algorithms to simple computational patterns that could be analyzed for the parallelism and communication overhead and rendered into a range of systems spanning multicore and distributed architectures. Systems: Building on the abstractions and common properties in data, I developed the GraphLab [6] and PowerGraph [7] systems. The GraphLab system introduced asynchronous prioritized scheduling in conjunction with concurrency control primitives to enable efficient parallel execution of graph algorithms while ensuring serializability. The PowerGraph system exploited the power law graph structure and vertex-cut partitioning to efficiently compute on large real world graphs in a distributed environment. These systems were able to achieve several orders-of-magnitude performance gains over contemporary map-reduce systems and remain the gold standard for general purpose graph processing systems. Through these projects, I helped lead a group of junior graduate students to develop their research and cultivate a growing open-source community. Inspired by the wide-spread adoption of the GraphLab open-source project and commercial interest in applications ranging from product targeting to language modeling, I co-founded GraphLab Inc. which has successfully commercialized these systems with customers in industries as diverse as e-commerce and defense. 2
3 Post-doctoral Research Summary: As a post-doc advising an exceptional group of graduate students, I adopted a database systems perspective on the design of machine learning systems to unify graph-processing and general purpose dataflow systems and introduce scalable transaction processing techniques to the design of parallel machine learning algorithms. The GraphX System: Driven by the need to support the entire graph analytics pipeline which combines tabular pre-processing and post-processing with complex graph algorithms, I led the development of the GraphX system [9] to unify tables and graphs. I revisited my earlier work in the design of graph-processing systems through the lens of distributed database systems. GraphX recast the essential patterns and optimizations in graph-processing systems as new distributed join strategies and new techniques for incremental materialized view maintenance. Through this alternative perspective, GraphX integrates graph computation into a general purpose distributed dataflow system, enabling users to view data as tables or graphs without data movement or duplication and efficiently execute graph algorithms at performance parity with specialized graph-processing systems. GraphX is now part of Apache Spark and has been put into production at major technology companies (e.g., Alibaba Taobao). Transaction Processing for Machine Learning: I introduced techniques in scalable transaction processing to the design of new parallel algorithms for nonparametric clustering and submodular optimizations. By applying classic ideas in optimistic concurrency control to the nonparametric DPmeans clustering algorithm, I developed a parallel inference algorithm which preserves the guarantees of the original serial algorithm [10]. Inspired by escrow techniques, which maintain bounds on the global state, I derived a parallelization of the double greedy algorithm for non-monotone submodular maximization which retains the original approximation guarantees [11]. One of the key contributions of this work is to flip the design and analysis of asynchronous algorithms from fast and sometimes correct to correct and often fast. As a consequence, this work exposes the opportunity for a new line of approaches to parallel algorithm design and analysis. Future Research In addition to continuing research on the design of scalable Bayesian inference algorithms, graphprocessing systems, and transaction techniques for machine learning I plan to explore two new directions in the design of systems for machine learning: machine learning lifecycle management and the unification of synchronous and asynchronous abstractions. Model Serving and Management: Machine learning and systems research have largely focused on the design of algorithms and systems to train models at scale on static datasets. However, training models is only a small part of the greater lifecycle of machine learning which spans training, model serving, performance evaluation, exploration, and eventually re-training. Furthermore, in many real-world applications there are often many models and modeling tasks which may share common data and incorporate user level personalization (e.g., spam prediction and content filtering). This bigger picture introduces a wide range of exciting new challenges in both the design of models and algorithms as well as the systems needed to support each of these stages. Below I enumerate just a few of these opportunities: 3
4 Low-Latency Model Serving: Serving predictions can be costly, requiring the evaluation of complex feature functions, retrieval of user personalization information, and potentially slow mathematical operations. Existing systems resort to pre-materialized predictions or specialized bespoke prediction services limiting their scalability and broader adoption. I believe that we can leverage advances in distributed query processing, caching, partial materialization, and model approximations to enable more general purpose low-latency prediction serving. Hybrid Online/Offline Learning: Typically models are trained offline at fixed intervals (e.g., every night) resulting in stale models. While online learning algorithms exist, there is limited systems support and offline re-training can often be more efficient and improve estimation quality (e.g., by iterating multiple times). I believe that by splitting models into online and offline components we can achieve a compromise by enabling fast updates to rapidly changing parameters (e.g., personalization parameters) while leveraging existing offline training systems for slowly evolving parameters. Managing Multiple Models: Often there will be multiple models for the same task (e.g., models built by different employees). Choosing the right model and sharing training and prediction computation across models can improve accuracy and system performance. Model selection is often accomplished using A/B testing, however this approach is deficient as model performance can vary across user groups and time. I believe that by applying active learning techniques we can adaptively select the best models for different groups at different times. This will require the development of new systems to support active learning in a low-latency serving environment. I have already started initial work on systems for model management and serving [12] and I believe that answering these questions could fill a decade of exciting research and shape the future of machine learning systems. By developing the algorithms and systems needed address the entire machine learning lifecycle we will be able to make better use of data, incorporate predictive analytics in a wide range of services, and enable the new highly responsive, personalized intelligent services that will drive everything from advertising to health-care. Hybrid Synchronous and Asynchronous Systems: Much of my early work on graph processing systems, the parameter servers, and even transactional models for machine learning leveraged nondeterminism and asynchrony to improve performance. Meanwhile, many contemporary data processing systems have abandoned asynchrony, in favor of the simpler Bulk Synchronous Parallel (BSP) execution model and the determinism it affords. This leads to the question: can we combine the benefits of asynchronous systems and the simplicity and determinism of synchronous systems? I have already begun [13] to explore a hybrid approach to algorithm and system design that provides more frequent, fine-grained communication, while retaining determinism at different levels of granularity. Building on the concept of mini-batch algorithms in machine learning, I believe their is a pattern and corresponding abstraction that can interpolate between the extremes enabling users to choose the level of coordination that leads to the optimal trade-off between algorithm convergence rates and system performance. Understanding how and when to exploit non-determinism while preserving guarantees on algorithm correctness will be a key part of managing the machine learning lifecycle and exploiting the high-performance systems of the future. 4
5 References [1] J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In Artificial Intelligence and Statistics (AISTATS), April [2] J. Gonzalez, Y. Low, C. Guestrin, and D. O Hallaron. Distributed parallel inference on large factor graphs. In Conf. on Uncertainty in Artificial Intelligence (UAI), [3] Joseph Gonzalez, Yucheng Low, and Carlos Guestrin. Scaling Up Machine Learning, chapter Parallel Inference on Large Factor Graphs. Cambridge U. Press, [4] Joseph Gonzalez, Yucheng Low, Arthur Gretton, and Carlos Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In Artificial Intelligence and Statistics (AISTATS), May [5] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), [6] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. In Proceedings of Very Large Data Bases (PVLDB), [7] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI 12, [8] Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alex Smola. Scalable inference in latent variable models. In Conference on Web Search and Data Mining (WSDM), [9] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. Graphx: Graph processing in a distributed dataflow framework. In Symposium on Operating Systems Design and Implementation (OSDI 14), [10] Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, and Michael I. Jordan. Optimistic concurrency control for distributed unsupervised learning. In NIPS 13, [11] Xinghao Pan, Stefanie Jegelka, Joseph E. Gonzalez, Joseph K. Bradley, and Michael I. Jordan. Parallel double greedy submodular maximization. In NIPS 14, [12] Daniel Crankshaw, Peter Bailis, Joseph E. Gonzalez, Haoyuan Li, Zhao Zhang, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with velox. In CIDR 15, [13] Peter Bailis, Joseph E. Gonzalez, Ali Ghodsi, Michael J. Franklin, Joseph M. Hellerstein, Michael I. Jordan, and Ion Stoica. Asynchronous complex analytics in a distributed dataflow architecture. In Under Review for SIGMOD 15, [14] Veronika Strnadova, Aydin Buluc, Leonid Oliker, Joseph Gonzalez, Stefanie Jegelka, Jarrod Chapman, and John Gilbert. Fast clustering methods for genetic mapping in plants. In 16th SIAM Conference on Parallel Processing for Scientific Computing, [15] David Bader, Aydın Buluç, John Gilbert, Joseph Gonzalez, Jeremy Kepner, and Timothy Mattson. The graph blas effort and its implications for exascale. In SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities (EX14), [16] Evan Sparks, Ameet Talwalkar, Virginia Smith, Xinghao Pan, Joseph Gonzalez, Tim Kraska, Michael I Jordan, and Michael J Franklin. Mli: An api for distributed machine learning. In International Conference on Data Mining (ICDM). IEEE, December [17] Reynold Xin, Joseph E. Gonzalez, Michael Franklin, and Ion Stoica. Graphx: A resilient distributed graph system on spark. In SIGMOD Grades Workshop,
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015
City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015 Part I Course Title: Data-Intensive Computing Course Code: CS4480
What s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Tachyon: memory-speed data sharing
Tachyon: memory-speed data sharing Ali Ghodsi, Haoyuan (HY) Li, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley Memory trumps everything else RAM throughput increasing exponentially Disk throughput
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013
MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons
Rakam: Distributed Analytics API
Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
BSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
Scalability! But at what COST?
Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated Microsoft Research Unaffiliated Abstract We offer a new metric for big data platforms, COST, or the Configuration
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Distributed Structured Prediction for Big Data
Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich [email protected] T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
A distributed k-mean clustering algorithm for cloud data mining
A distributed k-mean clustering algorithm for cloud data mining Renu Asnani Computer Science department, Rajiv Gandhi Proudyogiki Vishwavidyalaya Address-E-7/54 Ashoka Society, Arera Colony, Bhopal(M.P.)
Big Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression
WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression Sponsored by: Oracle Steven Scully May 2010 Benjamin Woo IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
The Power of Relationships
The Power of Relationships Opportunities and Challenges in Big Data Intel Labs Cluster Computing Architecture Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
Locality Based Protocol for MultiWriter Replication systems
Locality Based Protocol for MultiWriter Replication systems Lei Gao Department of Computer Science The University of Texas at Austin [email protected] One of the challenging problems in building replication
Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks
Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks Haoyuan Li UC Berkeley Outline Motivation System Design Evaluation Results Release Status Future Directions Outline Motivation
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
q for Gods Whitepaper Series (Edition 7) Common Design Principles for kdb+ Gateways
Series (Edition 7) Common Design Principles for kdb+ Gateways May 2013 Author: Michael McClintock joined First Derivatives in 2009 and has worked as a consultant on a range of kdb+ applications for hedge
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Tachyon: A Reliable Memory Centric Storage for Big Data Analytics
Tachyon: A Reliable Memory Centric Storage for Big Data Analytics a Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica June 30 th, 2014 Spark Summit @ San Francisco UC Berkeley Outline
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Towards Resource-Elastic Machine Learning
Towards Resource-Elastic Machine Learning Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan Tyson Condie, Sundararajan Sellamanickam, Keerthi Selvaraj Microsoft [shravan mweimer dhrumaha tcondie ssrajan
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Big Data Science. Prof. Lise Getoor University of Maryland, College Park. http://www.cs.umd.edu/~getoor. October 17, 2013
Big Data Science Prof Lise Getoor University of Maryland, College Park October 17, 2013 http://wwwcsumdedu/~getoor BIG Data is not flat 2004-2013 lonnitaylor Data is multi-modal, multi-relational, spatio-temporal,
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
How Location Intelligence drives value in next-generation customer loyalty programs in Retail business
How Location Intelligence drives value in next-generation customer loyalty programs in Retail business Andreas Nemeth a, Mikhail M. Komarov a a National Research University Higher School of Economics,
Load Balancing in Distributed Data Base and Distributed Computing System
Load Balancing in Distributed Data Base and Distributed Computing System Lovely Arya Research Scholar Dravidian University KUPPAM, ANDHRA PRADESH Abstract With a distributed system, data can be located
From Spark to Ignition:
From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for
ANALYTICS STRATEGY: creating a roadmap for success
ANALYTICS STRATEGY: creating a roadmap for success Companies in the capital and commodity markets are looking at analytics for opportunities to improve revenue and cost savings. Yet, many firms are struggling
CAP Theorem and Distributed Database Consistency. Syed Akbar Mehdi Lara Schmidt
CAP Theorem and Distributed Database Consistency Syed Akbar Mehdi Lara Schmidt 1 Classical Database Model T2 T3 T1 Database 2 Databases these days 3 Problems due to replicating data Having multiple copies
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Interactive data analytics drive insights
Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has
Towards Resource-Elastic Machine Learning
Towards Resource-Elastic Machine Learning Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan Tyson Condie, Sundararajan Sellamanickam, Keerthi Selvaraj Microsoft [shravan mweimer dhrumaha tcondie ssrajan
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
IBM System x reference architecture solutions for big data
IBM System x reference architecture solutions for big data Easy-to-implement hardware, software and services for analyzing data at rest and data in motion Highlights Accelerates time-to-value with scalable,
Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
Cray: Enabling Real-Time Discovery in Big Data
Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
High-Volume Data Warehousing in Centerprise. Product Datasheet
High-Volume Data Warehousing in Centerprise Product Datasheet Table of Contents Overview 3 Data Complexity 3 Data Quality 3 Speed and Scalability 3 Centerprise Data Warehouse Features 4 ETL in a Unified
REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
305 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS
WHITEPAPER BASHO DATA PLATFORM BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS INTRODUCTION Big Data applications and the Internet of Things (IoT) are changing and often improving our
IMPACT OF DISTRIBUTED SYSTEMS IN MANAGING CLOUD APPLICATION
INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND SCIENCE IMPACT OF DISTRIBUTED SYSTEMS IN MANAGING CLOUD APPLICATION N.Vijaya Sunder Sagar 1, M.Dileep Kumar 2, M.Nagesh 3, Lunavath Gandhi
Master of Science in Computer Science
Master of Science in Computer Science Background/Rationale The MSCS program aims to provide both breadth and depth of knowledge in the concepts and techniques related to the theory, design, implementation,
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
MEng, BSc Computer Science with Artificial Intelligence
School of Computing FACULTY OF ENGINEERING MEng, BSc Computer Science with Artificial Intelligence Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
How To Get A Computer Science Degree At Appalachian State
118 Master of Science in Computer Science Department of Computer Science College of Arts and Sciences James T. Wilkes, Chair and Professor Ph.D., Duke University [email protected] http://www.cs.appstate.edu/
An Approach Towards Customized Multi- Tenancy
I.J.Modern Education and Computer Science, 2012, 9, 39-44 Published Online September 2012 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijmecs.2012.09.05 An Approach Towards Customized Multi- Tenancy
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Simplified Management With Hitachi Command Suite. By Hitachi Data Systems
Simplified Management With Hitachi Command Suite By Hitachi Data Systems April 2015 Contents Executive Summary... 2 Introduction... 3 Hitachi Command Suite v8: Key Highlights... 4 Global Storage Virtualization
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
MEng, BSc Applied Computer Science
School of Computing FACULTY OF ENGINEERING MEng, BSc Applied Computer Science Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give a machine instructions
Doctor of Philosophy in Computer Science
Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
MLlib: Machine Learning in Apache Spark
Journal of Machine Learning Research 17 (2016) 1-7 Submitted 5/15; Published 4/16 MLlib: Machine Learning in Apache Spark Xiangrui Meng [email protected] Joseph Bradley [email protected] Burak Yavuz
DRAFT. 1 Proposed System. 1.1 Abstract
Doctoral Thesis Proposal A language-level approach to distributed computation with weak synchronization and its application to cloud and edge computing environments. Christopher Meiklejohn Université catholique
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
