Big Data Science. Prof. Lise Getoor University of Maryland, College Park. October 17, 2013


 Stephen Poole
 2 years ago
 Views:
Transcription
1 Big Data Science Prof Lise Getoor University of Maryland, College Park October 17, 2013
2 BIG Data is not flat lonnitaylor
3 Data is multimodal, multirelational, spatiotemporal, multimedia Entities and relationships are important!
4 NEED: Data Science for Graphs New V: Vinculate
5 Diverse applications areas: social networks & social media, computational biology, computer vision, computational linguistics, computational sustainability, computational social science, personalized medicine, cybersecurity, etc Research Overview Combining statistical & logical methods for machine learning and reasoning under uncertainty ML Visual Analytic Tools MIT Press 2007 Probabilistic databases, data integration and alignment DB DDupe VIZ CGroup
6 Statistical Relational Learning (SRL) AI/DB representations + statistics for multirelational data l Entities can be of different types l Entities can participate in a variety of relationships l examples: Markov logic networks, relational dependency networks, Bayesian logic programs, probabilistic relational models, many others Key ideas l Relational feature construction l Collective reasoning l Lifted representation, inference and learning For more details, see NIPS 2012 Tutorial, nips2012pdf
7 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
8 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
9 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
10 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
11 What s Needed Next? Methods which can perform and interleave these tasks Desiderata: Flexible, scalable, declarative support for collective classification, link prediction, entity resolution and other information fusion and information alignment problems
12 Probabilistic Soft Logic (PSL) Stephen Bach Matthias Broecheler Alex Memory Lily Mihalkova Stanley Kok Bert Huang Angelika Kimmig
13 Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems  Predicate = relationship or property  Atom = (continuous) random variable  Rule = capture dependency or constraint  Set = define aggregates PSL Program = Rules + Input DB
14 Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems  Predicate = relationship or property  Atom = (continuous) random variable  Rule = capture dependency or constraint  Set = define aggregates PSL Program = Rules + Input DB
15 Node Labeling?
16 Voter Opinion Modeling? Status update $ $ Tweet
17 Voter Opinion Modeling spouse colleague friend friend spouse friend spouse friend colleague
18 Voter Opinion Modeling vote(a,p) friend(b,a) à vote(b,p) : 03 spouse colleague friend friend spouse friend spouse friend colleague vote(a,p) spouse(b,a) à vote(b,p) : 08
19 Link Prediction Entities  People, s Attributes  Words in s Relationships  communication, work relationship Goal: Identify work relationships  Supervisor, subordinate, colleague #
20 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues #
21 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues # complete by due
22 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues #
23 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues #
24 Entity Resolution Entities  People References Attributes  Name Relationships  Friendship Goal: Identify references that denote the same person John Smith J Smith name name C E A B friend friend D F G = = H
25 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person John Smith J Smith name name C E A B friend friend D F G = H =
26 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person Aname {str_sim} Bname => A B : 08 C John Smith A E name friend J Smith B name D F G = = H friend
27 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person John Smith J Smith name name C E A B friend friend D F G = H {Afriends} {} {Bfriends} => A B : 06 =
28 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person John Smith J Smith name name C E A B friend friend D F G = H = A B ^ B C => A C :
29 Mathematical Foundation
30 Probabilistic Model Probability density over interpretation I Ground rules distance to satisfaction d r (I) = max{i r,body I r,head, 0} # P (I) = 1 Z exp " X r2r w r (d r (I)) p r Normalization constant Ground rules Rule weight Distance exponent (in {1, 2})
31 Hingeloss MRFs
32 Hingeloss Markov Random Fields 2 P (Y X) = 1 Z exp 4 3 mx w j max{`j(y, X), 0} p j 5 j=1 Continuous variables in [0,1] Potentials are hingeloss functions Subject to arbitrary linear constraints Logconcave!
33 Inference as Convex Optimization Maximum Aposteriori Probability (MAP) Objective: arg max P (Y X) Y mx = arg min Y This is convex! Can solve using offtheshelf convex optimization packages or custom solver j=1 w j max{`j(y, X), 0} p j
34 Consensus Optimization Idea: Decompose problem and solve subproblems independently (in parallel), then merge results  Subproblems are ground rules  Auxiliary variables enforce consensus across subproblems Framework: Alternating direction method of multipliers (ADMM) [Boyd, 2011] Inference with ADMM is fast, scalable, and straightforward to implement [Bach et al, NIPS 2012, UAI 2013]
35 [Bach et al, UAI 2013; London et al, 2013] Speed Average running time Cora Citeseer Epinions Ac/vity Discrete MRF 1109 s 1843 s 2124 s 3442 s HL MRF 04 s 07 s 12 s 06 s Variables 10k 10k 1k 8k Poten/als 14k 19k 18k 75k Inference in HLMRFs is orders of magnitude faster than in discrete MRFs which use MCMC approximate inference In practice, scales linearly with the number of potentials
36 Distributed Processing
37 Distributed MAP Inference ADMM consensus optimization problem can be implemented naturally in distributed setting For k+1 iteration, it consists three steps in which sub problems can run independently (1 st and 2 nd step): 1 Update Lagrangian multiplier Miao, Liu, Huang, Getoor, IEEE Big Data Update each sub problem OR 3 Update the global variables
38 Distributed MAP: MapReduce Job Bootstrap Mapper sub problem load global variable X as side data read global variable X read/write subproblem Pros: Straightforward Design Cons: Job bootstrapping cost between iterations Difficult to schedule subset of nodes to run local variable copy z1 z2 zq zp z1 zq z2 z1 Reducer update global component z1 z1 z1 z2 z2 zq zq zp HDFS or HBase write new global variable
39 Distributed MAP: GraphLab sub problem node global variable component Advantages: No need to touch disk, no job bootstrapping cost Easy to express local convergence conditions to schedule only subset of nodes gather get z gather get local z,y apply update y update x scatter notify z apply update z scatter unless converge notify X update i update i+1
40 Experimental Results Using PSL for knowledge graph cleaning task  16M+ vertices, 22M+ edges, for small running instances  Takes 100 minutes to finish in Java single machine implementation using 40G+ memory  Distributed GraphLab implementation takes less than 15 minutes using 4 smaller machines  Possible to use commodity machines on large models!
41 Miao, Liu, Huang, Getoor, BigData 13 Experimental Results Voter model using commodity machines Name Subproblem Consensus Edge Fit in One Machine? Run time (sec) m = 8 SN 1M 33M 11M 6M Yes 2230 SN 2M 66M 21M 12M No 3997 SN 3M 10M 31M 18M No 4395 SN 4M 13M 42M 24M No 5376 Machine: Intel Core2 Quad CPU 266GHz machines with 4GB RAM running Ubuntu 1204 Linux Running time (sec) Hyper Greedy Number of Machines (SN 2M ) Running time (sec) Hyper Greedy Number of machines Strong scaling with fixed dataset Weak scaling with increasing size
42 Applications
43 Application Domains Computer Vision  Lowlevel image reconstruction  Activity recognition in Video Computational Biology & Health Informatics  Drugtarget prediction  Event discovery in EMR data Computational Social Science  Voter Opinion Analysis & discovery of Latent Groups in Social Media  Inferring bias in political discourse  Psychological modeling on online social networks Information Integration & Extraction  Entity resolution  Ontology alignment & schema mapping  Knowledge graph identification Upcoming: climate graphs, discourse analysis, MOOCs, more!
44 Closing Comments Great opportunities to do good work and do useful things in the current era of big data, information overload and network science big data science Statistical relational learning provides some of the tools, much work still needed, developing theoretical bounds for relational learning, scalability, etc Compelling applications abound! Looking for students, programmers, & postdocs
45 pslumiacsumdedu
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
More informationBig Data from a Database Theory Perspective
Big Data from a Database Theory Perspective Martin Grohe Lehrstuhl Informatik 7  Logic and the Theory of Discrete Systems A CS View on Data Science Applications Data System Users 2 Us Data HUGE heterogeneous
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationModeling Learner Engagement in MOOCs using Probabilistic Soft Logic
Modeling Learner Engagement in MOOCs using Probabilistic Soft Logic Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daumé III, Lise Getoor Department of Computer Science, University of Maryland, MD, USA {artir,
More informationMachine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
More informationCell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
More informationA Learning Based Method for SuperResolution of Low Resolution Images
A Learning Based Method for SuperResolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method
More informationHow Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
More informationLearning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
More informationDistributed forests for MapReducebased machine learning
Distributed forests for MapReducebased machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationMapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
More informationGraph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
More informationParallel Programming MapReduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming MapReduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introducon Open source MapReduce
More informationBIG DATA PROBLEMS AND LARGESCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION
BIG DATA PROBLEMS AND LARGESCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli
More informationJoin Bayes Nets: A New Type of Bayes net for Relational Data
Join Bayes Nets: A New Type of Bayes net for Relational Data Bayes nets, statisticalrelational learning, knowledge discovery in databases Abstract Many realworld data are maintained in relational format,
More informationE6895 Big Data Analytics: Lecture 12. Mobile Vision. ChingYung Lin, Ph.D. IBM Chief Scientist, Graph Computing. December 1st, 2016
E6895 Big Data Analytics: Lecture 12 Mobile Vision ChingYung Lin, Ph.D. IBM Chief Scientist, Graph Computing Adjunct Professor, Dept. of Electrical Engineering and Computer Science December 1st, 2016
More informationFast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Largescale Computation Traditional solutions for computing large quantities of data
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationSanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationDuke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationBig Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
More informationBig Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
More informationDistributed Structured Prediction for Big Data
Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationCOMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
More informationMachine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC
Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain
More informationTopics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationEuropean Archival Records and Knowledge Preservation Database Archiving in the EARK Project
European Archival Records and Knowledge Preservation Database Archiving in the EARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GSASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GSASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationSome Research Challenges for Big Data Analytics of Intelligent Security
Some Research Challenges for Big Data Analytics of Intelligent Security YuhJong Hu hu at cs.nccu.edu.tw Emerging Network Technology (ENT) Lab. Department of Computer Science National Chengchi University,
More informationbigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
More informationProcessing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
More informationComparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More informationLearning Latent Engagement Patterns of Students in Online Courses
Learning Latent Engagement Patterns of Students in Online Courses 1 Arti Ramesh, 1 Dan Goldwasser, 1 Bert Huang, 1 Hal Daumé III, 2 Lise Getoor 1 University of Maryland, College Park 2 University of California,
More informationANALYTICS IN BIG DATA ERA
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut
More informationData Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduceHadoop
More informationHIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks YoungRae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationSYSTAP / bigdata. Open Source High Performance Highly Available. 1 http://www.bigdata.com/blog. bigdata Presented to CSHALS 2/27/2014
SYSTAP / Open Source High Performance Highly Available 1 SYSTAP, LLC Small Business, Founded 2006 100% Employee Owned Customers OEMs and VARs Government TelecommunicaHons Health Care Network Storage Finance
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationCSEE5430 Scalable Cloud Computing Lecture 2
CSEE5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.92015 1/36 Google MapReduce A scalable batch processing
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationJOURNAL OF COMPUTER SCIENCE AND ENGINEERING
Exploration on Service Matching Methodology Based On Description Logic using Similarity Performance Parameters K.Jayasri Final Year Student IFET College of engineering nishajayasri@gmail.com R.Rajmohan
More informationStatistical Learning Theory Meets Big Data
Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,
More informationISSN: 23201363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationStorage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
More informationBig Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI580, Bo Wu Graphs
More informationPredicting CPU Availability of a Multicore Processor Executing Concurrent Java Threads
Predicting CPU Availability of a Multicore Processor Executing Concurrent Java Threads Khondker Shajadul Hasan 1, Nicolas G. Grounds 2, John K. Antonio 1 1 School of Computer Science, University of Oklahoma,
More informationMapReduce for Machine Learning on Multicore
MapReduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers  dual core to 12+core Shift to more concurrent programming paradigms and languages Erlang,
More informationProgramming Tools based on Big Data and Conditional Random Fields
Programming Tools based on Big Data and Conditional Random Fields Veselin Raychev Martin Vechev Andreas Krause Department of Computer Science ETH Zurich Zurich Machine Learning and Data Science Meetup,
More informationRegression Using Support Vector Machines: Basic Foundations
Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationAnalyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers
Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Sonal Gupta Christopher Manning Natural Language Processing Group Department of Computer Science Stanford University Columbia
More informationAsking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate  R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationIntegrating Benders decomposition within Constraint Programming
Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP
More informationInfluence Discovery in Semantic Networks: An Initial Approach
2014 UKSimAMSS 16th International Conference on Computer Modelling and Simulation Influence Discovery in Semantic Networks: An Initial Approach Marcello Trovati and Ovidiu Bagdasar School of Computing
More informationNAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju
NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised
More informationChallenges and Lessons from NIST Data Science Prepilot Evaluation in Introduction to Data Science Course Fall 2015
Challenges and Lessons from NIST Data Science Prepilot Evaluation in Introduction to Data Science Course Fall 2015 Dr. Daisy Zhe Wang Director of Data Science Research Lab University of Florida, CISE
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationBert Huang, Ph.D. Education. Employment History. Teaching. Awards
Bert Huang, Ph.D. Assistant Professor, Department of Computer Science Virginia Tech, Blacksburg, VA 24060 bhuang@vt.edu, (646) 8758838 http://berthuang.com Research Interests Machine learning, structured
More informationCollective Stance Classification of Posts in Online Debate Forums
Collective Stance Classification of Posts in Online Debate Forums Dhanya Sridhar Computer Science Dept. UC Santa Cruz dsridhar@soe.ucsc.edu Lise Getoor Computer Science Dept. UC Santa Cruz getoor@soe.ucsc.edu
More informationWinter 2016 Course Timetable. Legend: TIME: M = Monday T = Tuesday W = Wednesday R = Thursday F = Friday BREATH: M = Methodology: RA = Research Area
Winter 2016 Course Timetable Legend: TIME: M = Monday T = Tuesday W = Wednesday R = Thursday F = Friday BREATH: M = Methodology: RA = Research Area Please note: Times listed in parentheses refer to the
More informationIn Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationBig Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
More informationArchitectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecomparistech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
More informationComplexity and Scalability in Semantic Graph Analysis Semantic Days 2013
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures Inmemory Semantic Database Formulation
More informationBig Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades
More informationGraph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 33 Outline
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRALab,
More informationLambda Architecture. Near RealTime Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near RealTime Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationKEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationIcepak HighPerformance Computing at Rockwell Automation: Benefits and Benchmarks
Icepak HighPerformance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer gkmorris@ra.rockwell.com Standard Drives Division Bruce W. Weiss Principal
More informationTracking Groups of Pedestrians in Video Sequences
Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESCID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal
More informationShared Parallel File System
Shared Parallel File System Fangbin Liu fliu@science.uva.nl System and Network Engineering University of Amsterdam Shared Parallel File System Introduction of the project The PVFS2 parallel file system
More informationTRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More information