Big Data Benchmark Suite
|
|
|
- Jayson Baldwin
- 10 years ago
- Views:
Transcription
1 BigDataBench: An Open source Big Data Benchmark Suite Jianfeng Zhan Professor, ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences WBDB 2015 Toronto, Canada INSTITUT TE OF COMPUTING TECHNOLOGY
2 What is BigDataBench? An open source big data benchmarking project Search Google using BigDataBench
3 Why BigDataBench? Specifi Application Workload Work Scalable data Multiple Multi Sub Simulat cation domains Types loads sets s (from real implemen pe e tenancy sets s or data) tations version BigDataBench Y Five Five 33 8 Y Y Y Y(3) BigBench Y One Three 10 3 N N N N CloudSuite N N/A Two 8 3 N N N Y(1) HiBench N N/A Two 10 3 N N N N CALDA Y N/A One 5 N/A Y N N N YCSB Y N/A One 6 N/A Y N N N LinkBench Y N/A One 10 1 Y N N N AMP Benchmarks N N/A One 4 N/A Y N N N
4 BigDataBench evolution application domains: 14 data sets and 33 workloads Same specifications: diverse implementations Multi tenancy version BigDataBench subset and simulator version BigDataBench Multidisciplinary effort 32 workloads: diverse implementations BigDataBench Typical Internet service domains An architectural perspective 19 workloads & data generation tools BigDataBench Search engine 6 workloads BigDataBench data analytics workloads DCBench 1.0 Mixed data analytics workloads CloudRank 1.0
5 BigDataBench Users fi /Bi B h/ / Industry users Accenture, BROADCOM, SAMSUMG, Huawei, IBM About 20 academia groups published papers using BigDataBench
6 Industry Standard: BigDataBench DCADCA China s first industry standard big data benchmark suite benchmarks/ Telecom Research Institute of Ministry of Industry and Information Technology, ICT, CAS, Huawei, China Mobile, Sina, ZTE, Intel (China), Microsoft (China), IBM CDL, Baidu, INSPUR, ZTE, 21viane and UCloud
7 BigDataBench Publications BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA 2014). Characterizing data analysis workloads in data centers IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying i Dwarfs Workloads in Big Data Analytics arxivpreprint i arxiv: BigDataBench MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv:
8 Outline BigDataBench 3.1 Overview Dwarf Workloads in Big Data Analytics Insights
9 What s New in BigDataBench Methodology New application domains Specification for each domain Implementation Specificpurpose Version New real world data sets New big data workloads Multi tenancy version BigDataBench B subset version
10 What is new? Application Domains Previous version Internet service: search engine, social network, e commerce Newly added: bioinformatics, multimedia Specification Guidelines for benchmark implementation
11 Methodology evolution Typical Application Domains Data BigDataBench 3.0 Complexity Functions of Abstraction Scalable data sets and workloads System BigDataBench p charististics Different implementations BigDataBench 3.1 Application Domain 1 Application Domain Application Domain N Data models of different types & semantics Data operations & workload patterns Benchmark specification i 1 Benchmark specification Benchmark specification N Real world data sets Data generation tools Workloads with diverse implementations Multi tenancy version Mix with different percentages Reduce benchmarking cost BigDataBench subset
12 Specification Search Search Engine General search and vertical search Online server and Offline analytics
13 Parsing: Search Engine: Parsing Extract the text content and links from the raw web pages
14 Indexing Search Engine: Indexing create the mapping of terms to document id lists
15 Search Engine: PageRank PageRank Compute the importance of the page according to the web link graph using PageRank
16 Search Engine: Search query Querying online web search server serving users requests
17 Sorting Search Engine: Sorting Sort the results according the page pg ranks and the relevance of between the query and the document
18 Search Engine: Recommendation Recommendation Recommend related queries to users by mining the search log
19 Search Engine: Statistic counting Statistic counting Countingthe the word frequency to extract the key words which represent the features of the page
20 Search Engine: Classification Classification Classify text content into different categories.
21 Search Engine: Filter & Semantic Filter Identify pages with specific topic which can be used for vertical search extraction ti Semantic extraction extract semantic information
22 Search Engine: Data access Data access operations Read, write, and scan the semantic information.
23 Specification Social network Data sets User table Relation table Article table Workloads Offline analytics
24 Social network: Data schema User table Relation table Tweet table
25 Social network: Workloads Hot review topic Select the top N tweets by the number of review Hot ttransmit ittopic Select the tweets which are transmitted more than N times. Active user Select the top N person who post the largest number of tweets. Leader of opinion Select top ones whose number of review and transmit are both large than N.
26 Social network: Workloads Topic classify Classify the tweets to certain category according to the topic. Sentiment classify Classify the tweets to negative or positive according to the sentiment. Friend recommendation Recommend friend to person according the relational graph. Community detection Detecting clusters or communities in large social networks. Breadth first search Sort persons according to the distance between two people.
27 Specification: E commerce Order table Item table
28 E commerce Data sets Order table Item table Workloads Offline analytics
29 E commerce: Workloads Sl Select query Find the items of which the sales amount is over 100 in a single order. Aggregation query Count the sales number of each goods. Join query Count the number of each goods that each buyer purchased between certain period of time. Recommendation Predict the preferences of the buyer and recommend goods to them. Sensitive classification Identify positive or negative review. Basic dt data operation Unit of operation of the data The workloads of select, aggregation, g and join are similar to the queries in A. Pavlo s sigmod09 pp paper,but are specified in the e commence environment
30 Specification Multimedia Voice Data Extraction Speech Recognition Video Data MPEG Decoder Frame Data Extraction Feature Extraction Image Segmentation Face Detection Three Dimensional Reconstruction Tracing
31 Multimedia: Workloads MPEG Decoder. Decode video streams using MPEG 2 standard. Feature extraction Fora given video frame, extract features which are invariant to scale, noise, and illumination. Speech Recognition. For a given audio file, recognize the content of the file and find whether there are sensitive words.
32 Ray Tracing. Multimedia: Workloads Render a 2 Dimensional video frame to a 3 Dimensional scene. Image Segmentation. Segmentthe the input video frame according to color, intensity, and texture, and extract concerned regions. Face Dt Detection. ti Detect whether face exists in the input data, if exists, then extract the face. Deep Learning. The input imagesare are classified into different categories, and then detect human face.
33 Specification Bioinformatics Sequence assembly. Assemble scattered and repetitive DNA fragments to originallong long sequence. Sequence alignment. Align assembled DNA sequence to known sequences in the database, and detect disease. Gene Sequencing Genome Sequence Data Sequence Assembly Sequence Mapping Sequence Alignment Detection Result
34 What s New in BigDataBench Methodology New application domains Specification for each domain Implementation Specificpurpose Version New real world data sets New big data workloads Multi tenancy version BigDataBench B subset version
35 BigDataBench 3.1 Overview BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social NetworkE commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene SoGou Data Genome sequence data Assembly of the human genome MNIST 14 Real world Data Sets Search Engine Multimedia Social E-commerce Network Bioinformatics 33 Workloads NoSql Impala Shark Hadoop RDMA MPI DataMPI Software Stacks
36 What s New in BigDataBench Methodology New application domains Specification for each domain Implementation Specificpurpose Version New real world data sets New big data workloads Multi tenancy version BigDataBench B subset version
37 BigDataBench Multi tenancy Version Scenarios of multiple tenants running heterogeneous applications in cloud datacenters Latency critical online services Latency insensitive offline batch applications Benchmarking scenarios Mining real world Workload traces (Google and Facebook) Profiling Realworld Workload traces Workload matching using Machine learning techniques Parametric workload generation tool Mixed workloads in public clouds Data analytical workloads in private clouds
38 BigDataBench Subset Motivation It is expensive to run all the benchmarks for system and architecture researches multiplied by different implementations BigDataBench 3.0 provides about 77 workloads
39 Methodology of Subsetting Identify a comprehensive set of workload characteristics from a specific perspective Eliminatethe the correlation data in those metrics Map the high dimension metrics to a low dimension Use the clustering method to classify Choose representative tti workloads from each category
40 Outline BigDataBench 3.1 Overview Dwarf Workloads in Big Data Analytics Insights
41 Why Dwarf Workloads in Big Data One attempt Ui Using popular workloads Subjective Another attempt Analytics How to define a representative big data benchmark? Fail to cover wide coverage and great complexity of big data Specific domains of systems Little analysis about representativen ess of workloads How can we construct a benchmark suite using a minimum set of units of computation to represent diversity of big data analytics? Dwarf workloads!
42 Significance A minimum set of necessary functionality A direction for evaluation and performance optimization A highly abstraction of patterns Dwarf Workloads Strong expression power pdf
43 Inspiration Successful Compute Abstractions Relational algebra 5 primitive operations Select, Project, Product, Unio n, Difference Successful Benchmarks TPC C C OLTP domain Functions of abstraction Parallel lcomputing HPCC Computational & High performance computing communication patterns Seven basically tests 13 dwarfs
44 Fundamental Issues What is the dwarf workloads in big data analytics? How to find?
45 Challenges #1 Massive Application Domains Big Data Massive application domains make us wonder where to start or how to achieve a wide range of coverage
46 Challenges #2 Multiple Research Fields and Techniques Deep learning Data mining Classification Machine learning Computer vision Clustering Data Mining Dimension reduction Natural language processing Signal processi ng Regression Association rule mining Many techniques for processing big data exist, which bring greater complexity for identifying dwarfs workloads
47 Challenges #3 Large Numbers of Algorithms and Variants A machine learning library scikitlearn, implements so many algorithms, which is still much less than the total number of algorithms learn.org/stable/tutorial/machine_learning_map/
48 Challenge #4 Unstructured Data with Complicated Operations 80% data growth are unstructured data Operations on big data are complicated Pipeline? Parallel? it off/2014/07/are you effectively using big data
49 Methodology of Dwarfs Application Big Data Analytics Representative Domain 1 Algorithms Processing Techniques Application Domain 2 Application Domain N Machine learning Data mining Deep learning Computer vision Natural language processing... Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Benchmarks BigBench, LinkBench... Dwarfs M
50 Internet Services Search Engine Electronic Commerce Others 15% 5% Social Network Media Streaming 40% Taking up 80% of internet services according to page views and daily visitors 15% 25% Top 20 websites
51 The Explosive Growth of Multimedia Data new VIDEOS on YouTube every minute hours MUSIC streaming on PANDORA every minute new PHOTOS on FLICKR every minute VIDEO feeds from minutes VOICE calls on are surveillance cameras Skype every minute uments, content/uploads/2014/11/whatisbigdata DKB v2.pdf l / t t/ l d /2014/11/ h ti d t 2df data growth IMAGES, VIDEOS, doc
52 The Explosive Growth of Bioinformatics Data DDBJ/EMBL/GenBank database Growth Nucleotides Entries Nucleot tides (billion) Entr ries (million) e.html#dbgrowth graph
53 The Explosion Growth of Astronomy The last decade Data hundreds of terabytes of astronomical data for hundreds of millions of sources The next decade Large Synoptic Survey Telescope (LSST) One 6 Gigabyte 30 Terabytes image every every night 20 seconds for 10years 100 Petabyte final image data archive anticipated
54 Chosen Application Domain Nucleotides (b billion) Search Engine Social Network Electronic Commerce Media Streaming Others 15% 5% 40% Internet 15% tservice Search engine, Social network, E commerce 25% 50 Top 20 websites DDBJ/EMBL/GenBank database Growth Nucleotides 0 Bioinformatics lion) Entries (mill new VIDEOS on YouTube every minute VIDEO feeds from surveillance cameras hours MUSIC streaming on PANDORA every minute minutes VOICE calls on Skype every minute new PHOTOS on FLICKR every minute Multimedia data growth are IMAGES, VIDEOS, d ocuments, Large Synoptic Survey Telescope (LSST) One 6 Gigabyte image every 20 seconds 30 Terabytes every night for 10 years Astronomy 100 Petabyte final image data archive anticipated
55 Methodology of Dwarfs Application Big Data Analytics Representative Domain 1 Algorithms Processing Techniques Application Domain 2 Application Domain N Machine learning Data mining Deep learning Computer vision Natural language processing... Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Benchmarks BigBench, LinkBench... Dwarfs M
56 Big Data Analytics Structured Semi Structured Unstructured Text Graph Table Multimedia Data Model Semantics Processing Techniques Open Source Projects Data Mining Machine Learning Libraries Frameworks Benchmarks
57 Libraries & Frameworks & Benchmarks Libraries Opencv, Mlib, lb Weka, AstroML Frameworks Spark, Hadoop, Graphlab Benchmarks BigDataBench, Linkbench
58 Methodology of Dwarfs Application Domain 1 Application Domain 2 Application Domain N Big Data Analytics Processing Techniques Machine learning Data mining Deep learning Computer vision Representative Algorithms Frequently appearing operations Dwarfs 1 Natural language processing... Dwarfs 2 Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Different combinations Benchmarks BigBench, LinkBench... Dwarfs M
59 Algorithms Chosen to Investigate Computer Vision Bioinformatics Deep Learning MPEG 2, Scale invariant feature transform, Image segmentation, Ray Tracing Needleman Wunsch, Smith Waterman, BLAST CNN, DBN Data Mining & Machine Learning Natural Language Processing Recommendation C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximumentropy markov model, Conditional random field, PageRank, HITS, Aporiori, FPgrowth, Principal component analysis, Back Propagation, Adaboost, MCMC, Connect edcomponent, Random forest Latent semantic indexing, plsi, Latent dirichlet allocation, Index, Porter Stemming, Sphinx speech recognition Demographic/Content based recommendation, Collaborative Filtering
60 Frequently appearing Operations Similarity Analysis KNN, K means, Recommendation, Feature matching, Image segmentation Neural Network Back propagation, CNN, DBN, Neural network Linear Algebra Multimedia Representation Sphinx speech recognition, MPEG 2, SIFT, Image segmentation, Ray tracing Matrix/Vector Calculation SVM, HMM, MEMM, CRF, PageRank, HITS, Logistic regression
61 Frequently appearing Operations(cont ) Similarity Analysis Jaccard similarity Locality sensitive hashing Set Operations Association Rules Mining Aporiori FP Growth Database St Set union Set difference
62 Frequently appearing Operations(cont ) Sampling MCMC, LDA, Rando m sampling, Downsam pling Transform Operation FFT, Convolution computation, DCT, S peech recognition, MPEG 2 Graph Operation BFS, DFS, Decision tree, Connected Component Logic Operation Encrpytion, index, Fin gerprint, SimHash, Mi nhash
63 Frequently appearing Operations(cont ) Two more primitive operations Statistic Operation Probability calculation LSI, plsi, Latent dirichlet allocation Sort Partial sort, quick sort, Top k sort K means, Decision tree
64 Methodology of Dwarfs Application Big Data Analytics Representative Domain 1 Algorithms Processing Techniques Application Domain 2 Application Domain N Machine learning Data mining Deep learning Computer vision Natural language processing... Libraries Mllib,Mahout Frameworks Spark, Hadoop, GraphLab Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Benchmarks BigBench, LinkBench... Dwarfs M
65 Finalized 8 Dwarfs workloads Linear Algebra Sort Sampling Transform operation Graph operation Logic operation Set operation Statistic operation
66 Properties Composability Algorithm can be composed of one or more dwarfs Dwarfs (one or more) Flow control Algorit hm Irreversibility combination is sensitive to the order of dwarfs Dwarfs A Dwarfs B Dwarfs B Dwarfs A Uniqueness Represent different patterns
67 Outline BigDataBench 3.1 Overview Dwarf Workloads in Big Data Analytics Insights
68 System Behaviors Diversified system level behaviors: 100% 80% 60% 40% 20% CPU utilization I/O wait ratio Percentage High CPU utilization & less I/O time Low CPU utilization relatively and lots of I/O time Mdi Medium CPU utilization 0% H-Grep(7) S-Kmeans(1) S-PageRank(1) H-Wor dcount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H- H-Diff I-Select S-Wor S- -Read(10) ference(9) tquery(9) dcount(8) -Project(4) S-OrderBy(3) H S-Grep(1) M-Grep -TPC-DS- I-OrderBy(7) S S -TPC-DS- -TPC-DS- S-Sort(1) M-WordCount M-Sort AVG_S_BigData time and I/O H-Grep(7) S-Kmeans(1) S-PageRank(1) H-WordCount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H-Read(10) H-Difference(9) I-SelectQuery(9) S-WordCount(8) S-Project(4) S-OrderBy(3) S-Grep(1) M-Grep H-TPC-DS- I-OrderBy(7) S-TPC-DS- S-TPC-DS- S-Sort(1) M-WordCount M-Sort AVG_S_BigData Weighted I/O Weighted disk I/O time ratio
69 Workloads Classification Finding from system behaviors System behaviors vary across different workloads Workloads can be divided into 3 categories:
70 IPC of BigDataBench vs. other benchmarks IPC H-Grep(7) S-Kmeans(1) S-PageRank(1) H-WordCount(1) H-Bayes(1) M-Bayes M-Kmeans M-PageRank H-Read(10) H-Difference(9) I-SelectQuery(9) S-WordCount(8) S-Project(4) S-OrderBy(3) S-Grep(1) M-Grep H-TPC-DS-query3(9) I-OrderBy(7) S-TPC-DS- S-TPC-DS-query8(1) S-Sort(1) M-WordCount M-Sort AVG_S_BigData TPC-C AVG_CloudSuite Avg_HPCC Avg_PARSEC AVG_SPECfp AVG_SPECint The average IPC of the big data workloads larger than CloudSuite, SPECFP and SPECINT, same like PARSEC and slightly smaller than HPCC The avrerage IPC of BigDataBench is 1.3 times of that of CloudSuite Some workloads have high IPC (M_Kmeans, S TPC DS Query8)
71 Instructions Mix of BigDataBench vs. other benchmarks Big data workloads are data movement dominated dcomputing with ihmore branch operations 92% percentage in terms of instruction mix (Load + Store + Branch + data movements of INT)
72 Cache Behaviors of BigDataBench vs. other benchmarks CPU intensive I/O intensive hybrid L1I MPKI Larger than traditional benchmarks, but lower than that of CloudSuite (12 Vs. 31) Different among big data workloads CPU intensive(8), I/O intensive(22), and hybrid workloads(9) One order of magnitude differences among diverse implementations M_WordCount is 2, while H_WordCount is 17
73 Cache Behaviors L2 Cache: The IO intensive workloads undergo more L2 MPKI L3 Cache: The average L3 MPKI of the big data workloads is smaller than all of the other workloads The underlying software stacks impact data locality MPI workloads have better data locality and less cache misses
74 Locality Instructions Cache miss ratio versus Cache size Data Cache miss ratio versus Cache size. Hadoop workloads have larger instructions footprint
75 Our observation from BigDataBench Unique characteristic data movement dominated computing with more branch operations 92% percentage in terms of instruction mix Locality Hadoop workloads have larger instructions footprint Different behaviors among Big Data workloads Disparity of ILP and memory access behaviors CloudSuite is a subclass of Big Data Software stacks impacts The L1I cache miss rates have one order of magnitude differences among diverse implementations with different software stacks.
76
Benchmarking and Ranking Big Data Systems
Benchmarking and Ranking Big Data Systems Xinhui Tian ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences INSTITUTE OF COMPUTING TECHNOLOGY Outline n BigDataBench n BigDataBench
Modern Processors using BigDataBench
Understanding Big Data Workloads on Modern Processors using BigDataBench Jianfeng Zhan http://prof.ict.ac.cn/bigdatabench Professor, ICT, Chinese Academy of Sciences and University of Chinese Academy of
How To Use A Webmail On A Pc Or Macodeo.Com
Big data workloads and real-world data sets Gang Lu Institute of Computing Technology, Chinese Academy of Sciences BigDataBench Tutorial MICRO 2014 Cambridge, UK INSTITUTE OF COMPUTING TECHNOLOGY 1 Five
BPOE Research Highlights
BPOE Research Highlights Jianfeng Zhan ICT, Chinese Academy of Sciences 2013-10- 9 http://prof.ict.ac.cn/jfzhan INSTITUTE OF COMPUTING TECHNOLOGY What is BPOE workshop? B: Big Data Benchmarks PO: Performance
BigDataBench: a Big Data Benchmark Suite from Internet Services
BigDataBench: a Big Data Benchmark Suite from Internet Services Lei Wang 1,7, Jianfeng Zhan 1, Chunjie Luo 1, Yuqing Zhu 1, Qiang Yang 1, Yongqiang He 2, Wanling Gao 1, Zhen Jia 1, Yingjie Shi 1, Shujie
arxiv:1505.06872v1 [cs.db] 26 May 2015
2015-5 arxiv:1505.06872v1 [cs.db] 26 May 2015 IDENTIFYING DWARFS WORKLOADS IN BIG DATA ANALYTICS Wanling Gao, Chunjie Luo, Jianfeng Zhan, Hainan Ye, Xiwen He, Lei Wang, Yuqing Zhu and Xinhui Tian Institute
Introducing EEMBC Cloud and Big Data Server Benchmarks
Introducing EEMBC Cloud and Big Data Server Benchmarks Quick Background: Industry-Standard Benchmarks for the Embedded Industry EEMBC formed in 1997 as non-profit consortium Defining and developing application-specific
On Big Data Benchmarking
On Big Data Benchmarking 1 Rui Han and 2 Xiaoyi Lu 1 Department of Computing, Imperial College London 2 Ohio State University [email protected], [email protected] Abstract Big data systems address
BigDataBench. Khushbu Agarwal
BigDataBench Khushbu Agarwal Last Updated: May 23, 2014 CONTENTS Contents 1 What is BigDataBench? [1] 1 1.1 SUMMARY.................................. 1 1.2 METHODOLOGY.............................. 1 2
On Big Data Benchmarking
On Big Data Benchmarking 1 Rui Han and 2 Xiaoyi Lu 1 Department of Computing, Imperial College London 2 Ohio State University [email protected], [email protected] Abstract Big data systems address
LeiWang, Jianfeng Zhan, ZhenJia, RuiHan
2015-6 CHARACTERIZATION AND ARCHITECTURAL IMPLICATIONS OF BIG DATA WORKLOADS arxiv:1506.07943v1 [cs.dc] 26 Jun 2015 LeiWang, Jianfeng Zhan, ZhenJia, RuiHan Institute Of Computing Technology Chinese Academy
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
CloudRank-D:A Benchmark Suite for Private Cloud Systems
CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China HVC tutorial in conjunction
Big Data: Image & Video Analytics
Big Data: Image & Video Analytics How it could support Archiving & Indexing & Searching Dieter Haas, IBM Deutschland GmbH The Big Data Wave 60% of internet traffic is multimedia content (images and videos)
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
Doing Multidisciplinary Research in Data Science
Doing Multidisciplinary Research in Data Science Assoc.Prof. Abzetdin ADAMOV CeDAWI - Center for Data Analytics and Web Insights Qafqaz University [email protected] http://ce.qu.edu.az/~aadamov 16 May
Big Data Systems CS 5965/6965 FALL 2014
Big Data Systems CS 5965/6965 FALL 2014 Today General course overview Q&A Introduction to Big Data Data Collection Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2014.html
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 GENIVI is a registered trademark of the GENIVI Alliance in the USA and other countries Copyright GENIVI Alliance
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
How To Write A Bigbench Benchmark For A Retailer
BigBench Overview Towards a Comprehensive End-to-End Benchmark for Big Data - bankmark UG (haftungsbeschränkt) 02/04/2015 @ SPEC RG Big Data The BigBench Proposal End to end benchmark Application level
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
MACHINE LEARNING BASICS WITH R
MACHINE LEARNING [Hands-on Introduction of Supervised Machine Learning Methods] DURATION 2 DAY The field of machine learning is concerned with the question of how to construct computer programs that automatically
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
NextGen Infrastructure for Big DATA Analytics.
NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
Data Centric Computing Revisited
Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013 Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
The 3 questions to ask yourself about BIG DATA
The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Big Data. Fast Forward. Putting data to productive use
Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A SURVEY ON BIG DATA ISSUES AMRINDER KAUR Assistant Professor, Department of Computer
Big Data and Marketing
Big Data and Marketing Professor Venky Shankar Coleman Chair in Marketing Director, Center for Retailing Studies Mays Business School Texas A&M University http://www.venkyshankar.com [email protected]
AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP
AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP Asst.Prof Mr. M.I Peter Shiyam,M.E * Department of Computer Science and Engineering, DMI Engineering college, Aralvaimozhi.
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
The 4 Pillars of Technosoft s Big Data Practice
beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed
01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours.
(International Program) 01219141 Object-Oriented Modeling and Programming 3 (3-0) Object concepts, object-oriented design and analysis, object-oriented analysis relating to developing conceptual models
BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. Aayush Agrawal
BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking Aayush Agrawal Last Updated: May 21, 2014 text CONTENTS Contents 1 Philosophy : 1 2 Requirements : 1 3 Observations : 2 3.1 Text Generator
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
Business Challenges and Research Directions of Management Analytics in the Big Data Era
Business Challenges and Research Directions of Management Analytics in the Big Data Era Abstract Big data analytics have been embraced as a disruptive technology that will reshape business intelligence,
TE's Analytics on Hadoop and SAP HANA Using SAP Vora
TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. -
Topics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected]
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected] Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
NetView 360 Product Description
NetView 360 Product Description Heterogeneous network (HetNet) planning is a specialized process that should not be thought of as adaptation of the traditional macro cell planning process. The new approach
Big Data. Lyle Ungar, University of Pennsylvania
Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -
Data Science, Predictive Analytics & Big Data Analytics Solutions. Service Presentation
Data Science, Predictive Analytics & Big Data Analytics Solutions Service Presentation Did You Know That According to the new research from GE and Accenture*: 87% of companies believe Big Data analytics
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data
Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data Yinong Chen 2 Big Data Big Data Technologies Cloud Computing Service and Web-Based Computing Applications Industry Control Systems
Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
What is Data Science? Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014
What is Data Science? { Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014 Let s start with: What is Data? http://upload.wikimedia.org/wikipedia/commons/f/f0/darpa
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Industry Impact of Big Data in the Cloud: An IBM Perspective
Industry Impact of Big Data in the Cloud: An IBM Perspective Inhi Cho Suh IBM Software Group, Information Management Vice President, Product Management and Strategy email: [email protected] twitter: @inhicho
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Hadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
305 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc. 2011 2014. All Rights Reserved
Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment
Network Big Data: Facing and Tackling the Complexities Xiaolong Jin
Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10
COMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Machine Learning. 01 - Introduction
Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge
