Big data workloads and real-world data sets Gang Lu Institute of Computing Technology, Chinese Academy of Sciences BigDataBench Tutorial MICRO 2014 Cambridge, UK INSTITUTE OF COMPUTING TECHNOLOGY 1
Five domains n Search engine n Social network n E- commence n Mul9- media n Bioinforma9cs
Search Engine General search and ver9cal search Online server and Offline analy9cs
n Parsing: n Search Engine: Parsing Extract the text content and out links from the raw web pages
n Indexing n Search Engine: Indexing The process of create the mapping of term to document id lists
Search Engine: PageRank n PageRank n Compute the importance of the page according to the web link graph using PageRank
n Querying n Search Engine: Search query The online web search server serving users requests
n Sor9ng n Search Engine: Sor9ng Sort the results according the page ranks and the relevance of between the query and the document
Search Engine: Recommenda9on n Recommenda9on n Recommend related queries to users by mining the search log
Search Engine: Sta9s9c cou9ng n Sta9s9c coun9ng n Coun9ng the word frequency to extract the key word which represent the features of the page
Search Engine: Classifica9on n Classifica9on n Classify text content into different categories, users can filter the results to a special category they are interested in
Search Engine: Filter & Seman9c n Filter n Iden9fy pages with specific topic which can be used for ver9cal search n Seman9c extrac9on n extract seman9c informa9on extrac9on
Search Engine: Data access n Data access opear9ons n Read, write, and scan the seman9c informa9on.
Social network n Data sets n User table n Rela9on table n Ar9cle table n Workloads n Offline analy9cs
Social network: Data schema User table Rela9on table Tweet table
Social network: Workloads n Hot review topic n Select the top N tweets by the number of review n Hot transmit topic n Select the tweets which are transmiwed more than N 9mes. n Ac9ve user n Select the top N person who post the largest number of tweets. n Leader of opinion n Select top ones whose number of review and transmit are both large than N.
Social network: Workloads n Topic classify n Classify the tweets to certain category according to the topic. n Sen9ment classify n Classify the tweets to nega9ve or posi9ve according to the sen9ment. n Friend recommenda9on n Recommend friend to person according the rela9onal graph. n Community detec9on n Detec9ng clusters or communi9es in large social networks. n Breadth first search n Sort persons according to the distance between two people.
Specifica9on: E- commerce Order table Item table
E- commerce n Data sets n Order table n Item table n Workloads n Offline analy9cs
E- commerce: Workloads n Select query n Find the items of which the sales amount is over 100 in a single order. n Aggrega9on query n Count the sales number of each goods. n Join query n Count the number of each goods that each buyer purchased between certain period of 9me. n Recommenda9on n Predict the preferences of the buyer and recommend goods to them. n Sensi9ve classifica9on n Iden9fy posi9ve or nega9ve review. n Basic data opera9on n Unit of opera9on of the data The workloads of select, aggrega2on, and join are similar as queries used in A. Pavlo s sigmod09 paper,but are BigDataBench specified in the e- commence environment MICRO 2014
Mul9media Voice Data Extrac1on Speech Recogni1on Video Data MPEG Decoder Frame Data Extrac1on Feature Extrac1on Image Segmenta1on Face Detec1on Three- Dimensional Reconstruc1on Tracing
Mul9media: Workloads n MPEG Decoder. n Decode video streams using MPEG- 2 standard. n Feature extrac9on n For a given video frame, extract features which are invariant to scale, noise, and illumina9on. n Speech Recogni9on. n For a given audio file, recognize the content of the file and find whether exists sensi9ve words.
n Ray Tracing. Mul9media: Workloads n Render a 2- Dimensional video frame to a 3- Dimensional scene. n Image Segmenta9on. n Segment the input video frame according to color, intensity, and texture, and extract concerned regions. n Face Detec9on. n Detect whether face exists in the input data, if exists, then extract the face. n Deep Learning. n The input images are classified into different categories, and then detect human face.
Bioinforma9cs n Sequence assembly. n Assemble scawered and repe99ve DNA fragments to original long sequence. n Sequence alignment. n Align assembled DNA sequence to known sequences in the database, and detect disease. Gene Sequencing Genome Sequence Data Sequence Assembly Sequence Mapping Sequence Alignment Detec1on Result
Summary:Real data sets
Summary:Search Engine Various implementa1on
Summary:Social network
Summary:E- commerce
Summary:Mul9media
Summary:Bioinforma9cs
Any Questions