BIG DATA IN SCIENCE & EDUCATION

Size: px

Start display at page:

Download "BIG DATA IN SCIENCE & EDUCATION"

Osborn Kelley
8 years ago
Views:

1 BIG DATA IN SCIENCE & EDUCATION SURFsara Data & Computing Infrastructure Event, 12 March 2014 Djoerd Hiemstra

2 WHY BIG DATA? 2

3 Source: Jimmy Lin & 3

4 19 May 2012: 234 people reach the top 4

5 James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3),

6 James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3),

7 James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3),

8 James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3),

9 THE PROGRAM VS. THE DATA... 9

10 Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2),

11 THERE IS NO DATA LIKE MORE DATA 11

12 Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the ACL,

13 Thorsten Brants, Ashok Popat, Peng Xu, Franz Och, Jeffrey Dean. Large Language Models in Machine Translation. In: Proceedings of EMNLP,

14 How to get here if you are (not) Google? Thorsten Brants, Ashok Popat, Peng Xu, Franz Och, Jeffrey Dean. Large Language Models in Machine Translation. In: Proceedings of EMNLP,

15 HOW? TEACH A COURSE (and get a DIY datacenter) 15

16 16

17 COURSE: MANAGING BIG DATA M.Sc. Course Computer Science with Maarten Fokkinga and Robin Aly First edition: Nov Feb

18 COURSE: MANAGING BIG DATA File systems (Google File System) New Storage model (BigTable, Cassandra) Programming paradigm (MapReduce) Programming languages (Haskell, Java,...) New Query languages (Sawzall, Pig,...)... 18

19 MAP/REDUCE A simple and powerful interface that enables automatic parallelization and distribution of large scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th OSDI Symposium,

20 MAP/REDUCE More simply, MapReduce is: A parallel programming model (and implementation) 20

21 MAP/REDUCE PROGRAMMING MODEL Process data using map() and reduce() functions The map() function is called on every item in the input and emits intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits output values 21

22 MAP/REDUCE: PROGRAMMING MODEL More formally, map(k1,v1) list(k2,v2) reduce(k2, list(v2)) list(v2) 22

23 MAP/REDUCE: WORD COUNT EXAMPLE mapper (DocId, DocText) = FOREACH Word IN DocText OUTPUT(Word, 1) reducer (Word, Counts) = Sum = 0 FOREACH Count IN Counts Sum = Sum + Count OUTPUT(Word, Count) 23

24 MAP/REDUCE: WORD COUNT EXAMPLE M How now Brown cow How does It work now M M M Map Input Distributed file system <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> R R Reduce brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 Local file systems Output Distributed file system 24

25 HADOOP RUNTIME SYSTEM 1. Partitions input data 2. Schedules execution across machines 3. Handles machine failure 4. Manages interprocess communication 25

26 CASE STUDY: CLUEWEB09 Web crawl of 1 billion pages (25 TB) crawled in Jan. Feb using only the English pages (0.5 billion) Rebuild Google's experimental infrastructure Jeffrey Dean. Challenges in building large-scale information retrieval systems. In Proceedings WSDM

27 BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include new information that is not in the engine s standard inverted index 3. Oversee all the code in the experiment 4. Large-scale experiments in reasonable time 27

28 CONCLUSION Brute force sequential search is feasible Faster turnaround of the experimental cycle: Faster coding = more experiments and more data = more improvement of search quality = better system! 28

29 MORE INFO Djoerd Hiemstra and Claudia Hauff. MapReduce for information retrieval evaluation. In: CLEF, Multilingual and Multimodal Information Access Evaluation, pages 64-69, 2010 Software open source at Wait: where about is this on Google's graph? 29

30 We were here in 2010! Thorsten Brants, Ashok Popat, Peng Xu, Franz Och, Jeffrey Dean. Large Language Models in Machine Translation. In: Proceedings of EMNLP,

31 TODAY CommonCrawl: 7 billion web pages > 100 TB uncompressed 31

32 YOU CAN BE GOOGLE! (in 3 to 4 years) 32

33 33

34 34

35 ACKNOWLEDGEMENTS Yahoo Research, Barcelona Netherlands Organization for Scientific Research (NWO), grant Common Crawl SURFsara 35

INTRO TO BIG DATA. Djoerd Hiemstra. http://www.cs.utwente.nl/~hiemstra. Big Data in Clinical Medicinel, 30 June 2014

INTRO TO BIG DATA. Djoerd Hiemstra. http://www.cs.utwente.nl/~hiemstra. Big Data in Clinical Medicinel, 30 June 2014 INTRO TO BIG DATA Big Data in Clinical Medicinel, 30 June 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: http://en.wikipedia.org/wiki/mount_everest 3 19 May 2012: 234 people