Veracity in Big Data Reliability of Routes Dr. Tobias Emrich Post-Doctoral Scholar Integrated Media Systems Center (IMSC) Viterbi School of Engineering University of Southern California Los Angeles, CA 900890781 emrich@usc.edu 1
OUTLINE Big (Uncertain) Data Reliability in Traffic Networks From Uncertainty to Reliability Outlook 2
BIG DATA Big data is like everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Dan Ariely Center for Advanced Hindsight at Duke University 3
BIG DATA Big data is like : everyone talks about it, Studies* on Microsoft and Yahoo production cluster: nobody really knows how to do it, everyone Median thinks Hadoop job everyone is else ~13 GB is doing it, 90% of the jobs are < 100 GB so everyone claims they are doing it... Dan Ariely Center for Advanced Hindsight at Duke University *A. Rowstron, D. Narayanan, A. Donnelly, G. O'Shea and A. Douglas. "Nobody ever got fired for using Hadoop on a cluster, Proceedings of HotCDP, April 2012. 4
BIG DATA Variety Volume Veracity Velocity Value 5
BIG DATA Variety Volume Veracity Velocity Value 6
Uncertainty in (not so big) Data J. Niedermayer, A. Züfle, T. Emrich, M. Renz, N. Mamoulis, L. Chen, H. P. Kriegel: Probabilistic Nearest Neighbor Queries on Uncertain Moving Object Trajectories In Proceedings of the 40th International Conference on Very Large Data Bases (VLDB), Hangzhou, China: 205 216, 2014. P. Zhang, R. Cheng, N. Mamoulis, M. Renz, A. Züfle, Y. Tang, T. Emrich: Voronoi based Nearest Neighbor Search for Multi Dimensional Uncertain Databases In Proceedings of the 29th International Conference on Data Engineering (ICDE), Brisbane, Australia: 158 169, 2013. J. Niedermayer, A. Züfle, T. Emrich, M. Renz, N. Mamoulis, L. Chen, H. P. Kriegel: Similarity Search on Uncertain Spatio temporal Data In Proceedings of the 6th Internation Conference on Similarity Search and Applications (SISAP), Coruna, Spain: 43 49, 2013 T. Emrich, H. P. Kriegel, J. Niedermayer, M. Renz, A. Suhartha, A. Züfle: Exploration of monte carlo based probabilistic query processing in uncertain graphs In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM), Maui, HI: 2728 2730, 2012. T. Emrich, H. P. Kriegel, N. Mamoulis, M. Renz, A. Züfle: Indexing uncertain spatio temporal data In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM), Maui, HI: 395 404, 2012. T. Bernecker, T. Emrich, H. P. Kriegel, M. Renz, A. Züfle: Probabilistic Ranking in Fuzzy Object Databases In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM), Maui, HI: 2647 2650, 2012. N. Hubig, A. Züfle, T. Emrich, M. A. Nascimento, M. Renz, H. P. Kriegel: Continuous Probabilistic Sum Queries in Wireless Sensor Networks with Ranges In Proceedings of the 24th International Conference on Scientific and Statistical Database Management (SSDBM), Chania, Crete, Greece: 96 105, 2012. T. Emrich, H. P. Kriegel, N. Mamoulis, M. Renz, A. Züfle: Querying Uncertain Spatio Temporal Data In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC, 2012. T. Bernecker, T. Emrich, H. P. Kriegel, N. Mamoulis, M. Renz, A. Züfle: A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany: 339 350, 2011. T. Bernecker, L. Chen, T. Emrich, H. P. Kriegel, N. Mamoulis, A. Züfle: Managing Uncertain Spatio Temporal Data In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Querying and Mining Uncertain Spatio Temporal Data (QUeST), Chicago, IL: 16 20, 2011. T. Bernecker, T. Emrich, H. P. Kriegel, M. Renz, S. Zankl, A. Züfle: Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB), Seattle, WA: 669 680, 2011. 7
Uncertainty Databases Result 1) Efficient 2) No Confidence Query/ Datamining Uncertainty is inherent in many datasets: Automated Extraction of Information from HTML Sensor Readings Human Observations Predictions Result 1) Efficient Alg. needed 2) Confidence attached 8
Reliability in Traffic Networks Route B: ~58 min Usually, I take route A Route A: ~53 min when I have a meeting in the morning, I take route B 9
Reliability in Traffic Networks Route B: ~58 min Travel time prediction, incorporating uncertainty 0,3 Route A: ~53 min Probability 0,25 0,2 0,15 0,1 Route A Route B 0,05 0 35 40 45 50 55 60 65 70 75 80 Travel Time 10
Reliability in Traffic Networks Predicted travel time of a route vary due to Imprecise Prediction of traffic flow Unpredictable accidents Changing weather conditions Routes differ in variance of travel time Starting at 9am Route A arrives before 10am with 89% Route B arrives before 10am with 99.2% To have 99.2% on route A I have to leave 10 mins earlier Probability 0,3 0,2 0,1 0 Route A: ~53 min Route B: ~58 min Route A Route B 35 40 45 50 55 60 65 70 75 80 Travel Time 11
From Uncertainty to Reliability Current approaches Traditional D TT = 14 min 9.00pm S 12
From Uncertainty to Reliability Current approaches Traditional 9.00pm S D TT = 14 min TT = 12 min ClearPath 13
From Uncertainty to Reliability Considering uncertainty of predictions D 9.00pm S How to predict How to model 14
From Uncertainty to Reliability Considering uncertainty of predictions D 9.00pm S How to predict How to model 15
From Uncertainty to Reliability Considering uncertainty of predictions D 9.00pm S How to predict What time to use? How to model 16
From Uncertainty to Reliability Considering uncertainty of predictions D 9.00pm S How to predict What time to use? How to model 17
From Uncertainty to Reliability Considering uncertainty of predictions D TT =? min 9.00pm S How to predict What time to use? How to model How to add up uncertain travel times? 18
From Uncertainty to Reliability Considering uncertainty of predictions How to deal with correlations? D TT =? min 9.00pm S How to predict What time to use? How to model How to add up uncertain travel times? 19
From Uncertainty to Reliability Considering uncertainty of predictions How to deal with correlations? D TT =? min 9.00pm S How to predict What time to use? How to model How to add up uncertain travel times? 20
Outlook Evaluation of the quality of the result Efficient online prediction Extension to new query mechanisms: When do I have to start (and which route do I have to take) when I want to be at USC at 8.00am (or before) with a probability of 99%? 21
Questions? Dr. Tobias Emrich emrich@usc.edu 22
Uncertainty in Databases Uncertainty is inherent in many datasets: Automated Extraction of Information from HTML (i.e. John works at Google vs. John works at Microsoft) Sensor Readings (i.e. RFID sensors tracking the position of customers) Human Observations (i.e. the seen Bird was either a Raven (75%) or a Crow (25%)) Predictions (i.e. tomorrow its going to rain (10%) or not(90%)) Two approaches to solve this Cleaning (e.g. get rid of uncertainty) Management (e.g. handle the uncertainty) 23
24