BIG DATA @ EMITLAB & CIDSE K. Selçuk Candan candan@asu.edu
Name: K. Selçuk Candan! Professor of Computer Science and Engineering at (CIDSE) ASU! Director, Enterprise, Media, and Information Technologies Labs (EmitLab)! Fulton Schools of Engineering Exemplar Faculty! Senior Sustainability Scientist- Global Institute of Sustainability
EmitLab Xiaolan Wang Ex-MS (now at U.Mass) Sriram Rathinavelu Ex-MS Mijung Kim ; Ex-PhD (now at HP Labs) Aneesha Bhat M S Jung Hyun Kim PhD Mithila Nagendra Ex-PhD (now at Akamai) Yash Garg M S Parth Nagarkar PhD Xinsheng Liu PhD Sicong Liu PhD Marco Berchiatti MS (U. Torino) Shengyu Huang PhD Adam Tse Undergrad Xilun Chen PhD Leonardo Allisio MS (U. Torino) Silvestro Poccia Research Technologist Ilaria Dal Grande MS (U. Torino) Rosaria Rossini PhD (U. Torino) KSC Maria Luisa Sapino Professor (U. Torino) Claudio Schifanella Ex. Post-doc. (now at RAI) Antonio Penta Post-doc. (U. Torino)
Research Overview Recent Relevant Grants/Projects: [NSF] National Science Digital Library (NSDL) Middleware for Network- and Context-aware Recommendations [KRA] A Framework for Real-time Context Monitoring in Sensor-rich Personal Mobile Environments [NSF] AURA: Design of Dense RFID Systems for Indexing in the Physical World across Space, Time, and Human Experience Ongoing Grants/Projects: [with SHESC, NSF] Management for Real-Time Driven Epidemic Simulations [with SHESC, NSF] Understanding the Evolution Patterns of the Ebola Outbreak in West- Africa and Supporting Real-Time Decision Making and Hypothesis Testing through Large Scale Simulations [NSF] RanKloud: Partitioning and Resource Allocation Strategies for Scalable Multimedia and Social Media Analysis [with JCI, NSF] Analysis and Optimization for Building Energy Management NSF: An Infrastructure to Support Complex Financial Patterns (CFP) based Real-Time Services Delivery and Visual Analytics [NSF] One Size Does Not Fit All: Empowering the User with User-Driven Integration NSF-IGERT: Person-centered Technologies and Practices for Individuals with Disabilities
What do I do?? Executive Committee member, ACM Special Interest Group on Management of (SIGMOD) Associate editor, ACM Transactions on base Systems (TODS) Associate editor, IEEE Transactions on Multimedia Associate editor, the Very Large Bases journal (2005-2012) Associate editor, Journal of Multimedia General Chair, IEEE International Conference on Cloud Engineering (IC2E) 2015. Workshops Chair, International Conference on Extending base Technology (EDBT) 2014 Organizing Committee Member, ACM SIG Multimedia Conference 2013 Panels Chair, Very Large bases (VLDB) Conference 2012 Publicity Chair, ACM SIG Multimedia Conference 2012 General Chair, ACM SIGMOD Conference 2012 General Chair, ACM SIG Multimedia Conference 2011 Program Group leader, ACM SIG Management of (SIGMOD) Conference 2010 PC Chair, the ACM International Conference on Image and Video Retrieval (CIVR) 2010 PC Chair, Workshop on Information & Software as Services. (WISS) 2010 Chair,Workshop on Information & Software as Services. (WISS) 2009 Chair, Workshop on Real-Time Business Intelligence (RTBI) 2009 PC Chair, ACM Workshop on Ambient Media Computing (iwam) 2009. PC Chair, ACM SIG Multimedia Conference 2008
Today, the amount of data being generated is massive. This necessitates engineering of new data architectures with lots of processing power and tools that can match the scale of the data and support split second decision making, through data fusion and integration and analysis and forecasting algorithms, to help non-data-experts (both government and commercial) make decisions and generate value. "Hunting for the Value Gaps in Management, Services, and Analytics ACM SIGMOD blog; http://wp.sigmod.org/
Challenges Cisco estimates we ll see a 1.3 zettabytes of traffic annually over the internet in 2016 Sensors from a Boeing jet engine create 20 terabytes of data every hour. 500 terabytes of new data of all forms are ingested in Facebook every day ISQP 3Vs HMLE [I]mprecision [S]parsity [Q]uality [P]rivacy [V]olume [V]elocity [V]ariety [H]igh-dimensional [M]ulti-modal inter-[l]inked [E]volving
Manage ment Analytic s Dimensi onality reductio n/feature selection Classific ation, clusterin g Summar ization Visual analytics Feature extractio n/media analysis Tempor al/spatial analysis Text Analysis /NLP Web/ social network s Recom mender systems Scalable /real time Perform ance and Scalabili ty Consiste ncy, quality, cleaning models Organiz ation and Schema Integrati on Cloud, DaaS Streami ng Parallel/ Distribut ed DM MapRede ce/ Hadoop Pregel/ Hama Other parallel DBMS Multitenant, Virtualiz ation Security, privacy, assuran ce Mobile, Sensor Visualiz ation Extractio n, filtering Rowstores Column Stores Key-value stores NoSql Relational OO XML Spatial Temporal Sequence Graph Fuzzy/ uncertain Text, image, video
Sequence Spatial management/mining techniques for supporting scalable, real-time, distributed analysis and retrieval systems Rowstores Key-value stores Fuzzy/ uncertain Column Stores NoSql and Schema Integrati on Graph Text, image, video Organiz ation Cloud, DaaS Multitenant, Virtualiz ation Temporal models Manage ment Streami ng XML Relational Mobile, Sensor Security, privacy, assuran ce systems for scalable data/query processing data streaming/mining/fusion OO Perform ance and Scalabili ty Parallel/ Distribut ed DM Visualiz ation Consiste ncy, quality, cleaning Extractio n, filtering MapRede ce/ Hadoop Pregel/ Hama Other parallel DBMS Feature extractio n/media analysis Tempor al/spatial analysis Visual analytics Text Analysis /NLP Summar ization Analytic s Web/ social network s Classific ation, clusterin g Scalable /real time Recom mender systems Dimensi onality reductio n/feature selection
Rowstores Key-value stores Most data in the real world are Spatial Sequence imprecise, multi-modal, and subjective Temporal anyhow Column Stores NoSql and Schema Integrati on Graph Organiz ation Cloud, DaaS Multitenant, Virtualiz ation Manage ment Streami ng XML So can we leverage techniques Fuzzy/ uncertain models from data and Text, media analysis Relational to image, video tackle the so called traditional data management/mining challenges?? Mobile, Sensor Security, privacy, assuran ce OO Perform ance and Scalabili ty Parallel/ Distribut ed DM Visualiz ation Consiste ncy, quality, cleaning Extractio n, filtering MapRede ce/ Hadoop Pregel/ Hama Other parallel DBMS Feature extractio n/media analysis Tempor al/spatial analysis Visual analytics Text Analysis /NLP Summar ization Analytic s Web/ social network s Classific ation, clusterin g Scalable /real time Recom mender systems Dimensi onality reductio n/feature selection
CENTER/CONSORTIUM FOR ASSURED AND SCALABLE DATA ENGINEERING (CASCADE) (CONSTRUCTION STAGE)
Focus and vision
CASCADE NSF I/UCRC Center (Proposal) Academic Partners Arizona State Univ. (KS Candan, H Davulcu, G Ahn, M Sapino) University of Maryland, College Park (Louiqa Raschid) The potential industrial members to the proposed NSF I/UCRC Center for Assured and SCAlable Engineering (CASCADE includes ASU site members: American Express, Early Warning, JCI, HP Labs, MapR, NEC America Labs, Oracle, Computational Analysis & Network Enterprise Solutions (CAaNES), Arizona Cyber Threat Response Alliance (ACTRA) UMD site members: Unscrambl, Leidos, JP Morgan Chase, Applied Communication Sciences (ACS), John Bottega, State Street, IBM Other potential partners Rengen Orion Health
Core CS Faculty working on Name Title Area(s) of Specialization as they relate to proposed concentration K. Selcuk Candan Professor Scalable data management and analysis Hasan Davulcu Assoc. Professor bases and data extraction Huan Liu Professor mining and analysis Ross Maciejewski Assistant Professor visualization Baoxin Li Professor Statistical machine learning, visual data Rao Kambhampati Professor integration, data cleaning Chitta Baral Professor Knowledge representation, NLP Dijuang Huang Associate Professor clouds Hanghang Tong Assistant Professor Graph structured data Mohamed Sarwat Assistant Professor management systems Jingrui He Assistant Professor analysis and sparse learning Paolo Shakarian Assistant Professor and network analysis
Relevant faculty at CIDSE/ASU 1. Gail- Joon Ahn risk management, access control, and security architecture for distributed systems 2. Ron Askin scheduling, opera?ons research; applied sta?s?cs 3. ChiCa Baral knowledge representa?on, bioinforma?cs, and text analysis 4. Rida Bazzi distributed compu?ng, fault tolerance, dynamic schema update in data clouds 5. K. Selcuk Candan scalable data management, integra?on and retrieval, data management and processing systems, mul?media retrieval, accessibility 6. Partha Dasgupta distributed systems, security, and resilience 7. Sandeep Gupta parallel and distributed compu?ng, data centers, energy- efficient, reliable data dissemina?on, and caching 8. Dijang Huang security, virtualiza?on, mobile cloud compu?ng 9. Subbarao Kambhampa? data integra?on, data cleaning, and planning 10. Baoxin Li sta?s?cal inference for visual tracking, feature selec?on for data/sensor fusion, image/video retrieval 11. Huan Liu data mining, machine learning, feature selec?on, classifica?on, subspace clustering, and social compu?ng 12. Ross Maciejewski geo- spa?al and spa?o- temporal visualiza?on, visual analy?cs for healthcare/pandemics, law enforcement 13. Pitu Mirchandhani water distribu?on systems, urban planning, transporta?on, forecas?ng, dynamic systems, remote sensing 14. Sethuraman Panchanathan ubiquituous mul?media analyis, accesibility 15. Andrea Richa adhoc networks, algorithms, self organizing systems, wireless communica?on 16. George Runger sta?s?cal learning, process control, data mining for massive, mul?variate data sets 17. Arunabha Sen network analysis, social, biological, transporta?on, communica?on networks 18. Esma Gel applied probability techniques for modeling, design and control of produc?on systems and supply chain 19. Hari Sundaram mul?- media and social- media analy?cs 20. Yalin Wang data visualiza?on, medical imaging, sta?s?cal pacern recogni?on 21. Peter Wonka data visualiza?on, geo- spa?al visualiza?on, modelling, image analysis 22. Teresa Wu decision making under uncertainty, biomedical informa?cs 23. Guoliang Xue privacy, smart grid, cloud compu?ng, network science 24. Steve Yau service- based systems, informa?on assurance, security, qos monitoring 25. Jieping Ye machine learning, data mining, dimensionality reduc?on, biomedical informa?cs 26. Nong Ye cyber- and network security
Relevant faculty at CIDSE/ASU
Big Systems Concentration for MS in Computer Science
CIDSE MS/MCS Concentration in Big Systems 15 credits of coursework in data engineering and data analytics Required base Management System (DBMS) Implementation Distributed and Parallel Systems Mining Elective (2 out of 5) Virtualization and Cloud Computing Semantic Web Mining Visualization Multimedia and Web bases Statistical Machine Learning
Key knowledge gaps.. Six most critical knowledge competency groups (in terms of the value gap i.e., the difference between current and desired states of the knowledge area) temporal and spatial analyses, summarization, cleaning, visualization, anomaly detection, real-time processing for streaming data, media analytics representations and fusion for unstructured/structured data, semantic Web, make unstructured data queriable, prioritize and rank data, correlate and identify the gaps in the data graph-based models, social networks, entity analytics, (social and other) network analytics, performance and scalability, distributed architectures. performance and scalability, distributed architectures. "Hunting for the Value Gaps in Management, Services, and Analytics ACM SIGMOD blog; http://wp.sigmod.org/
Key Tools.. Tools that can support federated and scalable data storage, analysis, and modeling make unstructured data queriable, prioritize and rank data, correlate and identify the gaps in the data entity analytics, (social and other) network analytics, and media analytics take into account for known models, but also adapt to new emerging patterns going back in history to validate models and going forward into future to support forecasting and if-then hypothesis testing.
Engineers.must have solid algorithmic and mathematical background, complemented with excellent data management, programming, and system development/integration skills
Engineers..should be able make informed architectural decisions based on a MapReduce/Hadoop Clustering/ classification RanKloud good understanding on how Reduce available technologies differ and complement each other Spark Mango-DB Map Map Map Map Feature extraction NetworkX GraphLab MADLib Hadoop-Online
Engineers.should also be able to identify data that is important, restructure data to make it useful, interpret data, formulate observation strategies and relevant data queries, and ask new questions based on the observations and results including what happened?, why did it happen?, and what happens next?.
Engineers..need to have the necessary skills to communicate with non data scientist/engineer co-workers, including domain experts business executives
Key learning outcomes make informed architectural decisions based on a good understanding on how available technologies differ and complement each other and what scalability/consistency trade-offs they provide. be able to pick and deploy the appropriate data management, processing, and analysis systems (including commercial and open-source) with the suitable structured or unstructured data model for the particular task and domain application needs. make informed decisions regarding data storage, indexing, querying, and retrieval. reason about optimization and execution alternatives and will be able to plan within the trade-offs introduces by concurrency control, transaction management, and recovery protocols and algorithms.
Key learning outcomes use tools and develop frameworks for federated and cloud based data storage, analysis, and modeling and mediated data services delivery. use as well as develop high performance distributed and/ or parallel data architectures that can match the scale of the data and support split second decision making, through data fusion and integration and analysis and forecasting algorithms. use as well as develop real-time, on-line data processing systems for temporally and spatially distributed observations for data in motion in applications, including those that include mobile applications, location-aware services, and human behavior modeling at individual and population scales. use as well as develop scalable batch processing systems for data at rest.
Key learning outcomes have knowledge regarding cutting-edge algorithms and systems for temporal and spatial data analyses, summarization, cleaning, anomaly detection, representations and fusion for unstructured/structured data, semantic Web, graph-based models, social networks, and multi-dimensional data visualization, use as well as develop tools that support entity analytics, (social and other) network analytics, text analytics, and media analytics not only for traditional applications like monitoring and security, but also for emerging applications, including enabling interest detection for retail/advertisement, social media, energy, healthcare, and finance.
Key learning outcomes use and develop algorithms, techniques, and tools for reducing the size and/or dimensionality of the data to make data amenable to analysis. make unstructured data queriable, prioritize and rank data, correlate and identify the gaps in the data, highlight what is normal and not normal, and automate the ingest of the data.
Key learning outcomes The graduates will be able to design and develop adaptive systems that take into account known models, but also adapt the models to new emerging patterns. use tools and develop systems that can go back in history to validate models and go forward into future to support forecasting and if-then hypothesis testing. The graduates will have the necessary skills to communicate with technical and non-technical co-workers