Big Data Challenges: Data Management, Analytics & Security Ivo D. Dinov Statistics Online Computational Resource University of Michigan www.socr.umich.edu
Big Data Challenges Availability, Sharing, Aggregation and Services Classical Data Science vs. Innovative Big Data Science Amateur Scientists vs. Experts Data Scientists vs. Practitioners Domain-specific vs. Trans-disciplinary knowledge Commercial vs. Open-source Resourceome Rapid Big Data Evolution Big Data IT proliferation Big Data Security risks Centralization won t work in Big Data Space Big Data is incredibly time, space, protocol, context dependent!
Big Data Characteristics * Mixture of quantitative & qualitative estimates Dinov et al. (2013)
Availability, Sharing, Aggregation & Services Cisco: "By the end of 2012, the number of mobile-connected devices [exceeded] the number of people on Earth There will be over 10 billion mobile-connected devices in 2016; i.e., there will be 1.3 mobile devices per capita Industry Sector Computer & Electronic Products Information Services Manufacturing Admin, support & waste management Transportation & Warehousing Wholesale Trade Professional Services Healthcare Providers Real Estate and Rental Finance and Insurance Utilities Retail Trade Government Accomodation & Food Arts & Enterntainment Corporate Management Other Services Construction Education Services Natural Resources Percent Growth Bubble Size ~ Relative size of GDP Big Data Value Potential Index U.S. Bureau of Labor Statistics McKinsey Global Institute
Amateur Scientists vs. Experts Democratization of Big Data Science Doctorate studies/certification is not mandatory nor does it guarantee appropriate Big Data expertise Lower barriers of entry Demand for constant Continuing Education and self-training Dichotomy between theoretical and empirical sciences Differences between fundamental knowledge and experimental skills (big data properties closely approximate core scientific principles)
Domain-specific vs. Trans-disciplinary knowledge Math/Stats Physics Biology Chemistry... Big Data Science Medical Sciences Social Sciences Environmental Sciences... Engineering Computer Science Bioinformatics Biomath/Biostats...
Commercial vs. Open-source Resourceome There is an explosion of open-data-science resources www.data.gov www.ncbi.nlm.nih.gov/gap Spawning of a number of industries and enterprises blending proprietary and open-source data, code, documentation, expert-support, infrastructure and services Big Data to Knowledge: www.bd2k.org Google Cloud Platform (GCP) Amazon Web Services (AWS)
Commercial vs. Open-source Resourceome
Rapid Big Data Evolution Millions of Grass-Roots initiatives addressing Big Data Challenges Big Data complexities require truly innovative, collaborative, trans-disciplinary solutions Increase of Data complexity Sources Heterogeneity Datum-elements Incongruent sampling
Data Scientists vs. Practitioners Modelers, Engineers, (Applied) Users No one user completely understands the entire pipeline of data provenance, processing protocols, analytic strategies, or results interpretation Black-boxes. Accuracy Privacy concerns Consistency Infrastructure
Big Data Security Risks Big Data Fusion provides enormous opportunities and presents significant challenges Privacy, security and legal concerns, authenticity, accuracy, consistency, reliability, availability Healthcare The cloud services enable sharing big data Significant security and privacy concerns exist, Health Insurance Portability and Accountability Act (HIPAA) EMR/EHR Federal, state and local regulations/policies (IRBMED) Genetics Viral - Dual-use research of concern (DURC), 10.1126/science.1223995 de novo synthesis of polio virus, the Australian mousepox experiment, the Penn State aerosolization study
Kryder s law: Exponential Growth of Data Increase of Imaging Resolution 6E+15 4E+15 2E+15 0 1 µm 10 µm 100 µm 1mm Gryo_Byte Cryo_Short Cryo_Color Cryo_Color Cryo_Short Gryo_Byte 1cm 15000000 10000000 5000000 Neuroimaging(GB) Genomics_BP(GB) Moore s Law (1000'sTrans/CPU) Data volume Increases faster than computational power 0 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2014 2015-2019 (estimated) Moore s Law (1000'sTrans/CPU) Genomics_BP(GB) Neuroimaging(GB) Dinov, et al., 2013
Alzheimer s Case Study: Stable-MCI vs. MCI-Converters Goals predictive-power of combinations of biomarkers and imaging derivative measures to provide reliable predictors of conversion from MCI to Alzheimer s disease Data MCI converters to AD (24-month period) and stable non-converters; matched for age, gender, handedness, education level Imaging (smri), Behavioral, Clinical, Neuropsychiatric, Biological data Approach Qualitative Exploratory Data Analysis and Quantitative Statistical Analysis (morphometric imaging correlates with clinical and genetics markers) MCI = Mild Cognitive Impairment (prelude to dementia of Alzheimer s type)
Alzheimer s Case Study: Stable-MCI vs. MCI-Converters Subject Demographics Gene -tics Clinical Neuroimaging Index Age Kg Sex APOE A1 APOE A2 NPI SCORE MMSE GD TOTAL CDR FAQ TOTAL L Gyrus Rectus BL L Superior Occipital Gyrus BL R Fusiform Gyrus BL L Caudate BL R Caudate BL L Putamen BL R Putamen BL 1 65 59 F 3 4 0 23 1 0.5 7 1695 3976 8363 1296 1992 1749 2776 2 73 93 M 3 3 7 19 1 1 8 1333 6016 13290 835 2137 2290 4327................................. N 64 63 F 3 3 3 29 6 0.5 2 2237 6887 16109 1223 2222 2525 4110
Alzheimer s Case Study: Stable-MCI vs. MCI-Converters Classification Results Using Baseline Data Hierarchical Clustering Prediction Ana (7 Regions) Metric True State (Dx at 24 month follow up) Converter Stable Total Converter TP FP TP+FP Stable FN TN FN+TN Total TP+FN FP+TN N Top 7 Regions Value Top 20 Regions Sensitivity 0.81 1.0 Specificity 0.61 0.87 Power to detect Converters 0.91 1.0 Accuracy 0.70 0.93