What Is Big Data? Craig C. Douglas University of Wyoming
What Is Big Data?... It Depends Unit Approximately 10 n Related to Kilobyte (KB) 1,000 bytes 3 Circa 1952 computer memory 32 KB Apollo 11 computer memory (1969) Megabyte (MB) 1,000 KB 6 Circa 1976 supercomputer memory Gigabyte (GB) 1,000 MB 9 2013 typical 16 GB memory smck Terabyte (TB) 1,000 GB 12 2012 largest SSD in a laptop Petabyte (PB) 1,000 GB 15 250,000 DVD s or the enmre digital library of all known books wriuen in all known languages Exabyte (EB) 1,000 PB 18 175 EB copied to disk in 2010 (est.) ZeUabyte (ZB) 1,000 EB 21 2 ZB copied to disk in 2011 (est.) 32 GB Smart phone memory (2014) 2
What Is Big Data?... It Depends What if Mme counts? Given a Mme period t, How much data can be read and wriuen? This changes over Mme as technology changes. What if the quanmty of data counts? How long does it take to read and write data? This changes over Mme as technology changes. DefiniMon of Big Data is fluid, not stamc. 3
Some Sources of Big Data InteracMons with dynamic databases Internet data City or regional transportamon flow control Environment and disaster management Oil/gas fields or pipelines, seismic imaging Credit cards and online businesses Government or industry regulamon/stamsmcs Dynamic data- driven apps 4
Why is Big Data a Hot Topic? Open posimons in data analymcs by 2020 (USA) up to 200,000 open posimons might only be 140,000 open posimons Bureau of Labor StaMsMcs projects that 70% of all newly created jobs across all STEM fields during 2010 s, across engineering, the physical sciences, the life sciences, and the social sciences, will be in computer science 5
Unprecedented OpportuniMes Significant contribumons to the development of these transformamve technologies have been made from diverse fields including: mathemamcs, natural sciences engineering social sciences arts and entertainment industries business world 6
Unprecedented OpportuniMes Algorithm and sofware development belong to computer science over the past 50 years: Computer science researchers have designed and implemented the algorithms and data structures, languages, models, tools, and abstracmons that have enabled these transformamonal technology developments 7
Quick summary SimulaMon oriented computamonal science is transformamonal science, but is only a niche in the grand scheme of things. Big data compumng capabilimes must be broadly available in any insmtumon that strives to compete in the coming decade. If not, an insmtumon will simply cease to be compemmve, similar to not joining the ARPAnet or CSnet in the 1970 s and 1980 s. 8
Some InteresMng Problems An Open Source, secure Hadoop replacement suitable for hospitals and medical records. Must be HPPA compliant. Must scale well for very large databases. Must have individual access capabilimes. Must not have complexity O(disk access) on a DFS. Should use OpenMP and MPI. Should use cache aware hashing methods. Will be useful well beyond medical records. 9
Some InteresMng Problems Dynamic Data- Driven ApplicaMon Systems and Big Data A natural fit and there is no agreed upon sofware for DDDAS or DDDAS- BD or DBDDAS. DDDAS has been applied to many, many fields. DDDAS researchers agree something should be produced: not considered an applicamon and too applied to be considered networking research. Need to find a niche or a program officer in a funding agency willing to think outside of the box. Many Big Data issues long common to DDDAS. 10
Some InteresMng Problems Sensors and telemetry SensorML was supposed to provide a standard way of describing sensor data and be able to get the data and deliver it to applicamons. It went commercial ($$$...$$$) afer the original PI remred. A true Open Source, interna5onally recognized standard would benefit one area of Big Data and DDDAS. 11
Some InteresMng Problems Reservoirs (oil, gas, water) Dynamic reservoir meshing VerMcal wells with micro sensors provide updates to fracked reservoirs. Speed up the meshing to including in a reservoir simulator Mme (e.g., go from a year to a day). Dynamically improve predicmons. Corporate oil/gas fields or pipelines (even small ones) produce excessive amounts of data Open Source data mining tools for specific problem 12
Some InteresMng Problems Audio and photographic data mining World s largest databases based on VoIP and phone monitoring by many governments (e.g., P.R. China, France, Germany, Kingdom of Saudi Arabia, United Kingdom, USA, ). Keeps disk drive makers in business and lowers hard disk prices very significantly. Another problem: Find all file duplicates in a file system efficiently. Similar to sentence problem earlier. Has commercial (e.g., Bing, satellite transmission) and research ramificamons that are not nefarious. 13