Big Data in the context of Preservation and Value Adding R. Leone, R. Cosac, I. Maggio, D. Iozzino ESRIN 06/11/2013 ESA UNCLASSIFIED
Big Data Background ESA/ESRIN organized a 'Big Data from Space' event on 5-7 June 2013 in order to address the barriers that hamper an effective use of large volumes of Earth Observation (EO) data. This event aimed to stimulate discussion between the different communities involved in the business of providing and manipulating EO very large-scale data and complex analyses. The meeting involved some 250 science, industry and policymaking representatives and national delegates from Europe, the US, Australia, China and Africa. Over 50 presentations during the three-day conference stimulated discussion between the different communities in the business of providing and manipulating very large-scale data and complex analyses of satellite and in situ Earth observations
Big Data Event Günther Kohlhammer H/EO Ground Segment and Missions Operations Department Big Data from Space (Big?) Data and Earth Observation Kostas Glinos Head of Unit - e-infrastructure DG CONNECT European Commission e-infrastructures for big data Reinhard Schulte-Braucks Head GMES Unit, DG ENTR, EC Copernicus and Big Data Gilles Ollier Head of Sector Earth observation Directorate General Research & Innovation EUROPEAN COMMISSION Earth Observation data and the EC Environmental Research and Innovation program
Big Data Technology Mosaic
Big Data Agencies and Research Institution
Big Data Industry
Big Data Background The event covered diverse aspects of handling large-scale data and complex analysis of Earth observation data products, including: Typical order of data volumes involved and their trends, primarily with respect to the utilization of streaming of data from presently available and upcoming satellite capabilities, and from ubiquitous ground devices. Challenges of data access, including timeliness, needs and policies for their dissemination, data capture, storage, search, sharing (including use of interoperability standards), transfer capacity, mining and analysis (including identification of representative samples), fusion, systematic and peak processing and visualization. The cost and weighing factors against identified challenges and in support of continuous evolution of techniques and technologies, in the short and longterm.
Big Data Topics Event papers were mostly, but not exclusively, devoted to the following topics: applied multivariate analysis, data mining computing power and storage scalability costs and weighting factors data access and use policies, licensing of derivative work data capturing and description data interoperability, retrieval, navigation data protection, and trustworthiness data delivery timeliness, distribution services, network capacity data slicing, sub-setting, extraction data variety, fusion, correlation data visualization, rendering, video streaming peak data processing performance indicators for big Earth data services systematic data processing spatial on-line analytical processing systems sustainability of big Earth data services.
Big Data and The Fourth V Paradigm Big Data consist of: Volume size of the data is increasing fast Value amount of value that can be derived from the data (through innovative analysis techniques, through combined use of diverse EO data and long term data series analysis (old data gaining value from new ones) Variety diversity and complexity of the data (format, type and storage medium) Velocity data is arriving at a faster rate and technology is always advancing
Big Data Current Challenges At this point in time main challenge is not only the volume of data, but its diversity (in term of data product content, format and type). Older EO data is recorded on various media, in different formats. A huge task represents the recovery, reformatting, reprocessing of such data, as well as the transcription of various associated information which is necessary to understand and use the data. Challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. A large proportion of users are not domain experts anymore data discovery tools, documentation and support are needed
Big Data Solutions Digital storytelling to open up to a larger audience and ease the understanding of users which may not be domain experts. Improved algorithms to lower processing time. Cloud services to support the Big data and ease accessibility EO Platforms where data can be easily accessed, shared and manipulated (e.g. Google Earth Engine, Nephelae). Cloud services and platforms could increase revisiting rate and enhancement of EO data archives.
Big Data Presentations Examples From Google Google Earth Engine: A Global Scale Geospatial Analysis Platform NOAA Big Data for a Big Ocean ECMWF Experience with managing a Multi-Petabyte Meteorological Archive DLR (GeoFarm) Enabling EO Data Exploitation CloudEO An Open Cloud Based EO-Services Production Platform and Marketplace AVHRR TIMELINE Project.. To Spacemetric- ESA LDCM repository, data processing and dissemination
Big Data Future Challenges Are we prepared to satisfy increasing user communities requirements (from policy maker to scientist?) Are we prepared to accommodate and process "Big data" EO data? Are we ready to share knowledge and tools on collaborative platforms? What are the security issues with regards to cloud services? Can we trust the cloud? How can we protect sensitive data? One idea could be to break Big data into pieces. While this could make it more secure, how will it affect the data processing and analysis times?
Big Data Conclusions and Recommendations The Big Data from Space event has set a new paradigm and a more advanced perspective to Big Data issues. The event ended with a strong call by all parties for the ability to handle and use big EO data. This could potentially open new opportunities for research and international cooperation schemes such as programmatic and industrial coordination. There was also unanimous support to promote the development of processing capabilities, and making data more accessible to users, complementing more traditional web service approaches. The excellent feedback and contributions received during the event paved the way to ESA for managing future EO data and form the basis for discussion among Earth observation data owners and suppliers.
Big Data Conclusions and Recommendations Scientists regularly encounter limitations due to large data sets in many areas. One of the possible ways to solve this issue would be that of bringing data processing directly to the data collection devices and/or facilities. This way the dataset to be archived, maybe long-term, would be orders of magnitudes smaller than the actual data amount collected by the sensors, by storing only the meaningful information extracted from the huge data collection. Big Data can also mean big changes to storage infrastructure, or working smarter with the available ones (object storage systems vs. cloud). Big Data activities boost private business entrepreneurial efforts, also at a huge level of infrastructure and investments (see, e.g., the papers by Google, Microsoft representatives). A new professional figure is emerging, namely that of the Data Scientist. This is a new type of specialist, with a solid foundation typically in computer science and applications, but as well as in modeling, statistics, analytics and math. A Data Scientist is a practitioner of data science. He is able to extract the meaningful information from the data deluge.
Big Data versus Long Term Data Preservation and Value The event addressed indirectly Long Term Data Preservation and Value issues,. Very large data sets data handling, their curation, valorization, retrieval, manipulation and finally visualization All issues will bring a contribution to the solution of problems always arising when dealing with such large data sets such as those considered when carrying out LTDP activities. One of the most relevant points was a new way of carrying out scientific research. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. After experimental, theoretical, and computational science, a Fourth Paradigm, emerging in scientific research, refers to the data management techniques and the computational system needed to manipulate, visualize, and manage those large amounts of scientific data.
Next Steps from LTDP to LTDP4V Valorize the past being inspired by the future 1. Foster interactions with user communities to gather requirements and collect feedback on data need and discoverability requirements 2. Review LTDP operations concept and scenarios to accommodate the 4V paradigm 3. Reinforce cooperation and federation with all involved parties (data producer, archive and consumer) 4. Identify technological areas for innovation for data exploitation 5. Improve communication to strengthen the value of data preservation, understandability and usage 6. Create center of excellence for thematic sensors applications and Fundamental Data Records exploitation