BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS
WHAT IS BIG DATA? describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information data sets so large or complex that traditional data processing applications are inadequate many other definitions
5 V ATTRIBUTES Volume: Data at Scale (terabytes or petabytes of data) Variety: Data in Many Forms (structured, unstructured, text, multimedia) Velocity: Data in Motion (analysis of streaming data to enable decisions in real time) Veracity: Data Uncertainty (managing reliability and predictability of inherently imprecise data types) Value: Data into Money
GARTNER S HYPE CYCLE: BIG DATA IS OUT 2014 2015
BIG DATA RELATED TECHNOLOGIES Autonomous vehicles Internet of Things Natural Language Question Answering Machine Learning Digital Humanism Citizen Data Scientist Enterprise 3D printing Gesture control Digital dexterity Data security
TOP 10 BIG DATA TECHNOLOGIES
COLUMN-ORIENTED DATABASES Stores data tables as sections of columns of data rather than as rows of data Allowing for huge data compression and very fast query times Traditional, row-oriented databases are excellent for online transaction processing with high update speeds But they fall short on query performance as the data volumes grow and as data becomes more unstructured
NOSQL DATABASES A mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases NoSQL classification based on data model: Column: Accumulo, Cassandra, Druid, HBase, Vertica Document: Apache CouchDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB Key-value: Aerospike, CouchDB, Dynamo, FairCom c-treeace, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak Graph: AllegroGraph, InfiniteGraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog Multi-model: Alchemy Database, ArangoDB, CortexDB, FoundationDB, MarkLogic, OrientDB
MAPREDUCE a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers The "Map" task: an input dataset is converted into a different set of key/value pairs, or tuples The "Reduce" task: several of the outputs of the "Map" task are combined to form a reduced set of tuples
HADOOP The most popular implementation of MapReduce Entirely open source platform for handling Big Data The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster Hadoop YARN a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications Hadoop MapReduce an implementation of the MapReduce programming model for large scale data processing
HIVE Open source "SQL-like" bridge that allows conventional BI applications to run queries against a Hadoop cluster It amplifies the reach of Hadoop, making it more familiar for BI users The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL
PIG Similar to Hive Open source Unlike Hive, PIG consists of a "Perl-like" language that allows for query execution over data stored on a Hadoop cluster, instead of a "SQL-like" language
SPARK It is a framework for performing general data analytics on distributed computing cluster like Hadoop It provides in memory computations for increase speed and data process over mapreduce Spark provides dramatically increased data processing speed compared to Hadoop and is now the largest big data open-source project It is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds Spark uses more RAM instead of network and disk I/O Spark stores data in-memory whereas Hadoop stores data on disk Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O
DEEP LEARNING
HTTP://OPEN-DATA.EUROPA.EU/ The European Union Open Data Portal is the single point of access to a growing range of data from the institutions and other bodies of the European Union (EU)
BIG DATA AND HORIZON 2020 Horizon 2020 is the biggest EU Research and Innovation programme ever with nearly 80 billion of funding available over 7 years (2014 to 2020) It promises more breakthroughs, discoveries and world-firsts by taking great ideas from the lab to the market Big data is one of the main direction in Horizon 2020 ICT work programme
BIG DATA AND HORIZON 2020: MAIN TOPICS
ICT 15 2014: BIG DATA AND OPEN DATA INNOVATION AND TAKE-UP Specific Challenge: to improve the ability of European companies to build innovative multilingual data products and services Expected Impact: Enhanced access to and value generation on open data Viable cross-border, cross-lingual and cross-sector data supply chains Tens of business-ready innovative data analytics solutions Availability of deployable educational material Effective networking and consolidation
ICT 16 2015: BIG DATA - RESEARCH Specific Challenge: contribute to the Big Data challenge by addressing the fundamental research problems related to the scalability and responsiveness of analytics capabilities Expected Impact: Ability to track publicly and quantitatively progress in the performance and optimization of very large scale data analytics technologies Advanced real-time and predictive data analytics technologies thoroughly validated Demonstrated ability of developed technologies to keep abreast of growth in data volumes and variety by validation experiments Demonstration of the technological and value-generation potential of the European Open Data documenting improvements in the market position and job creations of hundreds of European data intensive companies
ICT-14-2016-2017: BIG DATA PPP: CROSS-SECTORIAL AND CROSS-LINGUAL DATA INTEGRATION AND EXPERIMENTATION Specific Challenge: create a stimulating, encouraging and safe environment for experiments where not only data assets but also knowledge and technologies can be shared Expected Impact: Data integration activities will simplify data analytics carried out over datasets independently produced by different companies and shorten time to market for new products and services Substantial increase in the number and size of data sets processed and integrated by the data integration activities Substantial increase in the number of competitive services provided for integrating data across sectors Increase in revenue by 20% (by 2020) generated by European data companies through selling integrated data and data integration services offered At least 100 SMEs and web entrepreneurs, including start-ups, participate in data experimentation incubators 30% annual increase in the number of Big Data Value use cases supported by the data experimentation incubators Substantial increase in the total amount of data made available in the data experimentation incubators including closed data Emergence of innovative incubator concepts and business models that allow the incubator to continue operations past the end of the funded duration
ICT-15-2016-2017: BIG DATA PPP: LARGE SCALE PILOT ACTIONS IN SECTORS BEST BENEFITTING FROM DATA-DRIVEN INNOVATION Specific Challenge: stimulate effective piloting and targeted demonstrations in largescale sectorial actions, in data-intensive sectors Expected Impact: Demonstrated increase of productivity in main target sector of the Large Scale Pilot Action by at least 20% Increase of market share of Big Data technology providers of at least 25% if implemented commercially within the main target sector of the Large Scale Pilot Action Doubling the use of Big Data technology in the main target sector of the Large Scale Pilot Action Leveraging additional target sector investments, equal to at least the EC investment At least 100 organizations participating actively in Big Data demonstrations
ICT-16-2017: BIG DATA PPP: RESEARCH ADDRESSING MAIN TECHNOLOGY CHALLENGES OF THE DATA ECONOMY Specific Challenge: fundamentally improve the technology, methods, standards and processes, building on a solid scientific basis, and responding to real needs Expected Impact: Powerful (Big) Data processing tools and methods that demonstrate their applicability in real-world settings, including the data experimentation /integration (ICT-14) and Large Scale Pilot (ICT-15) projects Demonstrated, significant increase of speed of data throughput and access,, as measured against relevant, industry-validated benchmarks Substantial increase in the definition and uptake of standards fostering data sharing, exchange and interoperability
ICT-17-2016-2017: BIG DATA PPP: SUPPORT, INDUSTRIAL SKILLS, BENCHMARKING AND EVALUATION Specific Challenge: newly created Big Data Value contractual public-private partnership (cppp) needs strong operational support for community outreach, coordination and consolidation, as well as widely recognized benchmarks and performance evaluation schemes to avoid fragmentation or overlaps, and to allow measuring progress in (Big) Data challenges by solid methodology. Also, there is an urgent need to improve the education, professional training and career dynamics Impact: At least 10 major sectors and major domains supported by Big Data technologies and applications developed in the PPP 50% annual increase in the number of organizations that participate actively in the PPP Significant involvement of SMEs and web entrepreneurs to the PPP Constant increase in the number of data professionals in different sectors, domains and various operational functions within businesses Networking of national centers of excellence and the industry, contributing to industrially valid training programs Availability of solid, relevant, consistent and comparable metrics for measuring progress in Big Data processing and analytics performance Availability of metrics for measuring the quality, diversity and value of data assets Sustainable and globally supported and recognized Big Data benchmarks of industrial significance