Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics BY FRANÇOYS LABONTÉ GENERAL MANAGER JUNE 16, 2015 Principal partenaire financier WWW.CRIM.CA
ABOUT CRIM Applied research centre in IT Dual mission: Provide expertise in IT to support enterprises and organisations in developing innovative products and solutions Contribute to the creation of new knowledge through scientific activities and publications Major financial partner 2
THREE MAJOR AREAS OF EXPERTISE 1 2 3 INTERACTION AND HUMAN-SYSTEMS INTERFACES Voice, movement, emotions Augmented reality User activity-related aspects ADVANCED DATA ANALYTICS Analysis and processing of video, imagery, audio, text Semantics, natural language processing Geospatial imaging ADVANCED ARCHITECTURES AND TECHNOLOGIES FOR DEVELOPMENT AND TESTING Client / cloud / mobile architectural approaches Test modeling and automation Code generation, model inference Development, test and technological management methodologies 3
THE BIG DATA HYPE From Gartner: At CRIM, since many years: Volume: we have dealing with large data sets: videos, satellite imagery, large text corpus Variety: we have been processing multi-modal data sets (text, images, audio, video) Velocity: we have been working on analyzing continuous data streams (surveillance) Visualisation: we have been investigating and developing human-machine interfaces Value (actionable items): we have been developing intelligent decision-support systems SO WHAT IS IT ALL ABOUT? 4
BIG DATA TECHNOLOGIES Open up new possibilities to solve complex problems in much simpler ways than before Hadoop and other related technologies: No limitation on computing resources No need to worry about scaling up NoSQL and other related technologies: No need to know in advance the relations between the elements in a database Capacity to combine as needed various heterogeneous data sources Dynamic data processing (streams): Going away from the batch processing approach Capacity to develop more adaptive and reactive systems Emergence of machine-to-machine / connected objects / Internet of things applications Data centers and cloud technologies Data storage and file management is simplified Promising technologies which do not offer yet simple, stable and mature solutions. 5
CRIM AND BIG DATA To continue developing our expertise by leveraging Big Data technologies in advanced analytics, but also in human-systems interactions and in architectures and advanced technologies for software development and testing New ways to think about complex problems Emphasis on problems involving unstructured data Empirical knowledge of Big Data technologies to accompany enterprises and organisations Application-driven with concrete use-case Looking for the 5 th V: Value We prefer talking about SMART DATA Multidisciplinary approach: Data science Advanced analytics / machine learning Visualisation and interaction Business analysts Governance and data quality Product management Data governance Architecture and software development 6
SMART DATA: ADVANCED ANALYTICS How to make it happen? What will happen? Prescriptive Why? Predictive value What happened? Descriptive Diagnostic difficulty 7
THE A 2 DI PROJECT (ADVANCED ANALYTICS FOR DATA INTELLIGENCE) Goals Develop a practical expertise with Big Data Technologies (analytics, interaction, visualisation) Consolidate CRIM s advanced analytics components Build concrete use-case that can be used as an interactive «Vitrine technologique» Foster multidisciplinary projects Develop new collaborations and partnerships 8
THE A 2 DI INFRASTRUCTURE Data collection and preparation Storage Data enrichment Metadata Analytics, data mining, machine learning, inference, fusion, statistical, heuristics Visualisation Decision support Configurable environment: specific deployments for selected use-cases Openstack Hadoop / Spark Data analytics tools Partners and external environment 9
DATA SET FROM OCEAN NETWORKS CANADA Video & audio streams Manual annotations, log files Spectrogram, echo sounder, hydrophone Streaming Data Text Data Multi dimensional Time Series Geo Spatial Video & Image Audio Relational Social Network RT Monitoring Vertical profiling system, sonar Navigation information, bathymetry, maps Fixed cameras and cameras mounted on a rover Narrative description Ontologies 10
USE-CASE # 1 Key word detection from the audio information of submarine maintenance videos Approximately 300 hours to process Specialized vocabulary in biology and submarine navigation Apache High level library for the processing of very large data sets Developed at AMPLab in 2009 (Berkley) Generalized MapReduce paradigm: 30x faster, with low latency for streaming applications Distributed in-memory computing Now more popular than Hadoop Native integration with: Hadoop, ElasticSearch, Cassandra, RDBMS, Play!, etc 11
ELASTIC SEARCH Distributed search engine NoSQL document database High-availability Linear horizontal scalability Widely used in industry: Features: Full-text advanced search (Lucene) Geospatial queries Approximate string matching Real-time analytics Native integration with: Hadoop (HDFS), Spark, etc 12
USE-CASE # 2 Integration of geolocation data Keywords position Rover position Satellite imagery Sonar location Spatio-temporal layer for Accumulo (NoSQL) GeoMesa + Accumulo = big-data + PostGIS + PostgreSQL Storage, querying and processing of vector spatial-temporal big-data OGC standards support: WMS, WFS, WPS Use-cases: Density heatmaps Batch or streaming analytics Spatio-temporal predictive analytics Native integration with: Spark (analytics et clustering) GeoServer (webmapping) et OpenLayers (frontend) GeoTrellis for raster geospatial data (satellite imagery, etc ) 13
USE-CASE # 3 Keyword search enhancement with ontologies from Web resources Natural langage processing 14
PLATFORM DEMONSTRATION 15
ANOTHER BIG DATA PROJECT VESTA Video Evaluation System for Task Analysis LEADS research network : Learning Environment Across Disciplines Education sciences: How do students learn? 6 universities et 11 partner organizations (Canada) 13 universities et 4 partner organization (North Amercia, Europe, Australia) Led by Dr Susanne Lajoie (McGill University) 16
LEADS CONTEXT Video analysis of students in learning situations Video content: typically one student, many tasks Audio content: Think aloud, reading, conversation, answering questions Video Local sources Access rights management Manual transcripts Manual coding Data sharing 17
THE VESTA PLATFORM FEATURES A Web-based platform relying on some of the most recent HTML5 features 5 semi-automated annotation services Speaker identification Transcription Audio-text correspondence Transition detection (video) Face detection 3 utility services Annotation storage Load balancing / task dispatching Multimedia file storage Access rights management taking into account ethics approval for research protocols 18
THE VESTA PLATEFORM 19
CONCLUSIONS Big Data offer a huge potential, largely underexploited at this time Like numerous fundamental changes, expect a long journey Establish an ambitious vision, accomplish modest first steps but with a tangible value There is no one size fits all approach; it must be tailored to the specific use cas The question is not Too Big or not Too Big, what is important is data intelligence ( Smart Data ) that brings concrete value to the organisation Big Data technologies can also be used in other contexts On top of technological challenges, human challenges will dominate and determine the success or failure of specific initiatives. 20
PITFALLS A wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the over abundance of information sources that might consume it. Herbert Simons: Designing organization to an information-rich World; 1 Do not plan enough Plan too much Weak commitment Thinking it will be easy to implement Minimise issues related to change management 21