The BigIaS Platfrm Simplifying Big Data Integratin A Sftware as a Service Apprach ~ Preliminary Analysis and Design ~ September 5 th, 2013 Sgndal, Nrway Dumitru Rman Claudia Daniela Pp Rxana Iana Rman Bjørn Magnus Mathisen Cntact: dumitru.rman@sintef.n 1
Cntext: Big Data and Our Primary Fcus Addresses things that can be dne at a large scale but cannt be dne at a smaller ne Extract new insights r create new frms f value in ways that change stakehlders and relatinships between them Causality vs. crrelatins: nt knwing why but nly what Challenges the basic traditinal understanding f hw t make decisins Hetergeneity Change 2
Overview The prblem Data integratin - a cmplex, unslved prblem Tls addressing varius aspects f data integratin prcess can hardly be used tgether fr mre cmplex, interesting integratin tasks => High cst f data integratin at large scale, rather cmplicated and time cnsuming prcess The gal: Simplify data integratin at large scale! Enable users with limited technical data integratin skills t get frm raw data t insightful data with minimal effrt The apprach Semantic-based data integratin Flexible and custmizable wrkflws f data integratin tls (applicatin integratin) => A Sftware-as-a-Service fr data integratin at large scale 3
The BigIaS Platfrm UX Prtal and widgets Platfrm Layer Data peratins Entity extractin Data crawling Data linking Visualizatin Streaming Talend Data Integratin C SPARQL Karma Grk/ NUPIC Open Refine Trident ML Tabula UIMA/ Gate Talend ESB Silk Map4Rdf LdSpider LdLive Other Data Layer Strage services Streaming services Sesame OWLIM Streaming Data Sets 3 rd party 3 rd party 3 rd party Open Data LOD 4
Evaluated tls/appraches 1. Applicatin Integratin Talend ESB 2. Data Prcessing Talend Data Integratin Tabula Karma Open Refine UIMA GATE Silk LdSpider 3. Strage Sesame OWLIM 4. Visualizatin Map4Rdf LdLive 5. Real-time Machine Learning Trident ML Grk/NUPIC 5. Streaming C-SPARQL 5
Prpsed Integratin Wrkflws 1. Data Cntextualizatin 2. Entity Discvery 3. Data Linking 4. Data Visualizatin 5. Real-time Machine Learning 6. RDF Streaming 6
Data Cntextualizatin *OWL Excel, CSV, XML, etc. Integratin with external data (Karma, Open Refine) PDF Excel Relatinal DB X2CSV (Tabula PDF, Open Refine, Karma) Exprt RDF (Karma, Open Refine) Open Refine: files: TSV, CSV, *SV, Excel (.xls and.xlsx), JSON, XML, RDF as XML and Ggle Dcs Web Services Karma: databases: MySQL, SQL Server, ORACLE, PstGIS files: Excel, CSV, XML, JSON, KML Web Services Tabula: PDF Data Cleaning (Open refine) Stre RDF (Sesame + OWLIM) SPARQL endpint * OWL file can be imprted r cnfigured 7
Entity Discvery *OWL Exprt RDF Stre RDF SPARQL endpint (Karma, Open Refine) (Sesame + OWLIM) CSV Excel Entity Extractin (UIMA, Gate) Recncile against: Gate: files: text files UIMA: files: text, audi, vide Entity Cntextualizatin Open Refine (Recncile Plug In) Sparql Endpint RDF file Sindice service (Twitter, Freebase, DBpedia, and ther datasets, relevant fr the data) * OWL file can be imprted r cnfigured 8
Data Linking SPARQL endpint CSV Excel Relatinal DB X2RDF (Open Refine, Karma) Stre RDF (Sesame + OWLIM) Link Discvery (SILK) Open Refine: files: TSV, CSV, *SV, Excel (.xls and.xlsx), JSON, XML, RDF as XML and Ggle Dcs Web Services Karma: databases: MySQL, SQL Server, ORACLE, PstGIS files: Excel, CSV, XML, JSON, KML Web Services Stre RDF (Sesame + OWLIM) SPARQL endpint 9
Data Visualizatin Data Cntextualizatin Sparql Endpint CSV Excel Relatinal DB Entity Discvery Data Visualizatin (Map4RDF, LdLive) GUI Data Linking 10
Real time Machine Learning Data Cntextualizatin Sparql Endpint CSV Excel Relatinal DB Entity Discvery Data Linking Real time Machine Learning (GROK/NUPIC, Trident ML) Patterns C Sparql Query Generatr C Sparql queries Streaming Service (DB, e.g. JSON) Real Time RDFizatin Service 11
RDF Streaming C Sparql queries Real Time RDFizatin Service Streaming Service (DB, e.g. JSON) Real time Machine Learning (GROK/NUPIC, Trident ML) Streaming Engine (C Sparql) RDF 12
Cre Technlgical Prerequisites Tl Open Refine Karma UIMA Silk C Sparql Prerequisites GREL Functins GREL Examples Create ntlgies Ontlgies frm Karma Web Interface Regular Expressins Silk Link Specificatin Language (Silk LSL) C Sparql language 13
Relevant upcming research prjects (currently under negtiatins) Data Publishing thrugh the Clud: A Data- and Platfrm-as-a- Service Apprach fr Efficient Data Publicatin and Cnsumptin (DaPaaS) The DaPaaS prject aims t deliver an integrated DaaS and PaaS envirnment fr pen data the DaPaaS platfrm tgether with supprting activities fr effective and efficient publicatin and cnsumptin f data and creatin f applicatins using the data. Expected t start Nv 2013 Budget ~2.1M (~1.5M ) fr 2 years (EC funded) 14
15
Relevant upcming research prjects (currently under negtiatins cnt ) PraSense The Practive Sensing Enterprise The gal is t prvide a very scalable, distributed architecture fr the management and prcessing f big-data that will enable cntinuus mnitring f the need fr the service adaptatin and prpse crrespnding changes in an (semi-) autmatic way. Expected t start Nv 2013 Budget ~4.2M (~3.2M ) fr 3 years (EC funded) 16
17
Relevant upcming research prjects (currently under negtiatins cnt ) SmartOpenData Open Linked Data fr envirnment prtectin in Smart Regins SmartOpenData aims t define mechanisms fr acquiring, adapting and using Open Data prvided by existing surces fr envirnment prtectin in Eurpean prtected areas Expected t start Nv 2013 Budget ~3.4M (~2.5M ) fr 2 years (EC funded) 18
Relevant upcming research prjects (currently under negtiatins cnt ) INFRARISK Nvel Indicatrs fr identifying critical INFRAstructure at RISK frm natural Hazards Develp reliable stress tests n Eurpean critical infrastructure using integrated mdelling tls fr decisin-supprt. It will lead t higher infrastructure netwrks resilience t rare and lw prbability extreme events, knwn as black swans. Expected t start Oct 2013 Budget ~3.6M (~2.8M ) fr 2 years (EC funded) 19
Relevant nging research prjects BigFut Analyzing Big Data: Preparing fr the Future f Intelligent Infrmatin Management SINTEF Internal prject Gals: Analyze and integrate a suit f technlgical appraches and techniques Advance dem/prttype implementatin t penetrate the market in the shrt term Jan-Dec 2013 20
Relevant nging research prjects (cnt ) CITI-SENSE develp Citizen s Observatries t empwer citizens t: Cntribute t and participate in envirnmental gvernance Supprt and influence cmmunity and plicy pririties and assciated decisin making Cntribute t Glbal Earth Observatin System f Systems (GEOSS) 27 partners (EC fuded) http://www.citi sense.eu/ 21
Summary and Outlk What s new here The platfrm itself, implementing flexible data integratin wrkflws A set f cmpnents (e.g. Real-time RDFizatin f streams, C-Sparql Query Generatr) What s challenging Applicatin integratin Cnsistent scalability thrughut wrkflws Platfrm deplyment n clud envirnments Use f new, unprven technlgies and many thers Shrt-term plan (end f September) Get the first prttype implementing the prpsed wrkflws Experiment with sme data / simple use cases (e.g. CITI-SENSE data) 22
Thank yu! Q&A 23