Data Big and Small: How Publisher gain Value out of Data in the Future WS 4: Frank Föge MarkLogic, Oliver Zmorek De Gruyter, Stefan Schwedt Newbooks Solutions
Agenda Introduction De Gruyter, Newbooks & MarkLogic Big Data Definition Data Big and Small Publishers perspective: Challenge, Approach & Solution DeGruyter, Use Case: Internal Data Analysis Newbooks, Use Case: VLB-TIX SLIDE: 2
De Gruyter - Innovative, agile responses to the challenges of tomorrow SLIDE: 3
Only a holistic view on publishing and a subsequent strategy will satisfy customers new demands SLIDE: 4
The only constant is change SLIDE: 5
BIG DATA DEFINITION
The very early days of Big Data 1854 Cholera in London from assumption to knowledge SLIDE: 7
Big Data = One C and Three V s Complexity Volume Variety Velocity SLIDE: 9
How did we get here? The early 80 s SLIDE: 10
The world as it was Data: Regular & Tabular Compute & Storage: Slow & Expensive Pace of Change: Glacial SLIDE: 11
Fast We had forward an elegant to today model that met our needs EMP EMPNO ENAME DEPTNO 7782 CLARK 10 7934 MILLER 10 7876 ADAMS 20 7902 FORD 20 7900 SMITH 30 DEPT DEPTNO DNAME 10 ACCOUNTING 20 RESEARCH 30 SALES 40 SHIPPING SLIDE: 12
Fast forward to today IT Budgets 33% on Innovation & Growth 67% on Maintenance (keeping the lights on) SLIDE: 13
We end up with the wrong technology for the job DATA When all you have is a hammer, everything looks like a nail SLIDE: 14
The Three V s of Big Data VOLUME VARIETY VELOCITY
IT faces the challenge leveraging both: Heterogeneous and Unstructured Data 12% Structured 88% Unstructured Reference Data OLTP Warehouse Archives Data Marts? SLIDE: 16
The Three V s of Big Data VOLUME VARIETY VELOCITY
Variety More of the same things Lots of different things SHAPE SOURCES FORMATS QUESTIONS SLIDE: 18
The result High Costs Crushing Complexity Strangled Innovation SLIDE: 19
DATA BIG AND SMALL
The Big and Small Data Opportunity: It is possible to utilize all your data in a cost effective way and realign for the future? Image Source: http://www.20x200.com/blog/blogimages/get_excited_and_make_things_with_creative_commons/1206_artworkimage.jpg
CHALLENGE, APPROACH & SOLUTION
Today s Data Landscape Massive growth of information with various data types RDBMS Search Engine Volume of Information Information Continuum XML Metadata / Onix Geospatial Graph Relational PDF Content Emails Documents Free text Hierarchical Semi-structured Structured Unstructured
Challenge: Things publishers need to deal with Different file formats and schemas (XML, CSV, Excel, Binaries) Different information transfer technologies (REST/SOAP, MBS, file ex-/ imports, File transfer protocols) Growing data amounts = scalibility Current situation: acquire knowledge of your data streams desired situation: data driven decision making SLIDE: 24
Information and System Flow Chart SLIDE: 25
Approach: What do publishers need? more systems to cover all our requirements = increasing costs, maintenance, support = further interfaces, mid-/long term integration = inflexible, not agile OR one target system that allows interdisciplinarily reporting and offers NoSQL technologies? SLIDE: 26
Source: http://www.langsonenergy.com/what-is-disruptive-technology
MarkLogic Enterprise NoSQL Database Application Services DATABASE SEARCH Semantics SLIDE: 28
Solution: NoSQL Database - don t worry about your data - different applications for different target groups - XML-oriented searches, queries and indexes - manage large volumes De Gruyter uses MarkLogic for: - De Gruyter Online platform - Content Management Systems SLIDE: 29
Use Case Correlation of usage and sales (1/2) 2 exemplary questions Find out what is being used but not sold? What is being sold but not used? => early insight into customers behavior; allows business to react accordingly and address customers Increasing the attractiveness of a publications requires a better understanding of the connection between sales and usage SLIDE: 30
Use Case Correlation of usage and sales (2/2) Source 1: Usage statistics of De Gruyter online gives an overview of database, book (chapter) and journal (article) usages Source 2: Sales Figures From Data warehouse contains sales overall statistics: webshop, mail/telephone order (customer service) DEMO SLIDE: 31
Group Question Think about a Use Case which will allow you to break up data silos to do ad-hoc analysis of heterogenous datatypes coming from various sources that can be usefull for your business - What kind of data? - What data sources? - Questions that take a lot of effort to find answers to? - Which decision making process can this data / answer support? - What new insights can be derived from this? SLIDE: 32
Q&A SLIDE: 33