1 Big Data Complexities for Scientific Computing in the Oil and Gas Industry nosql, SQL, and mo SQL http://www.limitpoint.com/images/publications/bigdatainoilandgas.pdf David M. Butler, President Limit Point Systems, Inc.
2 Outline Big Data in oil & gas exploration & production Field theory for data scientists The data model paradigm The sheaf data model A query language for the sheaf data model
3 The oil and gas business Adapted from [Krebbers] Upstream is exploration and production ( E&P ) (upper left) Downstream is transportation, refining, and marketing (lower right)
4 Major Acquired Upstream Data Types Time lapse raw seismic Time lapse prestack seismic image Time lapse poststack seismic image Well logs Production monitoring dozens of other data types
5 Time lapse raw seismic data each sensor gives amplitude as a function of time ~10K sensors moving towards ~1M ~10K shots ~5K samples/shot ~4 12 bytes/sample time lapse: repeat ~2/year ~10 years from [KrisEnergy] ~10 TB/project*~100 projects/year/major company ~1PB/year/major
6 Time lapse prestack seismic image data clean up seismic data remove noise remove artifacts other signal processing operations migrate data focus signal energy convert time to position up to 5D array of data reflectivity as a function of 3D position source-sensor 2D offset ~same size as raw seismic
7 Poststack seismic image data stack of prestack data aggregate over 1 or more array indices reduces size ~100x 2D or 3D image reflectivity as function of position similar to medical ultrasound image [epmag 1] interpret to produce model of subsurface
8 Well logs lower sensor package into well measure various properties as a function of depth ~10k samples ~1k components simple numbers bore hole images others typically done once before production starts ~100MB/well*~1K wells/year/major ~ 100GB/year/major [decogeo]
9 Production monitoring Classical methods at well head flow volumes gas/oil/water composition temperature pressure Distributed sensing methods fiber optic cables in well acoustic sensing temperature sensing ~1000 equivalent discrete sensors ~1k samples/sec continuous monitoring ~10-100GB/day/well function of time and position along well path [epmag 2] [slb 1] ~1K wells (growing rapidly) ~1PB/year/major
10 Major interpreted/modeled data types Geological structure model Velocity model Basin model Reservoir models geological quantitative engineering Geomechanical model dozens of other data types
11 Geological structure model geologist interprets seismic image identifies surfaces defining rock strata and faults very complex networks of intersecting surfaces iterative process seismic image depends on acoustic velocity acoustic velocity depends on rock type rock type interpreted from seismic image and well data ~1GB/structure ~1K structures/year/major ~1TB/year/major
12 Velocity model velocity of sound as a function of position in volume corresponding to geological structure scalar, vector, or tensor models used to produce seismic images accurate velocity model key to good seismic image ~1-10GB/model [geosoft] ~1K models/year/major ~1TB/year/major [pdgm 1]
13 Basin model dynamic model of entire sedimentary basin rock movement fluid movement study history of hydrocarbon deposits generation expulsion migration to reservoir entrapment useful in predicting whether structure contains oil or gas [outernode] ~100GB/model*~100/year/major ~10TB/year/major
14 Reservoir models static models prior to production estimate volume and other properities dynamic models fluid flow fluid composition function of position and time used to guide drilling & production keep wells producing ~100GB/project many fields, many versions/year/major ~100 TB/year/major [dgi]
15 Geomechanical model simulation of mechanical stresses and strains whole subsurface specific reservoirs stress, strain, deformation as function of position and time used to anticipate mechanical changes around bore hole and in reservoir ~1-10GB/model ~100 models/year/major ~100GB/year/major [slb3]
Summary of Upstream Data Types (Order of magnitude estimates) 16 Variety Volume (/object) Velocity (/year/major) Raw seismic ~1TB ~1PB Prestack seismic ~1TB ~1PB Poststack seismic ~10GB ~10TB Well logs ~100MB/well ~100GB Production monitoring ~10GB ~1PB Geological structure ~1GB ~1TB Velocity model ~1GB ~1TB Basin model ~100GB ~10TB Reservoir models ~100GB ~100TB Geomechanical model ~1GB ~100GB dozens of other data types, all important variety rather than volume or velocity is dominant feature
17 Upstream Data Flow (partial) [cda] complex interoperation between data types
18 Shared Earth Model concept integrated data base for evolving models of subsurface all data types multiple scales structure reservoir basin multiple interpretations and versions per object uncertainty quantification for everything provenance for everything constantly evolving holy grail of Exploration and Production ( E&P ) data integration in practice: still mostly vendor proprietary islands of integration
Shared Earth Model conceptually similar to conventional enterprise data warehouse 19 analysis and report oriented rather than transaction oriented integrates data from many different applications Extract-Transfer-Load ( ETL ) processes a critical component conventional warehouse and ETL relational data model provides conceptual framework Shared Earth Model for E&P data relational data model has not proven particularly useful why not? most data is physicist s field data
20 Outline Big Data in oil & gas exploration & production Field theory for data scientists The data model paradigm The sheaf data model A query language for the sheaf data model
21 Field Theory for Data Scientists physicist s field not same as database admin s field field describes some physical property as function of position and/or time in some physical object position in a physical object physical property physical property as a function of position use a simple example to introduce these ideas
22 A simple example derrick floor Upper well well junction Lower well bore 1 bore 2 Branched well
23 position in a physical object position represented by coordinate vector y R 2 r = x(p) y(p) y(p) p x(p) x
24 Physical property physical property types specified by mathematical physics family of types jointly referred to as multilinear algebra scalar types single number F vector types F column of numbers F = 0 F 1 tensor types matrix of numbers F = F 00 F 01 F 10 F 11 each has important algebraic properties a few dozen standard types, many more app specific types
25 Physical property as a function of position function (map) from physical space to property space associates a value of F with each p in the object y R 2 F r = F 00 F 00 x F 11 F 11 y infinite number of points infinite number of property values y(p) x(p) p x F 00 F 11 F 00 F 11 how do we represent this on the computer?
26 How do we represent a field on the computer? numerous methods small industry busy creating new methods makes interoperation and integration difficult some common features decompose physical object into simple pieces approximate by simple function on each piece
27 Decompose physical object into simple pieces mathematicians call each piece a cell decomposition is a cell complex df df s0 v1 s1 j s2 j s4 s3 v3 v5 s5 v4 v6 more commonly called a mesh
28 Approximate by simple function on each cell for each cell c: store a data tuple specify an evaluation method evaluation method F(p) = eval c(p) (p, data tuple) data tuple may or may not correspond to value of field at some point depends on evaluation method data for entire field is an array of tuples example: linear interpolation F F 0 value(p) F 1 v 0 p v 1 u(p) value(p) = u*f 1 + (1-u)*F 0
29 Data for entire field is an array of tuples cell 0 cell 1 cell 2 cell n-1 scalar F 0 F 1 F 2... F n-1 cell 0 cell 1 cell 2 cell n-1 vector F 0,0 F 0,1 F 1,1 F 0,2 F 1,2 F 1,0... F 0,n-1 F 1,n-1 cell 0 cell 1 cell n-1 tensor F 00,0 F 01,0 F 10,0 F 11,0 F 00,1 F 01,1 F 10,1 F 11,1... F 00,n-1 F 01,n-1 F 10,n-1 F 11,n-1 tuple components typically real (float or double) but may be of any type
30 How do we want to use field data? operations specified by mathematical physics five main categories topological operations compose and decompose geometric operations change the shape functional operations set and get the value at a point move field from one mesh to another algebraic operations add, subtract, multiply, divide, diagonalize,... calculus operations differentiate and integrate
31 Why isn t the relational model useful for field data? doesn t fit the way we want to store field data relational schema can t directly capture field entity captures data tuple entity instead of entire field entity field entity has to be reconstructed by queries normalization forces introduction of surrogate keys may require recursive queries doesn t fit the way we want to use field data table operations are too low level aren t useful for high level field operations no pay-off to using relational model most field data is stored in app-specific, proprietary flat files so what data model is useful for field data?
32 Outline Big Data in oil & gas exploration & production Field theory for data scientists The data model paradigm The sheaf data model A query language for the sheaf data model
33 The data model paradigm Data model [Codd] specifies class of mathematical objects operations on those objects constraints valid instances must satisfy Languages, libraries, tools based on data model Applications developed on top of tools Numerous benefits
34 Benefits of data model paradigm Increases level of abstraction for application development Increases capability of applications Facilitates interoperation and integration Increases productivity of programmers But
35 But Benefits only accrue if model captures application structure The more structure captured the bigger the benefit Important to capture as much structure as possible
Spectrum of mathematical structure captured by various data models 36 most nosql models capture less structure than relational the no in nosql should perhaps be less scientific apps have way more mathematical structure relational model isn t nearly structured enough scientific apps don t need no Structured Query Language need a (much) more Structured Query Language mo SQL
37 Data model/mo SQL requirements must capture common math structure of scientific data scalars, vectors, tensors topology and geometry fields algebra and calculus operations must describe how math entities are represented/stored decomposition into primitive types and operations decomposition for parallelism must maintain rigorous connection between high level semantics and low level implementation need a new data model
38 Outline Big Data in oil & gas exploration & production Field theory for data scientists The data model paradigm The sheaf data model A query language for the sheaf data model
39 Sheaf data model objects are discrete sheaves over finite distributive lattices math details: http://www.limitpoint.com/images/publications/the%20sheaf%20data%20model.pdf finite distributive lattice part space all distinct composite parts formed from set of basic parts discrete sheaf describes association of attributes with parts algebraic description of decomposition of abstract data types into tuples of primitive attributes
40 Visualizing a finite distributive lattice directed acyclic graph Hasse diagram two kinds of nodes composite parts basic parts links represent covers covers := immediately includes A covers B if and only if A includes B there is no C such that A includes C includes B. draw graph so that if A covers B, B is lower on page composite part A covers basic part B covers basic part C example
41 Example: branched well derrick floor well Upper well upper well lower well well junction bore 1 bore 2 Lower well Well parts bore 1 bore 2 df junction Hasse diagram basic parts are independent objects composite parts are precisely the sum of their basic parts
42 Sheaf table metaphor data base is a set of tables each table represents a type each row an instance each column an attribute rows carry client-defined lattice order col lattice is row lattice of some other table schema are first class objects unified algebraic framework for all common scientific data types
43 Unified framework for scientific data types tabular types contains relational model as limiting case row lattice is a boolean lattice physical property types scalars, vectors, tensors object-oriented types with multiple inheritance col lattice is subobject inclusion hierarchy spatial types (meshes) any decomposition of space row lattice represents spatial inclusion field types any property, any mesh, any evaluation method col lattice = tensor(mesh row lattice, property col lattice) rigorous connection between abstract math types and numeric reps from high level specification to tuples of primitives
44 Open Source Implementation SheafSystem Community Edition C++ libraries with Java, Python, and C# bindings Field API field types pushers refiners Geometry API coordinate sections (invertible sections) point locators spatial types Fiber Bundle Data Model API physical property types tensors groups Jacobians Sheaf Data Model API section types sheaf storage agent HDF5 www.sheafsystem.org or github
45 Outline Big Data in oil & gas exploration & production Field theory for data scientists The data model paradigm The sheaf data model A query language for the sheaf data model
46 Query language for sheaf data model work in progress with Prof Magne Haveraaen Bergen Language Design Laboratory, University of Bergen started with initial guess at operators extension of relational operators experience with implementation formalizing and refining definitions goal is mo SQL
47 Acknowledgements Mark Verschuren, Shell, provided many useful comments and other input for this presentation Original research and development funded by subcontracts B347785, B515090, and B560973 of prime contract W-7405- ENG-48 with the Department of Energy National Nuclear Security Administration (DOE/NNSA) Ongoing development has been funded by Shell GameChanger and Shell TaCIT http://www.limitpoint.com/images/publications/bigdatainoilandgas.pdf
48 END
49 References 1 [Krebbers] Big Data & Analytics: Exploiting it, Johan Krebbers, VP Architecture, Shell http://cdn.osisoft.com/corp/en/media/presentations/2013/ UsersConference2013/PDF/UC2013_Shell_Krebbers_GlobalIT Architecture_1.pdf [KrisEnergy] http://www.krisenergy.com/company/aboutoil-and-gas/exploration/ [epmag 1] http://www.epmag.com/exploration-geology- Geophysics/Three-D-Seismic-Advances-Improve-Exploration- Success_90469 [decogeo] http://www.decogeo.com/upload/image/log1_bigl.jpg
50 References 2 [epmag 2] http://www.epmag.com/item/das-enablessimultaneous-multiwell-vsp_121593 [slb1] http://www.slb.com/resources/case_studies/completions/~/medi a/images/completions/intelligent/wellwatcher_neon_tp_01tn.jpg [slb 2] System of subsurface faults and horizons in the Gulfaks oil field in the Norwegian sector of the North Sea. Data set courtesy of Schlumberger Limited. [geosoft] http://blogs.geosoft.com/exploringwithdata/2012/08/3dmodelling-with-velocity-volumes-in-gm-sys.html [pdgm 1] http://www.pdgm.com/getmedia/c72b49d9-571b-4fe8- ae3f-bfd00f862b0d/skua-salt- 2010.jpg.aspx?width=1024&height=650&ext=.jpg
51 References 3 [slb 3] http://www.software.slb.com/publishingimages/totalstress.jpg [dgi] http://www.dgi.com/images/cvslideshow/fullsize/coviz4d_slides how_003.jpg [outernode] http://outernode.pir.sa.gov.au/ data/assets/image/0020/119009 /Curnamona_3D.jpg [cda] http://www.oilandgasuk.co.uk/cmsfiles/custom/html/report- 14.png [Codd] E. F. Codd. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6 (June 1970), 377-387. DOI=10.1145/362384.362685 http://doi.acm.org/10.1145/362384.362685