MANAGING SCIENTIFIC DATA WITH NDN Chengyu Fan, Susmit Shannigrahi, Steve DiBenedetto, Catherine Olschanowsky, Christos Papadopoulos NDNcomm 2015 Sept 28, 2015 Los Angeles, CA Supported by NSF #13410999 and NSF#1345236
Introduction Scientific data is often very large and complex Climate - CMIP5: 3.5 PB, CMIP6: 350PB-3EB Physics - Atlas: 4 PB/Year Astronomy, bioinformatics, others Science infrastructure Cutting edge hardware but often incompatible domain software (ESGF, xrootd, etc.) Complexity, replication, redundancy 1 1
Our Project Build and deploy software to evaluate NDN in scientific applications over a dedicated hardware infrastructure Evaluate NDN in the context of: Application services: publishing, discovery, retrieval, access control, load balancing, failover, caching, etc. Network integration (OSCARS, SDN, etc.) Metrics Performance, reduced complexity, ease of deployment, interoperability, reuse, efficiency, routing, security/trust, etc. 2 2
NDN Layer Structure host host UDP/IP UDP/IP 3
NDN Layer Structure APP host host UDP/IP UDP/IP 4
NDN Layer Structure APP host host router NDN UDP/IP UDP/IP 5
NDN Layer Structure APP host host router NDN NDN NDN LINK ETH UDP/IP ETH UDP/IP Other Other 6
NDN Layer Structure APP host host APP router NDN NDN NDN LINK ETH UDP/IP ETH UDP/IP Other Other 7
NDN Layer Structure APP host host APP router NDN NDN NDN NDN LINK ETH UDP/IP ETH UDP/IP Other Other 8
NDN Layer Structure APP host host APP router router NDN NDN NDN NDN LINK ETH UDP/IP ETH UDP/IP LINK Other Other 9
Methodology Investigate the use of NDN as a common platform for scientific data applications by: Understanding data management challenges of various scientific domains Developing and evaluating prototype applications that leverage NDN's features Use prototypes to further drive NDN research 4 10
First Step Build a Catalog Create a shared resource a distributed, synchronized catalog of names over NDN Provide common operations such as publishing, discovery, access control Catalog only deals with name management, not dataset retrieval Platform for further research and experimentation Research questions: Namespace construction, distributed publishing, key management, UI design, failover, etc. Functional services such as subsetting Mapping of name-based routing to tunneling services (VPN, OSCARS, MPLS) 5 11
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN Data storage Catalog node 2 Consumer 6 12
Overview of Catalog Workflow Catalog node 1 (1)Publish Dataset names Data storage Catalog node 3 Publisher NDN Data storage Catalog node 2 Consumer 6 13
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN Data storage Catalog node 2 Consumer 6 14
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher (2) Sync changes NDN Data storage Catalog node 2 Consumer 6 15
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN Data storage Catalog node 2 Consumer 6 16
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN (3) Query for Dataset names Data storage Catalog node 2 Consumer 6 17
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN Data storage Catalog node 2 Consumer 6 18
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN Data storage (4) Retrieve data Catalog node 2 Consumer 6 19
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN Data storage (4) Retrieve data Catalog node 2 Consumer 6 20
Overview of Catalog Workflow Catalog node 1 Data storage Catalog node 3 Publisher NDN Data storage (4) Retrieve data Catalog node 2 Consumer 6 21
NDN-Science Testbed NSF CC-NIE campus infrastructure award 10G testbed (courtesy of ESnet, UCAR, and CSU Research LAN) Currently ~50TB of CMIP5, ~70TB of HEP data 7 22
Demos Search Publication and Sync Access control Retrieval and failover 8 23
Conclusions IP encourages common host access, not common data access methods Does not encourage interoperability at the application level NDN has the potential to unify the service interface required by scientific applications Science testbed and prototypes to test hypothesis and drive research and experimentation Ready-to-try catalog, we invite you to try it with your data Catalog is general, supports a variety of applications Currently CMIP5 and HEP applications UI for data search and retrieval. 9 24
Our sponsors: NSF and ESnet Join us @ http://www.netsec.colostate.edu/mailman/listinfo/ndn-sci 1025
Backup Slides 11
Current Example: xrootd xrootd cmsd xrootd cmsd xrootd cmsd A /my/file B C /my/file Data Servers Fragile, fairly complex middleware 1227
Current Example: xrootd xrootd cmsd Manager (a.k.a. Redirector) xrootd cmsd xrootd cmsd xrootd cmsd A /my/file B C /my/file Data Servers Fragile, fairly complex middleware 1228
Current Example: xrootd Client xrootd cmsd Manager (a.k.a. Redirector) xrootd cmsd xrootd cmsd xrootd cmsd A /my/file B C /my/file Data Servers Fragile, fairly complex middleware 1229
Current Example: xrootd Client 4: Try open() at A xrootd cmsd Manager (a.k.a. Redirector) xrootd cmsd xrootd cmsd xrootd cmsd A /my/file B C /my/file Data Servers Fragile, fairly complex middleware 1230
xrootd under NDN NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file Significantly reduced system complexity Better service abstraction 1331
xrootd under NDN NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file Significantly reduced system complexity Better service abstraction 1332
xrootd under NDN Client NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file Significantly reduced system complexity Better service abstraction 1333
xrootd under NDN Client? /my/file NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file Significantly reduced system complexity Better service abstraction 1334
xrootd under NDN Client? /my/file NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file Significantly reduced system complexity Better service abstraction 1335
Data Publication Catalog Publisher 1) Listening on /<catalogprefix>/publish 36
Data Publication Catalog Publisher 1) Listening on /<catalogprefix>/publish 2) Generate NDN names for datasets/services 37
Data Publication Catalog Publisher 1) Listening on /<catalogprefix>/publish 3) Request publish 2) Generate NDN names for datasets/services 38
Data Publication Catalog Publisher 1) Listening on /<catalogprefix>/publish 3) Request publish 2) Generate NDN names for datasets/services 4) Fetch published name list 39
Data Publication Catalog Publisher 1) Listening on /<catalogprefix>/publish 3) Request publish 2) Generate NDN names for datasets/services 4) Fetch published name list 5) Authenticate the Data and validate data name against trust model 40
Data Publication Catalog Publisher 1) Listening on /<catalogprefix>/publish 3) Request publish 2) Generate NDN names for datasets/services 4) Fetch published name list 5) Authenticate the Data and validate data name against trust model 6) Share names with other catalogs 41
Keys for ndn-atmos /cmip5/key Self-signed root key /cmip5/lbl/key /cmip5/nwsc/key Site s keys (Dataset names publishing) /cmip5/lbl/<datapublisher>/key (NLSR) /cmip5/nwsc/<operator>/key /cmip5/nwsc/<router>/key Application s keys 1542
Keys for ndn-atmos /cmip5/key Self-signed root key signs /cmip5/lbl/key /cmip5/nwsc/key Site s keys (Dataset names publishing) /cmip5/lbl/<datapublisher>/key (NLSR) /cmip5/nwsc/<operator>/key /cmip5/nwsc/<router>/key Application s keys 1543
Trust Model Only namespace owners are allowed to publish data Data provenance built into the data packet Content Name Signature /PublisherA/publish Publisher A s signature Data payload - /PublisherA/publish/file/1 - /PublisherA/publish/file/2 + /PublisherA/publish/file/3 + /PublisherA/publish/file/4 Valid publish message 1644
Trust Model Only namespace owners are allowed to publish data Data provenance built into the data packet Content Name Signature /PublisherA/publish Publisher A s signature /PublisherA/publish Publisher A s signature Data payload - /PublisherA/publish/file/1 - /PublisherA/publish/file/2 + /PublisherA/publish/file/3 + /PublisherA/publish/file/4 - /PublisherB/publish/file Valid publish message Invalid publish message 1645
Trust Model Only namespace owners are allowed to publish data Data provenance built into the data packet Content Name Signature /PublisherA/publish Publisher A s signature /PublisherA/publish Publisher A s signature Data payload - /PublisherA/publish/file/1 - /PublisherA/publish/file/2 + /PublisherA/publish/file/3 + /PublisherA/publish/file/4 - /PublisherB/publish/file Valid publish message Invalid publish message 1646
Trust Model Only namespace owners are allowed to publish data Data provenance built into the data packet Content Name Signature /PublisherA/publish Publisher A s signature /PublisherA/publish Publisher A s signature Data payload - /PublisherA/publish/file/1 - /PublisherA/publish/file/2 + /PublisherA/publish/file/3 + /PublisherA/publish/file/4 - /PublisherB/publish/file Valid publish message Invalid publish message 1647
Name Discovery Catalog Consumer 1) Listening on /<catalogprefix>/query 48
Name Discovery Catalog Consumer 1) Listening on /<catalogprefix>/query 2) Query with parameters (model=cmip5 AND frequency=6hr) 49
Name Discovery Catalog Consumer 1) Listening on /<catalogprefix>/query 2) Query with parameters (model=cmip5 AND frequency=6hr) 3) Query local DB; Packetize results under /<catalog-prefix>/queryresults/<params> 50
Name Discovery Catalog Consumer 1) Listening on /<catalogprefix>/query 2) Query with parameters (model=cmip5 AND frequency=6hr) 3) Query local DB; Packetize results under /<catalog-prefix>/queryresults/<params> 3) ACK 51
Name Discovery Catalog Consumer 1) Listening on /<catalogprefix>/query 2) Query with parameters (model=cmip5 AND frequency=6hr) 3) Query local DB; Packetize results under /<catalog-prefix>/queryresults/<params> 3) ACK 4) Fetch query results (name list) 52
Name Discovery Catalog Consumer 1) Listening on /<catalogprefix>/query 2) Query with parameters (model=cmip5 AND frequency=6hr) 3) Query local DB; Packetize results under /<catalog-prefix>/queryresults/<params> 3) ACK 4) Fetch query results (name list) 5) Fetch desired dataset(s) or re-query 53
Data Publication Catalog Publisher Accept publish requests: /<catalog-prefix>/publish Generate NDN names for datasets/services Authenticate and retrieve data names from publisher Inform catalog of names to add/remove Sync names with other catalogs Catalog Publisher 54
Data Publication Catalog Publisher Accept publish requests: /<catalog-prefix>/publish Generate NDN names for datasets/services Authenticate and retrieve data names from publisher Inform catalog of names to add/remove Sync names with other catalogs Catalog Request publish Publisher 55
Data Publication Catalog Publisher Accept publish requests: /<catalog-prefix>/publish Generate NDN names for datasets/services Authenticate and retrieve data names from publisher Inform catalog of names to add/remove Sync names with other catalogs Catalog Request publish Fetch published name list Publisher 56
Data Publication Catalog Publisher Accept publish requests: /<catalog-prefix>/publish Generate NDN names for datasets/services Authenticate and retrieve data names from publisher Inform catalog of names to add/remove Sync names with other catalogs Catalog Request publish Publisher Validate data name against trust model Fetch published name list 57
Data Publication Catalog Publisher Accept publish requests: /<catalog-prefix>/publish Generate NDN names for datasets/services Authenticate and retrieve data names from publisher Inform catalog of names to add/remove Sync names with other catalogs Catalog Request publish Publisher Validate data name against trust model Fetch published name list Share names with other catalogs 58
Name Discovery Catalog User Accept queries on /<catalog-prefix>/query Query local DB Packetize the returned names under /<catalog-prefix>/queryresults/<params> Query catalog for names with specified components e.g.: model=cmip5 AND frequency=6hr Fetch generated name list Fetch desired dataset(s) or requery Catalog Consumer 59
Name Discovery Catalog User Accept queries on /<catalog-prefix>/query Query local DB Packetize the returned names under /<catalog-prefix>/queryresults/<params> Query catalog for names with specified components e.g.: model=cmip5 AND frequency=6hr Fetch generated name list Fetch desired dataset(s) or requery Catalog Query with parameters Consumer 60
Name Discovery Catalog User Accept queries on /<catalog-prefix>/query Query local DB Packetize the returned names under /<catalog-prefix>/queryresults/<params> Query catalog for names with specified components e.g.: model=cmip5 AND frequency=6hr Fetch generated name list Fetch desired dataset(s) or requery Query local DB; Packetize results Catalog Query with parameters Consumer 61
Name Discovery Catalog User Accept queries on /<catalog-prefix>/query Query local DB Packetize the returned names under /<catalog-prefix>/queryresults/<params> Query catalog for names with specified components e.g.: model=cmip5 AND frequency=6hr Fetch generated name list Fetch desired dataset(s) or requery Query local DB; Packetize results Catalog Query with parameters ACK Consumer 62
Name Discovery Catalog User Accept queries on /<catalog-prefix>/query Query local DB Packetize the returned names under /<catalog-prefix>/queryresults/<params> Query catalog for names with specified components e.g.: model=cmip5 AND frequency=6hr Fetch generated name list Fetch desired dataset(s) or requery Query local DB; Packetize results Catalog Query with parameters ACK Fetch query results Consumer 63
Name Discovery Catalog User Accept queries on /<catalog-prefix>/query Query local DB Packetize the returned names under /<catalog-prefix>/queryresults/<params> Query catalog for names with specified components e.g.: model=cmip5 AND frequency=6hr Fetch generated name list Fetch desired dataset(s) or requery Query local DB; Packetize results Catalog Query with parameters ACK Fetch query results Consumer Fetch data with standard NDN 64
Name Discovery Optimization Avoid maintaining state between user and catalog Enables graceful failover Catalog Accept queries on /<catalog-prefix>/queryparams Query local DB Packetize the returned names under /<catalogprefix>/queryparams/seg# In case of failure, queries get redirected to another catalog Consumers Can query any catalog instances Can transparently failover to another catalog 65
Simplified xrootd Under NDN NDN integrates discovery, failover, retrieval Provides a better abstraction to the applications NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file 2166
Simplified xrootd Under NDN NDN integrates discovery, failover, retrieval Provides a better abstraction to the applications NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file 2167
Simplified xrootd Under NDN NDN integrates discovery, failover, retrieval Provides a better abstraction to the applications Client NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file 2168
Simplified xrootd Under NDN NDN integrates discovery, failover, retrieval Provides a better abstraction to the applications Client? /my/file NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file 2169
Simplified xrootd Under NDN NDN integrates discovery, failover, retrieval Provides a better abstraction to the applications Client? /my/file NDN xrootd cmsd xrootd cmsd xrootd cmsd Data Servers A /my/file B C /my/file 2170
Name Discovery Challenges Users may need to discover content/services without knowing a the full NDN name prefix structure NDN names are contiguous prefixes Users may only know a few disjoint name components (e.g. frequency=6hr) But can not use wildcards for name discovery User wants: /CMIP5/output1/VA/6hr/2016 Consumer NDN... 2271
Name Discovery Challenges Users may need to discover content/services without knowing a the full NDN name prefix structure NDN names are contiguous prefixes Users may only know a few disjoint name components (e.g. frequency=6hr) But can not use wildcards for name discovery User wants: /CMIP5/output1/VA/6hr/2016 /CMIP5 Consumer NDN... 2272
Name Discovery Challenges Users may need to discover content/services without knowing a the full NDN name prefix structure NDN names are contiguous prefixes Users may only know a few disjoint name components (e.g. frequency=6hr) But can not use wildcards for name discovery NDN User wants: /CMIP5/output1/VA/6hr/2016 /CMIP5 /CMIP5/output/BCC/6hr/1998 Consumer... 2273
Name Discovery Challenges Users may need to discover content/services without knowing a the full NDN name prefix structure NDN names are contiguous prefixes Users may only know a few disjoint name components (e.g. frequency=6hr) But can not use wildcards for name discovery NDN User wants: /CMIP5/output1/VA/6hr/2016 /CMIP5 /CMIP5/output/BCC/6hr/1998 /CMIP5/output/BCC/6hr (exclude 1998)... Consumer 2274
Name Discovery Challenges Users may need to discover content/services without knowing a the full NDN name prefix structure NDN names are contiguous prefixes Users may only know a few disjoint name components (e.g. frequency=6hr) But can not use wildcards for name discovery NDN User wants: /CMIP5/output1/VA/6hr/2016 /CMIP5 /CMIP5/output/BCC/6hr/1998 /CMIP5/output/BCC/6hr (exclude 1998)... Consumer May take too many requests to find desired data or service 2275
NDN Support for Big Science NDN Names separate data from hosts Discovery: Names directly translate to network queries Failover: Network can get verifiable data from anywhere Retrieval: Data can be fetched from optimal source(s) Investigate the use of NDN as a platform for scientific data applications Understand data management challenges of various scientific domains Develop prototype applications to leverage NDN's built-in features Use these applications as case studies to drive NDN research aspects 2376
Summary NDN improves scientific data management at scale Apps benefit from transparent multipath, automatic failover, etc. Built-in security provides publisher provenance Names are the common building block for content and services Names are flexible: can refer to static content or dynamic services Catalog supports efficient publication, non-contiguous name discovery Users can discover content and services with minimal a priori knowledge Catalog validates publication requests for authorization 2477
Managing Scientific Data with NDN Distributed, synchronized catalog of names and services Common functionality: publishing, discovery, access control, etc. Search and retrieval UI Platform for further research and experimentation Research questions: Namespace construction, distributed publishing, key management, UI design, failover, etc. Functional services such as subsetting Science testbed 10G testbed (courtesy of ESnet, UCAR, and CSU Research LAN) Nodes strategically located near scientific data (climate +HEP) CC-NIE NSF award Mapping of name-based routing to tunneling services (VPN, OSCARS, MPLS) 78
Managing Scientific Data with NDN Name-based Internet architecture Name the data, not the host All data digitally signed Unifies and pushes common functionality to the network: publishing, discovery, access control, etc. Science testbed Data Intensive applications Automatic pervasive in-network caching, parallel retrieval, automatic failover and more Simpler alternative middleware implementation e.g., ESGF, xrootd 10G testbed (courtesy of ESnet, UCAR, and CSU Research LAN) CMIP5 and HEP data CC-NIE NSF award 79