GMP Data Warehouse Data Management Richard Hůlek, Jana Klánová, Ladislav Dušek, Jiří Jarkovský, Daniel Klimeš, Daniel Schwarz, Petr Holub, Jiří Kalina Jakub Gregor, Jana Borůvková, Kateřina Šebková
GMP Data Warehouse Data Management Content Content... 1 1 Introduction... 3 2 Where to start... 3 3 Data structures of data collection... 3 3.1 Basic data structures... 3 3.1.1 Site... 3 3.1.2 Sampling attributes... 4 3.1.3 Measurement... 4 3.1.4 Detailed specification of data structures... 4 3.2 Data hierarchy... 4 3.3 Types of data... 5 4 Data collection methods... 5 4.1 Direct input of data into collection forms... 5 4.2 Data uploading... 5 4.3 Transfer of data from publicly available source... 5 5 Flow of data collection process... 5 5.1 Illustration of data flow and data collection process... 6 5.2 Skip logic... 10 6 System roles and user accounts... 11 6.1 Organisation structure... 11 6.2 Organisation hierarchy... 11 6.3 System roles... 12 6.4 Users and user accounts... 13 7 Data management workflow... 13 7.1 Data record... 13 7.2 General data management workflow... 14 8 Access rights and access scopes... 18 8.1 Access rights... 18 8.2 Access scopes... 18 1
8.3 System roles and access scopes combination... 18 9 Reporting services for users... 18 9.1 Operational reports... 19 9.2 Dynamic visualizations... 19 9.3 Custom reporting services... 19 2
1 Introduction This document describes a data warehouse developed for the purposes of the Stockholm Convention s Global Monitoring Plan for Persistent Organic Pollutants, particularly for the second data collection campaign, which is to begin in 2014. GMP Data Warehouse (thereafter referred to as GMP DWH) represents a comprehensive data warehouse including an interactive on-line e-data capture system, handling, and a presentation module for the Global Monitoring Plan on POPs (GMP). The module covers all 5 UN regions and enables insertion, processing, storage and presentation of all GMP data. GMP DWH is developed, maintained and hosted at the Masaryk University, Brno, Czech Republic; nevertheless, it is accessible from anywhere via Internet. A detailed description of the system architecture and ICT background is available in the document GMP Data Warehouse System Documentation and Architecture. 2 Where to start All functions of the GMP DWH are available through web-based user interface. The best place to start is the web portal www.pops-gmp.org. The user interface of the GMP DWH is embedded in this portal including all the necessary documentation, user guides and links to all modules of the data warehouse. 3 Data structures for data collection According to the GMP Guidance document (UNEP/POPS/COP.6/INF/31, version 2013), data from four environmental matrices are collected: Ambient air Human blood Human milk Water GMP DWH implements four individual data collection branches; each branch is designed for one environmental matrix. Data are collected, stored, managed and visualized independently across branches; Moreover, some data visualization tools of the GMP DWH are capable to show data of various matrices combined (data coverage-based reports). 3.1 Basic data structures All four data collection branches contain three common data structures. The structures are of the same foundation and significance but they differ slightly due to a different nature of individual matrices. The basic data structures are: Site Sampling attributes Measurement 3.1.1 Site The Site represents a location of sampling. It can be a single point or an area. Each site has its unique ID and name. The position of each Site is defined by geographic coordinates. In addition, there are 3
other attributes describing, characterizing, and categorizing each site. These attributes differ for each matrix. 3.1.2 Sampling attributes Sampling attributes item contains a description of all samples which were annually aggregated and have common measurements. Several attributes (such as year or monitoring programme) are shared/the same for all four matrix branches. Nevertheless, there are specific attributes which differ within individual matrices (such as sampling method for air or blood source for human blood). 3.1.3 Measurement Measurement represents aggregated concentrations of a particular chemical parameter expressed as a set of summary statistics (minimum, maximum, mean, median, number of values) with additional description, such as limit of quantification (LOQ) or laboratory. 3.1.4 Detailed specification of data structures A detailed description of basic data structures and a complete list of their attributes are provided in the document GMP Data Warehouse System Documentation and Architecture in chapter 6 and its annexes. 3.2 Data hierarchy Site, sampling attributes and measurement are data structures creating a 3-level hierarchy, where the lower layers extend those on the top (Figure 1). Figure 1: GMP DWH data structures and data hierarchy The attribute site is a top of the hierarchy; it can encompass multiple sampling attributes (i.e. for each year of the sampling campaign); each sampling attribute can hold multiple measurements of specific chemical parameters. GMP DWH allows for storage of multiple sites, so that the whole data hierarchy could be created repeatedly. 4
A multi-level hierarchy enables extension of individual data structures derived from the preceding level. A shared/common information is recorded at the upper levels of the hierarchy and lower levels of hierarchy inherit this information. Such approach helps to avoid redundancies and repeated data insertion and makes the whole process of data insertion easier. 3.3 Types of data GMP DWH supports the insertion of both aggregated and primary data. Aggregated data must fulfil predefined data format (a template to download is available in the GMP DWH or from the pops.gmp.org portal). Aggregation of data must comply with the methodology provided by the Guidance on the Global Monitoring Plan for Persistent Organic Pollutants. Primary data are handled by DWH managers who perform validation and aggregation of data and import them into the GMP DWH for further processing. 4 Data collection methods There are three options how to insert data into GMP DWH: Direct input of data into GMP DWH on-line forms (aggregated data only) Data upload through MS Excel spread sheets (both aggregated and primary data) Transfer from publicly available data source (both aggregated and primary data) 4.1 Direct input of data into on-line forms GMP DWH provides a set of predefined forms in standardized format with in-built validation services. A direct data input is possible only for aggregated data and the relevant data provider is fully responsible for their accurate aggregation. 4.2 Data upload Data can be directly uploaded into the GMP DWH in a single file. Data files must comply with obligatory structure defined in data templates. Once data uploaded, DWH manager performs all necessary steps to process the data, particularly validation and aggregation. Both aggregated and primary data can be uploaded. Subsequently, the imported data sets are assigned to the data provider who originally uploaded them. 4.3 Data Transfer from a publicly available source Data from publicly available sources can be transferred into the GMP DWH. This transfer is performed by the DWH manager. The user, member of the Regional Organization Group requests this operation to be undertaken and provides an URL link and description of data to be transferred. After processing and transferring data into GMP DWH, the imported data sets are assigned to the user who requested the data transfer. 5 Flow of data collection process Each record at any hierarchy level can be handled individually. The GMP DWH offers functions for each matrix branch: Creation of a new site Search for existing site Display a list of sites 5
Editing of an existing site Deletion of existing site Creation of a new entry of sampling attributes Search for existing entries of sampling attributes Display a list of entries of sampling attributes for a specific site Edition of an entry of sampling attributes Deletion of entry of sampling attributes Creation of a new entry of measurement Display list of entries of measurements for specific sampling attributes Editing of an entry of measurement Deletion of entry of measurement 5.1 Data flow and data collection process Data flows for each matrix are shown in Figures 2, 3, 4 and 5. Flow charts show differences in site, sampling attributes and measurement attributes between individual matrix branches and functions available for individual data structure. 6
Figure 2: Flow of data collection process air 7
Figure 3: Flow of data collection process human milk 8
Figure 4: Flow of data collection process human blood 9
Figure 5: Flow of data collection process water 5.2 Skip logic In some cases a skip logic has been introduced. This approach allows only those combinations of multiple code lists that are relevant for a particular data flow/process. All supported combinations of attributes and values are tabled; the skip logic system activates and deactivates relevant input fields. 10
6 System roles and user accounts The whole process of the GMP data management is reflected and implemented in one comprehensive logical processing workflow into the GMP DWH. 6.1 Organization structure This section looks into interactions of all relevant players who work with data before they become publicly available through GMP DWH portal. The following actors have been identified: GMP Global Coordination Group members is a top level group with a right to view all collected and approved data and use them for global monitoring report; Regional Organization Group (ROG) is responsible for organizing individual Data Providers in its region. It coordinates all data management activities, in particular approval of data to be published in the regional monitoring report; Data Provider is an institution responsible for physical insertion of data into the GMP DWH; Data Warehouse Manager provides its capacities during a formal validation of collected data and supports primary data import into the GMP DWH; Data Warehouse Helpdesk provides support over hotline and email; Secretariat of the Stockholm Convention helps to coordinate all activities of data collection. Flow chart in Figure 6 shows interactions within individual actors involved in the GMP data collection and data handling. Figure 6: Organization structure 6.2 Organization - hierarchy Actors involved in the GMP data collection form a multi-level hierarchy. Such structure is further used to define a range of access rights and rights to browse/view inserted data. Initial classification starts by 11
individual ROGs, next lower level are Data providers collecting data for individual ROGs. This is shown in Figure 7. Figure 7: Organization hierarchy 6.3 System roles System role represents a set of user rights that authorize their owner to use some system functions. These roles are derived from the planned workflow of GMP data collection process. The roles are implemented in the GMP DWH and assigned to individual users as follows: GCG Member can see all approved data from all regions; ROG Head can approve data for his region; ROG Member can approve or reject data. ROG members work with data of institutions which provide data to or operate in the region of the particular ROG; DWH Manager performs data validation; Data Manager can insert and check data inserted by a particular Data provider; Data Manager Assistant can insert data into the GMP DWH system; DWH Helpdesk helps to resolve all issues that might occur and handles all user accounts; Stockholm Convention Secretariat Staff have access to the system to see data approved by individual ROG heads. 12
Figure 8: User accounts assigned to system roles 6.4 Users and user accounts Every user authorized to work with GMP DWH is given his/her unique user account. This user account is linked with his/her institution(s) and defined by roles he/she stands for. It is important to note that one user can belong to multiple institutions and have multiple roles as illustrated in Figure 8. It is responsibility of the DWH helpdesk to create user accounts for all GMP DWH system users. 7 Data management workflow 7.1 Data record Due to a large amount of data for GMP DWH it would not be suitable to handle every single record describing chemical compound measurement separately. It would lead to overload for any person involved in the data management process to handle records one by one. Therefore, a data record constitutes an elementary building unit. Each data record is defined as a group of two data structures: sample attributes and all corresponding measurements (see Figure 11 for more details). A data record represents elementary unit in the context of data management workflow. Change of workflow is always performed at the level of whole data record. 13
Figure 9: Data record 7.2 General data management workflow All data records inserted into the DMP DWH are handled and managed according to built-in data management workflow: 1. Data records are inserted and completed by data manager assistant. 2. Data manager supervises data records from his/her institution (data provider). 3. DWH manager validates data records from all data providers within each region in cooperation with individual data managers. 4. ROG members approve or reject data records within their region. 5. ROG head communicates a final approval of data records to GMP DWH and can make available (internally) data records within his/her region. 6. GCG members can see all approved data records and use them for global monitoring report. Figure 9 depicts individual steps of Data Management Workflow of GMP DWH. 14
Figure 10: General data management workflow 15
Data management workflow is internally implemented as a set of states. Data records are assigned to one state of the workflow at one time point. Each workflow state is determined by a limited number of system roles; For example, data manager is exclusive person to set data records as supervised and a ROG member can exclusively mark data records as approved. In addition, there are other states serving as a transition from one state to another one. A complete list of data management workflow states of data is as follows: Inserted data records are inserted in the GMP DWH by data manager assistant. The work is in progress and the data records are considered as pending. Completed once data manager assistant finishes insertion of data records into the system, he/she marks these data records as completed. Accepted for supervision when data manager moves data records into the accepted for supervision state he/she reserves data for him/herself only so that nobody else could modify their contents and state. Rejected from supervision data manager returns data records to the data manager assistant by marking data rejected from supervision. This might happen upon occurrence of errors. Supervised data manager marks data as supervised when they are correct and thereby grants data provided by his/her institution (data provider). Accepted for validation DWH manager reserves data records for him/herself to be able to validate the data. Rejected from validation DWH manager returns data record to data manager when formal error(s) occur(s). Validated once there are no formal issues in the data record, DWH manager marks data as validated and submits them to ROGs. Accepted for approval a ROG member reserves the data record for himself marking it as accepted for approval. Rejected from approval if there is a need to return data record to lower levels of the workflow, ROG members mark it as rejected from approval. Rejected if it is not suitable to include a particular data record into the GMP report, ROG members mark data record as Rejected. It is not a return-back type of state; therefore, such data record is not given back to DWH manager or data manager but it is definitely excluded from use in a GMP monitoring report. Approved data records that are suitable for publication in the GMP regional report are marked as approved by ROG members. Approval communicated ROG head thereby communicates approval of data records so that they could be further used by GCG. 16
Figure 11: Data Management Workflow states assigned to system roles Data Management Workflow State Assigned system role Type Inserted Data Manager Assistant Pending state Completed Data Manager Assistant Decision state Accepted for supervision Data Manager Pending state Rejected from Supervision Data Manager Return back state Supervised Data Manager Decision state Accepted for validation DWH Manager Pending state Rejected from validation DWH Manager Return back state Validated DWH Manager Decision state 17
Accepted for approval ROG Member Pending state Rejected from approval ROG Member Return back state Rejected ROG Member Decision state Approved ROG Member Decision state Approval communicated GCG Head Decision state 8 Access rights and access scopes 8.1 Access rights User rights used in the GMP DWH are to control operations that individual users can perform with the data records. Access rights are mostly applied to the data management workflow to limit workflow states available. User rights are assigned to every single user of the GMP DWH according to his/her user roles. 8.2 Access scopes Another option to distinguish between users and their roles and rights is to limit an access scope of individual user to data stored in the GMP DWH. The GMP DWH system has built-in functions to narrow data available to each user to his/her institution(s) only. Implementation of institutional hierarchy is important to define access scopes: Data manager assistant and data manager can work only with data records which were inserted by data provider institution they belong to. DWH manager of a particular ROG can work only with data records which were inserted to the GMP DWH by data providers which are in the institution hierarchy pertaining to that ROG itself. ROG member and ROG head can work only with data records which were inserted to the GMP DWH by data providers which are in the institution hierarchy of that ROG. GCG member can work with any approved data records. 8.3 System roles and access scopes combination Furthermore, access scope of individual users modifies/limits access to data records to a greater extent than the access scope determined by an institutional hierarchy. GMP DWH system delivers to each user only those data records which are in a workflow state they have granted access to or in any state which follows the state they are allowed to work with. Example: ROG member can access (see) data records which were inserted to the GMP DWH by data providers which are in the institution hierarchy pertaining to that ROG; these data records are assigned by one of the following workflow step: validated, accepted for approval, rejected from approval, rejected, approved or approval communicated. 9 Reporting services for users Reporting services based on access rights and access scopes are available. There are built-in functions or specialized data visualization applications which provide their users direct and dynamic overview on currently collected/acquired data while respecting all principles of data ownership, access rights and access scopes at the same time. All reporting services can be used to track progress of data 18
collection. In addition, reporting services can serve for creation of final GMP report by a particular ROG. 9.1 Operational reports Reporting functions are directly incorporated into GMP DWH data collection module. Data records can be searched by parametric search or browsed via overview tables. Summary rows for each data record in overview table contain, among others, information on validation. Operational reports can be used for tracking changes and finding incomplete entries. 9.2 Dynamic visualizations Dynamic visualizations are connected directly to GMP DWH data layer and provide real-time feedback on data collection progress in a more suitable form. Tools for visualization are described in detail in the document "GMP Data Warehouse System Documentation and Architecture", chapter 7 and in the document "Data analysis and reporting in GMP DWH", chapter 4. 9.3 Custom reporting services Any additional data report can be requested via DWH helpdesk. 19