A Multidimensional Design for the Genesis Data Set

Transcription

1 A Multidimensional Design for the Genesis Data Set Ray Hylock Management Sciences Department University of Iowa Iowa City, IA USA ABSTRACT Relational databases are designed to store and process moving data. There is a current that runs through these systems continuously inserting, updating, and deleting record after record; never ceasing to tire. Modern day database management systems such as Oracle and Microsoft SQL Server are designed to control this chaotic environment. However, it is the very same data models that maintain the robustness of these systems that hinders speed during data analysis. That is, the goal of database models such as the Entity Relationship (ER) model is to normalize the data; remove redundancy and separate data themes in order to avoid insert, update, and delete anomalies. Data warehouses, on the other hand, do not suffer from these anomalies. Therefore, the goal of the multidimensional models used by data warehouses is to denormalize the data as much as possible in order to increase the speed of analytical queries. This paper is an extension to the work done by Lu et al. The goal for [1] is to create a model to manage large volumes of clinical data for decision support and quality control at the point of patient care. One component of this is a data warehouse. This paper represents the first stage in converting the existing relational design created in [1] to a multidimensional one, suitable for the needs of data analysis. As you will see, the current healthcare related data warehouse research is geared more towards customer billing and does not allow for multi valued attributes. In fact, in similar settings, only the top n number of values (in the case of [2] and [3] n = 10) were selected and the rest were thrown out. In order to accurately build a model for decision support in this field, all data points need to be considered. 1 INTRODUCTION There is a clear distinction between traditional databases (e.g. relational), which are transactional in nature, and data warehouses. Transactional systems support online transaction processing (OLTP) which includes inserts, updates, deletes, and support querying small subsets of data held within. Data warehouse systems using online analytical processing (OLAP) tools, on the other hand, are mainly intended for decision support applications and are optimized for retrieval instead of routine transaction processing [11]. OLAP generally involves highly complex queries that use one or more aggregations; sometimes called decision support queries [6]. While everyday processing of healthcare information should be done with OLTP databases, analysis of the data stored within needs to be done on a platform intended for intensive, complex, and all inclusive data computations. Data warehouses can be characterized as a subject oriented, integrated, nonvolatile, time variant collection of data used to support management decision making. Each characteristic is as follows [4]: Subject oriented: The data warehouse is organized around key subjects such as customers, students, patients, and in our case, visits Integrated: All data in the data warehouse conforms to one standard. For example, in one OLTP database, male and female may be stored as m and f whereas in another, male and female. When records are added to the data warehouse, the domain (or set of values) of each attribute will be checked to make

2 Hylock: Multidimensional Design for the Genesis Data Set Page 2 sure all values adhere to the standards imposed by the data warehouse; in this case, if the acceptable values were set to { m, f }, the second system s values will need to be converted from male to m and female to f. This ensures data integrity and improves query performance. Nonvolatile: The data is loaded and refreshed in a single process and is not updatable by end users. This means that the new data is generally uploaded in batches and users do not have access to edit this data; only that which is stored in the database. Time variant: The same data can be stored as a time series such that you can analyze the data across time and look for trends and changes Since the goal of the Genesis project is to create a model to manage large volumes of clinical data for decision support and quality control at the point of patient care, this work will focus on converting the existing relational database design into a data warehouse supported, multidimensional one. 1.1 THE GENESIS DATA SET The Genesis data set was generated by the Genesis Medical Center (GMC) in Davenport, Iowa. Approximately 650 registered nurses (RN) provide acute and skilled inpatient, outpatient, and home care for this over 500 bed community hospital. Since 1983, GMC has maintained a computerized nursing information system which stores care plans in the form of North American Nursing Diagnosis Association approved codes. The original data set was given to Lu et al in the form of two flat files which were later converted into 10 relational entities (tables in a database)[1]. In order to understand the multidimensional design a little background in what the codes look like and what they mean is required. First, we will cover Nursing Diagnoses, then Interventions (which is almost the same for characteristics, etiologies, and outcomes. By almost, I mean that the code group varies in interventions while it does not in the others). Sample Nursing Diagnoses code: 01MBA This code can be broken up into four pieces: 1. Ordering: how important is this code to the patients treatment plan = Code Group: MBA 3. Unique Identifier: Date (YYMMDD): Sample Interventions code: 01MBV This code can be broken up into five pieces: 1. Ordering: Code Group: MBV (for interventions, MBE-MBZ) 3. Nursing Diagnoses Group: 039 (this code and any other code with 039 for this value, references the single Nursing Diagnoses Code in the sample above) 4. Unique Identifier: Date (YYMMDD): As you can see, Nursing Diagnoses is the parent node for these five categories (Nursing Diagnoses, Interventions, Outcomes, Characteristics, and Etiologies). For more information about the data set, see [1].

3 Hylock: Multidimensional Design for the Genesis Data Set Page THE RELATIONAL DESIGNDATABASE As mentioned above, 10 entity classes were generated from the flat files. These were based on the ER diagram designed in [1]. Since then, several more tables have been added to cover Clinical Classification Software Diagnoses (CCS) 1 codes which group together individual ICD 9 codes (diseases). Figure 1 is the ER diagram designed for this system before the CCS codes were added. For more information about the relational design, see [1]. Figure 1: ER Diagram for the Genesis data set [1] 1 A reference of all CCS and the ICD 9 codes that make them up can be found at us.ahrq.govtoolssoftwareccsappendixasingledx.txt

4 Hylock: Multidimensional Design for the Genesis Data Set Page 4 2 LITERATURE REVIEW 2.1 THE MULTIDIMENSIONAL MODEL Before we start with the actual constructs and models, it is important to understand the difference between normalization and denormalization. As defined by [11], normalization seeks to separate the logically related attributes into tables to minimize redundancy, and thereby avoid the update anomalies that lead to an extra processing overhead to maintain consistency in the database. Anomalies are errors or inconsistencies that may result when a user attempts to update a table that contains redundant data [5]. There are three different types of anomalies: insert, update, and delete. Basically, if redundancy exists, delete and updating information may require looping through a table to delete or update all affected attributesvalues. For example, say you have a table that stores individual sales along with who made the sale. Each individual transaction will be associated with a single salesperson. However, each salesperson will be listed in this table multiple times since they can sell to other customers. If you want to delete a salesperson or update a name, you will need to loop through the data and perform the desired operation on all of the rows that sales person is a part of. This example can also be used to discuss insert anomalies. These anomalies occur when there is no place to put a new record because it must be associated with something else (in our example a sale) first. That is, since salesperson data is stored with each sale, if a salesperson does not have a sale, then they do not exist in the database. Normalization solves this problem by removing salesperson from sales and leaving in their stead a foreign key (section ) that references the new salesperson table. This new table will have one record for every salesperson, avoiding the previous anomalies. Denormalization is the opposite of normalization. Since updates and deletes do not normally occur in a data warehouse and insertion are complete (all of the data necessary to make the information worth inserting is there), separating themes will only hinder performance. Take the previous example. By breaking up the table into two, we are adding time during the query process because of joins. This is counterproductive in an analytic environment. So, denormalization will ultimately bring those two tables back together in order to avoid time consuming table joins. Section defines the constructs used in multidimensional models as well as some basic principles such as primary keys, foreign keys, and relationship definitions. Section covers the three main types of multidimensional models: Star, Snowflake, and Fast Constellation MODEL CONSTRUCTS This portion of the paper will cover some basics of data warehouse terminology and conceptual design characteristics. As a general note, some of the definitions are recursive, meaning in order to fully understand each topic you will need to have some prior knowledge of another. I ve listed the terms in an order that minimizes this effect, but it still exists. So reading this section twice might help clear up any remaining questions Data Cubes A data cube is a way of modeling and viewing information stored in a data warehouse. Although it s called a cube, it is not limited to just 3 dimensions. A data cube in fact is an n dimensional object. The axes are determined by the selected group of dimensions ( ) and the values are measures stored in a fact table ( ). An example 3 dimensional data cube is shown in Figure 2 below.

5 Hylock: Multidimensional Design for the Genesis Data Set Page Primary and Foreign Keys Figure 2: A sample 3 dimensional data cube In order to differentiate between records in a table, a primary key is selected. The primary key is an attribute (or set of attributes) that uniquely identifies the tuples within a table. In a dimension, these keys are called surrogate keys. A surrogate key is a nonintelligent or system generated key (one produced from a sequence). The reason for this is attribute keys (known as business keys) can change over time and thus introduce potential data integrity issues [5]. If there are more than one attribute or set of attributes that can uniquely identify a record, then you select one to be the primary key and rest are known as candidate keys. Also, to ensure that every record is identified, a primary key can never be blank (NULL) or have any component (if a composite key, see below) that is NULL. This is generally called entity integrity [11]. For example, each person has a unique social security number which allows the government to distinguish one person from another using a single value. If a table wants to reference another, we can simply store the primarysurrogate key from that table as an attribute. This is called a foreign key. Unlike a primary key, this attribute does not follow the entity integrity constraints since it is not used to uniquely identify the records. Instead, it is simply used to say that a particular entity in one table is related to an entity in other. For instance, say we have two tables: employees and phone numbers. Each employee has a unique employee ID similar to a SSN and likewise for each phone number. Since an employee can have multiple phone numbers, the employee ID will be entered multiple times in the phone numbers table. But it is not the employee ID that makes each record unique, it is the phone number ID. This brings about Referential Integrity which states that a foreign key in one table needs to correspond to a primarysurrogate key (or candidate key) in the table it references. That is, you cannot have information about an employee that does not exist. Both primary keys and foreign keys can be composite. That is, they key is composed on multiple attributes that together uniquely identify the tuples. However, in the case of data warehouses, only fact tables (discussed below) are allowed to have composite keys Dimensions The textual descriptors of the business are described by its dimension tables (Figure 3). These give meaning to the fact table (discussed below) attributes and are also the primary source of query constraints, groupings, and report labels. For example, a system designed to keep records for store sales would have dimension tables such as store, time, and item. Dimension tables have a single attribute primary key, have less than 100 attributes on average, are small in terms of number of tuples (less than 1 million) compared to fact tables, and total between per design [7][8][11].

6 Hylock: Multidimensional Design for the Genesis Data Set Page 6 There are several dimension table subtypes. The first is a time dimension. This is a special dimension that stores time variables across different increments of measures. For example, a date can be represented as a specific day, week, month quarter, and year. The next is called a multi valued dimension [5][8]. This is generally not acceptable since it models a one to many relationship between a single fact and a dimension, but health care data has been considered one of the allowable exceptions [8]. For example, for a single visit, a patient can have multiple nursing diagnoses. Another type is a junk dimension [8]. This dimension is used to condense simple dimensions together. For example, say we want to track the answers to a 10 question survey given to customers; each with 3 possible values. In order to do this, we would have 10 foreign keys in our fact table (discussed below) storing the primary key for each dimension tuple containing the answers. In order to clean this up, we could create a junk table that stores all reported combinations of answers. In our case, we would have at worst 10 3 = 1,000 records in that junk table (as you can see, it is an exponential function and with a limit of around 1 million records per dimension, this only works is certain cases) and those 10 foreign keys would be reduced to 1. The final type of dimension I will discuss is the minidimension. This is used if for certain attributes in a dimension, there is a high frequency of analysis or updates or a vast majority of those fields are NULL. What happens is the dimension is split into two (or more) pieces along those lines; the lesser being the minidimension. A foreign key is added to the fact table (discussed below) for each minidimension and not the dimension from where it split [8]. Certain portions of a dimension can also be normalized into what are known as outriggers (Figure 4). These are dimensions that connect to other dimensions (also known as snowflaking). Reasons for using outriggers include: a different level of granularity (such as county versus an individual zip code), different updating times, and data reductions (if there are only a few values available for each field). For example, say we have a dimension table for patient information. For each patient, we store a birth date. Instead of storing the date in the patient table, we would create an outrigger table that stores the day, day of week, week, month, year, and generation for each patient. We can later use the information in the outrigger table to view patient data based on month or year of birth for example. In general, however, outrigger tables should not be used since it increases the complexity of the design, normalizes the data further, and makes it difficult for browsing [8]. Figure 3: Sample dimension tables

7 Hylock: Multidimensional Design for the Genesis Data Set Page 7 Figure 4: Sample outrigger tables Fact Tables The primary table in a dimensional model is the fact table (Figure 5). The fact table stores, as its columns, dimension keys (foreign keys) that connect the dimensions to each tuple and other attributes. It is these very same foreign keys that compose the primary key for the fact table. If the foreign keys are still not enough to ensure uniqueness, a degenerate dimension may be added. A degenerate dimension is a foreign key to a dimension that does not exist. It is simply an attribute created to aid in ensuring uniqueness. Also, measured or observed variable(s) of interest (such as sum or count preferably additive facts) are saved which represent measurements at the intersections of the dimensions [7][8][11]. Next, we will discuss several types of fact tables. The first is transaction grain. This is the most common of the three types. In this, the fact table has exactly one record for each individual transaction usually differentiated by a timestamp. For example, an item at a supermarket is sold to a single person at exactly one time. The second type is periodic snapshot grain. This is similar to the transaction grain except instead of an exact time, it will represent a time span; for example, sales by month. Finally, we have accumulating snapshot. As its name suggests, the fact table accumulates attribute responses. This would be useful in an orderingshipping environment. That is, a record is created for an order. When that order is shipped, the attribute corresponding to shipped is then updated. This could also be used to track the progress of a product through a process (by hand or RFID tags). This is the only type of fact table that is not static; that is, it is set up to be updated regularly [8]. Up until now, we have talked about fact tables that not only store dimension keys and other attributes, but measures as well. There are special cases in which the fact table is simply needed to signify the convergence of dimensions and thus, have no meaningful measures except a count across the tuples. These are called factless fact tables. There are two types. One is the event factless fact table. This type is used to represent an event such as student registration. During a specified period of time, a student registers for a unique set of classes. This is seen as an event because a student does not continuously register for classes. If something were to happen continuously it would be considered our second type, a coverage factless fact table. This could be applied to church facilities. A church would be interested in knowing when a particular facility (such as the sanctuary) was reserved (e.g. for a wedding or funeral). A single room can be rented multiple times per day and hundreds of times per year [8].

8 Hylock: Multidimensional Design for the Genesis Data Set Page Relationships Figure 5: Sample fact table A mapping from one fact to one dimension is known as a one to one relationship and is the preferred relational setup for a data warehouse. The reason for this is when building your data cubes, the size of dimensions are already known (how many attributes there will be on each axis) which allows for ease of viewing (once accustomed to the data) and aggregation; the most import being aggregating the cube. Some cubes will be pre computed to speed up the query process (I will not cover the selection process in this paper). There is a specific order in which a cube should be aggregated. This order depends on the size of the dimensions. If the dimensions are dynamic, then pre computing the cube becomes far more difficult because calculating in the wrong direction can lead to an increased use of RAM by orders of magnitude. One to many relationships are more complex in nature, requiring an additional table between fact and dimension (Section ). This increases the time necessary to form the data cube as well as the level of system understanding that is needed to traverse such bridges. So, in general, one to many relationships are frowned upon, but there are a few exceptions such as healthcare data Bridges In the case of multi valued dimensions (one to many relationship from fact to dimension), a bridge or helper table is created as an intermediary [5][8]. This bridge is used to store groupings of items associated with a particular fact or facts as well as additional attributes. One such additional attribute is a weighting factor. The value is a portion of 1 (for 100%) that corresponds to the weight in which that particular record has in the overall group. In the case of healthcare billing, a single patient can have any number of diseases, all of which are important, but some might cost more and therefore have a larger percentage. The bridge table has more records than the dimensions table because the dimension now only has to store unique items (or whatever it represents) whereas the bridge holds two keys, one to the fact table and one to the dimension table, and any other unique attributes that identify that particular item in relation to the fact table which would otherwise increase the length of the dimensions table. More details in Section 4.3.

9 Hylock: Multidimensional Design for the Genesis Data Set Page TYPES OF SCHEMAS The star schema is most widely used model for data warehousing. It consists of a central fact table surrounded by dimensions that do not have outriggers [5][7]. An example of a star schema design can be seen in Figure 6. As you can see, all four dimensions connect only to the central fact table. Figure 6: Sample star schema

10 Hylock: Multidimensional Design for the Genesis Data Set Page 10 The snowflake schema represents a star design with outrigger tables (Figure 7). This is also called snowflaking and in general, it is frowned upon because it normalizes the dimensions [5][7]. This leads to more tables and thus an increase in the number of joins performed for a query which ultimately converts into longer query processing times. In some instances, however, snowflaking a design is the best course of action. Later, when we discuss the data warehouse design for Genesis, we will see an example when normalizing tables is necessary. In this example, you can see that Department is connected to College which hold the information for each individual college. Also, City was added to create an effect much like that of a Time Dimension. It is used for the purpose of building hierarchies. Figure 7: Sample snowflake schema

11 Hylock: Multidimensional Design for the Genesis Data Set Page 11 The last type of multidimensional models I will discuss is the fast constellation model (Figure 8). This model combines multiple fact tables into one design [5][7]. In order for this design to make sense, the fact tables need to be connected together and the joining element is a shared dimension. In the example below, both fact tables connect via the Student table. Also, not all dimensions have to be shared as with Account, Date, and Semester. Figure 8: Sample fast constellation schema 2.2 SUGGESTED DESIGNS FOR MULTI VALUED DIMENSIONS There is a general consensus throughout literature [2][8][9][10] that multi valued dimensions are undesirable. However, where there are rules, there are always exceptions; healthcare data, such as diseases, is one such exception. There are four main ways to deal with multi valued dimensions as described by Ralph Kimball [9]: 1. Disqualify the dimension because it is multi valued 2. Choose one value (the "primary") and omit the other values 3. Extend the dimension list to have a fixed number of dimensions (i.e. the top N) 4. Put a helper table in between this fact table and the dimension table For this project, option 4 will be used as the needs of the user require all possible combinations. Herein lies the main challenge. From the readings so far, no one has attempted to join this many multi valued dimensions to a single fact table. In fact, the majority of the research indicates that using option 2 above is the best (although [2] and [3] use option 3).

12 Hylock: Multidimensional Design for the Genesis Data Set Page FACT TABLE DESIGNS Single Unique Attribute Per Record One option for removing the multi valued dimensions is to incorporate all collected values into the fact table. Each tuple consists of all necessary degenerate dimensions and only one foreign key. An example is shown below in Table 1. SEQ(DD) Intervention Code(FK) Characteristic Code(FK) Outcomes(FK) MBV MBL MBB MBB MBD MBD Table 1 The problem comes in the dimensionality of the cube. That is to say, tuples 1 and 2 have 1 dimension (Intervention Code) and 1 degenerate dimension (SEQ). Tuples 3 and 4 have the same SEQ degenerate dimension as tuples 1 and 2, but different code dimensions (Characteristic Code). This means that tuples 1 and 2 can be compared to one another, but not to 3 andor 4 because they do not share any common code dimensions. Even if we created an SEQ dimension, the best we could hope for is a one dimensional count. Although an Intervention Code might be tied to a Characteristic Code, this model does not indicate the way in which the codes are related. That is to say, in a two dimensional cube (Figure 9) with Characteristic Code on one axis and Intervention Code on the other, we will either know the row or the column, but not both for each data point. So basically, the dimensions become lookup tables defeating the purpose of a data warehouse which is the ability to walk around in the data via dimensions. Figure 9: we have a specific Intervention Code (row). To which Characteristic Code does it correspond?

13 Hylock: Multidimensional Design for the Genesis Data Set Page Combinations An alternative to the section above is to create a fact table with all possible permutations which is frowned upon due to is exponentially increasing nature as the number of dimensions increase [7]. Going back to Table 1, you will see that for each SEQ there are three dimensions each with two possible values each for a total of 6 records (only 1 SEQ). Below in Table 2, you will see that there are a total of 8 rows. Although this is the worst case scenario (i.e. every possible combination) you can see the effect of adding dimensions to the fact table. If we were to add another dimension such as Etiologies with two possible values or a second SEQ number, we would then have a total of 16 rows. So, if the number of values is constant across all dimensions (which is not the case in Table 2), then we can determine the maximum number of rows with the following exponential equation. Total number of rows where is the number of values and is the number of dimensions. Of course having dimensions with the exact same number of values is extremely rare, so in a general case, the total number of rows where is the number of dimensions and is number of possible values for dimension. So for Table 2, we have 1*2*2*2 = 8 rows max. If we were to add three more dimensions with 3, 4, and 5 values respectively, then we would have 1*2*2*2*3*4*5 = 480 row max. Even if 13 rd of the combinations actually exited, we would still have 160 rows for a single SEQ. In our case, we have 137,857 visits and many dimensions with hundreds if not thousands of unique values. As you can see, the number of tuples in our fact table gets out of hand in hurry and is thus not a viable solution BRIDGE DESIGN SEQ(DD) Intervention Code(FK) Characteristic Code(FK) Outcomes(FK) MBV MBB MBD MBV MBB MBD MBV MBB MBD MBV MBB MBD MBL MBB MBD MBL MBB MBD MBL MBB MBD MBL MBB MBD Table 2 Designing a bridge is quite simple. First, decide what information you want to store in the bridge. Remember that you are trying to remove attributes of the dimension that takes an otherwise unique item like an intervention code and forces there to be multiple records. That is, we are trying to normalize the dimension by breaking off a piece into a bridge. For example, for an intervention code, you have the code itself (which is unique from all other intervention codes) and then you have the date in which it was entered and the order of importance related to patient treatment. The last two attributes force there to be repetitions of the code itself. Therefore, we will remove date and order from the dimension table and store that information in the bridge (we are basically turning the dimension into a lookup table of intervention codes). Next, we need to add a primary key to the bridge and a foreign key that reverences the Interventions dimension. Then, in the fact table, replace (or add) the foreign key that references the dimension surrogate key with that of the bridge table. So, we are joining the fact table to the bridge and the bridge to the dimension (Figure 10).

14 Hylock: Multidimensional Design for the Genesis Data Set Page 14 Figure 10: bridge design for interventions Just to point out, you will have more records in the bridge table than the dimension table since all of the unique information is stored in the bridge. To check and make sure you have moved all of the records to where they should be, you can perform a simple SELECT DISTINCT count(<attribute to count>) FROM <table_name>, where <attribute to count> is the unique identifier for the dimension, to get the number or records that should be in the dimension table and SELECT count(*) FROM <table_name> to get number of rows that should be in the bridge. 3 REQUIREMENTS As mentioned earlier, the goal for this project is to create a model to manage large volumes of clinical data for decision support and quality control at the point of patient care. Part of this includes requires many different data mining operations to be performed on the set. In the typical relational database setting, the data is normalized as much as possible. For transaction processing, this in the best way to go in order to avoid insert, update, and delete anomalies (Section 2.1). However, this forces a lot of table joins and when executing SQL statements which takes time and can get a bit messy when dealing with multiple tables and aliases. In a data warehouse, data are denormalized as much as possible in order to avoid the massive amount of joins sometimes required. The point of a data warehouse is speed in analytical processing. Therefore, the data mining steps (which only read the data) do not require a relational setup and should perform faster in a setting devised for fast processing. 4 DATA WAREHOUSE DESIGN This design does not incorporate all of the entities shown in the ER Diagram (Figure 1). The subset consists of: Visits, Patients, Diseases, Nursing Diagnoses, Interventions, and Outcomes. There were several dimensions added which will be discusses in detail along with the others in the following sections.

15 Hylock: Multidimensional Design for the Genesis Data Set Page FACT TABLE The center of this model is the Visits fact table (Figure 11). The reason Visits was chosen as the hub for this design is quite simple. Going back to the ER Diagram in Figure 1, you can see that in order to traverse the diagram, virtually every entity needs to go through the Visits table. The joining attribute for all tables except Patients (Visits stores the patient ID as a foreign key) is the sequence number assigned to a particular visit (SEQ). Normally, the primary key for the fact table is the set of all foreign (dimension) keys. However, in this instance those values do not guarantee uniqueness. Therefore, SEQ was added to ensure entity integrity; SEQ is known as a degenerate dimension (Section ). The attributes ending in _group_id are foreign keys to bridge tables and the ones ending in _ID connect the fact table to dimensions directly. There are four more attributes and four measures remaining (measures bolded and italicized). The attributes are uniquely assigned to each visit: services, length of stay (los), discharge state (disdate), and age. The lists of measures can easily be added to if needed. As of now, the model takes into consideration the total number of patients, average age, average length of stay, and total length of stay. The values change based upon the axes of the cube and level of granularity. Figure 11: Fact Table 4.2 DIMENSIONS AND TIME DIMENSION As mentioned earlier, this model will not include all of the entities shown in Figure 1. Also, some dimensions not on the ER diagram are going to be added. Figure 12 shows the dimensions used in this paper. The first attribute in each dimension is the primary key. The first dimension is patients. The reason patient information is important is because we can define an axis of a cube by features (e.g. gender, age, race, and zipcode), which could give new meaning to the data. Outcomes, Interventions, and Nursing Diagnoses are all based on nursing codes. The root of all of these is Nursing Diagnoses which is referenced by Outcomes and Interventions. The data stored in diseases are ICD 9 and CCS codes (which as mentioned earlier group together ICD 9 codes). One main difference between the ER diagram and this is the use of SEQ. Before, SEQ was used to bind almost every table to the Visits table.

16 Hylock: Multidimensional Design for the Genesis Data Set Page 16 Now, we have removed the SEQ and inserted a Surrogate key since SEQ is a meaning attribute (see Section for more information). Finally the Date dimension, which is a special kind of dimension called a Time Dimension. This table is generated automatically by Oracle 11g Warehouse builder and consists of 41 attributes such as day of calendar week, month, year, month of quarter, day, and day name. This table also auto generates information for each day starting From the ER diagram to this one, patients did not change. All attributes are accounted for. For Outcomes, Interventions, Diseases, and Nursing Diagnoses, any visit specific attribute such as the order of importance for the patient (ordering) and any date values were removed and placed in the bridge tables (discussed in the next section). Another difference is the way the codes are broken up. For Outcomes and Nursing Diagnoses, from the original code, we removed the Code Group identifier since it is the same for all records. For Interventions, however, that value from MBE MBZ was required to ensure uniqueness. For Diseases, the CCS Code was added to the table to add a hierarchy. That is, now we can view the data by individual code or in groups. 4.3 BRIDGES Figure 12: Dimension tables Since dimensions Outcomes, Interventions, Diseases, and Nursing Diagnoses are all multi valued dimensions, we need to use a bridge table to connect them to the fact table (Figure 13). As mentioned in Section , bridge table attributes consist of a primary key (first attribute in the diagram) that is referenced by the fact table, a foreign key to the dimension (second attribute in the diagram), the rest are unique to each fact (visit). This removes code redundancies from the dimensions turning them into lookup tables (basically). Figure 13: Bridge tables

17 Hylock: Multidimensional Design for the Genesis Data Set Page FULL DESIGN Figure 14 shows the full multidimensional design for the Genesis data set used. As you can see, Visits is the central hub. Connected to Visits are Patients, Diseases_Group, Nursing_Diagnoses_Group, Interventions_Group, Outcomes_Group, and Date. Also, Nursing_Diagnoses_Group, Interventions_Group, and Outcomes_Group are connected to the Date Time Dimension because each code is entered at a specific date per visit. This will allow the user to roll up and drill down the axis by date. That is, they can group by day, month, quarter, or year for example. Also, Nursing_Diagnoses is connected to Interventions and Outcomes. Back in Section 1.1, the codes were broken down into parts and it was shown that Nursing_Diagnoses was the root node. With this relationship, we will be able to set an axis by Intervention or Outcomes individual code, or by the Nursing_Diagnoses code group they belong to. Figure 14: Full multidimensional design 4.5 CONCEPT HIERARCHIES Figure 15 shows the four concept hierarchies used in this project so far. The far left is the date hierarchy. Now, there are far too many possible combinations to list, so the one shown is the hierarchy that is most likely to be continually used. What it says is the year number is the most abstract level from which the data can be seen. As you start down the hierarchy (drill down), the next level of granularity is the quarter number. Then we have the month number and finally, the lowest level (highest level of granularity) is day. That is, the most specific data that can be retrieved from the date is by day.

18 Hylock: Multidimensional Design for the Genesis Data Set Page 18 To the right of date is the Diseases hierarchy. It consists of two levels, CCS which is the most course level of granularity (the most abstract) and ICD 9 which is the finest level of granularity. For the next two, we can see how to group together the dimensions Outcomes and Interventions to Nursing Diagnoses. Again, there are only two levels, the group level at the top consisting of the codes grouped by Nursing Diagnoses, and the lowest level consisting of all unique Outcomes and Interventions. 5 FUTURE WORK Figure 15: Concept hierarchies There are still many steps that will need to be performed before even the validity of the design can be verified. First, I will need to either alter the data warehouse design or build a temporary one in order to accommodate the Extraction, Transformation, and Loading (ETL) process. This is due to the change in primary and foreign keys between the relational and the multidimensional design. That is, as mentioned before, SEQ was the primary connector between tables in the relational design, but only used in the patients table for this design. Therefore, during the ETL process, those SEQ keys will have to be stored in order to retain the proper relationships. Once the data is loaded, then the experiments can. Using SQL Developer (which comes with Oracle 11g), one can enter an SQL query and either predict the running time (Explain function) or follows the actual steps and record system information during the process (Auto Trace function). As of now, I do not know if SQL Developer can be used to run more advanced OLAP queries. If not, SQL Plus can be used and the same functions can be run (requires more coding). If necessary, the data warehouse can be tested using Oracles build in data mining tools. All of these experiments will of course be compared to the current relational database system. Also, this is one possible design. In [8], Kimball states that the use of bridge tables might be overlooked if there does not exist a hierarchy in the connected dimension. That is, instead of normalizing the dimension thus creating the bridge, one could leave the dimension alone in order to further increase the speed of the system. As I am not privy to all of the hierarchies that could be implemented, the bridges were added to accommodate any future additions. If, however, there are not any, then another data warehouse will be generated and tested. 6 CONCLUSIONS In this paper, a multidimensional model for the Genesis data set was proposed. Previous work in the field does not take into consideration the use multi valued dimensions beyond the top n from each category. This work presented a way in which to model the use of multi valued dimensions in a healthcare environment. Although this model is specific to the Genesis data set, the ideas presented can be generalized to encompass many different types of data.

19 Hylock: Multidimensional Design for the Genesis Data Set Page 19 7 APPENDIX 7.1 SQL FOR TABLE CREATION * outrigger tables DROP table Date_Outrigger; dimension tables DROP table Interventions; DROP table Nursing_Diagnoses; DROP table Diseases; DROP table Outcomes; DROP table Patients; bridge tables DROP table Interventions_Group; DROP table Nuring_Diagnoses_Group; DROP table Diseases_Group; DROP table Outcomes_Group; fact table DROP table visits; * *outrigger dimension* This table is auto-generated by Oracle 11g Warehouse Builder CREATE TABLE DATE_OUTRIGGER ( DATE_ID NUMBER NOT NULL, DAY_DAY_CODE DAY_OF_CAL_WEEK DAY_OF_CAL_MONTH DAY DATE, DAY_DESCRIPTION VARCHAR2(2000), DAY_OF_CAL_YEAR DAY_START_DATE DATE, DAY_OF_CAL_QUARTER DAY_END_DATE DATE, DAY_ID DAY_TIME_SPAN JULIAN_DATE DAY_NAME VARCHAR2(25), CALENDAR_MONTH_NAME VARCHAR2(25), MONTH_OF_QUARTER CAL_MONTH_NUMBER CALENDAR_MONTH_TIME_SPAN CALENDAR_MONTH_ID CALENDAR_MONTH_DESCRIPTION VARCHAR2(2000), CALENDAR_MONTH_START_DATE DATE, CALENDAR_MONTH_END_DATE DATE, CALENDAR_MONTH_CAL_MONTH_CODE MONTH_OF_YEAR CALENDAR_QUARTER_END_DATE DATE, CALENDAR_QUARTER_START_DATE DATE, CAL_QUARTER_NUMBER

20 Hylock: Multidimensional Design for the Genesis Data Set Page 20 QUARTER_OF_YEAR CALENDAR_QUARTER_ID CALENDAR_QUARTER_TIME_SPAN CALENDAR_QUARTER_NAME CALENDAR_QUART_CAL_QUARTER_CO CALENDAR_QUARTER_DESCRIPTION CALENDAR_YEAR_START_DATE CALENDAR_YEAR_CAL_YEAR_CODE CALENDAR_YEAR_DESCRIPTION CALENDAR_YEAR_NAME CALENDAR_YEAR_END_DATE CALENDAR_YEAR_ID CAL_YEAR_NUMBER CALENDAR_YEAR_TIME_SPAN VARCHAR2(25), VARCHAR2(2000), DATE, VARCHAR2(2000), VARCHAR2(25), DATE, NUMBER *dimensions* CREATE table Patients (pt_id ssn NUMBER(9,0), birthdate_id gender CHAR(1), race CHAR(1), zipcode NUMBER(9,0), marstat CHAR(1), relig VARCHAR2(1), constraint patients_pk PRIMARY KEY (pt_id), constraint patients_fk_date_outrigger FOREIGN KEY (birthdate_id) REFERENCES Date_Outrigger(date_ID) CREATE table Diseases (icd_id Icd VARCHAR2(10), ccs_code VARCHAR2(10), constraint diseases_pk PRIMARY KEY (icd_id) CREATE table Nursing_Diagnoses (NDCode_ID NDCode NUMBER(3,0), constraint nursing_diagnoses_pk PRIMARY KEY (NDCode_ID) CREATE table Interventions (NI_ID NICode_group CHAR(3), NICode NUMBER(3,0), NIDefining_code NUMBER(2,0), constraint interventions_pk PRIMARY KEY (NI_ID) CREATE table Outcomes (NO_ID NOCode NUMBER(3,0), NODefining_code NUMBER(2,0), constraint outcomes_pk PRIMARY KEY (NO_ID)

21 Hylock: Multidimensional Design for the Genesis Data Set Page 21 *bridge tables* CREATE table Diseases_Group (diseases_group_id icd_id ordering constraint diseases_group_pk PRIMARY KEY (diseases_group_id), constraint diseases_group_fk_deseases FOREIGN KEY (icd_id) REFERENCES Diseases (icd_id) CREATE table Nursing_Diagnoses_Group (nursing_diagnoses_group_id NDCode_ID NUMBER(3,0), NDDate_ID ordering constraint nursing_diagnoses_group_pk PRIMARY KEY (nursing_diagnoses_group_id), constraint nursing_diagnoses_group_fk_nursing_diagnoses FOREIGN KEY (NDCode_ID) REFERENCES Nursing_Diagnoses(NDCode_ID), constraint nursing_diagnoses_group_fk_date_outrigger FOREIGN KEY (NDDate_ID) REFERENCES Date_Outrigger(date_ID) CREATE table Interventions_Group (interventions_group_id NIDate_ID ordering constraint interventions_group_pk PRIMARY KEY (interventions_group_id), constraint interventions_group_fk_interventions FOREIGN KEY (interventions_group_id) REFERENCES Interventions(interventions_group_ID), constraint interventions_group_fk_date_outrigger FOREIGN KEY (NIDate_ID) REFERENCES Date_Outrigger(date_ID) CREATE table Outcomes_Group (outcomes_group_id NO_ID NODate_ID ordering constraint outcomes_group_pk PRIMARY KEY (outcomes_group_id), constraint outcomes_group_fk_outcomes FOREIGN KEY (NO_ID) REFERENCES Outcomes (NO_ID), constraint outcomes_group_fk_date_outrigger FOREIGN KEY (NODate_ID) REFERENCES Date_Outrigger(date_ID)

22 Hylock: Multidimensional Design for the Genesis Data Set Page 22 *fact table* CREATE table Visits (seq pt_id diseases_group_id nursing_diagnoses_group_id interventions_group_id outcomes_group_id service VARCHAR2(10), admdate_id disdate_id los disstate NUMBER(5,0), age NUMBER(5,2), constraint visits_pk PRIMARY KEY (seq), *degenerate dimension* constraint visits_fk_patients_group FOREIGN KEY (pt_id) REFERENCES Patients_Group (pt_id), constraint visits_fk_diseases_group FOREIGN KEY (diseases_group_id) REFERENCES Diseases_Group(diseases_group_ID), constraint visits_fk_nursing_diagnoses_group FOREIGN KEY (nursing_diagnoses_group_id) REFERENCES Nursing_Diagnoses_Group(nursing_diagnoses_group_ID), constraint visits_fk_interventions_group FOREIGN KEY (interventions_group_id) REFERENCES Interventions_Group(interventions_group_ID), constraint visits_fk_outcomes_group FOREIGN KEY (outcomes_group_id) REFERENCES Outcomes_Group(outcomes_group_ID), constraint visits_fk_date_outrigger FOREIGN KEY (admdate_id) REFERENCES Date_Outrigger(date_ID), constraint visits_fk_date_outrigger FOREIGN KEY (disdate_id) REFERENCES Date_Outrigger(date_ID) 7.2 PLSQL FOR DIMENSIONS cwm2_olap_dimension.create_dimension( -- dimension name 'Date', 'Dates', 'Date', 'Date', 'Date' -- dimension type cwm2_olap_dimension.create_dimension( 'PATIENT_DIM', -- dimension name 'Patient', 'Patients', 'Patient', 'Patient', 'Patient' -- dimension type

23 Hylock: Multidimensional Design for the Genesis Data Set Page 23 cwm2_olap_dimension.create_dimension( 'DISEASE_DIM', -- dimension name 'Disease', 'Diseases', 'Disease', 'Disease', 'Disease' -- dimension type cwm2_olap_dimension.create_dimension( 'NURSING_DIAGNOSES_DIM', -- dimension name 'Nursing_Diagnose', 'Nursing_Diagnoses', 'Nursing_Diagnose', 'Nursing_Diagnose', 'Nursing_Diagnose' -- dimension type cwm2_olap_dimension.create_dimension( 'INTERVENTION_DIM', -- dimension name 'Intervention', 'Interventions', 'Intervention', 'Intervention', 'Intervention' -- dimension type cwm2_olap_dimension.create_dimension( 'OUTCOME_DIM', -- dimension name 'Outcome', 'Outcomes', 'Outcome', 'Outcome', 'Outcome' -- dimension type cwm2_olap_dimension.create_dimension( 'DISEASES_GROUP_DIM', -- dimension name 'Diseases_Group', 'Diseases_Groups', 'Diseases_Group', 'Diseases_Group', 'Diseases_Group' -- dimension type

24 Hylock: Multidimensional Design for the Genesis Data Set Page 24 cwm2_olap_dimension.create_dimension( 'NURSING_DIAGNOSES_GROUP_DIM', 'Nursing_Diagnoses_Group', 'Nursing_Diagnoses_Groups', 'Nursing_Diagnoses_Group', 'Nursing_Diagnoses_Group', 'Nursing_Diagnoses_Group' cwm2_olap_dimension.create_dimension( 'INTERVENTIONS_GROUP_DIM', 'Interventions_Group', 'Interventions_Groups', 'Interventions_Group', 'Interventions_Group', 'Interventions_Group' cwm2_olap_dimension.create_dimension( 'OUTCOMES_GROUP_DIM', 'Outcomes_Group', 'Outcomes_Groups', 'Outcomes_Group', 'Outcomes_Group', 'Outcomes_Group' -- dimension name -- dimension type -- dimension name -- dimension type -- dimension name -- dimension type 7.3 PLSQL FOR HIERARCHIES cwm2_olap_hierarchy.create_hierarchy ( -- owner of dimension to which hierarchy is assigned -- name of dimension to which hierarchy is assigned 'DATE_HIER', -- name of hierarchy 'Date hierarchy', 'Date hierarchy', 'Date hierarchy', 'UNSOLVED LEVEL-BASED' -- solved code

25 Hylock: Multidimensional Design for the Genesis Data Set Page 25 cwm2_olap_hierarchy.create_hierarchy ( -- owner of dimension to which hierarchy is assigned 'DISEASE_DIM', -- name of dimension to which hierarchy is assigned 'DISEASE_HIER', -- name of hierarchy 'Disease hierarchy', 'Disease hierarchy', 'Disease hierarchy', 'UNSOLVED LEVEL-BASED' -- solved code 7.4 PLSQL FOR LEVELS cwm2_olap_level.create_level ( -- owner of dimension to which level is assigned -- name of dimension to which level is assigned 'LVL_YEAR', 'Year', 'Years', 'Years', 'Years' cwm2_olap_level.create_level ( -- owner of dimension to which level is assigned -- name of dimension to which level is assigned 'LVL_QUARTER', 'Quarter', 'Quarters', 'Quarters', 'Quarters' cwm2_olap_level.create_level ( -- owner of dimension to which level is assigned -- name of dimension to which level is assigned 'LVL_MONTH', 'Month', 'Months', 'Months', 'Months'

26 Hylock: Multidimensional Design for the Genesis Data Set Page 26 cwm2_olap_level.create_level ( -- owner of dimension to which level is assigned -- name of dimension to which level is assigned 'LVL_WEEK', 'Week', 'Weeks', 'Weeks', 'Weeks' cwm2_olap_level.create_level ( -- owner of dimension to which level is assigned -- name of dimension to which level is assigned 'LVL_DAY ', 'Day', 'Days', 'Days', 'Days' cwm2_olap_level.create_level ( -- owner of dimension to which level is assigned 'DISEASE_DIM', -- name of dimension to which level is assigned 'LVL_CCS', 'CCS Group', 'CCS Groups', 'CCS Groups', 'CCS Groups' cwm2_olap_level.create_level ( -- owner of dimension to which level is assigned 'DISEASE_DIM', -- name of dimension to which level is assigned 'LVL_ICD ', 'ICD Number', 'ICD Numbers', 'ICD Numbers', 'ICD Numbers' 7.5 PLSQL FOR LEVELS TO HIERARCHIES cwm2_olap_level.add_level_to_hierarchy ( -- owner of dimension -- name of dimension 'DATE_HIER', -- name of hierarchy 'LVL_YEAR', null -- parent level

27 Hylock: Multidimensional Design for the Genesis Data Set Page 27 cwm2_olap_level.add_level_to_hierarchy ( -- owner of dimension -- name of dimension 'DATE_HIER', -- name of hierarchy 'LVL_QUARTER', 'LVL_YEAR' -- parent level cwm2_olap_level.add_level_to_hierarchy ( -- owner of dimension -- name of dimension 'DATE_HIER', -- name of hierarchy 'LVL_WEEK', 'LVL_YEAR' -- parent level cwm2_olap_level.add_level_to_hierarchy ( -- owner of dimension -- name of dimension 'DATE_HIER', -- name of hierarchy 'LVL_MONTH', 'LVL_QUARTER' -- parent level cwm2_olap_level.add_level_to_hierarchy ( -- owner of dimension -- name of dimension 'DATE_HIER', -- name of hierarchy 'LVL_DAY', 'LVL_MONTH' -- parent level cwm2_olap_level.add_level_to_hierarchy ( -- owner of dimension 'DISEASE_DIM', -- name of dimension 'DISEASE_HIER', -- name of hierarchy 'LVL_CCS', null -- parent level cwm2_olap_level.add_level_to_hierarchy ( -- owner of dimension 'DISEASE_DIM', -- name of dimension 'DISEASE_HIER', -- name of hierarchy 'LVL_ICD', 'LVL_CCS' -- parent level

28 Hylock: Multidimensional Design for the Genesis Data Set Page PLSQL FOR MAPPING LEVELS TO SPECIFIC DIMENSION COLUMNS cwm2_olap_table_map.map_dimtbl_hierlevel( -- dimension name 'DATE_HIER', -- name of hierarchy 'LVL_YEAR', -- owner of dimension table 'DATE_OUTRIGGER', -- name of table 'YEAR', -- name of column null -- name of parent column cwm2_olap_table_map.map_dimtbl_hierlevel( -- dimension name 'DATE_HIER', -- name of hierarchy 'LVL_QUARTER', -- owner of dimension table 'DATE_OUTRIGGER', -- name of table 'QUARTER', -- name of column 'YEAR' -- name of parent column cwm2_olap_table_map.map_dimtbl_hierlevel( -- dimension name 'DATE_HIER', -- name of hierarchy 'LVL_WEEK', -- owner of dimension table 'DATE_OUTRIGGER', -- name of table 'WEEK', -- name of column 'YEAR' -- name of parent column cwm2_olap_table_map.map_dimtbl_hierlevel( -- dimension name 'DATE_HIER', -- name of hierarchy 'LVL_MONTH', -- owner of dimension table 'DATE_OUTRIGGER', -- name of table 'MONTH', -- name of column 'QUARTER' -- name of parent column

29 Hylock: Multidimensional Design for the Genesis Data Set Page 29 cwm2_olap_table_map.map_dimtbl_hierlevel( -- dimension name 'DATE_HIER', -- name of hierarchy 'LVL_DAY', -- owner of dimension table 'DATE_OUTRIGGER', -- name of table 'DAY', -- name of column 'MONTH' -- name of parent column cwm2_olap_table_map.map_dimtbl_hierlevel( 'DISEASE_DIM', -- dimension name 'DISEASE_HIER', -- name of hierarchy 'LVL_CCS', -- owner of dimension table 'DISEASES', -- name of table 'CCS_CODE', -- name of column null -- name of parent column cwm2_olap_table_map.map_dimtbl_hierlevel( 'DISEASE_DIM', -- dimension name 'DISEASE_HIER', -- name of hierarchy 'LVL_ICD', -- owner of dimension table 'DISEASES', -- name of table 'ICD_ID', -- name of column 'CCS_CODE' -- name of parent column 7.7 PLSQL SEQUENCES These are auto-generated by Oracle 11g Warehouse builder *Date_Outrigger* CREATE SEQUENCE DATE_OUT_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ;

30 Hylock: Multidimensional Design for the Genesis Data Set Page 30 *Diseases* CREATE SEQUENCE DISEASES_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ; *Diseases_Group* CREATE SEQUENCE DISEASES_GROUP_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ; *Interventions* CREATE SEQUENCE INTERVENTIONS_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ; *Interventions_Group* CREATE SEQUENCE INTERVENTIONS_GROUP_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ; *Outcomes* CREATE SEQUENCE OUTCOMES_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ;

31 Hylock: Multidimensional Design for the Genesis Data Set Page 31 *Outcomes_Group* CREATE SEQUENCE OUTCOMES_GROUP_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ; *Nursing_Diagnoses* CREATE SEQUENCE NURSING_DIAGNOSES_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ; *Nursing_Diagnoses_Group* CREATE SEQUENCE NURSING_DIAGNOSES_GROUP_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ; *Patients* CREATE SEQUENCE PATIENTS_SEQ MINVALUE 1 MAXVALUE INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE ;

32 Hylock: Multidimensional Design for the Genesis Data Set Page 32 REFERENCES 1. Der Fa Lu, W. Nick Street, Faiz Currim, and Connie Delaney: A Data Modeling Process For Decomposing Healthcare Patient Data Sets. Working paper. 2. Donald J. Berndt, Alan R. Hevner and James Studnicki: The Catch data warehouse: support for community health care decision making. Decision Support Systems, Volume 35, Issue 3, June 2003, pp Donald J. Berndt, John W. Fisher, Alan R. Hevner, James Stundicki: Healthcare data warehousing and quality assurance. Computer, Volume 34, Issue 12, Dec. 2001, pp William H. Inmon: Building the Data Warehouse. John Wiley & Sons, Jeffrey A. Hoffer, Mary B. Prescott, and Fred R. McFadden: Modern Database Management 6 th edition. Prentice Hall, Jeffrey D. Ullman and Jennifer Widom: A First Course In Database Systems 3 rd Edition. Pearson, Prentice Hall, Jiawei Han, Micheline Kamber: Data Mining Concepts and Techniques 2 nd edition. Morgan Kaufmann, Ralph Kimball, M. Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2 nd ed. Wiley, Ralph Kimball: Help for Dimensional Modeling: Helper tables help you design and manage multivalued dimensions successfully. August Ralph Kimball: Managing Helper Tables. Intelligent Enterprise, August 10, Ramez Elmasri, Shamkant B. Navathe: Fundamentals of Database Systems 5 th edition. Pearson, 2007