Duplication Problem. Duplicating Columns

The amount of data required to perform our work today can be staggering. A fundamental question is, How should data be stored? Many people will use spreadsheets for data storage. Creating a spreadsheet is simple and fast, which is way many people use them. However, they present a serious data problem that can only be solved by a database. The fundamental difference between spreadsheets and databases is the former is meant for data analysis and the latter for data storage. To illustrate the problems of storing data in a spreadsheet, we ll take a common scenario of keeping track of event attendance. Duplication Problem One of the primary benefits of a database is that it reduces duplication. Duplicate data is generally bad and should be avoided. When data is duplicated, there is an increased chance of introducing errors into the database. Spreadsheets are not set up to reduce duplication. As such, data is often duplicated in a spreadsheet in three ways: 1. Duplicating columns. 2. Duplicating rows. 3. Duplicating information in a cell. Duplicating Columns First Last Golf 12 Golf 2013 Golf14 Kim Jones x x x Bob Miaygi x x Pat Smith x Here we have a duplication of the events. The last three columns show the events though even in this simple example there are data anomalies. The event names are not consistent. They should either be Golf 12, Golf 13, Golf 14 or Golf 2012, Golf 2013, Golf 2014 or Golf12, Golf13, Golf14. Is this a major problem? Not really. But it does illustrate that as more columns are added, it will be difficult to keep consistency in the names. Another problem with this scenario is that it would be laborious to find out how many events a person attended. More on this below. www.newleafdata.com (567) 455-3162 info@newleafdata.com 1 of 5

Duplicating Rows First Last Event Kim Jones Golf 12 Kimberly Jones Golf 13 Kim Jones Golf 14 Bob Miaygi Golf 12 Robert Miaygi Golf 14 Pat Smith Golf 14 In this example, duplicating rows means that the person is listed more than once. Since the person is listed multiple times, it may be hard to determine if it is the same person. For example, are Kim Jones and Kimberly Jones the same person? There is no way to tell from the list. Duplicating Information in a Cell First Last Event Kim Jones Golf 12 Golf 13 Golf 14 Bob Miaygi Golf 12 Golf 14 Pat Smith Golf 14 Duplicating data in a cell is never a good idea. Of the three duplication scenarios, this is the hardest to query. How many events exists? How many people when to each event? Duplicating cell data almost inevitably means manually counting the data. Adding Complexity Even if one of the above methods was used to keep track of event attendance, it is limited in its complexity. For example, perhaps the people who come have different roles. Most are guests/attendees. Some may be volunteers and a couple of others may be organizers. Perhaps there are various discounts available. How would that get noted? www.newleafdata.com (567) 455-3162 info@newleafdata.com 2 of 5

Dirty Data Dirty data refers to the type and consistency of data stored. In the first example of duplicating columns an attendee is marked with an x. But the cell could just as easily contain the amount paid, or a Yes/No. First Last Golf 12 Golf 2013 Golf14 Kim Jones $10 Not Paid Yes Bob Miaygi Paid Paid Pat Smith $10 It may seem unlikely that our small spreadsheet would suffer from dirty data. But it doesn t take much for data to become dirty, especially as the list grows. Query Conundrum One of the primary reasons for storing data is to query it. That is, ask the data questions. As mentioned above in the duplicating column section, it would be laborious to count how many events a person attended. Counting people is a fairly low-level query. That is, it is not very complex. An organization may really want know the following: How many attendees were first time attendees? How much money did the event make? How much money came from returning attendees? How much money came from first time attendees? www.newleafdata.com (567) 455-3162 info@newleafdata.com 3 of 5

A Better Alternative In a relational database management system (RDBMS), data can be stored and related to other data. This makes viewing and querying the data relatively easy. Moreover, the same data can often be viewed from multiple perspectives. In our example, a people layout could display all the events a person attended while an events layout can list all the attendees. Finding first time attendees, total number of attendees, total income, etc. is quick and easy to obtain. www.newleafdata.com (567) 455-3162 info@newleafdata.com 4 of 5

When to Use a Spreadsheet Although spreadsheets are not optimal for data storage, they are very useful tools for data analysis. In our example, let s say we want to keep track of the number of people who attended an event, how much the event cost, how much revenue the event created, and how much revenue per person the event created. As previously mentioned, these numbers are easy to obtain in a relational database. However, a user may want to use the matrix structure of a spreadsheet to analyze the data. Attendees Cost Gross Net Net per person Golf 12 Golf 13 Golf 14 35 42 45 $80 $80 $82 $350 $420 $450 $270 $340 $368 $7.71 $8.10 $8.18 Questions 1. How many spreadsheets are in your computer, file server, or shared drive? 2. Of those spreadsheets, how many are for data analysis and how many for data storage? 3. How long does it take to find (query) the data for important information? 4. Can you quickly produce a report by simply clicking a button? New Leaf Data, LLC can help you and your organization store data efficiently. Contact New Leaf Data, LLC today to find out how. www.newleafdata.com (567) 455-3162 info@newleafdata.com 5 of 5