Big Data Cleaning. Nan Tang. Qatar Computing Research Institute, Doha, Qatar
|
|
|
- Myra Howard
- 10 years ago
- Views:
Transcription
1 Big Data Cleaning Nan Tang Qatar Computing Research Institute, Doha, Qatar Abstract. Data cleaning is, in fact, a lively subject that has played an important part in the history of data management and data analytics, and it still is undergoing rapid development. Moreover, data cleaning is considered as a main challenge in the era of big data, due to the increasing volume, velocity and variety of data in many applications. This paper aims to provide an overview of recent work in different aspects of data cleaning: error detection methods, data repairing algorithms, and a generalized data cleaning system. It also includes some discussion about our efforts of data cleaning methods from the perspective of big data, in terms of volume, velocity and variety. 1 Introduction Real-life data is often dirty: Up to 30% of an organization s data could be dirty [2]. Dirty data is costly: It costs the US economy $3 trillion+ per year [1]. These highlight the importance of data quality management in businesses. Data cleaning is the process of identifying and (possibly) fixing data errors. In this paper, we will focus on discussing dependency based data cleaning techniques, and our research attempts in each direction [5]. Error detection. There has been a remarkable series of work to capture data errors as violations using integrity constraints (ICs) [3, 9, 10, 18, 20, 21, 25, 27] (see [17] for a survey). A violation is a set of data values such that when putting together, they violate some ICs, thus considered to be wrong. However, ICs cannot tell, in a violation, which values are correct or wrong, thus fall short of guiding dependable data repairing. Fixing rules [30] are proposed recently that can precisely capture which values are wrong, when enough evidence is given. Data repairing. Data repairing algorithms have been proposed [7, 8, 12 15, 23, 24, 26, 28, 31]. Heuristic methods are developed in [6, 8, 13,26], based on FDs [6, 27], FDs and INDs [8], CFDs [18], CFDs and MDs [23] and denial constraints [12]. Some works employ confidence values placed by users to guide a repairing process [8, 13, 23] or use master data [24]. Statistical inference is studied in [28] to derive missing values, and in [7] to find possible repairs. To ensure the accuracy of generated repairs, [24, 28, 31] require to consult users. Efficient data repairing algorithms using fixing rules have also been studied [30]. Data cleaning systems. Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning,
2 name country capital city conf r 1 : George China Beijing Beijing SIGMOD r 2 : Ian China Shanghai Hongkong ICDE (Beijing) (Shanghai) r 3 : Peter China Tokyo Tokyo ICDE (Japan) r 4 : Mike Canada Toronto Toronto VLDB (Ottawa) Fig. 1. Database D: an instance of schema Travel there is a lack of end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. Nadeef [14, 15] was presented as an extensible, generalized and easy-to-deploy data cleaning platform. Nadeef distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. Organization. We describe error detection techniques in Section 2. We discuss data repairing algorithms in Section 3. We present a commodity data cleaning system, Nadeef, in Section 4, followed by open research issues in Section 5. 2 Dependency-based Error Detection In this section, we start by illustrating how integrity constraints work for error detection. We then introduce fixing rules. Example 1. Consider a database D of travel records for a research institute, specified by the following schema: Travel (name, country, capital, city, conf), where a Travel tuple specifies a person, identified by name, who has traveled to conference (conf), held at the city of the country with capital. One Travel instance is shown in Fig. 1. All errors are highlighted and their correct values are given between brackets. For instance, r 2 [capital] = Shanghai is wrong, whose correct value is Beijing. A variety of ICs have been used to capture errors in data, from traditional constraints (e.g., functional and inclusion dependencies [8,10]) to their extensions (e.g., conditional functional dependencies [18]). Example 2. Suppose that a functional dependency (FD) is specified for the Travel table as: φ 1 : Travel ([country] [capital])
3 country {capital } capital + ϕ 1 : China Shanghai Beijing Hongkong country {capital } capital + ϕ 2 : Canada Toronto Ottawa Fig. 2. Example fixing rules which states that country uniquely determines capital. One can verify that in Fig. 1, the two tuples (r 1, r 2 ) violate φ 1, since they have the same country but carry different capital values, so do (r 1, r 3 ) and (r 2, r 3 ). Example 2 shows that although ICs can detect errors (i.e., there must exist errors in detected violations), it reveals two shortcomings of IC based error detection: (i) it can neither judge which value is correct (e.g., t 1 [country] is correct), nor which value is wrong (e.g., t 2 [capital] is wrong) in detected violations; and (ii) it cannot ensure that consistent data is correct. For instance, t 4 is consistent with any tuple w.r.t. φ 1, but t 4 cannot be considered as correct. Fixing rules. Data cleaning is not magic; it cannot guess something from nothing. What it does is to make decisions from evidence. Certain data patterns of semantically related values can provide evidence to precisely capture and rectify data errors. For example, when values (China, Shanghai) for attributes (country, capital) appear in a tuple, it suffices to judge that the tuple is about China, and Shanghai should be Beijing, the capital of China. In contrast, the values (China, Tokyo) are not enough to decide which value is wrong. Motivated by the observation above, fixing rules were introduced [30]. A fixing rule contains an evidence pattern, a set of negative patterns, and a fact value. Given a tuple, the evidence pattern and the negative patterns of a fixing rule are combined to precisely tell which attribute is wrong, and the fact indicates how to correct it. Example 3. Figure 2 shows two fixing rules. The brackets mean that the corresponding cell is multivalued. For the first fixing rule ϕ 1, its evidence pattern, negative patterns and the fact are China, {Shanghai, Hongkong}, and Beijing, respectively. It states that for a tuple t, if its country is China and its capital is either Shanghai or Hongkong, capital should be updated to Beijing. For instance, consider the database in Fig. 1. Rule ϕ 1 detects that r 2 [capital] is wrong, since r 2 [country] is China, but r 2 [capital] is Shanghai. It will then update r 2 [capital] to Beijing. Similarly, the second fixing rule ϕ 2 states that for a tuple t, if its country is Canada, but its capital is Toronto, then its capital is wrong and should be Ottawa. It detects that r 4 [capital] is wrong, and then will correct it to Ottawa. After applying ϕ 1 and ϕ 2, two errors, r 2 [capital] and r 4 [capital], can be repaired. Notation. Consider a schema R defined over a set of attributes, denoted by attr(r). We use A R to denote that A is an attribute in attr(r). For each attribute A R, its domain is specified in R, denoted as dom(a).
4 Syntax. A fixing rule ϕ defined on a schema R is formalized as ((X, t p [X]), (B, T p [B])) t + p [B] where 1. X is a set of attributes in attr(r), and B is an attribute in attr(r) \ X (i.e., B is not in X); 2. t p [X] is a pattern with attributes in X, referred to as the evidence pattern on X, and for each A X, t p [A] is a constant value in dom(a); 3. T p [B] is a finite set of constants in dom(b), referred to as the negative patterns of B; and 4. t + p [B] is a constant value in dom(b) \ T p [B], referred to as the fact of B. Intuitively, the evidence pattern t p [X] of X, together with the negative patterns T p [B] impose the condition to determine whether a tuple contains an error on B. The fact t + p [B] in turn indicates how to correct this error. Note that condition (4) enforces that the correct value (i.e., the fact) is different from known wrong values (i.e., negative patterns) relative to a specific evidence pattern. We say that a tuple t of R matches a rule ϕ : ((X, t p [X]), (B, T p [B])) t + p [B], denoted by t ϕ, if (i) t[x] = t p [X] and (ii) t[b] T p [B]. In other words, tuple t matches rule ϕ indicates that ϕ can identify errors in t. Example 4. Consider the fixing rules in Fig. 2. They can be formally expressed as follows: ϕ 1 : (([country], [China]), (capital, {Shanghai, Hongkong})) Beijing ϕ 2 : (([country], [Canada]), (capital, {Toronto})) Ottawa In both ϕ 1 and ϕ 2, X consists of country and B is capital. Here, ϕ 1 states that, if the country of a tuple is China and its capital value is in {Shanghai, Hongkong}, its capital value is wrong and should be updated to Beijing. Similarly for ϕ 2. Consider D in Fig. 1. Tuple r 1 does not match rule ϕ 1, since r 1 [country] = China but r 1 [capital] {Shanghai, Hongkong}. As another example, tuple r 2 matches rule ϕ 1, since r 2 [country] = China, and r 2 [capital] {Shanghai, Hongkong}. Similarly, we have r 4 matches ϕ 2. Semantics. We next give the semantics of fixing rules. We say that a fixing rule ϕ is applied to a tuple t, denoted by t ϕ t, if (i) t matches ϕ (i.e., t ϕ), and (ii) t is obtained by the update t[b] := t + p [B]. That is, if t[x] agrees with t p [X], and t[b] appears in the set T p [B], then we assign t + p [B] to t[b]. Intuitively, if t[x] matches t p [X] and t[b] matches some value in T p [B], it is evident to judge that t[b] is wrong and we can use the fact t + p [B] to update t[b]. This yields an updated tuple t with t [B] = t + p [B] and t [R \ {B}] = t[r \ {B}]. Example 5. As shown in Example 3, we can correct r 2 by applying rule ϕ 1. As a result, r 2 [capital] is changed from Shanghai to Beijing, i.e., r 2 ϕ1 r 2 where r 2[capital] = Beijing and the other attributes of r 2 remain unchanged. Similarly, we have r 4 ϕ2 r 4 where the only updated attribute value is r 4[capital] = Ottawa.
5 Fundamental problems associated with fixing rules have been studied [30]. Termination. The termination problem is to determine whether a rule-based process will stop. We have verified that applying fixing rules can ensure the process will terminate. Consistency. The consistency problem is to determine, given a set Σ of fixing rules defined on R, whether Σ is consistent. Theorem 1. The consistency problem of fixing rules is PTIME. We prove Theorem 1 by providing a PTIME algorithm for determining if a set of fixing rules is consistent in [30]. Implication. The implication problem is to decide, given a set Σ of consistent fixing rules, and another fixing rule ϕ, whether Σ implies ϕ. Theorem 2. The implication problem of fixing rules is conp-complete. It is down to PTIME when the relation schema R is fixed. Please refer to [30] for a proof. Determinism. The determinism problem asks whether all terminating cleaning processes end up with the same repair. From the definition of consistency of fixing rules, it is trivial to get that, if a set Σ of fixing rules is consistent, for any t of R, applying Σ to t will terminate, and the updated t is deterministic (i.e., a unique result). 3 Data Repairing Algorithms In this section, we will discuss several classes of data repairing solutions. We will start by the most-studied problem: computing a consistent database (Sectioin 3.1). We then discuss user guided repairing (Section 3.2) and repairing data with precomputed confidence values (Section 3.3). We will end up this section with introducing the data repairing with fixing rules (Section 3.4). 3.1 Heuristic Algorithms A number of recent research [4,8,12] have investigated the data cleaning problem introduced in [3]: repairing is to find another database that is consistent and minimally differs from the original database. They compute a consistent database by using different cost functions for value updates and various heuristics to guide data repairing. For instance, consider Example 2. They can change r 2 [capital] from Shanghai to Beijing, and r 3 [capital] from Tokyo to Beijing, which requires two changes. One may verify that this is a repair with the minimum cost of two updates. Though these changes correct the error in r 2 [capital], they do not rectify r 3 [country]. Worse still, they mess up the correct value in r 3 [capital].
6 country capital s 1 : China Beijing s 2 : Canada Ottawa s 3 : Japan Tokyo Fig. 3. Data D m of schema Cap 3.2 User Guidance It is known that heuristic based solutions might introduce data errors [22]. In order to ensure that a repair is dependable, users are involved in the process of data repairing [22, 29, 31]. Consider a recent work [24] that uses editing rules and master data. Figure 3 shows a master data D m of schema Cap (country, capital), which is considered to be correct. An editing rule er 1 defined on two relations (Travel, Cap) is: er 1 : ((country, country) (capital, capital), t p1 [country] = ()) Rule er 1 states that: for any tuple r in a Travel table, if r[country] is correct and it matches s[country] from a Cap table, we can update r[capital] with the value s[capital] from Cap. For instance, to repair r 2 in Fig. 1, the users need to ensure that r 2 [country] is correct, and then match r 2 [country] and s 1 [country] in the master data, so as to update r 2 [capital] to s 1 [capital]. It proceeds similarly for the other tuples. 3.3 Value Confidence Instead of interacting with users to ensure the correctness of some values or to rectify some data, some work employs pre-computed or placed confidence values to guide a repairing process [8, 13, 23]. The intuition is that the values with high confidence values should not be changed, and the values with low confidence values are mostly probably to be wrong and thus should be changed. These information about confidence values will be taken into consideration by modifying algorithms e.g., those in Section Fixing Rules There are two data repairing algorithms using fixing rules that are introduced in Section 2. Readers can find the details of these algorithms in [30]. In this paper, we will give an example about how they work. Example 6. Consider Travel data D in Fig. 1, rules ϕ 1, ϕ 2 in Fig 2, and the following two rules. ϕ 3 : (([capital, city, conf], [Tokyo, Tokyo, ICDE]), (country, {China})) Japan ϕ 4 : (([capital, conf], [Beijing, ICDE]), (city, {Hongkong}) Shanghai
7 pkeyp plistp country, China ϕ 1 X X country, Canada ϕ 2 conf, ICDE capital, Tokyo ϕ 3, ϕ 4 X ϕ 3 X city, Tokyo ϕ 3 X capital, Beijing ϕ 4 (a) Inverted listsx X r 1: Γ = {ϕ 1}, c(ϕ 1) = 1 itr1: r 1 = r 1, Γ = r 2: Γ ={ϕ 1}, c(ϕ 1, ϕ 3, ϕ 4)=1; itr1: r 2[capital]= Beijing c(ϕ 3)=1, c(ϕ 4)=2, Γ ={ϕ 4}; itr2: r 2[city]= Shanghai, Γ = r 3: Γ ={ϕ 3}, c(ϕ 3)=3; itr1: r 3[country]= Japan, Γ = r 4: Γ ={ϕ 2}, c(ϕ 2)=1; itr1: r 4[capital]= Ottawa, Γ = Fig. 4. A running example Rule ϕ 3 states that: for t in relation Travel, if the conf is ICDE, held at city Tokyo and capital Tokyo, but the country is China, its country should be updated to Japan. Rule ϕ 4 states that: for t in relation Travel, if the conf is ICDE, held at some country whose capital is Beijing, but the city is Hongkong, its city should be Shanghai. This holds since ICDE was held in China only once at 2009, in Shanghai but never in Hongkong. Before giving a running example, we shall pause and introduce some indices, which is important to understand the algorithm. Inverted lists. Each inverted list is a mapping from a key to a set Υ of fixing rules. Each key is a pair (A, a) where A is an attribute and a is a constant value. Each fixing rule ϕ in the set Υ satisfies A X ϕ and t p [A] = a. For example, an inverted list w.r.t. ϕ 1 in Example 4 is as: country, China ϕ 1 Intuitively, when the country of some tuple is China, this inverted list will help to identify that ϕ 1 might be applicable. Hash counters. It uses a hash map to maintain a counter for each rule. More concretely, for each rule ϕ, the counter c(ϕ) is a nonnegative integer, denoting the number of attributes that a tuple agrees with t p [X ϕ ]. For example, consider ϕ 1 in Example 4 and r 2 in Fig. 1. We have c(ϕ 1 ) = 1 w.r.t. tuple r 2, since both r 2 [country] and t p1 [country] are China. As another example, consider r 4 in Fig. 1, we have c(ϕ 1 ) = 0 w.r.t. tuple r 4, since r 4 [country] = Canada but t p1 [country] = China. Given the four fixing rules ϕ 1 ϕ 4, the corresponding inverted lists are given in Fig. 4(a). For instance, the third key (conf, ICDE) links to rules ϕ 3 and ϕ 4, since conf X ϕ3 (i.e., {capital, city, conf}) and t p3 [conf] = ICDE; and moreover,
8 conf X ϕ4 (i.e., {capital, conf}) and t p4 [conf] = ICDE. The other inverted lists are built similarly. Now we show how the algorithm works over tuples r 1 to r 4, which is also depicted in Fig. 4. Here, we highlight these tuples in two colors, where the green color means that the tuple is clean (i.e., r 1 ), while the red color represents the tuples containing errors (i.e., r 2, r 3 and r 4 ). r 1 : It initializes and finds that ϕ 1 may be applied, maintained in Γ. In the first iteration, it finds that ϕ 1 cannot be applied, since r 1 [capital] is Beijing, which is not in the negative patterns {Shanghai, Hongkong} of ϕ 1. Also, no other rules can be applied. It terminates with tuple r 1 unchanged. Actually, r 1 is a clean tuple. r 2 : It initializes and finds that ϕ 1 might be applied. In the first iteration, rule ϕ 1 is applied to r 2 and updates r 2 [capital] to Beijing. Consequently, it uses inverted lists to increase the counter of ϕ 4 and finds that ϕ 4 might be used. In the second iteration, rule ϕ 1 is applied and updates r 2 [city] to Shanghai. It then terminates since no other rules can be applied. r 3 : It initializes and finds that ϕ 3 might be applied. In the first iteration, rule ϕ 3 is applied and updates r 3 [coutry] to Japan. It then terminates, since no more applicable rules. r 4 : It initializes and finds that ϕ 2 might be applied. In the first iteration, rule ϕ 2 is applied and updates r 4 [capital] to Ottawa. It will then terminate. At this point, we see that all the fours errors shown in Fig. 1 have been corrected, as highlighted in Fig NADEEF: A Commodity Data Cleaning Systems Despite the need of high quality data, there is no end-to-end off-the-shelf solution to (semi-)automate error detection and correction w.r.t. a set of heterogeneous and ad-hoc quality rules. In particular, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. Although there exist more expressive logical forms (e.g., first-order logic) to cover a large group of quality rules, e.g., CFDs, MDs or denial constraints, the main problem for designing an effective holistic algorithm for these rules is the lack of dynamic semantics, i.e., alternative ways about how to repair data errors. Most of these existing rules only have static semantics, i.e., what data is erroneous. Emerging data quality applications place the following challenges in building a commodity data cleaning system. Heterogeneity: Business and dependency theory based quality rules are expressed in a large variety of formats and languages from rigorous expressions (e.g., functional dependencies), to plain natural language rules enforced by code embedded in the application logic itself (as in many practical scenarios). Such diversified semantics hinders the creation of one uniform system to accept het-
9 Data Loader NADEEF Rule Collector Data Metadata Management Data Quality Dashboard Auditing and Lineage Indices Probabilistic models Data owners Domain experts ETLs, CFDs, MDs, Business rules Metadata Detection and Cleaning Algorithms Core Rules Rule Compiler Violation Detection Data Repairing Fig. 5. Architecture of Nadeef erogeneous quality rules and to enforce them on the data within the same framework. Interdependency: Data cleaning algorithms are normally designed for one specific type of rules. [23] shows that interacting two types of quality rules (CFDs and MDs) may produce higher quality repairs than treating them independently. However, the problem related to the interaction of more diversified types of rules is far from being solved. One promising way to help solve this problem is to provide unified formats to represent not only the static semantics of various rules (i.e., what is wrong), but also their dynamic semantics (i.e., alternative ways to fix the wrong data). Deployment and extensibility: Although many algorithms and techniques have been proposed for data cleaning [8,23,31], it is difficult to download one of them and run it on the data at hand without tedious customization. Adding to this difficulty is when users define new types of quality rules, or want to extend an existing system with their own implementation of cleaning solutions. Metadata management and data custodians: Data is not born an orphan. Real customers have little trust in the machines to mess with the data without human consultation. Several attempts have tackled the problem of including humans in the loop (e.g., [24, 29, 31]). However, they only provide users with information in restrictive formats. In practice, the users need to understand much more meta-information e.g., summarization or samples of data errors, lineage of data changes, and possible data repairs, before they can effectively guide any data cleaning process. Nadeef 1 is a prototype for an extensible and easy-to-deploy cleaning system that leverages the separability of two main tasks: (1) isolating rule specification that uniformly defines what is wrong and (possibly) how to fix it; and (2) developing a core that holistically applies these routines to handle the detection and cleaning of data errors. 1
10 4.1 Architecture Overview Figure 5 depicts of the architecture of Nadeef. It contains three components: (1) the Rule Collector gathers user-specified quality rules; (2) the Core component uses a rule compiler to compile heterogeneous rules into homogeneous constructs that allow the development of default holistic data cleaning algorithms; and (3) the Metadata management and Data quality dashboard modules are concerned with maintaining and querying various metadata for data errors and their possible fixes. The dashboard allows domain experts and users to easily interact with the system. Rule Collector. It collects user-specified data quality rules such as ETL rules, CFDs (FDs), MDs, deduplication rules, and other customized rules. Core. The core contains three components: rule compiler, violation detection and data repairing. (i) Rule Compiler. This module compiles all heterogeneous rules and manages them in a unified format. (ii) Violation Detection. This module takes the data and the compiled rules as input, and computes a set of data errors. (iii) Data Repairing. This module encapsulates holistic repairing algorithms that take violations as input, and computes a set of data repairs, while (by default) targeting the minimization of some pre-defined cost metric. This module may interact with domain experts through the data quality dashboard to achieve higher quality repairs. For more details of Nadeef, please refer to the work [14]. 4.2 Entity Resolution Extension Entity resolution (ER), the process of identifying and eventually merging records that refer to the same real-world entities, is an important and long-standing problem. Nadeef/Er [16] was an extension of Nadeef as a generic and interactive entity resolution system, which is built as an extension over Nadeef. Nadeef/Er provides a rich programming interface for manipulating entities, which allows generic, efficient and extensible ER. Nadeef/Er offers the following features: (1) Easy specification Users can easily define ER rules with a browser-based specification, which will then be automatically transformed to various functions, treated as black-boxes by Nadeef; (2) Generality and extensibility Users can customize their ER rules by refining and fine-tuning the above functions to achieve both effective and efficient ER solutions; (3) Interactivity Nadeef/Er [16] extends the existing Nadeef [15] dashboard with summarization and clustering techniques to facilitate understanding problems faced by the ER process as well as to allow users to influence resolution decisions. 4.3 High-Volume Data In order to be scalable, Nadeef has native support for three databases, PostgreSQL, mysql, and DerbyDB. However, to achieve high performance for high-
11 volume data, a single machine is not enough. To this purpose, we have also built Nadeef on top of Spark 2, which is transparent to end users. In other words, users only need to implement Nadeef programming interfaces in logical level. Nadeef will be responsible to translate and execute user provided functions on top of Spark. 4.4 High-Velocity Data In order to deal with high-velocity data, we have also designed new Nadeef interfaces for incremental processing of streaming data. By implementing these new functions, Nadeef can maximally avoid repeated comparison of existing data, hence is able to process data in high-velocity. 5 Open Issues Data cleaning is, in general, a hard problem. There are many issues to be addressed or improved to meet practical needs. Tool selection. Given a database and a wide range of data cleaning tools (e.g., FD-, DC- or statistical-based methods), the first challenging question is which tool to pick for the given specific task. Rule discovery. Although several discovery algorithms [11, 19] have been developed for e.g., CFDs or DCs, rules discovered by automatic algorithms are far from clean themselves. Hence, often times, manually selecting/cleaning thousands of discovered rules is a must, yet a difficult process. Usability. In fact, usability has been identified as an important feature of data management, since it is challenging for humans to interact with machines. This problem is harder when comes to the specific topic of data cleaning, since given detected errors, there is normally no evidence that which values are correct and which are wrong, even for humans. Hence, more efforts should be put to usability of data cleaning systems so as to effectively involve users as first-class citizens. References 1. Dirty data costs the U.S. economy $3 trillion+ per year Firms full of dirty data /firms-full-of-dirty-data. 3. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. TPLP, L. E. Bertossi, S. Kolahi, and L. V. S. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. In ICDT, G. Beskales, G. Das, A. K. Elmagarmid, I. F. Ilyas, F. Naumann, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, and N. Tang. The data analytics group at the qatar computing research institute. SIGMOD Record, 41(4):33 38,
12 6. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, G. Beskales, M. A. Soliman, I. F. Ilyas, and S. Ben-David. Modeling and querying possible repairs in duplicate detection. In VLDB, P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13), X. Chu, P. Papotti, and I. Ilyas. Holistic data cleaning: Put violations into context. In ICDE, G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, A. Ebaid, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Nadeef: A generalized data cleaning system. PVLDB, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, J. Quiane-Ruiz, N. Tang, and S. Yin. NADEEF/ER: Generic and interactive entity resolution. In SIGMOD, W. Fan. Dependencies revisited for improving data quality. In PODS, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng., 23(5): , W. Fan, F. Geerts, N. Tang, and W. Yu. Inferring data currency and consistency for conflict resolution. In ICDE, W. Fan, F. Geerts, and J. Wijsen. Determining the currency of data. In PODS, W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1): , W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., W. Fan, J. Li, N. Tang, and W. Yu. Incremental detection of inconsistencies in distributed data. In ICDE, pages , I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, S. Kolahi and L. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, V. Raman and J. M. Hellerstein. Potter s Wheel: An interactive data cleaning system. In VLDB, J. Wang and N. Tang. Towards dependable data with fixing rules. In SIGMOD, M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011.
The Data Analytics Group at the Qatar Computing Research Institute
The Data Analytics Group at the Qatar Computing Research Institute George Beskales Gautam Das Ahmed K. Elmagarmid Ihab F. Ilyas Felix Naumann Mourad Ouzzani Paolo Papotti Jorge Quiane-Ruiz Nan Tang Qatar
A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 3, Issue 5 (August 2012), PP. 56-61 A Fast and Efficient Method to Find the Conditional
Edinburgh Research Explorer
Edinburgh Research Explorer CerFix: A System for Cleaning Data with Certain Fixes Citation for published version: Fan, W, Li, J, Ma, S, Tang, N & Yu, W 2011, 'CerFix: A System for Cleaning Data with Certain
Models for Distributed, Large Scale Data Cleaning
Models for Distributed, Large Scale Data Cleaning Vincent J. Maccio, Fei Chiang, and Douglas G. Down McMaster University Hamilton, Ontario, Canada {macciov,fchiang,downd}@mcmaster.ca Abstract. Poor data
Data Quality Problems beyond Consistency and Deduplication
Data Quality Problems beyond Consistency and Deduplication Wenfei Fan 1, Floris Geerts 2, Shuai Ma 3, Nan Tang 4, and Wenyuan Yu 1 1 University of Edinburgh, UK, [email protected],[email protected]
Improving Data Excellence Reliability and Precision of Functional Dependency
Improving Data Excellence Reliability and Precision of Functional Dependency Tiwari Pushpa 1, K. Dhanasree 2, Dr.R.V.Krishnaiah 3 1 M.Tech Student, 2 Associate Professor, 3 PG Coordinator 1 Dept of CSE,
Data Development and User Interaction in GDR
Guided Data Repair Mohamed Yakout 1,2 Ahmed K. Elmagarmid 2 Jennifer Neville 1 Mourad Ouzzani 1 Ihab F. Ilyas 2,3 1 Purdue University 2 Qatar Computing Research Institute 3 University of Waterloo Qatar
Dependencies Revisited for Improving Data Quality
Dependencies Revisited for Improving Data Quality Wenfei Fan University of Edinburgh & Bell Laboratories Wenfei Fan Dependencies Revisited for Improving Data Quality 1 / 70 Real-world data is often dirty
Effective Data Cleaning with Continuous Evaluation
Effective Data Cleaning with Continuous Evaluation Ihab F. Ilyas University of Waterloo [email protected] Abstract Enterprises have been acquiring large amounts of data from a variety of sources to build
M C. Foundations of Data Quality Management. Wenfei Fan University of Edinburgh. Floris Geerts University of Antwerp
Foundations of Data Quality Management Wenfei Fan University of Edinburgh Floris Geerts University of Antwerp SYNTHESIS LECTURES ON DATA MANAGEMENT #29 & M C Morgan & claypool publishers ABSTRACT Data
Data Quality: From Theory to Practice
Data Quality: From Theory to Practice Wenfei Fan School of Informatics, University of Edinburgh, and RCBD, Beihang University [email protected] ABSTRACT Data quantity and data quality, like two sides
Data Quality in Information Integration and Business Intelligence
Data Quality in Information Integration and Business Intelligence Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies
Repair Checking in Inconsistent Databases: Algorithms and Complexity
Repair Checking in Inconsistent Databases: Algorithms and Complexity Foto Afrati 1 Phokion G. Kolaitis 2 1 National Technical University of Athens 2 UC Santa Cruz and IBM Almaden Research Center Oxford,
Incremental Violation Detection of Inconsistencies in Distributed Cloud Data
Incremental Violation Detection of Inconsistencies in Distributed Cloud Data S. Anantha Arulmary 1, N.S. Usha 2 1 Student, Department of Computer Science, SIR Issac Newton College of Engineering and Technology,
Consistent Query Answering in Databases Under Cardinality-Based Repair Semantics
Consistent Query Answering in Databases Under Cardinality-Based Repair Semantics Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada Joint work with: Andrei Lopatenko (Google,
Conditional-Functional-Dependencies Discovery
Conditional-Functional-Dependencies Discovery Raoul Medina 1 Lhouari Nourine 2 Research Report LIMOS/RR-08-13 19 décembre 2008 1 [email protected] 2 [email protected], Université Blaise Pascal LIMOS Campus
Addressing internal consistency with multidimensional conditional functional dependencies
Addressing internal consistency with multidimensional conditional functional dependencies Stefan Brüggemann OFFIS - Institute for Information Technology Escherweg 2 262 Oldenburg Germany [email protected]
FINDING, MANAGING AND ENFORCING CFDS AND ARS VIA A SEMI-AUTOMATIC LEARNING STRATEGY
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 2, 2014 FINDING, MANAGING AND ENFORCING CFDS AND ARS VIA A SEMI-AUTOMATIC LEARNING STRATEGY KATALIN TÜNDE JÁNOSI-RANCZ Abstract. This paper describes
AN EFFICIENT EXTENSION OF CONDITIONAL FUNCTIONAL DEPENDENCIES IN DISTRIBUTED DATABASES
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 3, Mar 2014, 17-24 Impact Journals AN EFFICIENT EXTENSION OF CONDITIONAL
How To Find Out What A Key Is In A Database Engine
Database design theory, Part I Functional dependencies Introduction As we saw in the last segment, designing a good database is a non trivial matter. The E/R model gives a useful rapid prototyping tool,
Report on the Dagstuhl Seminar Data Quality on the Web
Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,
DYNAMIC QUERY FORMS WITH NoSQL
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH
Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya
Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data
Logic Programs for Consistently Querying Data Integration Systems
IJCAI 03 1 Acapulco, August 2003 Logic Programs for Consistently Querying Data Integration Systems Loreto Bravo Computer Science Department Pontificia Universidad Católica de Chile [email protected] Joint
Conditional Functional Dependencies for Data Cleaning
Conditional Functional Dependencies for Data Cleaning Philip Bohannon 1 Wenfei Fan 2,3 Floris Geerts 3,4 Xibei Jia 3 Anastasios Kementsietsidis 3 1 Yahoo! Research 2 Bell Laboratories 3 University of Edinburgh
Enterprise Data Quality
Enterprise Data Quality An Approach to Improve the Trust Factor of Operational Data Sivaprakasam S.R. Given the poor quality of data, Communication Service Providers (CSPs) face challenges of order fallout,
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques Sean Thorpe 1, Indrajit Ray 2, and Tyrone Grandison 3 1 Faculty of Engineering and Computing,
Detailed Investigation on Strategies Developed for Effective Discovery of Matching Dependencies
Detailed Investigation on Strategies Developed for Effective Discovery of Matching Dependencies R.Santhya 1, S.Latha 2, Prof.S.Balamurugan 3, S.Charanyaa 4 Department of IT, Kalaignar Karunanidhi Institute
Oracle Real Time Decisions
A Product Review James Taylor CEO CONTENTS Introducing Decision Management Systems Oracle Real Time Decisions Product Architecture Key Features Availability Conclusion Oracle Real Time Decisions (RTD)
SQL Server Master Data Services A Point of View
SQL Server Master Data Services A Point of View SUBRAHMANYA V SENIOR CONSULTANT [email protected] Abstract Is Microsoft s Master Data Services an answer for low cost MDM solution? Will
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD
Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
Viewpoint ediscovery Services
Xerox Legal Services Viewpoint ediscovery Platform Technical Brief Viewpoint ediscovery Services Viewpoint by Xerox delivers a flexible approach to ediscovery designed to help you manage your litigation,
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Query Answering in Peer-to-Peer Data Exchange Systems
Query Answering in Peer-to-Peer Data Exchange Systems Leopoldo Bertossi and Loreto Bravo Carleton University, School of Computer Science, Ottawa, Canada {bertossi,lbravo}@scs.carleton.ca Abstract. The
Query Processing in Data Integration Systems
Query Processing in Data Integration Systems Diego Calvanese Free University of Bozen-Bolzano BIT PhD Summer School Bressanone July 3 7, 2006 D. Calvanese Data Integration BIT PhD Summer School 1 / 152
CS143 Notes: Normalization Theory
CS143 Notes: Normalization Theory Book Chapters (4th) Chapters 7.1-6, 7.8, 7.10 (5th) Chapters 7.1-6, 7.8 (6th) Chapters 8.1-6, 8.8 INTRODUCTION Main question How do we design good tables for a relational
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
What's New in SAS Data Management
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment
DOI: 10.15415/jotitt.2014.22021 A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment Rupali Gill 1, Jaiteg Singh 2 1 Assistant Professor, School of Computer Sciences, 2 Associate
How to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
Web Database Integration
Web Database Integration Wei Liu School of Information Renmin University of China Beijing, 100872, China [email protected] Xiaofeng Meng School of Information Renmin University of China Beijing, 100872,
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 [email protected]
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
A Framework for Data Migration between Various Types of Relational Database Management Systems
A Framework for Data Migration between Various Types of Relational Database Management Systems Ahlam Mohammad Al Balushi Sultanate of Oman, International Maritime College Oman ABSTRACT Data Migration is
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
Dynamic Querying In NoSQL System
Dynamic Querying In NoSQL System Reshma Suku 1, Kasim K. 2 Student, Dept. of Computer Science and Engineering, M. E. A Engineering College, Perinthalmanna, Kerala, India 1 Assistant Professor, Dept. of
Data Quality for BASEL II
Data Quality for BASEL II Meeting the demand for transparent, correct and repeatable data process controls Harte-Hanks Trillium Software www.trilliumsoftware.com Corporate Headquarters + 1 (978) 436-8900
Module 10. Coding and Testing. Version 2 CSE IIT, Kharagpur
Module 10 Coding and Testing Lesson 23 Code Review Specific Instructional Objectives At the end of this lesson the student would be able to: Identify the necessity of coding standards. Differentiate between
Continuous Data Cleaning
Continuous Data Cleaning Maksims Volkovs #1, Fei Chiang #2, Jaroslaw Szlichta #1, Renée J. Miller #1 #1 Dept. of Computer Science, University of Toronto {mvolkovs, szlichta, miller}@cs.toronto.edu #2 Dept.
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study
Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study Kofi Adu-Manu Sarpong Institute of Computer Science Valley View University, Accra-Ghana P.O. Box VV 44, Oyibi-Accra ABSTRACT
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
II. PREVIOUS RELATED WORK
An extended rule framework for web forms: adding to metadata with custom rules to control appearance Atia M. Albhbah and Mick J. Ridley Abstract This paper proposes the use of rules that involve code to
Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation
Objectives Distributed Databases and Client/Server Architecture IT354 @ Peter Lo 2005 1 Understand the advantages and disadvantages of distributed databases Know the design issues involved in distributed
A STATISTICAL DATA FUSION TECHNIQUE IN VIRTUAL DATA INTEGRATION ENVIRONMENT
A STATISTICAL DATA FUSION TECHNIQUE IN VIRTUAL DATA INTEGRATION ENVIRONMENT Mohamed M. Hafez 1, Ali H. El-Bastawissy 1 and Osman M. Hegazy 1 1 Information Systems Dept., Faculty of Computers and Information,
Database Application Developer Tools Using Static Analysis and Dynamic Profiling
Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract
Relational Database Design: FD s & BCNF
CS145 Lecture Notes #5 Relational Database Design: FD s & BCNF Motivation Automatic translation from E/R or ODL may not produce the best relational design possible Sometimes database designers like to
Specification and Analysis of Contracts Lecture 1 Introduction
Specification and Analysis of Contracts Lecture 1 Introduction Gerardo Schneider [email protected] http://folk.uio.no/gerardo/ Department of Informatics, University of Oslo SEFM School, Oct. 27 - Nov.
Cloud Self Service Mobile Business Intelligence MAKE INFORMED DECISIONS WITH BIG DATA ANALYTICS, CLOUD BI, & SELF SERVICE MOBILITY OPTIONS
Cloud Self Service Mobile Business Intelligence MAKE INFORMED DECISIONS WITH BIG DATA ANALYTICS, CLOUD BI, & SELF SERVICE MOBILITY OPTIONS VISUALIZE DATA, DISCOVER TRENDS, SHARE FINDINGS Analysis extracts
Business Intelligence: Recent Experiences in Canada
Business Intelligence: Recent Experiences in Canada Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies 2 Business Intelligence
Data Quality Assessment. Approach
Approach Prepared By: Sanjay Seth Data Quality Assessment Approach-Review.doc Page 1 of 15 Introduction Data quality is crucial to the success of Business Intelligence initiatives. Unless data in source
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,
Towards Privacy aware Big Data analytics
Towards Privacy aware Big Data analytics Pietro Colombo, Barbara Carminati, and Elena Ferrari Department of Theoretical and Applied Sciences, University of Insubria, Via Mazzini 5, 21100 - Varese, Italy
Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014
Increase Agility and Reduce Costs with a Logical Data Warehouse February 2014 Table of Contents Summary... 3 Data Virtualization & the Logical Data Warehouse... 4 What is a Logical Data Warehouse?... 4
DataFlux Data Management Studio
DataFlux Data Management Studio DataFlux Data Management Studio provides the key for true business and IT collaboration a single interface for data management tasks. A Single Point of Control for Enterprise
How To Develop Software
Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which
Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza
Data Integration Maurizio Lenzerini Universitá di Roma La Sapienza DASI 06: Phd School on Data and Service Integration Bertinoro, December 11 15, 2006 M. Lenzerini Data Integration DASI 06 1 / 213 Structure
Software Engineering Reference Framework
Software Engineering Reference Framework Michel Chaudron, Jan Friso Groote, Kees van Hee, Kees Hemerik, Lou Somers, Tom Verhoeff. Department of Mathematics and Computer Science Eindhoven University of
Automatic Annotation Wrapper Generation and Mining Web Database Search Result
Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India
Schema Refinement and Normalization
Schema Refinement and Normalization Module 5, Lectures 3 and 4 Database Management Systems, R. Ramakrishnan 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational
Dynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
Big Data Begets Big Database Theory
Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process
Reduce and manage operating costs and improve efficiency. Support better business decisions based on availability of real-time information
Data Management Solutions Horizon Software Solution s Data Management Solutions provide organisations with confidence in control of their data as they change systems and implement new solutions. Data is
