Object Fusion in Geographic Information Systems
|
|
|
- Merry Wiggins
- 10 years ago
- Views:
Transcription
1 Object Fusion in Geographic Information Systems Catriel Beeri Yaron Kanza Eliyahu Safra Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science The Hebrew University Jerusalem, 994 {beeri, yarok, safra, Abstract Given two geographic databases, a fusion algorithm should produce all pairs of corresponding objects (i.e., objects that represent the same real-world entity). Four fusion algorithms, which only use locations of objects, are described and their performance is measured in terms of recall and precision. These algorithms are designed to work even when locations are imprecise and each database represents only some of the real-world entities. Results of extensive experimentation are presented and discussed. The tests show that the performance depends on the density of the data sources and the degree of overlap among them. All four algorithms are much better than the current state of the art (i.e., the onesided nearest-neighbor join). One of these four algorithms is best in all cases, at a cost of a small increase in the running time compared to the other algorithms. Introduction When integrating data from two heterogeneous sources, one is faced with the task of fusing distinct objects that represent the same real-world entity. This is known as the object-fusion problem. Most of the research on this problem has considered either structured (i.e., relational) or semistructured (notably XML) data (e.g., [, ]). In both cases, objects have identifiers (e.g., keys). Object fusion is easier when global identifiers are used; that is, when objects representing the same entity are guaranteed to have the Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 3th VLDB Conference, Toronto, Canada, 24 same identifier. When integrating data from heterogeneous sources, however, the object-fusion problem is much harder, due to the lack of global identifiers. When integrating geographic databases, spatial (and non-spatial) properties should be used in lieu of global identifiers. Since location is the only property that is always available for spatial objects, we investigate location-based fusion for the case of two geographic databases. We assume that each database has at most one object per real-world entity and locations are given as points. Thus, the fusion is one-to-one. Location-based fusion may seem to be an easy task, since locations could be construed as global identifiers. This is not so, however, for several reasons. First, measurements introduce errors, and the errors in different databases are independent of each other. Second, each organization has its own approach and requirements, and hence uses different measurement techniques and may record spatial properties of entities using a different scale or a different structure. For example, one organization might represent buildings as points, while another could use polygonal shapes for the same purpose. While an estimated point location can be derived from a polygonal shape, it may not agree with a point-based location in another database. A third reason could be displacements that are caused by cartographic generalizations. The motivation for this work was hands-on experience on integrating data sources about hotels in Tel- Aviv. Two of the sources were organizations that obtained their data by different and independent means. Moreover, one organization represented buildings as points while the other as polygons. It turned out that, in obvious cases, corresponding objects were close to each other, but did not always have identical locations. When locations were not sufficiently close, hotel names could be used to determine some of the corresponding pairs. However, there were also pairs with similar, yet not identical names; and so, some cases remained unresolved. The current state of the art is the one-sided nearestneighbor join [9] that fuses an object from one dataset with its closest neighbor in the other dataset. The ra- 86
2 tionale is that even in the presence of measurement errors, different objects that represent the same entity should have close locations. However, in the onesided nearest-neighbor join, every object from one of the two datasets is matched with some object from the other dataset, even if each dataset has objects that should not be matched with any object from the other dataset. Thus, the performance is poor when the overlap between the two dataset is small. Another source of difficulty is due to the fact that in the presence of errors, in a dense dataset, the right match is not always the nearest neighbor. Consequently, one also has to consider objects that are further away. Moreover, there are mutual influences: If an object a is matched with an object b, then both cease to be candidates for possible matches with other objects, thereby increasing the likelihood of other matches between the other objects. In this paper, we consider the problem of finding one-to-one correspondences among objects that have point locations and belong to two geographic databases. We present several location-based algorithms, with an increasing level of sophistication, for finding corresponding objects that should be fused. We also present the results of extensive tests that illustrate the weaknesses and strengths of these algorithms, under varying assumptions about the error bounds, the density of each spatial database and the degree of overlap between these databases. The main contribution of our work is in showing that point locations can be effectively used to find corresponding objects. Since locations are always available for spatial objects and since a point is the simplest form of representing a location, additional information (e.g., names or polygonal locations) can only enhance the strength of our location-based algorithms. The outline of this paper is as follows. In Section 2, we introduce the framework and formally define the problem. Section 3 describes how to measure the quality of the result of a fusion algorithm. In that section, we also discuss the factors that may influence the performance of such an algorithm. Section 4 describes the fusion algorithms that we propose. The tests and their results are discussed in Section 5. In Section 6, we consider further improvements. Section 7 describes how to choose an optimal threshold value. Related work is described in Section 8, and we conclude in Section 9. 2 Object Fusion A geographic database stores spatial objects, or objects for short. Each object represents a single real-world geographic entity. We view a geographic databases as a dataset of objects, with at most one object for each real-world entity. An object has associated spatial and non-spatial attributes. Spatial attributes describe the location, height, shape and topology of an entity. Examples of non-spatial attributes are name, address, number of rooms in a hotel, etc. We assume that locations of objects are recorded as points. This is the simplest form of representing locations. More complex forms of recording locations (e.g, polygons) can be approximated by points (e.g., by computing the center of mass). The distance between two objects is the Euclidean distance between their point locations. When two geographic databases are integrated, the main task is to identify pairs of objects, one from each dataset, that represent the same entity; the objects in each pair should be fused into a single object. In general, a fusion algorithm may process more than two datasets and it generates fusion sets with at most one object from each dataset. A fusion set is sound if all its objects represent the same entity (but it does not necessarily contain all the objects that represent that entity). A fusion set is complete if it contains all the objects that represent some entity (but it may also contain other objects). A fusion set is correct if it is both sound and complete. In this paper, we consider the case of two datasets and investigate the problem of finding the correct fusion sets, under the following assumptions. First, in each dataset, distinct objects represent distinct realworld entities. This is a realistic assumption, since a database represents a real-world entity as a single object. Second, only locations of objects are used to find the fusion sets. This is a practical assumption, since spatial objects always have information about their locations. As explained earlier, location-based fusion is not easy, since locations are not accurate. We denote the two datasets as A = {a,..., a m } and B = {b,..., b n }. Two objects a A and b B are corresponding objects if they represent the same entity. A fusion set that is generated from A and B is either a singleton (i.e., contains a single object) or has two objects, one from each dataset. A fusion set {a, b} is correct if a and b are corresponding objects. A singleton fusion set {a} is correct if a does not have a corresponding object in the other dataset. In the absence of any global key, it is not always possible to find all the correct fusions sets. We will present novel algorithms that only use locations and yet are able to find the correct fusion sets with a high degree of success. We will discuss the factors that effect the performance of these algorithms and show results of testing them. 3 Quality of Results 3. Measuring the Quality Similarly to information retrieval, we measure the quality of a fusion algorithm in terms of recall and precision. Recall is the percentage of correct fusion sets that actually appear in the result (e.g., 9% of all the correct fusion sets appear in the result). Precision 87
3 is the percentage of correct fusion sets out of all the fusion sets in the result (e.g., 8% of the sets in the result are correct). Formally, let the result of a fusion algorithm have s r fusion sets and let s r c sets out those be correct. Let e denote the total number of real-world entities that are represented in at least one of the two datasets. Then the precision is s r c /sr and the recall is s r c /e. Note that applying the above definitions requires full knowledge of whether two objects represent the same entity or not. Clearly, this knowledge was available for the datasets that we used in the tests. A similar definition of recall and precision was used in [3]. However, they considered a somewhat different problem, where the correct fusion sets are always of size two and the fusion algorithm only produces fusion sets of size two. Our definition is more general in that it takes into account the possibility that a fusion algorithm produces correct as well as incorrect singleton fusion sets. Recall and precision could also be defined in a different way, by counting object occurrences instead of fusion sets and entities. When there are only two datasets (as in our case), there is no substantial difference between the two definitions. However, when fusion sets may have more than two objects, counting object occurrences seems to be a more suitable approach. The details of the alternative definitions are beyond the scope of this paper. 3.2 Factors Affecting Recall and Precision One factor that influences the recall and precision is the error interval of each dataset. The error interval is a bound on the distance between an object in the dataset and the entity it represents. The density of a dataset is the number of objects per unit of area. The choice factor is the number of objects in a circle with a radius that is equal to the error interval (note that the choice factor is the product of the density and the area of that circle). Intuitively, for a given entity, the choice factor is an estimate of the number of objects in the dataset that could possibly represent that entity. It is more difficult to achieve high recall and precision when the choice factor is large. Generally, we consider a choice factor as large if it is greater than ; otherwise, it is small. Note that the above factors need not be uniform in the geographic area that is represented by a given dataset. In the general case, these factors may have different values for different subareas. Suppose that the datasets A and B have m and n objects, respectively. Let c be the number of corresponding objects, i.e., the number of objects in all the correct fusion sets of size 2. Note that the number of distinct entities that are represented in the two datasets is m + n (c/2). The overlap between A and B is c/(m + n). The overlap is a measure of the fraction of objects that have a corresponding object in the other set. One of the challenges we faced was to develop an algorithm that has high recall and precision for all degrees of overlap. Note that when the two sets represent exactly the same set of entities, then the overlap is maximal, i.e.,. Another special case is when one dataset covers the second dataset, i.e., all the entities represented in the second dataset are also represented in the first dataset. We tested the performance of the algorithms on datasets with varying degrees of overlap. 4 Finding Fusion Sets In this section, we present methods for computing the fusion sets of two datasets, A = {a,..., a m } and B = {b,..., b n }. The methods are based on the intuition that two corresponding objects are more likely to be close to each other than two non-corresponding objects. The recall and precision of each method depend on specific characteristics of the given datasets, e.g., the choice factors of the datasets, the degree of overlap, etc. The methods are compared in Section The One-Sided Nearest-Neighbor Join We start by describing an existing method, the onesided nearest-neighbor join, which is commonly used in commercial geographic-information systems [9]. Given an object a A, we say that an object b B is the nearest B-neighbor of a if b is the closest object to a among all the objects in B. The one-sided nearestneighbor join of a dataset B with a dataset A produces all fusion sets {a, b}, such that a A and b B is the nearest B-neighbor of a. Note that every a A is in one of the fusion sets, while objects of B may appear in zero, one or more fusion sets. Thus, the one-sided nearest-neighbor join is not symmetric, i.e., the result of joining B with A is not necessarily equal to the result of joining A with B. We modify the above definition by adding to the result the singleton set {b} for every b B that is not the nearest neighbor of some a A. We do that to boost up the recall of this method; otherwise, the recall could be very low. We say that a dataset A is covered by a dataset B if every real-world entity that is represented in A is also represented in B. Recall that the one-sided nearestneighbor join produces singleton fusion sets just for one of the two datasets, even if neither dataset covers the other one. Thus, the result will include wrong pairs and the precision will be low. In conclusion, the one-sided nearest-neighbor join is likely to produce good approximations only when one dataset is covered by the other and the choice factors of the datasets are not large. 88
4 4.2 New Methods In the following subsections, we present three novel methods for computing fusion sets. Each method computes a confidence value for every fusion set. The confidence value indicates the likelihood that the fusion set is correct. The final result in each method is produced by choosing the fusion sets that have a confidence value that is above a given threshold value. The threshold τ ( τ ) is chosen by the user. Typically, increasing the threshold value will increase the precision and lower the recall, while decreasing the threshold value will increase the recall and decrease the precision. Controlling the recall and precision by means of a threshold value is especially useful when the datasets have large choice factors. 4.3 The Mutually-Nearest Method Two objects, a A and b B, are mutually nearest if a is the nearest A-neighbor of b and b is the nearest B-neighbor of a. The intuition behind the mutuallynearest method is that corresponding objects are likely to be mutually nearest. Note that some objects of A are not in any pair of mutually-nearest objects (and, similarly, for some objects of B). It happens when the nearest B-neighbor of a A is some b B, but the nearest A-neighbor of b is different from a. In the mutually-nearest method, a two-element fusion set is created for each pair of mutually-nearest objects. A singleton fusion set is created for each object that is not in any pair of mutually-nearest objects. Consider an object a A and let b be the nearest B-neighbor of a. The second-nearest B-neighbor of A is the object b B, such that b is the closest object to a among all the object in B {b}. We will now define the confidence values of fusion sets. First, consider a pair of mutually-nearest objects a A and b B. Let a be the second-nearest A-neighbor of b, and let b be the second-nearest B- neighbor of a. The confidence of the fusion set {a, b} is defined as follows. confidence({a, b}) = distance(a, b) min{distance(a, b ), distance(a, b)} That is, the confidence is the complement to one of the ratio of the distance between the two objects to the minimum among the distances between each object and its second-nearest neighbor. Now consider a fusion set with one element a A, i.e., a is not in any pair of of mutually-nearest objects. Let b be the nearest B-neighbor of a and let a be the nearest A-neighbor of b. The confidence of the fusion set {a} is defined as follows. confidence({a}) = distance(a, b) distance(a, b) That is, the complement to one of the ratio of the distance between a and b to the distance between a and b. Note that the confidence cannot be negative, since it would imply that a, rather than a, is the nearest A-neighbor of b. The confidence of a fusion set with a single element from b B is defined similarly. The above formulas are used to compute the confidence value, except when the distance upper bound β rules out the possibility that a has a corresponding object in the other set. The parameter β should be equal to the sum of all possible errors in the locations of objects. Thus, if for all b B, it holds that distance(a, b) > β, then we define confidence({a}) = whereas for all b B, we define confidence({a, b}) =. A threshold can be used to increase the precision of the result by choosing only those fusion sets that have a confidence value above the threshold. Consequently, some objects from the given datasets may not be in the result. A less restrictive approach is to discard two-element fusion sets with a confidence value below the threshold, but to add their elements as singletons. The main advantage of the mutually-nearest method over the traditional one-sided nearestneighbor join is lower sensitivity to the degree of overlap between the two datasets. In particular, it may perform well even when neither datasets is covered by the other one. However, it does not perform well when the choice factors are large, since it only takes into account the nearest neighbor and the second-nearest neighbor, while ignoring other neighbors that might be almost as close. In situations characterized by large choice factors, it is likely that an object a will not be the nearest neighbor of its corresponding object b. If this is indeed the case, the pair a and b will not be in the result even if the distance between a and b is only slightly greater than the distance between b and its nearest neighbor. 4.4 The Probabilistic Method The probabilistic method was devised to perform well when the choice factors are large. In such situations, it is not enough to consider only the nearest and secondnearest B-neighbors of an object a A, since there could be several objects in B that are close to a. In the probabilistic method, the confidence of a fusion set {a, b} depends on the probability that b is the object that corresponds to a, and that probability depends inversely on the distance between a and b. Consider two datasets A = {a,..., a m } and B = {b,..., b n }. For each object a A, we will define the function P a : B [, ] that gives the probability that a chooses b B. Similarly, for each b B, we will define the probability function P b : A [, ]. Formally, the probability function P ai is defined, as shown below, in terms of the distance, the distance 89
5 decay factor α > and the distance upper bound β. P ai (b j ) = distance(a i, b j ) α m k= distance(a i, b k ) α () The probability function P bj is defined similarly. Although β is not shown explicitly in the above formula, it has the following effect. If distance(a i, b j ) > β, then distance(a i, b j ) is taken to be infinity, i.e., distance(a i, b j ) α =. Recall that the distance upper bound should be equal to the sum of all possible errors in the locations of objects. Thus, if distance(a i, b j ) > β, then a i and b j are not likely to be corresponding objects and setting distance(a i, b j ) α to is justified. Note that the denominator of Formula () is if the distance between a i and every object of B is greater than β. In this case, P ai (b j ) is defined to be. Due to the numerator in the above formula (and the fact that α > ), the probability that a i chooses b j increases when the distance between a i and b j decreases. Thus, a i chooses its nearest B-neighbor with the highest probability. The parameter α determines the rate of decrease in the probability as the distance increases. We performed tests with different values for α and the best results were obtained for α = 2. Due to the denominator in the above formula, the probability that a i chooses one of the b j s is, since m j= P a i (b j ) = (unless a i does not choose any b j, which happens when the distance between a i and every b j B is greater than β and hence P ai (b j ) = for every b j B). The confidence of the fusion set {a i, b j } is defined as follows. confidence({a i, b j }) = P ai (b j ) P bj (a i ) Note that P ai (b j ) P bj (a i ) is the probability that a i chooses b j and b j chooses a i, i.e., the probability that a and b are corresponding objects. The confidence is defined as the square root of that probability in order that it will not be too small. The confidence of a fusion set that includes a single object a i is defined as follows. confidence({a i }) = m ( Pai (b k ) P bk (a i )) k= The rationale for this formula is that the sum of confidences over all fusion sets that contain a i should be equal to. Note that the singleton {a i } is given a confidence value of if the distance between a i and every object of B is greater than the distance upper bound β. Also note that in some rare cases the above formula may give a value that is less than and, in such cases, we define confidence({a i }) =. The result of the probabilistic method consists of all fusion sets that have confidence values above the threshold τ. (a) (b) a b b 2 a b b 2 a 2 Figure : Adding a new object to a dataset. When the choice factors are large, the probabilistic method performs better than either the mutuallynearest method or the one-sided nearest-neighbor join. The reason for that is that the probabilistic method assigns a confidence value to every pair of objects. Another advantage of the probabilistic methods is the ability to increase the recall by lowering the threshold. Doing so, however, may cause an object to be in more than one fusion set. In the probabilistic method, the recall is low when the overlap between the two datasets is small. In such a situation, most of the singleton fusion sets, both the correct ones and the incorrect ones, get similar confidence values. Thus, lowering the threshold would introduce into the result many incorrect singletons, while at the same time many correct ones would still remain below the threshold (unless the threshold becomes very low). If the threshold is high, it is likely that many correct singletons would be missing from the result. In conclusion, when the overlap is small, the probabilistic method does not handle correctly objects that should be in singletons. Therefore, this method should only be used when the overlap between the datasets is large. 4.5 The Normalized-Weights Method The normalized-weights method is a variation of the probabilistic method and it performs better when the overlap is small or medium. As in the probabilistic method, weights (i.e., confidence values) are given to each (correct or incorrect) fusion set, based on the same probability functions P a and P b that were defined in the previous section (see Equation ). The initial weights are normalized by an iterative algorithm that will be described in this section. This algorithm has an effect of mutual influence between pairs of objects, as illustrated in the next example. Example 4. Consider the objects a, b and b 2 in Figure (a). We assume that a is an object in the dataset A while b and b 2 are objects in the dataset B. Let the distance between a and b be and let the distance between a and b 2 be 5. In the probabilistic method, assuming a distance decay factor of (i.e., α = ), the probability that b will choose a is, since a is the only object of A. For a, the probabilities to choose b and b 2 are 5 6 and 6, respectively. 82
6 Thus, the confidence value that is given to the fusion set {a, b } is 5 P a (b ) P b (a ) = 6. In Figure (b), the object a 2 is added to the dataset A. The distance between a 2 and b 2 is and the distance between a 2 and b is 5. In the probabilistic method, after adding a 2, the confidence of the set {a, b } is 5 P a (b ) P b (a ) = = 5 6 < 5 6. Thus, the addition of a 2 has reduced the confidence of the fusion set {a, b }. This reduction is caused by the fact that after the addition of a 2, object b could potentially be the corresponding object of either a or a 2, whereas before the addition of a 2, only the correspondence between b and a was possible. But the addition of a 2 also has an effect of increasing the likelihood of a correspondence between b and a, because b 2 is now more likely to correspond to a 2 than to a and, hence, the confidence of the fusion set {a, b } should increase relatively to the confidence of the fusion set {a, b 2 }. In summary, the addition of a 2 has two opposite effects: one is an increase in the confidence of the fusion set {a, b } and the other is a decrease of that confidence. The probabilistic method is only capable of capturing the second effect, whereas the normalized-weights method captures both. Next, we will describe the normalized-weights method. Consider two datasets A = {a,..., a m } and B = {b,..., b n }. The probability functions P ai (b j ) and P bj (a i ) are defined as in Section 4.4. The matchings matrix M is an (m + ) (n + ) matrix, such that the element in row i and column j, denoted µ ij, is defined as follows. P ai (b j ) P bj (a i ) : i m, j n n k= µ ij = ( P b k (a i )) : i m, j = n + m k= ( P a k (b j )) : i = m +, j n : i = m +, j = n + Note that in the first case (i.e., the top line) of the above definition, µ ij is assigned the probability that a i and b j mutually choose each other. In each row i, the element in the last column (j = n + ) gives the probability that a i is not chosen by any b B. Similarly, in each column j, the element in the last row (i = m + ) gives the probability that b j is not chosen by any a A. A row (or a column) r is normalized if s =, where s is the sum of all the elements of r. We can always normalize r by dividing each element by s (since s > ). The normalization algorithm is a sequence of iterations over M, such that in each iteration, the first m rows and the first n columns are normalized one by one in some order. Note that the elements of the last column are changed when the rows are normalized, but the last column itself is not normalized; similarly for the last row. Let M () denote the matrix M, as it was defined above, and let M (k) denote the matrix after k iterations. It was shown by Sinkhorn [4, 5] that the normalization algorithm converges and the result does not depend upon the order of normalizing rows and columns in each iteration. We terminate the normalization algorithm when the sum of each row and each column, except for the last row and the last column, is different from by no more than some very small ɛ > (in the tests, ɛ was equal to.). Let M (t) denote the matrix upon termination of the iteration algorithms. The confidence of the fusion set {a i, b j } is the value in row i and column j of M (t). The confidence of the fusion set {a i } is the value in row i and column n + of M (t). The confidence of the fusion set {b j } is the value in column j and row m + of M (t). Given a threshold τ, the result of the normalized-weights method consists of all the fusion sets with confidence values above the threshold τ. The normalized-weights method gives good results when the overlap between the datasets is medium or small. However, the results are not as good as those of the probabilistic method when the overlap is large, because the weights that are assigned to singletons (i.e., the weights in the last row and in the last column) are not normalized. Consequently, these weights remain rather large even when they should be almost zero (i.e., when the overlap is large). In Section 6, we discuss how to improve this method by normalizing all the rows and columns of M. 4.6 Tuning the Methods The performance of the three new methods, which were introduced in the previous sections, can be tuned by choosing appropriate values for the threshold, the distance decay factor and the distance upper bound. The threshold τ controls the precision and recall. Increasing the threshold increases the precision while decreasing the threshold increases the recall. The distance decay factor α controls the effect of distant neighbors in either the probabilistic method or the normalized-weights method. In either method, the probability that an object a chooses some object from the other set is split among all the objects of the other set. This probability might be split among many objects, including far away objects that are really no more than background noise. By choosing α >, the effect of far away objects decreases exponentially. Thus, increasing α eliminates background noise. However, increasing α too much can cause these methods to act like the mutually-nearest method, i.e., pairs of objects that are not mutually nearest will be ignored. The distance upper bound β determines which twoelement fusion sets {a i, b j } are a priori deemed incorrect, because the distance between a i and b j is too far. In practice, the value of β should be an upper-bound 82
7 The two datasets are depicted in Figure 2. The performance results are given in Table. Note that the table describes both the one-sided nearest-neighbor join of MUNI with MAPA (first column) and the onesided nearest-neighbor join of MAPA with MUNI (second column). The one-sided nearest-neighbor join performed much worse than the other methods, while the normalized-weights method had the best performance. Figure 2: Hotels and tourist attractions in Tel-Aviv (the objects of MUNI are depicted by + and the objects of MAPA are depicted by ). estimate for the sum of all possible errors in the locations of objects. 5 Testing and Comparing the Methods We tested the different methods using real-world datasets as well as randomly-generated datasets. The two real-world datasets describe hotels and other tourist attractions in Tel-Aviv. One dataset, called MUNI, was extracted from a digital map that was made by the City of Tel-Aviv. The second dataset, called MAPA, was extracted from a digital map that was made by the Mapa Corp. The randomly-generated datasets were created using a random-dataset generator that we implemented. 5. Results for Real-World Datasets In the MUNI and MAPA datasets, each object has a name, although in MUNI the names are in English and in MAPA the names are in Hebrew. The correctness of the results was checked using the specified names of the objects. However, only locations were used for computing the fusion sets. The MAPA dataset has 86 objects and the MUNI dataset has 73 objects. A total of 37 real-world entities are represented in these datasets and 22 entities appear in both datasets. The mutually-nearest method was tested with a threshold τ =. The probabilistic and the normalized-weights methods were tested with an optimal threshold value (see Section 7). The distance upper bound β was 5 meters and the distance decay factor α was 2. There were 4 objects in MAPA and 6 objects in MUNI without any neighbor in the other dataset at a distance of β or less. 5.2 Random-Dataset Generator There are not sufficiently many real-world datasets to test our algorithms under varying degrees of density and overlap. Moreover, in real-world datasets, it is not always possible to determine accurately the correspondence between objects and real-world entities. Thus, we implemented a random-dataset generator, which is a two-step process. First, the real-world entities are randomly generated. Second, the objects in each dataset are randomly generated, independently of the objects in the other dataset. For the first step, the user specifies the coordinates of a square area, the number of entities in that area and the minimal distance between entities. The generator randomly chooses locations for the entities, according to a uniform distribution. The generator verifies that entities are not too close to each other, i.e., the distance between entities is greater than the minimal distance specified by the user. This is done to enforce realistic constraints; for example, the distance between two buildings cannot be just a few centimeters. For the second step, the user specifies the number of objects in the datasets and the error interval. The creation of a random object has two aspects. First, the object is randomly associated with one of the realworld entities that were created in the first step (at most one object, in each dataset, corresponds to any given entity). Second, the location of the object is randomly chosen, according to the normal distribution, in a circle with a center that is equal to the location of the corresponding entity and a radius that is equal to the error interval. Note that the two datasets have an equal number of objects. However, in each dataset, the association of objects with entities and the locations of objects are chosen randomly and independently of the other dataset. By changing the number of objects, it is easy to control the overlap between the two randomly generated datasets. For example, if each dataset has 5 objects while the number of entities is 5, then there is a complete overlap. However, if there are 5, entities, then the overlap between the two datasets is very small, with a very high probability. 5.3 Results for Random Datasets We have experimented with many randomly-generated datasets, but in this section we only describe a few tests that demonstrate the main conclusions about the 822
8 Joining MAPA objects Joining MUNI objects Mutually Probabilistic Normalized with nearest neighbors with nearest neighbors Nearest Weights from MUNI from MAPA Result Size Recall Precision Table : Performance of the fusion algorithms for datasets of hotels and tourist attractions in Tel-Aviv. (a) objects in each dataset. (b) 5 objects in each dataset. (c) objects in each dataset. (d) An enlarged fragment of a pair of datasets. Figure 3: A visual view of the random pairs of datasets, with, 5 and objects. performance of each method. The following parameters were given to the random-dataset generator. A square area of, 35, 35 meters,, entities, a minimal distance of 5 meters between entities, and an error interval of 3 meters for each datasets. The above parameters imply that if a dataset represents all, entities, then the choice factor is.55 and each entity occupies, on the average, an area of,822 sq m. In the tests described below, the, real-world entities were created once and then three pairs of datasets were randomly generated (see Figure 3). The three pairs had, 5 and, objects in each dataset, respectively. Thus, each pairs had a different degree of overlap and density (, objects in each dataset means a complete overlap). To get a sense of the difficulty in finding the correct fusion sets, an enlarged fragment of the pair of datasets from Figure 3(c) is depicted in Figure 3(d), where an entity identifier is attached to each object. The recall and precision of each method were measured for every pair of datasets. In all the methods, the distance upper bound β was 6 meters and the error interval was 3 meters. In the mutually-nearest method, the threshold τ was, since it gave the highest or close to the highest recall. For the datasets with objects, the recall and the precision were 8 and.8, respectively; for 5 objects, they were.72 and.77, respectively; and for, objects, they were.74 and 3, respectively. For the probabilistic method and the normalized- 823
9 (a) Probabilistic with objects. (b) Normalized-weights with objects. (c) Improved normalized-weights with. (d) Combined method with objects (e) Probabilistic with 5 objects. (f) Normalized-weights with 5 objects. (g) Improved normalized-weights with 5. (h) Combined method with 5 objects (i) Probabilistic with, objects. (j) Normalized-weights with, objects. (k) Improved normalized-weights with,. (l) Combined method with, objects. Figure 4: Recall (dotted line) and precision (solid line) as a function of the threshold value. weights method (as well as for two additional methods that are explained in Section 6), Figure 4 presents the recall and the precision as a function of the threshold. Tests were made with different values for the distance decay factor (i.e., α =.5,, 2, 3). We only show the results for α = 2, since this choice was the best for both methods under all circumstances. The main conclusions from the tests are as follows. When the overlap between the datasets is small, the mutually-nearest and the normalized-weights methods give good results. The probabilistic method does not perform well in this case, because it does not find the correct singletons and instead creates incorrect pairs. For a medium overlap, the best results are given by the normalized-weights method. This is demonstrated in the case of 5 objects in each dataset. When the overlap is large, the best results are given by the probabilistic method. The normalized-weights method does not perform well in this case, because singletons get weights (i.e., confidence values) that are too large. We will now give a detailed analysis of the graphs in Figure 4. Essentially, a method performs well if it assigns low confidence values to incorrect fusion sets and high confidence values to correct ones. Most of the graphs in Figure 4, except for 4(a), 4(b) and 4(j), exhibit this phenomenon. In the low range of threshold values, a small increase of the threshold value eliminates many incorrect fusion sets and only a few correct ones. Thus, the precision rises sharply while the recall declines slowly. In the high range of threshold values, a small increase of the threshold value eliminates many correct fusion sets and only a few incorrect ones. Thus, the precision rises slowly while the recall declines sharply. Also note that the precision and the recall change more sharply as the overlap grows. A bigger overlap means a bigger density, which causes more pairs to get confidence values that are close to each other. Figures 4(a) and 4(b) show the performance of the probabilistic and the normalized-weights methods in 824
10 the case of a small overlap. In that case, many correct singletons get low confidence values while many incorrect pairs get high confidence values. Thus, when a low threshold value is increased, many correct singletons are eliminated and the recall declines sharply. The precision remains about the same, since many incorrect fusion sets are also eliminated. In the high range of threshold values, a small increase of the threshold value eliminates many incorrect pairs and the precision rises sharply while the recall remains about the same. In the medium range of threshold values, there is very little change in the recall and the precision, because the confidence values are either very low or very high, as a result of the low density (i.e., an object has only a few close neighbors in the other dataset). Figure 4(j) shows the performance of the normalized-weights method in the case of a complete overlap (i.e., there are no correct singletons). In that method, the last row and the last column of the matching matrix are not normalized and, consequently, many incorrect singletons get high confidence values, regardless of the degree of overlap. Thus, the precision rises sharply only in the high range of threshold values. 6 Complete Normalization In this section, we will show how knowledge about the degree of overlap between the datasets can improve the normalized-weights method. 6. The Normalization Value Ideally, the normalized-weights method should give to singletons high confidence values when the overlap is large and low confidence values when the overlap is small. However, since the last row and the last column of the matching matrix are not normalized, the confidence values of singletons tend to remain high in any case. Recall that each entry in those row and column gives the probability that some object does not have a corresponding object in the other set. Thus, those row and column should not be normalized to. Suppose that the datasets A and B have m and n objects, respectively, and c is the number of real-world entities that are represented in both datasets. Thus, m c objects of A and n c objects of B should be in singleton fusion sets. Ideally, an entry in either the last row or the last column should be if its corresponding object is in a correct singleton, and should be otherwise. Consequently, we improve the normalization algorithm by normalizing the last row to n c and the last column to m c. Note that if n c = (or m c = ), then we just set all the elements in the last row (or last column) to and need not normalize this row (or column) anymore. The improved normalized-weights method always performs better than all the other methods, as shown in Figures 4(c), 4(g) and 4(k). Note that in these three tests, the correct value of c was used. 6.2 Estimating the Overlap The improved normalized-weights method requires a good estimate for the number c of correct pairs of corresponding objects. If the association of objects with entities in each dataset is independent of the other dataset, then an approximation of c is given by c m e n e e where e is the number of real-world entities and the two datasets have m and n objects, respectively. If e is not known, then a less accurate approximation of c can be obtained by replacing e with the number of pairs that are produced by the mutually-nearest method. We conducted a series of tests with the following parameters. The number of entities ranged from 5 to, in increments of. The size of the datasets ranged from 5 to 5 in increments of 5. The minimal distance between entities was either 5, 25 or 35 meters. The conclusion from these tests is that c can be approximated by c.2p e where p is the number of pairs that are produced by the mutually-nearest method and e is the number of entities. In particular, when p is approximately e 2, then c is also approximately e 2. The above formula suggests that p is a good estimate for c. Thus, we also tested the combined method that applies the improved normalized-weights method with p as the estimate for c. The results for the random datasets of Section 5.3 are shown in Figures 4(d), 4(h) and 4(l). These results are almost as good as the results for the improved normalized-weights method, using the correct value of c. 7 Choosing the Threshold Value For the mutually-nearest neighbor, a threshold value of zero will give the highest recall with a good precision. For the other methods, choosing a good threshold value is more complicated. A threshold value of.5 is a reasonable choice, since it usually gives results with good recall and precision, and without duplicates (i.e., each object appears in at most one fusion set). Note that if the threshold is just above.5, then there cannot be duplicates. An optimal threshold gives the best combination of recall and precision. Formally, an optimal threshold can be defined as the one that gives the largest geometric average of the recall and the precision. In Figure 4, this is usually the threshold at the intersection of the graphs of the recall and the precision. We will now describe how to estimate the optimal threshold without a priori knowledge of the correct fusion sets. The idea is to minimize two types of errors. 825
11 the normalized-weights method, the optimal threshold was 3 and the threshold that minimized the error count was 5, while the geometric averages were about the same (.873 and.868). Thus, for a real pair of datasets, the threshold value can be chosen by trying several different values and choosing the one that minimizes the error count. (a) Probabilistic with objects (c) Probabilistic with 5 objects (e) Probabilistic with, objects..8 (b) Improved normalized-weights with (d) Improved normalized-weights with (f) Improved normalized-weights with,. Figure 5: The error count (shown by a dashed line), recall and precision as a function of the threshold. A missing object is an object that does not appear in any fusion set. A duplicate object is an object that appears in more than one fusion set. The error count is the number of missing objects plus the number of duplicate occurrences of objects (if an object appears in k fusions sets, then it is counted k times). The threshold value should be chosen so that the error count is minimal. Figure 5 shows the error count as a function of the threshold for two methods, using the random datasets of Section 5.3 (the graphs for the other two methods are similar). It can be seen that a threshold value that minimizes the error count is close to the optimal threshold. This was also confirmed for the real-world datasets, MUNI and MAPA. For the probabilistic method, the optimal threshold was.3 and the threshold that minimized the error count was. For these two threshold values, there was only a small difference (.796 vs..784) between the geometric averages of the recall and the precision. For 8 Related Work Map conflation is the process of producing a new map by integrating two existing digital maps, and it has been studied extensively in the last twenty years. Map conflation starts by choosing some anchors, i.e., pairs of points that represent the same location. A triangular planar subdivision of the datasets with respect to the anchors (using Delaunay triangulation) is performed and a rubber-sheeting transformation is applied to each subdivision [4, 5, 6,, 2]. Map conflation cannot be applied without initially finding some anchors. Thus, our methods can be used as part of this process. Unlike map conflation, which is a complicated process, our algorithms are most suitable for the task of online integration of geographic databases using mediators (e.g., [2, 7]). A feature-based approach to conflation, which is similar to object fusion, has been studied in [3]. Their approach is different from ours in that they find corresponding objects based on both topological similarity and textual similarity in non-spatial attributes. Once some corresponding objects have been found, they construct some graphs and use graph similarity to find more corresponding objects. In [3], topological similarity is used to find corresponding objects, while in [7, 8, 6] ontologies are used for that purpose. The problem of how to fuse objects, rather than how to find fusion sets, was studied in []. Thus, their work complements ours. 9 Conclusion and Future Work The novelty of our approach is in developing efficient algorithms that find fusion sets with high recall and precision, using only locations of objects. The mutually-nearest method is an improvement of the existing one-sided nearest-neighbor join and it achieves substantially better results. We also developed a completely new approach, based on a probabilistic model. We presented several algorithms that use this model. The improved normalized-weights method achieves the best results under all circumstances. This method combines the probabilistic model with a normalization algorithm and information about the overlap between the datasets. We showed that the mutually-nearest method provides a good estimate for the degree of 826
12 overlap. We also showed how to tune the methods and how to choose an optimal threshold value. Several interesting problems remain for future work. One problem is to deal with situations in which an object from one set could correspond to many objects in the other set. This may happen, for example, when one dataset represents shopping malls while the other dataset represents shops. A second problem is to develop fusion algorithms for more than two datasets. This problem is similar to the previous one in that both require finding fusion sets that have several, rather than just 2 objects. In principle, this problem can be solved by a sequence of steps, such that only two datasets are involved in each step (where one of those two datasets is the result of the previous step). However, it is not clear whether a long sequence can still give good recall and precision. A third problem is how to utilize locations that are given as polygons or lines, rather than just points. A fourth research direction is to combine our approach with other approaches, such as the featurebased approach of [3], topological similarity (e.g., [3]) or ontologies (e.g., [7, 8, 6]). Acknowledgments We thank Michael Ben-Or for pointing out to us the work of Sinkhorn [4, 5]. We also thank the anonymous referees for helpful suggestions. This research was supported in part by a grant from the Israel Ministry of Science and by the Israel Science Foundation (Grant No. 96/). References [] B. Amann, C. Beeri, I. Fundulaki, and M. Scholl. Ontology-based integration of XML Web resources. In Proceedings of the First International Semantic Web Conference, pages 7 3, Sardinia (Italy), 22. [2] O. Boucelma, M. Essid, and Z. Lacroix. A WFSbased mediation system for GIS interoperability. In Proceedings of the th ACM International Symposium on Advances in Geographic Information Systems, pages 23 28, 22. [3] T. Bruns and M. Egenhofer. Similarity of spatial scenes. In Proceedings of the 7th International Symposium on Spatial Data Handling, pages 3 42, Delft (Netherlands), 996. [4] M. A. Cobb, M. J. Chung, H. Foley, F. E. Petry, and K. B. Show. A rule-based approach for conflation of attribute vector data. GioInformatica, 2():7 33, 998. [5] Y. Doytsher and S. Filin. The detection of of corresponding objects in a linear-based map conflation. Surveying and Land Information Systems, 6(2):7 28, 2. [6] Y. Doytsher, S. Filin, and E. Ezra. Transformation of datasets in a linear-based map conflation framework. Surveying and Land Information Systems, 6(3):59 69, 2. [7] F. T. Fonseca and M. J. Egenhofer. Ontologydriven geographic information systems. In Proceedings of the 7th ACM International Symposium on Advances in Geographic Information Systems, pages 4 9, Kansas City (Missouri, US), 999. [8] F. T. Fonseca, M. J. Egenhofer, and P. Agouris. Using ontologies for integrated geographic information systems. Transactions in GIS, 6(3), 22. [9] M. Minami. Using ArcMap. Environmental Systems Research Institute, Inc., 2. [] Y. Papakonstantinou, S. Abiteboul, and H. Garcia-Molina. Object fusion in mediator systems. In Proceedings of the 22nd International Conference on Very Large Databases, pages , 996. [] B. Rosen and A. Saalfeld. Match criteria for automatic alignment. In Proceedings of 7th International Symposium on Computer-Assisted Cartography (Auto-Carto 7), pages 2, 985. [2] A. Saalfeld. Conflation-automated map compilation. International Journal of Geographical Information Systems, 2(3):27 228, 988. [3] A. Samal, S. Seth, and K. Cueto. A feature based approach to conflation of geospatial sources. International Journal of Geographical Information Science, 8(): 3, 24. [4] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2): , 964. [5] R. Sinkhorn. Diagonal equivalence to matrices with perscribed row and column sums. The American Mathematical Monthly, 74(4):42 45, 967. [6] H. Uitermark, P. V. Oosterom, N. Mars, and M. Molenaar. Ontology-based geographic data set integration. In Proceedings of Workshop on Spatio-Temporal Database Management, pages 6 79, Edinburgh (Scotland), 999. [7] G. Wiederhold. Mediation to deal with heterogeneous data sources. In Introperating Geographic Information Systems, pages 6,
Integrating Data from Maps on the World-Wide Web
Integrating Data from Maps on the World-Wide Web Eliyahu Safra 1, Yaron Kanza 2, Yehoshua Sagiv 3, and Yerach Doytsher 1 1 Department of Transportation and Geo-Information, Technion, Haifa, Israel {safra,
Geospatial Data Integration
Geospatial Data Integration 1. Research Team Project Leader: Other Faculty: Post Doc(s): Graduate Students: Industrial Partner(s): Prof. Craig Knoblock, Computer Science Prof. Cyrus Shahabi, Computer Science
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Identifying Market Price Levels using Differential Evolution
Identifying Market Price Levels using Differential Evolution Michael Mayo University of Waikato, Hamilton, New Zealand [email protected] WWW home page: http://www.cs.waikato.ac.nz/~mmayo/ Abstract. Evolutionary
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
9.2 Summation Notation
9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a
GeoMatchMaker: Automatic and Efficient Matching of Vector Data with Spatial Attributes in Unknown Geometry Systems
GeoMatchMaker: Automatic and Efficient Matching of Vector Data with Spatial Attributes in Unknown Geometry Systems Mohammad Kolahdouzan, Ching-Chien Chen, Cyrus Shahabi, Craig A. Knoblock ζ Department
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
6. Cholesky factorization
6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
What is GIS? Geographic Information Systems. Introduction to ArcGIS. GIS Maps Contain Layers. What Can You Do With GIS? Layers Can Contain Features
What is GIS? Geographic Information Systems Introduction to ArcGIS A database system in which the organizing principle is explicitly SPATIAL For CPSC 178 Visualization: Data, Pixels, and Ideas. What Can
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Vector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Network quality control
Network quality control Network quality control P.J.G. Teunissen Delft Institute of Earth Observation and Space systems (DEOS) Delft University of Technology VSSD iv Series on Mathematical Geodesy and
Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001
A comparison of the OpenGIS TM Abstract Specification with the CIDOC CRM 3.2 Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001 1 Introduction This Mapping has the purpose to identify, if the OpenGIS
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra
U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory
The Characteristic Polynomial
Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
ANALYTIC HIERARCHY PROCESS (AHP) TUTORIAL
Kardi Teknomo ANALYTIC HIERARCHY PROCESS (AHP) TUTORIAL Revoledu.com Table of Contents Analytic Hierarchy Process (AHP) Tutorial... 1 Multi Criteria Decision Making... 1 Cross Tabulation... 2 Evaluation
Recall that two vectors in are perpendicular or orthogonal provided that their dot
Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal
CS3220 Lecture Notes: QR factorization and orthogonal transformations
CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss
Numerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
The Fourth International DERIVE-TI92/89 Conference Liverpool, U.K., 12-15 July 2000. Derive 5: The Easiest... Just Got Better!
The Fourth International DERIVE-TI9/89 Conference Liverpool, U.K., -5 July 000 Derive 5: The Easiest... Just Got Better! Michel Beaudin École de technologie supérieure 00, rue Notre-Dame Ouest Montréal
MATH 551 - APPLIED MATRIX THEORY
MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points
3. INNER PRODUCT SPACES
. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.
Estimating the Average Value of a Function
Estimating the Average Value of a Function Problem: Determine the average value of the function f(x) over the interval [a, b]. Strategy: Choose sample points a = x 0 < x 1 < x 2 < < x n 1 < x n = b and
SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
An Introduction to Point Pattern Analysis using CrimeStat
Introduction An Introduction to Point Pattern Analysis using CrimeStat Luc Anselin Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign
Using simulation to calculate the NPV of a project
Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Section 1.1. Introduction to R n
The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to
Figure 2.1: Center of mass of four points.
Chapter 2 Bézier curves are named after their inventor, Dr. Pierre Bézier. Bézier was an engineer with the Renault car company and set out in the early 196 s to develop a curve formulation which would
Solving Simultaneous Equations and Matrices
Solving Simultaneous Equations and Matrices The following represents a systematic investigation for the steps used to solve two simultaneous linear equations in two unknowns. The motivation for considering
Approximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
ADVANCED GEOGRAPHIC INFORMATION SYSTEMS Vol. II - Using Ontologies for Geographic Information Intergration Frederico Torres Fonseca
USING ONTOLOGIES FOR GEOGRAPHIC INFORMATION INTEGRATION Frederico Torres Fonseca The Pennsylvania State University, USA Keywords: ontologies, GIS, geographic information integration, interoperability Contents
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
Solution to Homework 2
Solution to Homework 2 Olena Bormashenko September 23, 2011 Section 1.4: 1(a)(b)(i)(k), 4, 5, 14; Section 1.5: 1(a)(b)(c)(d)(e)(n), 2(a)(c), 13, 16, 17, 18, 27 Section 1.4 1. Compute the following, if
Compact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA [email protected], [email protected],
Spatial Data Analysis
14 Spatial Data Analysis OVERVIEW This chapter is the first in a set of three dealing with geographic analysis and modeling methods. The chapter begins with a review of the relevant terms, and an outlines
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
OPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 [email protected].
Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 [email protected] This paper contains a collection of 31 theorems, lemmas,
The Gravity Model: Derivation and Calibration
The Gravity Model: Derivation and Calibration Philip A. Viton October 28, 2014 Philip A. Viton CRP/CE 5700 () Gravity Model October 28, 2014 1 / 66 Introduction We turn now to the Gravity Model of trip
2010 Solutions. a + b. a + b 1. (a + b)2 + (b a) 2. (b2 + a 2 ) 2 (a 2 b 2 ) 2
00 Problem If a and b are nonzero real numbers such that a b, compute the value of the expression ( ) ( b a + a a + b b b a + b a ) ( + ) a b b a + b a +. b a a b Answer: 8. Solution: Let s simplify the
6.4 Normal Distribution
Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under
Part-Based Recognition
Part-Based Recognition Benedict Brown CS597D, Fall 2003 Princeton University CS 597D, Part-Based Recognition p. 1/32 Introduction Many objects are made up of parts It s presumably easier to identify simple
We shall turn our attention to solving linear systems of equations. Ax = b
59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system
How To Solve The Cluster Algorithm
Cluster Algorithms Adriano Cruz [email protected] 28 de outubro de 2013 Adriano Cruz [email protected] () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz [email protected]
Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets
Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets Nathaniel Hendren January, 2014 Abstract Both Akerlof (1970) and Rothschild and Stiglitz (1976) show that
ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite
ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Chapter 6: Sensitivity Analysis
Chapter 6: Sensitivity Analysis Suppose that you have just completed a linear programming solution which will have a major impact on your company, such as determining how much to increase the overall production
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
A HYBRID APPROACH FOR AUTOMATED AREA AGGREGATION
A HYBRID APPROACH FOR AUTOMATED AREA AGGREGATION Zeshen Wang ESRI 380 NewYork Street Redlands CA 92373 [email protected] ABSTRACT Automated area aggregation, which is widely needed for mapping both natural
Numbers 101: Growth Rates and Interest Rates
The Anderson School at UCLA POL 2000-06 Numbers 101: Growth Rates and Interest Rates Copyright 2000 by Richard P. Rumelt. A growth rate is a numerical measure of the rate of expansion or contraction of
How I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
Segmentation of building models from dense 3D point-clouds
Segmentation of building models from dense 3D point-clouds Joachim Bauer, Konrad Karner, Konrad Schindler, Andreas Klaus, Christopher Zach VRVis Research Center for Virtual Reality and Visualization, Institute
The primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
Derive 5: The Easiest... Just Got Better!
Liverpool John Moores University, 1-15 July 000 Derive 5: The Easiest... Just Got Better! Michel Beaudin École de Technologie Supérieure, Canada Email; [email protected] 1. Introduction Engineering
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
EXPLORING SPATIAL PATTERNS IN YOUR DATA
EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze
Time series Forecasting using Holt-Winters Exponential Smoothing
Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract
Measurement with Ratios
Grade 6 Mathematics, Quarter 2, Unit 2.1 Measurement with Ratios Overview Number of instructional days: 15 (1 day = 45 minutes) Content to be learned Use ratio reasoning to solve real-world and mathematical
A Software Tool for. Automatically Veried Operations on. Intervals and Probability Distributions. Daniel Berleant and Hang Cheng
A Software Tool for Automatically Veried Operations on Intervals and Probability Distributions Daniel Berleant and Hang Cheng Abstract We describe a software tool for performing automatically veried arithmetic
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Inner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Simple Random Sampling
Source: Frerichs, R.R. Rapid Surveys (unpublished), 2008. NOT FOR COMMERCIAL DISTRIBUTION 3 Simple Random Sampling 3.1 INTRODUCTION Everyone mentions simple random sampling, but few use this method for
APLS 2011. GIS Data: Classification, Potential Misuse, and Practical Limitations
APLS 2011 GIS Data: Classification, Potential Misuse, and Practical Limitations GIS Data: Classification, Potential Misuse, and Practical Limitations Goals & Objectives Develop an easy to use geospatial
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
ASSESSMENT OF VISUALIZATION SOFTWARE FOR SUPPORT OF CONSTRUCTION SITE INSPECTION TASKS USING DATA COLLECTED FROM REALITY CAPTURE TECHNOLOGIES
ASSESSMENT OF VISUALIZATION SOFTWARE FOR SUPPORT OF CONSTRUCTION SITE INSPECTION TASKS USING DATA COLLECTED FROM REALITY CAPTURE TECHNOLOGIES ABSTRACT Chris Gordon 1, Burcu Akinci 2, Frank Boukamp 3, and
Analysis of Algorithms, I
Analysis of Algorithms, I CSOR W4231.002 Eleni Drinea Computer Science Department Columbia University Thursday, February 26, 2015 Outline 1 Recap 2 Representing graphs 3 Breadth-first search (BFS) 4 Applications
