EFFICIENT CLUSTERING OF VERY LARGE DOCUMENT COLLECTIONS

Size: px
Start display at page:

Download "EFFICIENT CLUSTERING OF VERY LARGE DOCUMENT COLLECTIONS"

Transcription

1 Chapter 1 EFFICIENT CLUSTERING OF VERY LARGE DOCUMENT COLLECTIONS Inderjit S. Dhion, James Fan and Yuqiang Guan Abstract An invauabe portion of scientific data occurs naturay in text form. Given a arge unabeed document coection, it is often hepfu to organize this coection into custers of reated documents. By using a vector space mode, text data can be treated as high-dimensiona but sparse numerica data vectors. It is a contemporary chaenge to efficienty preprocess and custer very arge document coections. In this paper we present a time and memory efficient technique for the entire custering process, incuding the creation of the vector space mode. This efficiency is obtained by (i) a memory-efficient muti-threaded preprocessing scheme, and (ii) a fast custering agorithm that fuy expoits the sparsity of the data set. We show that this entire process takes time that is inear in the size of the document coection. Detaied experimenta resuts are presented a highight of our resuts is that we are abe to effectivey custer a coection of 113,716 NSF award abstracts in 23 minutes (incuding disk I/O costs) on a singe workstation with modest memory consumption. Keywords: custering, arge document coections, hash tabes, spherica k-means, vector space mode 1. Introduction Large coections of documents are becoming increasingy common. The pubic internet currenty has more than 1.5 biion web pages, whie private intranets aso contain an abundance of text data. A vast amount of important scientific data appears as technica abstracts and papers. Given such arge document coections it is important to organize them into structured ontoogies. This organization faciitates navigation and search, and at the same time provides a framework for continua maintenance as document repositories grow in size. 1

2 2 Manua construction of structured ontoogies is one possibe soution and has been adopted to organize the internet ( and to structure ibrary content. However, this process has the obvious disadvantage of being too abor intensive, and is viabe ony in arge corporations. Thus it is desirabe to seek automatic methods for organizing unabeed document coections. Given a coection of unabeed data points, custering refers to the probem of automaticay assigning cass abes to the data and has been widey studied in statistica pattern recognition and machine earning [DH73, Mit97]. A starting point for appying custering agorithms to unstructured text data is to create a vector space mode, aternativey known as a bagof-words mode [SM83]. The basic idea is (a) to extract unique contentbearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Observe that we may regard the vector space mode of a text data set as a word-by-document matrix whose rows are words and coumns are document vectors. Typicay, a arge number of words exist in even a moderatey sized set of documents where a few thousand words or more are common. Thus for arge document coections, both the row and coumn dimensions of the matrix are quite arge. However, as we wi discuss ater in greater detai, this matrix is typicay very sparse with amost 99% of the matrix entries being zero. Using the vector space mode various cassica custering agorithms such as the k-means agorithm and its variants, hierarchica aggomerative custering, and graph-theoretic methods have been expored in the text mining iterature; for detaied reviews, see [Ras92, Wi88]. Recenty, there has been a furry of activity in this area, see [BGG + 98, CKPT92, SS97, ZE98]. A substantia amount of this work has concentrated on custering web search resuts where the document coections to be custered are not very arge. In this paper, our main concern is in obtaining a highy efficient process for custering very arge document coections. Our main motivation is that we want to custer coections in excess of more than 100,000 documents in a reasonabe amount of time on a singe processor. Thus our main emphasis is on high speed and scaabiity with modest main memory consumption. The custering process invoves reading the text documents from disk, and preprocessing them to form the vector space mode before using a particuar custering agorithm. It turns out that main memory consumption and disk I/O costs in this process can be prohibitive. In order to aeviate these probems, we empoy a memoryefficient muti-threaded approach to reading and preprocessing the docu-

3 Efficient Custering of Very Large Document Coections 3 ments. For creating the vector space mode, we use efficient and scaabe data structures such as oca and goba hash tabes. Finay, we use the highy efficient and effective spherica k-means agorithm that fuy expoits the sparsity of the data [DM01]. The above steps ead to a highy efficient custering process as a resut we are abe to preprocess and custer a coection of 113,716 NSF award abstracts documents in 23 minutes on a Sun workstation with modest memory consumption. This paper is organized as foows. In Section 2 we highight the chaenges in obtaining an efficient and effective process for custering arge document coections. Section 3 describes our muti-threaded agorithm for creating the vector space mode. In Section 4 we present the spherica k-means agorithm which is ideay suited for custering high-dimensiona sparse text data. Section 5 presents speed and memory consumption resuts on some document coections, in particuar we examine the resuts on a coection of 113,716 NSF award abstracts. Finay, Section 6 concudes with a short summary and discussion of future work. A quick word about notation. Lower case bod etters such as x, c wi denote vectors whie x wi denote the 2-norm of the corresponding vector. Thus when we say that x = 1 we mean that x T x = Chaenges As mentioned above, the preprocessing phase eads to the creation of the vector space mode. Given a arge document coection, the foowing steps are invoved in this phase: (a) read a input documents from disk, (b) parse into word tokens using efficient reguar expression pattern matching, (c) ookup words rapidy in hash tabes to track the number of occurrences of each word, and finay, (d) output the vector space mode. Efficient parsing using reguar expression pattern matching is a westudied probem. Existing software toos such as FLEX provide a quick and easy way to construct the required parser [Pax96]. Aso, efficient hash tabe indexing and access is provided in pubic-domain software [MS96]. Thus effective soutions to steps (b) and (c) above are easy to obtain. However I/O costs in steps (a) and (d) can be substantia, especiay if the documents need to be accessed from a Network Fie System(NFS) [Ca99]. In such a case, the time taken by these steps can be quite arge and often unpredictabe depending on other traffic on the NFS. Our approach to sove this probem is to use mutipe threads [NBF96] so that input documents can be read in a parae fashion.

4 4 Another chaenge is main memory consumption. After a the documents have been read and processed there is a fina phase which invoves a sweep and modification of the entire vector space mode. One simpe approach to faciitate this phase is to retain the partiay formed vector space mode in main memory whie the rest of the documents are being processed. We impemented such an approach and on our test coection of 113,716 documents, this required neary 500 MBytes of main memory. Extrapoating this memory requirement to 1 miion documents, we can see that this main-memory based impementation wi require severa gigabytes of memory. This cost is unacceptabe since our goa is to preprocess and custer a miion documents on current workstations consuming ess than 256 MBytes of main memory. In Section 3 we introduce such a memory efficient agorithm that aows us to preprocess a arge number of documents without sacrificing much speed. In our test coection of 113,716 documents there are a tota of more than 150,000 unique words. After a pruning step, we retain 26,000 unique words in the vector space mode. Thus, the document vectors are very high-dimensiona. However, typicay, most documents contain ony a sma subset of the tota number of words. Hence, the document vectors are very sparse a sparsity of 99% is common. The major chaenge is to find a custering agorithm that can yied good effective soutions for very high-dimensiona data and at the same time, expoit the sparsity of the vector space mode. Since we are interested in custering very arge document coections we seek agorithms that consume time and memory that is inear in the size of the document coection. 3. Efficient Preprocessing In this section, we describe our preprocessing agorithm for creating the vector space mode. Before we describe the detais we give a high eve description of the task at hand Vector Space Mode The basic idea is to represent each document as a vector of certain weighted word frequencies. In order to do so, the foowing parsing and extraction steps are needed. 1 Ignoring case, extract a unique words from the entire set of documents. 2 Eiminate non-content-bearing stopwords such as a, and, the, etc. For sampe ists of stopwords, see [FBY92, Chapter 7]. 3 For each document, count the number of occurrences of each word.

5 Efficient Custering of Very Large Document Coections 5 4 Using heuristic or information-theoretic criteria, eiminate noncontent-bearing high-frequency and ow-frequency words [SM83]. 5 After the above eimination, suppose w unique words remain. Assign a unique identifier between 1 and w to each remaining word, and a unique identifier between 1 and d to each document. The above steps outine a simpe preprocessing scheme. In addition, one may extract word phrases such as New York, and one may reduce each word to its root or stem, thus eiminating puras, tenses, prefixes, and suffixes [FBY92, Chapter 8]. The above preprocessing yieds the number of occurrences of word j in document i, say, f ji, and the number of documents which contain the word j, say, d j. Using these counts, we can represent the i-th document as a w-dimensiona vector x i as foows. For 1 j w, set the j-th component of x i, to be the product of three terms x ji = t ji g j s i, where t ji is the term weighting component and depends ony on f ji, whie g j is the goba weighting component and depends on d j, and s i is the normaization component for x i. Intuitivey, t ji captures the reative importance of a word in a document, whie g j captures the overa importance of a word in the entire set of documents. The objective of such weighting schemes is to enhance discrimination between various document vectors for better retrieva effectiveness [SB88]. There are many schemes for seecting the term, goba, and normaization components, see [Ko97] for various possibiities. In this paper we use the popuar tfn scheme known as normaized term frequencyinverse document frequency. This scheme uses t ji = f ji, g j = og(d/d j ) and s i = ( w j=1 (t jig j ) 2 ) 1/2. Note that this normaization impies that x i = 1, i.e., each document vector ies on the surface of the unit sphere in R w. Intuitivey, the effect of normaization is to retain ony the proportion of words occurring in a document. This ensures that documents deaing with the same subject matter (that is, using simiar words), but differing in ength ead to simiar document vectors Preprocessing Agorithm The input to the preprocessing step is a data directory that contains a the documents to be processed. The documents may aso be contained within subdirectories of the input directory. The output is the vector space mode described in the above section, which can be represented as

6 6 1.Initiaize goba vocabuary hash tabe. 2. Recursivey wak through the input directory to obtain the ist of documents to be processed. 3. Process a document. Output each 3. Process a document. Output each word id and its frequency of occurrence word id and its frequency of occurrence in the document to a temporary fie. in the document to a temporary fie. 4. Remove words that are too common or too rare from the goba vocabuary hash tabe, and reassign word ids. 5. Output the fina vocabuary aong with their ids and tota frequencies. 6. Reprocess the temporary fies to assign the correct word ids, and output the resuts as a sparse matrix in CCS format. Figure 1.1. Outine of the preprocessing agorithm a highy sparse word-by-document matrix. We store this sparse matrix by using the Compressed Coumn Storage(CCS) format [DGL89]. In this format, we record the vaue of each non-zero eement, aong with its row and coumn index. The coumn indices represent the input documents, the row indices represent ids of distinct words present in the document coection, and the non-zero entries in the matrix represent the frequencies of words in documents. Figure 1.1 gives an outine of the preprocessing agorithm. The agorithm first initiaizes a goba hash tabe. To resove whether a word has been encountered previousy, a oca and a goba hash tabe are used. Both these hash tabes use words as keys and store the corresponding row indices and frequencies as vaues. As the names suggest, the goba hash tabe keeps track of words and their occurrences in the entire document coection whie the oca hash tabe does so for just one document. After initiaizing the goba hash tabe, the agorithm recursivey waks through the input directory to obtain the ist of documents to be processed. The preprocessing agorithm then creates severa threads of computation. The purpose of each thread is to process a set of documents independenty and output its resuts into temporary fies. Detais of this processing are given in Figure 1.2 which we discuss in the next paragraph. After a the threads have finished, the goba hash tabe is examined, and words that are too common or too rare are removed from the goba hash tabe. Unique word ids are assigned to the

7 Efficient Custering of Very Large Document Coections 7 1 Initiaize the oca hash tabe (this hash tabe uses words as keys, and stores row indices and frequencies as vaues). 2 Get a token from the document, and convert the token to owercase. 3 Discard the token if it is too ong or too short or is a non-content stopword such as a, and, the, etc. 4 If the word aready exists in the oca hash tabe, increment the frequency of that entry, otherwise insert the word into the oca hash tabe with row index 1 and frequency 1. 5 After the whoe document has been processed, set the row indices in the oca hash tabe to the corresponding ones in the goba hash tabe. If a word in the oca hash tabe does not exist in the goba hash tabe, assign a new row index to the word and add to the goba hash tabe (note that this requires ocking the goba hash tabe to prevent simutaneous modification by another thread). 6 Output the contents of the oca hash tabe (row indices and frequencies) to a temporary fie, and discard the oca hash tabe. Detais on the processing of a document by each thread (step 3 of Fig- Figure 1.2. ure 1.1) words that sti remain in the goba hash tabe. The temporary fies are then reoaded, the word ids are resoved and then the fina vocabuary and word-by-document matrix are output. Figure 1.2 describes the various steps performed by each preprocessing thread. Two decisions warrant further expanation the use of temporary fies for storing the partia vector space mode and the way in which the oca and goba hash tabes are accessed. As mentioned in the ast section, storing the partia vector space mode in main memory woud require a few gigabytes of main memory and is thus prohibitive for modern workstations. Hence to reduce main memory consumption we store the contents of the oca hash tabe onto temporary fies. Since this ony eads to oca disk access, the resuting overhead is not substantia. The goba hash tabe is accessed and modified by a processing threads and hence is a shared resource. In order to achieve maximum paraeism, we need to minimize the number of times the goba hash tabe is ocked and modified by each processing thread. We achieve this

8 8 by using a oca hash tabe to process the entire document first, and then making a bock access to the goba hash tabe. This access invoves resoving the word ids and possibe modification of the goba hash tabe, at which time this data structure needs to be ocked. See Figure 1.2 for detais, especiay step 5. The major main memory consumption in our preprocessing agorithm is due to the goba vocabuary tabe. The partia vector space mode is stored in temporary fies instead of main memory. Thus the overa main memory requirement is O(W ), where W is the number of distinct words that appear in the document coection. It is we known that W grows sower than ineary with the size of the document coection Heaps Law states that the number of unique words in a text of size d grows as O(d β ), where β is a positive number ess than one[hea78]. Thus the main memory consumption grows sower than ineary with the size of the document coection. The computation time for the preprocessing step is approximatey inear with respect to the input data size since each word in a document is processed in O(1) amortized time. Performance resuts on arge document coections that vaidate these caims are given in Section Efficient Custering of High-Dimensiona Text Data Given the vector space mode, the document vectors may be represented by x 1, x 2,..., x d, where each x i R w. Reca that w stands for the number of unique words in the vector space mode and d is the tota number of documents. A custering of the document coection is its partitioning into the disjoint subsets π 1, π 2,..., π k, i.e., k π j = {x 1, x 2,..., x d } and π j π = φ, j. j=1 The most important and chaenging characterisitics of the vector space modes that arise from text data are high dimensionaity and sparsity. Typicay, w is in the thousands and a sparsity of 99% is common. For purposes of efficiency, it is important that the custering agorithm expoit the sparsity of the data whie giving meaningfu resuts at the same time. The spherica k-means agorithm satisfies both these properties and hence is our agorithm of choice. We now briefy formaize this agorithm highighting its saient features. More detais may be found in [DM01]. Any text custering agorithm needs an objective notion of simiarity between documents. A widey used measure of simiarity is the cosine

9 Efficient Custering of Very Large Document Coections 9 of the ange between two document vectors [FBY92, SM83]. Cosine simiarity is easy to interpret and simpe to compute for sparse vectors and has been used in other information retrieva appications, such as query retrieva. Based on cosine simiarity, we can define the goodness or coherence of custer π j as x T i c j, (4.1) x i π j where each x i is assumed to be normaized such that x i = 1 and c j is the normaized centroid of custer π j, x c j = i π j x i x i π j x i. By the Cauchy-Schwarz inequaity, x T i z x T i c j, z R w, x i π j x i π j and thus the normaized centroid is the vector that is cosest in cosine simiarity (in an average sense) to a the document vectors in the custer π j. As a resut, we aso ca the vector c j s as concept vectors. Aggregating (4.1) over a custers, we can measure the goodness of any given partitioning {π j } k j=1 using the foowing objective function: Q({π j } k j=1 ) = k j=1 x i π j x T i c j. (4.2) Intuitivey, the objective function measures the combined coherence of a the k custers. Having posed the objective function, we now present an agorithm that attempts to maximize its vaue Spherica k-means Agorithm It is we known that finding the partitioning that maximizes (4.2) is NP-Compete [KPR98, Theorem 3.1]; aso, see [GJW82]. Thus we seek a heuristic agorithm that can quicky find a good oca maximum. For this purpose, we use a variant of the cassica k-means agorithm [DH73] that uses the cosine simiarity measure. Since both the document and concept vectors ie on the surface of a high-dimensiona sphere, we ca this variant the spherica k-means agorithm. This agorithm proceeds as foows.

10 10 1 Initiaize custering. Start with some initia partitioning of the document vectors, namey {π (0) j } k. Let {c(0) j=1 j } k be the concept j=1 vectors of the associated partitioning. Set the iteration count t to 0. 2 Re-assign document vectors. For each document vector x i, 1 i d, do the foowing: a. Compute x T i c(t) for a = 1, 2,..., k. b. From among a x T i c(t) computed above, find j = arg max x T i c(t) (break ties arbitrariy if x i has argest cosine simiarity with more than one concept vector). This induces the new partitioning π (t+1) j = {x i : j = arg max x T i c (t) }, 1 j k. 3 Update concept vectors. Compute the concept vectors corresponding to the new partitioning: s j = x i, c (t+1) j = s j x s i π j, 1 j k. j 4 Check the stopping criterion. If the stopping criterion is satisfied, then exit. Otherwise increase t by 1 and go to step 2 above. In our impementation, the stopping criterion is: Q({π (t) j } k ) j=1 Q({π(t+1) j } k ) j= Q({π (t) j } k ). j=1 It can be shown that the above agorithm yieds a gradient-ascent scheme. In particuar, we can show that the objective function vaue in (4.2) does not decrease from one iteration to the next, i.e., Q({π (t) j } k ) j=1 Q({π(t+1) j } k ). j=1 Like any other gradient-ascent scheme, the spherica k-means agorithm is prone to oca maxima. A carefu seection of initia partitions

11 Efficient Custering of Very Large Document Coections 11 {π (0) j } k j=1 is important. One can either (a) randomy assign each document to one of the k custers, (b) first compute the concept vector for the entire document coection and randomy perturb this vector to get k starting concept vectors or (c) try severa initia custerings and seect the best in terms of the argest objective function. We use strategy (b) in our impementation. We now examine the time and memory compexity of the above agorithm. We assume that the number of non-zero entries in the sparse matrix is nz, and that the above agorithm iterates τ times before stopping. Using our initiaization strategy, step 1 of the agorithm costs O(nz + k w) operations. For each iteration, step 2a costs O(nz k) operations whie 2b costs O(k d) comparisons. Step 3 updates the concept vectors and costs O(nz + k w) operations. Thus the tota time compexity for τ iterations is O((nz k + k d + k w) τ). Typicay nz max(d, w), hence it is cear that step 2a is the most computationay expensive step in the agorithm and the overa compexity of the agorithm is O(nz k τ). The main memory consumed by the agorithm is for storing the document vectors and for od and new copies of the concept vectors. Storing the document vectors in the CCS sparse matrix form requires 4(2nz+d+ 4) bytes whie the concept vectors require 8kw bytes; hence the memory consumption is modest Simiarity Estimation The computationa botteneck in the spherica k-means agorithm is the dot product computation between a the document vectors and concept vectors (see step 2a). During the course of the agorithm it turns out that the first few iterations ead to a ot of movement of documents between custers. However, just after a few iterations the custers become more stabe, see the soid ine in the eft pot of Figure 1.3. Consequenty, the overa objective function vaue aso settes down after 4-5 iterations as seen in the right pot of Figure 1.3. When the custers become stabe, there are potentia savings if we can somehow avoid computing unnecessary dot products between document vectors and far away concept vectors. We now introduce a technique for estimating cosine simiarities which aows us to avoid such dot product computations.

12 12 Potentia vs actua number of document assignment changes 12 x k=12 potentia changes actua changes Objective function 3 x k=20 k=12 k= Iteration count (t) Iteration count (t) Figure 1.3. Custers stabiize after just a few iterations. Suppose c (t) is the concept vector of custer at iteration t and x is a document vector. Then, x T c (t) x T c (t+1) x c (t) c (t+1) = c (t) c (t+1), x T c (t) c (t) c (t+1) x T c (t+1) x T c (t) + c (t) c (t+1) (4.3) The right side of inequaity (4.3) gives a simiarity upper bound which can be used profitaby in the (t + 1)-st iteration to avoid computing x T c (t+1). The idea is to store in memory the quantities c (t) and upper bounds for x T c (t) c (t+1). Suppose x beongs to the custer j at the t-th iteration. Then, at iteration t + 1, we can update in O(1) time the simiarity upper bound for x T c (t+1), j. If this simiarity upper bound is smaer than x T c (t+1) j (which is computed exacty) there is no need to expicity compute x T c (t+1), otherwise the exact vaue needs to be computed. As the custers become more and more stabe, this simiarity estimation can dramaticay cut down on computation (see Figure 1.4). In our impementation, we store the simiarity upper bounds in the d k matrix U. We obtain the new agorithm by repacing step 2 of the spherica k-means agorithm by the foowing step (here we assume that simiarity estimation is started after t min iterations).

13 Efficient Custering of Very Large Document Coections x Number of dot products computed With simiarity estimation Origina agorithm Iteration count (t) Figure 1.4. Computationa Savings due to Simiarity Estimation 2. Re-assign document vectors. For each document vector x i, 1 i d, do the foowing: a. If t t min, do step 2a as in Section 4.1. ese if t = t min, set U(i, ) = x T i c(t) for a = 1, 2,..., k. ese if t > t min, do the foowing steps for a = 1, 2,..., k, a1. Set U(i, ) = U(i, ) + c (t) a2. Compute x T i c(t) where x i beongs to custer j at iteration t 1. If U(i, ) > x T i c(t) j x T i c(t). b. From among a x T i c(t) j c (t 1)., compute xt i c(t) and set U(i, ) = computed above, find j = arg max x T i c(t). Figure 1.4 shows the considerabe savings in the number of dot product computations as the iteration count increases in a typica run of the agorithm. To obtain these savings, an extra storage requirement of 4dk bytes is required to store the matrix U. Additionay, an extra O(dk) operations to update the matrix U are required (see step a1 above).

14 14 However, these extra costs are sma compared to the resuting savings in computation time, see Section for an exampe Custering for Dimensionaity Reduction We now highight another use of our custering agorithm. Using the vector space mode, each document may be represented as a highdimensiona vector of words. Ceary the occurrence of one word in a document is not independent of other words. Thus there is a great dea of redundancy in this coection of vectors. Dimensionaity reduction is a technique that tries to represent each document as a vector with fewer dimensions that are more independent. It turns out that the above custering agorithm aso yieds a fast and effective technique for dimensionaity reduction. Let {c j } k j=1 denote the concept vectors corresponding to a custering of the document vectors. The concept matrix C k is defined to be a d k matrix such that, for 1 j k, the j-th coumn of the matrix is the concept vector c j, i.e, C k = [c 1, c 2,..., c k ]. An approximation X k to the word-document matrix X may be obtained by taking the east squares projection of X onto the coumn space of C k, i.e., the inear subspace spanned by the concept vectors. This may be expressed as X k = C k Z k, where Zk is a k d matrix that is to be determined by soving the foowing east-squares probem: Z k = arg min X C k Z 2 F. Z It is we known that a cosed form soution exists for this east-squares probem, namey, Zk = (CT k C k) 1 Ck T X. (4.4) The i-th coumn of Zk gives the reduced k-dimensiona representation of the i-th document vector. Typicay the origina dimensionaity w is in the thousands whie k is much smaer. In previous work, we have shown that empiricay, the quaity of the reduced dimensiona representation given by (4.4) is comparabe to the best possibe, namey the k-truncated SVD, see [DM01] for detais. Thus our custering process foowed by dimensionaity reduction can be used in various appications, such as query retrieva, text cassification, etc.

15 Efficient Custering of Very Large Document Coections Experimenta Resuts In this section, we give experimenta resuts for the entire custering process this incudes time and memory consumed by the preprocessing phase as we as by the custering agorithm. In addition, we need to evauate the goodness of the custering produced. Evauating custering resuts is a tricky business. However, in situations where documents are aready categorized(abeed), we can compare the custers with the true cass abes. For this comparison we use the measures of purity and entropy as defined beow, see aso [SGM00]. Suppose we are given c categories (true cass abes) whie the custering agorithm produces k custers. Custer π s purity can be defined as P (π ) = 1 max n h (n(h) ), where n = π and n (h) is the number of documents in π that beong to cass h, h = 1,..., c. Note that each custer may contain sampes from different casses. Purity gives the ratio of the dominant cass size in the custer to the custer size itsef. A high purity vaue impies that the custer is a pure subset of the dominant cass. Additionay, we aso use entropy as a quaity measure, which is defined as foows: ( ) H(π ) = 1 c n (h) (h) n og. og c n n h=1 Entropy is a more comprehensive measure than purity. It considers the distribution of casses in a custer. Note that we have normaized entropy to take vaues between 0 and 1. An entropy vaue of 0 means the custer is comprised entirey of one cass, whie an entropy vaue near 1 is bad since it impies that the custer contains a uniform mixture of casses. We now examine these quaity measures on a sampe coection. We formed a coection of 3893 documents by merging the popuar MED- LINE, CISI, and CRANFIELD sets. MEDLINE consists of 1033 abstracts from medica journas, CISI consists of 1460 abstracts from information retrieva papers, whie CRANFIELD consists of 1400 abstracts from aeronautica systems papers (ftp://ftp.cs.corne.edu/pub/smart). We preprocessed this coection by proceeding as in Section 3. After removing common stopwords, the coection contained unique words from which we eiminated ow-frequency words appearing in ess than 8 documents (roughy 0.2% of the documents), and 7 high-frequency

16 16 words appearing in more than 585 documents (roughy 15% of the documents). We were finay eft with 4303 words using which we created 3893 document vectors using the tfn scheme. Each document vector has dimension 4303, however, on an average, each document vector is about 99% sparse. For this contrived coection, we used our custering agorithm to form 3 custers. The foowing tabe shows the confusion matrix for this custering, from which we can see that the custers can be easiy identified with the three casses MEDLINE, CISI and CRANFIELD. MEDLINE CRANFIELD CISI patients(1023) boundary(1388) ibrary (1481) In the above tabe the rows denote custers. Note that we have denoted the custers by the most frequenty occurring word in the custer from among the words in the vector space mode patients, boundary and ibrary. The above tabe says that the custer patients has 1023 documents of which 1020 beong to the MEDLINE cass, the boundary custer has 1388 documents whie the ibrary custer has 1481 documents. Notice that the confusion matrix is amost diagona which shows that the custering agorithm neary recovered the three casses. This fact is aso refected by the high purity and ow entropy vaues of the three custers shown beow. Custer# Purity Entropy Case Study on a Large Scientific Coection To show the efficiency and effectiveness of our agorithms, we now present resuts on a arge rea-ife coection of NSF award abstracts. We obtained the NSF data set by downoading abstracts of grants awarded by the Nationa Science Foundation between Jan 1958 and August 1999 from For our experiments we extracted tites and abstracts (ignoring fieds such as Type, NSF Org, Date, etc.) of awards made in the 8 NSF organizations: Directorate for Bioogica Sciences (BIO), Directorate for Computer and Information Science and Engineering (CISE), Directorate for Education and Human Resources (EHR), Directorate for Engineering (ENG), Directorate for Geosciences (GEO), Directorate for Mathematica and Physica Sciences (MPS), Office of

17 Efficient Custering of Very Large Document Coections 17 Poar Programs (OPP) and Directorate for Socia, Behaviora, and Economic Sciences (SBE). This resuts in a tota of abstracts. The number of documents per NSF organization is as foows: Cass abe ENG CSE MPS BIO GEO EHR SBE OPP # documents The most popuous cass is MPS with over awards, with ENG being second. On the other hand, OPP contains the fewest (2767) number of documents. Both the mean and median cass size is about We preprocessed this NSF coection by proceeding as in Section 3. After removing common stopwords, the coection contained unique words from which we eiminated ow-frequency and highfrequency words. We were finay eft with words using which we created document vectors using the tfn scheme. Each document vector has dimension 26424, however, on an average, each document vector contains ony about 72 nonzero components and thus is more than 99% sparse. In terms of the words contained in these documents, Tabe 1.1 shows the top ten most common words in the NSF data set. Note that we refer to the most frequenty occurring word as having rank 1, the second most frequent word as having rank 2, and so on. As is common in most document coections the majority of the words occur in very few documents. Figure 1.5 shows the distribution of a the word frequencies versus their rank on a og-og scae. This distribution approximatey fits the so-caed Zipf s aw [Zip49]. Rank Word Frequency 1 abstract research tite project study data system university program science Tabe 1.1. Top 10 most common words in the NSF data set As mentioned above, there are a tota of unique word in the NSF coection. In genera, the number of unique words W grows sower

18 Word frequency Top 10 most common words 1. abstract 2. research 3. tite 4. project 5. study 6. data 7. systems 8. university 9. program 10. science Word rank Figure 1.5. Distribution of word frequencies on a og-og scae (Zipf s Law) 1.6 x Number of distinct words Number of documents x 10 4 Figure 1.6. Vocabuary size vs. number of documents (Heap s Law)

19 Efficient Custering of Very Large Document Coections Main memory consumed (MBytes) Size of data (MBytes) Figure 1.7. Main memory consumed vs. size of data than ineary with the number of documents, i.e., W = O(d β ) where 0 < β < 1. This behaviour known as Heap s aw [Hea78] is seen in Figure Preprocessing Resuts. We now show the resuts of our custering process on this arge NSF coection. The preprocessing phase takes time that grows ineary with the size of the data set, whie the main memory consumed aso grows ineary with the number of unique words. Experiments on subsets of the NSF coection vaidate these caims, see Figures 1.7 and 1.8. Figure 1.7 shows that we consume ess than 18 MBytes of main memory to preprocess the entire NSF coection of documents. Extrapoating this number to a coection of one miion documents, we see that 160 MBytes of main memory woud be sufficient, which impies that our agorithms can carry out this arge task on a singe workstation with just 256 MBytes of main memory. From Figure 1.8 we see that the entire NSF data set is preprocessed in ess than 20 minutes. Note that this incudes disk I/O time. Our impementation is muti-threaded and hence the time taken is not very sensitive to the traffic over the Network Fie Server Custering Resuts. In this section, we examine the speed and quaity of the custering agorithm. As discussed in Sec-

20 Computation time (seconds) Size of data (MBbyes) Figure 1.8. Computation time vs. size of data tion 4.1, the spherica k-means agorithm forms k custers in O(nz k) time. Figure 1.9 gives the time taken to custer the NSF data set ( documents, words, non-zero entries) into a varied number of custers. Times for both the origina custering agorithm and its modification that uses simiarity estimation are shown (see Section 4.2). The simiarity estimation technique yieds considerabe savings in computation time when the number of custers is arge. The running time of the agorithm is aso seen to increase ineary with the number of custers k. To form 100 custers, 1400 seconds of computationa time is needed. On the other hand, to form 12 custers ony 200 seconds are needed. Thus, combined with the preprocessing time of about 1190 seconds, the entire custering process for 12 custers is seen to take ony about 23 minutes. We now present a sampe custering obtained by the spherica k-means agorithm. We custered the entire set of documents into 12 custers. In the foowing tabe, we have isted the 10 dominant words and their weights in each of the concept vectors that correspond to the 12 custers. Note that we have denoted each of the 12 custers by the dominant word, e.g., theory, chemistry, physics, etc. By examining the top 10 words it is easy to see the concept contained in each custer. For exampe, the conference custer containing 6652 documents appears to be about NSF awards that support conferences, meetings, workshops and symposiums for internationa scientists and researchers.

21 Efficient Custering of Very Large Document Coections With simiarity estimation Origina agorithm 5000 Computation Time(seconds) Number of custers Figure 1.9. Time required to custer the NSF coection We now see the extent to which our custering agorithm is abe to recover the 8 NSF casses: BIO, CSE, EHR, ENG, GEO, MPS, OPP and SBE. The foowing gives the confusion matrix between the 12 custers and 8 casses. Note that the argest number in each row of the above tabe is bodfaced. We have ordered the custers and casses so that the confusion matrix has arge numbers near the diagona. The fact that the numbers near the diagona are arger than those away from the diagonas impies that many of the custers can be identified with NSF organizations. For exampe, the theory and chemistry custers can be identified with MPS, the socia custer with SBE, the protein and species custers with BIO and the ocean custer can be identified with GEO. Interestingy, some of the custers indicate the mutidiscipinary nature of some NSF organizations. The conference custer is seen to have many awards from MPS, ENG, SBE and BIO but no particuar organization is dominant. Aso, organizations such as MPS are seen to have subcategories mathematica theory, chemistry reactions and quantum physics. The numbers in the confusion matrix aso indicate the inter-reationships between various NSF orgainzations. For exampe, the materias custer indicates coseness between MPS and ENG whie the design custer shows the coseness between ENG and CSE. The quaity measures of purity and entropy for this custering are given in the foowing tabe.

22 22 theory(7442) chemistry(7420) physics (9835) materias(11589) design(13255) conference(6652) mathematics(7374) aboratory(11302) socia(8267) protein(9212) species(8822) ocean(12546) theory(.339),mathematica(.295),equations(.25),geometry(.192), probems(.186),sciences(.177),differentia(.168) agebraic(.165),manifods(.126),spaces(.114) chemistry(.448),reactions(.217),organic(.21),nmr(.194), moecues(.182),meta(.169),chemica(.151),compounds(.142), spectroscopy(.122),reaction(.12) physics(.189),quantum(.166),magnetic(.166),partice(.144), eectron(.135),stars(.13),soar(.129),energy(.123), aser(.115),theoretica(.112) materias(.292),phase(.192),properties(.156),fims(.14), poymer(.135),surface(.125),thin(.116),optica(.114), temperature(.11),iquid(.107) design(.243),agorithms(.216),contro(.18),parae(.175), performance(.149),probems(.144),software(.118),modes(.115), computer(.112),optimization(.1) conference(.47),workshop(.399),internationa(.23),hed(.207), meeting(.181),symposium(.178),trave(.135),participants(.112), scientists(.107),researchers(.096) mathematics(.342),teachers(.333),schoo(.259),education(.177), coege(.151),year(.150),teacher(.132),summer(.125), schoos(.12) aboratory(.303),equipment(.241),undergraduate(.225), computer(.186),engineering(.185),courses(.175),student(.132), bioogy(.126),facuty(.124),projects(.124) socia(.27),economic(.180),poitica(.167),poicy(.143), dissertation(.13),anguage(.126),pubic(.104),cutura(.103), decision(.1),market(.087) protein(.273),ce(.258),ces(.235),proteins(.22),gene(.196), genes(.179),dna(.153),expression(.138),moecuar(.131), reguation(.119) species(.326),pant(.167),popuations(.151),pants(.133), popuation(.125),evoutionary(.125),evoution(.101), variation(.098),ecoogica(.096) ocean(.232),ice(194),cimate(.153),mante(.147),sea(.134), water(.118),seismic(.106),circuation(.097),pacific(.096), isotopic(.092) Tabe 1.2. Top 10 words associated with each custer (the numbers in the right coumn give the word s weight in the concept vector) Despite the above quaity measures, it is sometimes best to have an informa way of examining a custering. For this purpose, we have created a sequence of web pages to browse through the custers, and the documents and keywords contained in them. The custering presented above is avaiabe at

23 Efficient Custering of Very Large Document Coections 23 MPS ENG CSE EHR SBE OPP BIO GEO theory(7442) chemistry(7420) physics(9835) materias(11589) design(13255) conference(6652) mathematics(7374) aboratory(11302) socia(8267) protein(9212) species(8822) ocean(12546) Tabe 1.3. Confusion matrix between the 8 true casses and 12 custers for the entire NSF coection Custer Purity Entropy theory chemistry physics materias design conference mathematics aboratory socia protein species ocean Tabe 1.4. coection Purity and entropy resuts for the computed custers of the entire NSF 6. Concusions and Future Work In this paper, we have presented a time and memory efficient scheme for custering very arge document coections. We empoy a mutithreaded approach to reading and preprocessing the documents to mitigate disk I/O costs, and use the effective spherica k-means agorithm to custer the documents. Our experimenta resuts show that we are abe to preprocess and custer 113, 716 NSF award abstracts in 23 minutes on a Sun workstation with modest memory consumption. We have aso

24 24 demonstrated that the quaity of the custering is good. Based on the experiments we have done in this paper, we predict that we can use our scheme to custer a miion documents into 12 custers in ess than 4 hours on a Sun workstation. In future work, we wi continue our focus on improving the efficiency and scaabiity of our scheme. The dot product computation between a the document vectors and concept vectors is the computationa botteneck in the spherica k-means agorithm. To cut down on this computation, techniques ike truncation which project each document vector onto a sma subspace of the tota word space wi be investigated, see [SS97]. Meanwhie, more sophisticated handing of I/O threads wi be studied in order to cut down the I/O cost which is the botteneck for preprocessing. Paraeizing the whoe process can be one of the approaches. Hierarchica custering wi aso be studied in future work to discover the inherent hierarchica structure and the correct number of custers in the data set. References [BGG + 98] D. Boey, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the Word Wide Web using WebACE. AI Review, [Ca99] Brent Caaghan. NFS Iustrated. Addison-Wesey, [CKPT92] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A custer-based approach to browsing arge document coections. In ACM SIGIR, [DGL89] [DH73] [DM01] I. Duff, R. Grimes, and J. Lewis. Sparse matrix test probems. ACM Trans Math Soft, pages 1 14, R. O. Duda and P. E. Hart. Pattern Cassification and Scene Anaysis. Wiey, I. S. Dhion and D. S. Modha. Concept decompositions for arge sparse text data using custering. Machine Learning, 42(1): , January Aso appears as IBM Research Report RJ 10147, Juy [FBY92] W. B. Frakes and R. Baeza-Yates. Information Retrieva: Data Structures and Agorithms. Prentice Ha, Engewood Ciffs, New Jersey, [GJW82] M. R. Garey, D. S. Johnson, and H. S. Witsenhausen. The compexity of the generaized Loyd-Max probem. IEEE Trans. Inform. Theory, 28(2): , 1982.

25 REFERENCES 25 [Hea78] [Ko97] [KPR98] J. Heaps. Information Retrieva - Computationa and Theoretica Aspects. Academic Press, T. G. Koda. Limited-Memory Matrix Methods with Appications. PhD thesis, The Appied Mathematics Program, University of Maryand, Coege Park, Mayand, Jon Keinberg, C. H. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowedge Discovery, 2(4): , December [Mit97] Tom M. Mitche. Machine Learning. McGraw-Hi, [MS96] [NBF96] D. Musser and A. Saini. STL Tutoria and Reference Guide. Addison-Wesey, Bradford Nichos, Bick Buttar, and Jackie Proux Farre. Pthreads Programming. O Reiy & Associates, Inc., 981 Chestnut Street, Newton, MA 02164, USA, [Pax96] Vern Paxson. Fex user manua, November [Ras92] E. Rasmussen. Custering agorithms. In Wiiam B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieva: Data Structures and Agorithms, pages Prentice Ha, Engewood Ciffs, New Jersey, [SB88] G. Saton and C. Buckey. Term-weighting approaches in automatic text retrieva. Information Processing & Management, 4(5): , [SGM00] A. Streh, J. Ghosh, and R. Mooney. Impact of simiarity measures on web-page custering. In Proceedings of the AAAI2000 Workshop on Artificia Inteigence for Web Search, pages 58 64, Austin, Texas, Juy AAAI/MIT Press. [SM83] G. Saton and M. J. McGi. Introduction to Modern Retrieva. McGraw-Hi Book Company, [SS97] [Wi88] [ZE98] [Zip49] H. Schütze and C. Siverstein. Projections for efficient document custering. In ACM SIGIR, P. Wiet. Recent trends in hierarchic document custering: a critica review. Information Processing & Management, 24(5): , O. Zamir and O. Etzioni. Web document custering: A feasibiity demonstration. In ACM SIGIR, G. K. Zipf. Human Behavior and the Principe of Least Effort. Addison Wesey, Reading, MA, 1949.

Secure Network Coding with a Cost Criterion

Secure Network Coding with a Cost Criterion Secure Network Coding with a Cost Criterion Jianong Tan, Murie Médard Laboratory for Information and Decision Systems Massachusetts Institute of Technoogy Cambridge, MA 0239, USA E-mai: {jianong, medard}@mit.edu

More information

Teamwork. Abstract. 2.1 Overview

Teamwork. Abstract. 2.1 Overview 2 Teamwork Abstract This chapter presents one of the basic eements of software projects teamwork. It addresses how to buid teams in a way that promotes team members accountabiity and responsibiity, and

More information

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing. Fast Robust Hashing Manue Urueña, David Larrabeiti and Pabo Serrano Universidad Caros III de Madrid E-89 Leganés (Madrid), Spain Emai: {muruenya,darra,pabo}@it.uc3m.es Abstract As statefu fow-aware services

More information

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies ISM 602 Dr. Hamid Nemati Objectives The idea Dependencies Attributes and Design Understand concepts normaization (Higher-Leve Norma Forms) Learn how to normaize tabes Understand normaization and database

More information

Face Hallucination and Recognition

Face Hallucination and Recognition Face Haucination and Recognition Xiaogang Wang and Xiaoou Tang Department of Information Engineering, The Chinese University of Hong Kong {xgwang1, xtang}@ie.cuhk.edu.hk http://mmab.ie.cuhk.edu.hk Abstract.

More information

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization Best Practices: Pushing Exce Beyond Its Limits with Information Optimization WHITE Best Practices: Pushing Exce Beyond Its Limits with Information Optimization Executive Overview Microsoft Exce is the

More information

A Latent Variable Pairwise Classification Model of a Clustering Ensemble

A Latent Variable Pairwise Classification Model of a Clustering Ensemble A atent Variabe Pairwise Cassification Mode of a Custering Ensembe Vadimir Berikov Soboev Institute of mathematics, Novosibirsk State University, Russia berikov@math.nsc.ru http://www.math.nsc.ru Abstract.

More information

Betting Strategies, Market Selection, and the Wisdom of Crowds

Betting Strategies, Market Selection, and the Wisdom of Crowds Betting Strategies, Market Seection, and the Wisdom of Crowds Wiemien Kets Northwestern University w-kets@keogg.northwestern.edu David M. Pennock Microsoft Research New York City dpennock@microsoft.com

More information

Australian Bureau of Statistics Management of Business Providers

Australian Bureau of Statistics Management of Business Providers Purpose Austraian Bureau of Statistics Management of Business Providers 1 The principa objective of the Austraian Bureau of Statistics (ABS) in respect of business providers is to impose the owest oad

More information

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH Ufuk Cebeci Department of Industria Engineering, Istanbu Technica University, Macka, Istanbu, Turkey - ufuk_cebeci@yahoo.com Abstract An Enterprise

More information

Fixed income managers: evolution or revolution

Fixed income managers: evolution or revolution Fixed income managers: evoution or revoution Traditiona approaches to managing fixed interest funds rey on benchmarks that may not represent optima risk and return outcomes. New techniques based on separate

More information

l l ll l l Exploding the Myths about DETC Accreditation A Primer for Students

l l ll l l Exploding the Myths about DETC Accreditation A Primer for Students Expoding the Myths about DETC Accreditation A Primer for Students Distance Education and Training Counci Expoding the Myths about DETC Accreditation: A Primer for Students Prospective distance education

More information

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN: 1-932394-06-0

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN: 1-932394-06-0 IEEE DISTRIBUTED SYSTEMS ONLINE 1541-4922 2005 Pubished by the IEEE Computer Society Vo. 6, No. 5; May 2005 Editor: Marcin Paprzycki, http://www.cs.okstate.edu/%7emarcin/ Book Reviews: Java Toos and Frameworks

More information

PREFACE. Comptroller General of the United States. Page i

PREFACE. Comptroller General of the United States. Page i - I PREFACE T he (+nera Accounting Office (GAO) has ong beieved that the federa government urgenty needs to improve the financia information on which it bases many important decisions. To run our compex

More information

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks Simutaneous Routing and Power Aocation in CDMA Wireess Data Networks Mikae Johansson *,LinXiao and Stephen Boyd * Department of Signas, Sensors and Systems Roya Institute of Technoogy, SE 00 Stockhom,

More information

CERTIFICATE COURSE ON CLIMATE CHANGE AND SUSTAINABILITY. Course Offered By: Indian Environmental Society

CERTIFICATE COURSE ON CLIMATE CHANGE AND SUSTAINABILITY. Course Offered By: Indian Environmental Society CERTIFICATE COURSE ON CLIMATE CHANGE AND SUSTAINABILITY Course Offered By: Indian Environmenta Society INTRODUCTION The Indian Environmenta Society (IES) a dynamic and fexibe organization with a goba vision

More information

Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey

Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey by Linda Drazga Maxfied and Virginia P. Rena* Using data from the New Beneficiary Survey, this artice examines

More information

A Supplier Evaluation System for Automotive Industry According To Iso/Ts 16949 Requirements

A Supplier Evaluation System for Automotive Industry According To Iso/Ts 16949 Requirements A Suppier Evauation System for Automotive Industry According To Iso/Ts 16949 Requirements DILEK PINAR ÖZTOP 1, ASLI AKSOY 2,*, NURSEL ÖZTÜRK 2 1 HONDA TR Purchasing Department, 41480, Çayırova - Gebze,

More information

A Similarity Search Scheme over Encrypted Cloud Images based on Secure Transformation

A Similarity Search Scheme over Encrypted Cloud Images based on Secure Transformation A Simiarity Search Scheme over Encrypted Coud Images based on Secure Transormation Zhihua Xia, Yi Zhu, Xingming Sun, and Jin Wang Jiangsu Engineering Center o Network Monitoring, Nanjing University o Inormation

More information

Network/Communicational Vulnerability

Network/Communicational Vulnerability Automated teer machines (ATMs) are a part of most of our ives. The major appea of these machines is convenience The ATM environment is changing and that change has serious ramifications for the security

More information

An Idiot s guide to Support vector machines (SVMs)

An Idiot s guide to Support vector machines (SVMs) An Idiot s guide to Support vector machines (SVMs) R. Berwick, Viage Idiot SVMs: A New Generation of Learning Agorithms Pre 1980: Amost a earning methods earned inear decision surfaces. Linear earning

More information

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS Dehi Business Review X Vo. 4, No. 2, Juy - December 2003 CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS John N.. Var arvatsouakis atsouakis DURING the present time,

More information

3.3 SOFTWARE RISK MANAGEMENT (SRM)

3.3 SOFTWARE RISK MANAGEMENT (SRM) 93 3.3 SOFTWARE RISK MANAGEMENT (SRM) Fig. 3.2 SRM is a process buit in five steps. The steps are: Identify Anayse Pan Track Resove The process is continuous in nature and handed dynamicay throughout ifecyce

More information

GREEN: An Active Queue Management Algorithm for a Self Managed Internet

GREEN: An Active Queue Management Algorithm for a Self Managed Internet : An Active Queue Management Agorithm for a Sef Managed Internet Bartek Wydrowski and Moshe Zukerman ARC Specia Research Centre for Utra-Broadband Information Networks, EEE Department, The University of

More information

Enabling Direct Interest-Aware Audience Selection

Enabling Direct Interest-Aware Audience Selection Enabing Direct Interest-Aware Audience Seection ABSTRACT Arie Fuxman Microsoft Research Mountain View, CA arief@microsoft.com Zhenhui Li University of Iinois Urbana-Champaign, Iinois zi28@uiuc.edu Advertisers

More information

ONE of the most challenging problems addressed by the

ONE of the most challenging problems addressed by the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 44, NO. 9, SEPTEMBER 2006 2587 A Mutieve Context-Based System for Cassification of Very High Spatia Resoution Images Lorenzo Bruzzone, Senior Member,

More information

Chapter 2 Traditional Software Development

Chapter 2 Traditional Software Development Chapter 2 Traditiona Software Deveopment 2.1 History of Project Management Large projects from the past must aready have had some sort of project management, such the Pyramid of Giza or Pyramid of Cheops,

More information

Let s get usable! Usability studies for indexes. Susan C. Olason. Study plan

Let s get usable! Usability studies for indexes. Susan C. Olason. Study plan Let s get usabe! Usabiity studies for indexes Susan C. Oason The artice discusses a series of usabiity studies on indexes from a systems engineering and human factors perspective. The purpose of these

More information

READING A CREDIT REPORT

READING A CREDIT REPORT Name Date CHAPTER 6 STUDENT ACTIVITY SHEET READING A CREDIT REPORT Review the sampe credit report. Then search for a sampe credit report onine, print it off, and answer the questions beow. This activity

More information

Advanced ColdFusion 4.0 Application Development - 3 - Server Clustering Using Bright Tiger

Advanced ColdFusion 4.0 Application Development - 3 - Server Clustering Using Bright Tiger Advanced CodFusion 4.0 Appication Deveopment - CH 3 - Server Custering Using Bri.. Page 1 of 7 [Figures are not incuded in this sampe chapter] Advanced CodFusion 4.0 Appication Deveopment - 3 - Server

More information

Traffic classification-based spam filter

Traffic classification-based spam filter Traffic cassification-based spam fiter Ni Zhang 1,2, Yu Jiang 3, Binxing Fang 1, Xueqi Cheng 1, Li Guo 1 1 Software Division, Institute of Computing Technoogy, Chinese Academy of Sciences, 100080, Beijing,

More information

Automatic Structure Discovery for Large Source Code

Automatic Structure Discovery for Large Source Code Automatic Structure Discovery for Large Source Code By Sarge Rogatch Master Thesis Universiteit van Amsterdam, Artificia Inteigence, 2010 Automatic Structure Discovery for Large Source Code Page 1 of 130

More information

Multi-Robot Task Scheduling

Multi-Robot Task Scheduling Proc of IEEE Internationa Conference on Robotics and Automation, Karsruhe, Germany, 013 Muti-Robot Tas Scheduing Yu Zhang and Lynne E Parer Abstract The scheduing probem has been studied extensivey in

More information

Degree Programs in Environmental Science/Studies

Degree Programs in Environmental Science/Studies State University Memorandum of New York to Presidents Date: June 30, 2000 Vo. 00 No. I From: Office of the Provost and Vice Chanceor for Academic Affairs SLbject: Guideines for the Consideration o New

More information

A New Statistical Approach to Network Anomaly Detection

A New Statistical Approach to Network Anomaly Detection A New Statistica Approach to Network Anomay Detection Christian Caegari, Sandrine Vaton 2, and Michee Pagano Dept of Information Engineering, University of Pisa, ITALY E-mai: {christiancaegari,mpagano}@ietunipiit

More information

Fast b-matching via Sufficient Selection Belief Propagation

Fast b-matching via Sufficient Selection Belief Propagation Fast b-matching via Sufficient Seection Beief Propagation Bert Huang Computer Science Department Coumbia University New York, NY 127 bert@cs.coumbia.edu Tony Jebara Computer Science Department Coumbia

More information

Early access to FAS payments for members in poor health

Early access to FAS payments for members in poor health Financia Assistance Scheme Eary access to FAS payments for members in poor heath Pension Protection Fund Protecting Peope s Futures The Financia Assistance Scheme is administered by the Pension Protection

More information

CLOUD service providers manage an enterprise-class

CLOUD service providers manage an enterprise-class IEEE TRANSACTIONS ON XXXXXX, VOL X, NO X, XXXX 201X 1 Oruta: Privacy-Preserving Pubic Auditing for Shared Data in the Coud Boyang Wang, Baochun Li, Member, IEEE, and Hui Li, Member, IEEE Abstract With

More information

Market Design & Analysis for a P2P Backup System

Market Design & Analysis for a P2P Backup System Market Design & Anaysis for a P2P Backup System Sven Seuken Schoo of Engineering & Appied Sciences Harvard University, Cambridge, MA seuken@eecs.harvard.edu Denis Chares, Max Chickering, Sidd Puri Microsoft

More information

Introduction to XSL. Max Froumentin - W3C

Introduction to XSL. Max Froumentin - W3C Introduction to XSL Max Froumentin - W3C Introduction to XSL XML Documents Stying XML Documents XSL Exampe I: Hamet Exampe II: Mixed Writing Modes Exampe III: database Other Exampes How do they do that?

More information

Introduction the pressure for efficiency the Estates opportunity

Introduction the pressure for efficiency the Estates opportunity Heathy Savings? A study of the proportion of NHS Trusts with an in-house Buidings Repair and Maintenance workforce, and a discussion of eary experiences of Suppies efficiency initiatives Management Summary

More information

Books on Reference and the Problem of Library Science

Books on Reference and the Problem of Library Science Practicing Reference... Learning from Library Science * Mary Whisner ** Ms. Whisner describes the method and some of the resuts reported in a recenty pubished book about the reference interview written

More information

The eg Suite Enabing Rea-Time Monitoring and Proactive Infrastructure Triage White Paper Restricted Rights Legend The information contained in this document is confidentia and subject to change without

More information

A short guide to making a medical negligence claim

A short guide to making a medical negligence claim A short guide to making a medica negigence caim Introduction Suffering from an incident of medica negigence is traumatic and can have a serious ong-term impact on both the physica and menta heath of affected

More information

Migrating and Managing Dynamic, Non-Textua Content

Migrating and Managing Dynamic, Non-Textua Content Considering Dynamic, Non-Textua Content when Migrating Digita Asset Management Systems Aya Stein; University of Iinois at Urbana-Champaign; Urbana, Iinois USA Santi Thompson; University of Houston; Houston,

More information

ICAP CREDIT RISK SERVICES. Your Business Partner

ICAP CREDIT RISK SERVICES. Your Business Partner ICAP CREDIT RISK SERVICES Your Business Partner ABOUT ICAP GROUP ICAP Group with 56 miion revenues for 2008 and 1,000 empoyees- is the argest Business Services Group in Greece. In addition to its Greek

More information

Betting on the Real Line

Betting on the Real Line Betting on the Rea Line Xi Gao 1, Yiing Chen 1,, and David M. Pennock 2 1 Harvard University, {xagao,yiing}@eecs.harvard.edu 2 Yahoo! Research, pennockd@yahoo-inc.com Abstract. We study the probem of designing

More information

NCH Software MoneyLine

NCH Software MoneyLine NCH Software MoneyLine This user guide has been created for use with MoneyLine Version 2.xx NCH Software Technica Support If you have difficuties using MoneyLine pease read the appicabe topic before requesting

More information

Load Balancing in Distributed Web Server Systems with Partial Document Replication *

Load Balancing in Distributed Web Server Systems with Partial Document Replication * Load Baancing in Distributed Web Server Systems with Partia Document Repication * Ling Zhuo Cho-Li Wang Francis C. M. Lau Department of Computer Science and Information Systems The University of Hong Kong

More information

INDUSTRIAL AND COMMERCIAL

INDUSTRIAL AND COMMERCIAL Finance TM NEW YORK CITY DEPARTMENT OF FINANCE TAX & PARKING PROGRAM OPERATIONS DIVISION INDUSTRIAL AND COMMERCIAL ABATEMENT PROGRAM PRELIMINARY APPLICATION AND INSTRUCTIONS Mai to: NYC Department of Finance,

More information

Learning from evaluations Processes and instruments used by GIZ as a learning organisation and their contribution to interorganisational learning

Learning from evaluations Processes and instruments used by GIZ as a learning organisation and their contribution to interorganisational learning Monitoring and Evauation Unit Learning from evauations Processes and instruments used by GIZ as a earning organisation and their contribution to interorganisationa earning Contents 1.3Learning from evauations

More information

Dynamic Pricing Trade Market for Shared Resources in IIU Federated Cloud

Dynamic Pricing Trade Market for Shared Resources in IIU Federated Cloud Dynamic Pricing Trade Market or Shared Resources in IIU Federated Coud Tongrang Fan 1, Jian Liu 1, Feng Gao 1 1Schoo o Inormation Science and Technoogy, Shiiazhuang Tiedao University, Shiiazhuang, 543,

More information

Business schools are the academic setting where. The current crisis has highlighted the need to redefine the role of senior managers in organizations.

Business schools are the academic setting where. The current crisis has highlighted the need to redefine the role of senior managers in organizations. c r o s os r oi a d s REDISCOVERING THE ROLE OF BUSINESS SCHOOLS The current crisis has highighted the need to redefine the roe of senior managers in organizations. JORDI CANALS Professor and Dean, IESE

More information

(12) Patent Application Publication (10) Pub. N0.: US 2006/0105797 A1 Marsan et al. (43) Pub. Date: May 18, 2006

(12) Patent Application Publication (10) Pub. N0.: US 2006/0105797 A1 Marsan et al. (43) Pub. Date: May 18, 2006 (19) United States US 20060105797A (12) Patent Appication Pubication (10) Pub. N0.: US 2006/0105797 A1 Marsan et a. (43) Pub. Date: (54) METHOD AND APPARATUS FOR (52) US. C...... 455/522 ADJUSTING A MOBILE

More information

Overview of Health and Safety in China

Overview of Health and Safety in China Overview of Heath and Safety in China Hongyuan Wei 1, Leping Dang 1, and Mark Hoye 2 1 Schoo of Chemica Engineering, Tianjin University, Tianjin 300072, P R China, E-mai: david.wei@tju.edu.cn 2 AstraZeneca

More information

Older people s assets: using housing equity to pay for health and aged care

Older people s assets: using housing equity to pay for health and aged care Key words: aged care; retirement savings; reverse mortgage; financia innovation; financia panning Oder peope s assets: using housing equity to pay for heath and aged care The research agenda on the ageing

More information

Bite-Size Steps to ITIL Success

Bite-Size Steps to ITIL Success 7 Bite-Size Steps to ITIL Success Pus making a Business Case for ITIL! Do you want to impement ITIL but don t know where to start? 7 Bite-Size Steps to ITIL Success can hep you to decide whether ITIL can

More information

NCH Software BroadCam Video Streaming Server

NCH Software BroadCam Video Streaming Server NCH Software BroadCam Video Streaming Server This user guide has been created for use with BroadCam Video Streaming Server Version 2.xx NCH Software Technica Support If you have difficuties using BroadCam

More information

Virtual trunk simulation

Virtual trunk simulation Virtua trunk simuation Samui Aato * Laboratory of Teecommunications Technoogy Hesinki University of Technoogy Sivia Giordano Laboratoire de Reseaux de Communication Ecoe Poytechnique Federae de Lausanne

More information

Packet Classification with Network Traffic Statistics

Packet Classification with Network Traffic Statistics Packet Cassification with Network Traffic Statistics Yaxuan Qi and Jun Li Research Institute of Information Technoogy (RIIT), Tsinghua Uniersity Beijing, China, 100084 Abstract-- Packet cassification on

More information

eg Enterprise vs. a Big 4 Monitoring Soution: Comparing Tota Cost of Ownership Restricted Rights Legend The information contained in this document is confidentia and subject to change without notice. No

More information

Chapter 3: e-business Integration Patterns

Chapter 3: e-business Integration Patterns Chapter 3: e-business Integration Patterns Page 1 of 9 Chapter 3: e-business Integration Patterns "Consistency is the ast refuge of the unimaginative." Oscar Wide In This Chapter What Are Integration Patterns?

More information

NCH Software Warp Speed PC Tune-up Software

NCH Software Warp Speed PC Tune-up Software NCH Software Warp Speed PC Tune-up Software This user guide has been created for use with Warp Speed PC Tune-up Software Version 1.xx NCH Software Technica Support If you have difficuties using Warp Speed

More information

Leadership & Management Certificate Programs

Leadership & Management Certificate Programs MANAGEMENT CONCEPTS Leadership & Management Certificate Programs Programs to deveop expertise in: Anaytics // Leadership // Professiona Skis // Supervision ENROLL TODAY! Contract oder Contract GS-02F-0010J

More information

Welcome to Colonial Voluntary Benefits. Thank you for your interest in our Universal Life with the Accelerated Death Benefit for Long Term Care Rider.

Welcome to Colonial Voluntary Benefits. Thank you for your interest in our Universal Life with the Accelerated Death Benefit for Long Term Care Rider. Heo, Wecome to Coonia Vountary Benefits. Thank you for your interest in our Universa Life with the Acceerated Death Benefit for Long Term Care Rider. For detai pease ca 877-685-2656. Pease eave your name,

More information

Best Practices for Push & Pull Using Oracle Inventory Stock Locators. Introduction to Master Data and Master Data Management (MDM): Part 1

Best Practices for Push & Pull Using Oracle Inventory Stock Locators. Introduction to Master Data and Master Data Management (MDM): Part 1 SPECIAL CONFERENCE ISSUE THE OFFICIAL PUBLICATION OF THE Orace Appications USERS GROUP spring 2012 Introduction to Master Data and Master Data Management (MDM): Part 1 Utiizing Orace Upgrade Advisor for

More information

Chapter 1 Structural Mechanics

Chapter 1 Structural Mechanics Chapter Structura echanics Introduction There are many different types of structures a around us. Each structure has a specific purpose or function. Some structures are simpe, whie others are compex; however

More information

Insertion and deletion correcting DNA barcodes based on watermarks

Insertion and deletion correcting DNA barcodes based on watermarks Kracht and Schober BMC Bioinformatics (2015) 16:50 DOI 10.1186/s12859-015-0482-7 METHODOLOGY ARTICLE Open Access Insertion and deetion correcting DNA barcodes based on watermarks David Kracht * and Steffen

More information

Design of Follow-Up Experiments for Improving Model Discrimination and Parameter Estimation

Design of Follow-Up Experiments for Improving Model Discrimination and Parameter Estimation Design of Foow-Up Experiments for Improving Mode Discrimination and Parameter Estimation Szu Hui Ng 1 Stephen E. Chick 2 Nationa University of Singapore, 10 Kent Ridge Crescent, Singapore 119260. Technoogy

More information

WHITE PAPER UndERsTAndIng THE VAlUE of VIsUAl data discovery A guide To VIsUAlIzATIons

WHITE PAPER UndERsTAndIng THE VAlUE of VIsUAl data discovery A guide To VIsUAlIzATIons Understanding the Vaue of Visua Data Discovery A Guide to Visuaizations WHITE Tabe of Contents Executive Summary... 3 Chapter 1 - Datawatch Visuaizations... 4 Chapter 2 - Snapshot Visuaizations... 5 Bar

More information

The Web Insider... The Best Tool for Building a Web Site *

The Web Insider... The Best Tool for Building a Web Site * The Web Insider... The Best Too for Buiding a Web Site * Anna Bee Leiserson ** Ms. Leiserson describes the types of Web-authoring systems that are avaiabe for buiding a site and then discusses the various

More information

Ricoh Legal. ediscovery and Document Solutions. Powerful document services provide your best defense.

Ricoh Legal. ediscovery and Document Solutions. Powerful document services provide your best defense. Ricoh Lega ediscovery and Document Soutions Powerfu document services provide your best defense. Our peope have aways been at the heart of our vaue proposition: our experience in your industry, commitment

More information

ST. MARKS CONFERENCE FACILITY MARKET ANALYSIS

ST. MARKS CONFERENCE FACILITY MARKET ANALYSIS ST. MARKS CONFERENCE FACILITY MARKET ANALYSIS Prepared by: Lambert Advisory, LLC Submitted to: St. Marks Waterfronts Forida Partnership St. Marks Conference Center Contents Executive Summary... 1 Section

More information

NCH Software Express Accounts Accounting Software

NCH Software Express Accounts Accounting Software NCH Software Express Accounts Accounting Software This user guide has been created for use with Express Accounts Accounting Software Version 5.xx NCH Software Technica Support If you have difficuties using

More information

GWPD 4 Measuring water levels by use of an electric tape

GWPD 4 Measuring water levels by use of an electric tape GWPD 4 Measuring water eves by use of an eectric tape VERSION: 2010.1 PURPOSE: To measure the depth to the water surface beow and-surface datum using the eectric tape method. Materias and Instruments 1.

More information

Spatio-Temporal Asynchronous Co-Occurrence Pattern for Big Climate Data towards Long-Lead Flood Prediction

Spatio-Temporal Asynchronous Co-Occurrence Pattern for Big Climate Data towards Long-Lead Flood Prediction Spatio-Tempora Asynchronous Co-Occurrence Pattern for Big Cimate Data towards Long-Lead Food Prediction Chung-Hsien Yu, Dong Luo, Wei Ding, Joseph Cohen, David Sma and Shafiqu Isam Department of Computer

More information

FRAME BASED TEXTURE CLASSIFICATION BY CONSIDERING VARIOUS SPATIAL NEIGHBORHOODS. Karl Skretting and John Håkon Husøy

FRAME BASED TEXTURE CLASSIFICATION BY CONSIDERING VARIOUS SPATIAL NEIGHBORHOODS. Karl Skretting and John Håkon Husøy FRAME BASED TEXTURE CLASSIFICATION BY CONSIDERING VARIOUS SPATIAL NEIGHBORHOODS Kar Skretting and John Håkon Husøy University of Stavanger, Department of Eectrica and Computer Engineering N-4036 Stavanger,

More information

THE IMPACT OF AN EXECUTIVE LEADERSHIP DEVELOPMENT PROGRAM

THE IMPACT OF AN EXECUTIVE LEADERSHIP DEVELOPMENT PROGRAM Leadership THE IMPACT OF AN EXECUTIVE LEADERSHIP DEVELOPMENT PROGRAM n Jay S. Grider, DO, PhD, Richard Lofgren, MD and Raph Weicke In this artice A eadership deveopment program at an academic medica center

More information

Vendor Performance Measurement Using Fuzzy Logic Controller

Vendor Performance Measurement Using Fuzzy Logic Controller The Journa of Mathematics and Computer Science Avaiabe onine at http://www.tjmcs.com The Journa of Mathematics and Computer Science Vo.2 No.2 (2011) 311-318 Performance Measurement Using Fuzzy Logic Controer

More information

Human Capital & Human Resources Certificate Programs

Human Capital & Human Resources Certificate Programs MANAGEMENT CONCEPTS Human Capita & Human Resources Certificate Programs Programs to deveop functiona and strategic skis in: Human Capita // Human Resources ENROLL TODAY! Contract Hoder Contract GS-02F-0010J

More information

Spherical Correlation of Visual Representations for 3D Model Retrieval

Spherical Correlation of Visual Representations for 3D Model Retrieval Noname manuscript No. (wi be inserted by the editor) Spherica Correation of Visua Representations for 3D Mode Retrieva Ameesh Makadia Kostas Daniiidis the date of receipt and acceptance shoud be inserted

More information

Leakage detection in water pipe networks using a Bayesian probabilistic framework

Leakage detection in water pipe networks using a Bayesian probabilistic framework Probabiistic Engineering Mechanics 18 (2003) 315 327 www.esevier.com/ocate/probengmech Leakage detection in water pipe networks using a Bayesian probabiistic framework Z. Pouakis, D. Vaougeorgis, C. Papadimitriou*

More information

GRADUATE RECORD EXAMINATIONS PROGRAM

GRADUATE RECORD EXAMINATIONS PROGRAM VALIDITY and the GRADUATE RECORD EXAMINATIONS PROGRAM BY WARREN W. WILLINGHAM EDUCATIONAL TESTING SERVICE, PRINCETON, NEW JERSEY Vaidity and the Graduate Record Examinations Program Vaidity and the Graduate

More information

7. Dry Lab III: Molecular Symmetry

7. Dry Lab III: Molecular Symmetry 0 7. Dry Lab III: Moecuar Symmetry Topics: 1. Motivation. Symmetry Eements and Operations. Symmetry Groups 4. Physica Impications of Symmetry 1. Motivation Finite symmetries are usefu in the study of moecues.

More information

AN APPROACH TO THE STANDARDISATION OF ACCIDENT AND INJURY REGISTRATION SYSTEMS (STAIRS) IN EUROPE

AN APPROACH TO THE STANDARDISATION OF ACCIDENT AND INJURY REGISTRATION SYSTEMS (STAIRS) IN EUROPE AN APPROACH TO THE STANDARDSATON OF ACCDENT AND NJURY REGSTRATON SYSTEMS (STARS) N EUROPE R. Ross P. Thomas Vehice Safety Research Centre Loughborough University B. Sexton Transport Research Laboratory

More information

Oligopoly in Insurance Markets

Oligopoly in Insurance Markets Oigopoy in Insurance Markets June 3, 2008 Abstract We consider an oigopoistic insurance market with individuas who differ in their degrees of accident probabiities. Insurers compete in coverage and premium.

More information

The guaranteed selection. For certainty in uncertain times

The guaranteed selection. For certainty in uncertain times The guaranteed seection For certainty in uncertain times Making the right investment choice If you can t afford to take a ot of risk with your money it can be hard to find the right investment, especiay

More information

LADDER SAFETY Table of Contents

LADDER SAFETY Table of Contents Tabe of Contents SECTION 1. TRAINING PROGRAM INTRODUCTION..................3 Training Objectives...........................................3 Rationae for Training.........................................3

More information

Ricoh Healthcare. Process Optimized. Healthcare Simplified.

Ricoh Healthcare. Process Optimized. Healthcare Simplified. Ricoh Heathcare Process Optimized. Heathcare Simpified. Rather than a destination that concudes with the eimination of a paper, the Paperess Maturity Roadmap is a continuous journey to strategicay remove

More information

Storing Shared Data on the Cloud via Security-Mediator

Storing Shared Data on the Cloud via Security-Mediator Storing Shared Data on the Coud via Security-Mediator Boyang Wang, Sherman S. M. Chow, Ming Li, and Hui Li State Key Laboratory of Integrated Service Networks, Xidian University, Xi an, China Department

More information

Sync Kit: A Persistent Client-Side Database Caching Toolkit for Data Intensive Websites

Sync Kit: A Persistent Client-Side Database Caching Toolkit for Data Intensive Websites WWW 2010 Fu Paper Apri 26-30 Raeigh NC USA Sync Kit: A Persistent Cient-Side Database Caching Tookit for Data Intensive Websites Edward Benson, Adam Marcus, David Karger, Samue Madden {eob,marcua,karger,madden}@csai.mit.edu

More information

Pricing and Revenue Sharing Strategies for Internet Service Providers

Pricing and Revenue Sharing Strategies for Internet Service Providers Pricing and Revenue Sharing Strategies for Internet Service Providers Linhai He and Jean Warand Department of Eectrica Engineering and Computer Sciences University of Caifornia at Berkeey {inhai,wr}@eecs.berkeey.edu

More information

INDUSTRIAL PROCESSING SITES COMPLIANCE WITH THE NEW REGULATORY REFORM (FIRE SAFETY) ORDER 2005

INDUSTRIAL PROCESSING SITES COMPLIANCE WITH THE NEW REGULATORY REFORM (FIRE SAFETY) ORDER 2005 INDUSTRIAL PROCESSING SITES COMPLIANCE WITH THE NEW REGULATORY REFORM (FIRE SAFETY) ORDER 2005 Steven J Manchester BRE Fire and Security E-mai: manchesters@bre.co.uk The aim of this paper is to inform

More information

STRATEGIC PLAN 2012-2016

STRATEGIC PLAN 2012-2016 STRATEGIC PLAN 2012-2016 CIT Bishopstown CIT Cork Schoo of Music CIT Crawford Coege of Art & Design Nationa Maritime Coege of Ireand Our Institute STRATEGIC PLAN 2012-2016 Cork Institute of Technoogy (CIT)

More information

Application-Aware Data Collection in Wireless Sensor Networks

Application-Aware Data Collection in Wireless Sensor Networks Appication-Aware Data Coection in Wireess Sensor Networks Xiaoin Fang *, Hong Gao *, Jianzhong Li *, and Yingshu Li +* * Schoo of Computer Science and Technoogy, Harbin Institute of Technoogy, Harbin,

More information

IEICE TRANS. INF. & SYST., VOL.E200 D, NO.1 JANUARY 2117 1

IEICE TRANS. INF. & SYST., VOL.E200 D, NO.1 JANUARY 2117 1 IEICE TRANS. INF. & SYST., VOL.E200 D, NO. JANUARY 27 PAPER Specia Issue on Information Processing Technoogy for web utiization A new simiarity measure to understand visitor behavior in a web site Juan

More information

Preschool Services Under IDEA

Preschool Services Under IDEA Preschoo Services Under IDEA W e don t usuay think of Specific Learning Disabiities in connection with chidren beow schoo age. When we think about chidren age birth to six, we think first of their earning

More information

NCH Software FlexiServer

NCH Software FlexiServer NCH Software FexiServer This user guide has been created for use with FexiServer Version 1.xx NCH Software Technica Support If you have difficuties using FexiServer pease read the appicabe topic before

More information

On Capacity Scaling in Arbitrary Wireless Networks

On Capacity Scaling in Arbitrary Wireless Networks On Capacity Scaing in Arbitrary Wireess Networks Urs Niesen, Piyush Gupta, and Devavrat Shah 1 Abstract arxiv:07112745v3 [csit] 3 Aug 2009 In recent work, Özgür, Lévêque, and Tse 2007) obtained a compete

More information

Protection Against Income Loss During the First 4 Months of Illness or Injury *

Protection Against Income Loss During the First 4 Months of Illness or Injury * Protection Against Income Loss During the First 4 Months of Iness or Injury * This note examines and describes the kinds of income protection that are avaiabe to workers during the first 6 months of iness

More information