Crowdsourcing Entity Resolution: a Short Overview and Open Issues

Size: px
Start display at page:

Download "Crowdsourcing Entity Resolution: a Short Overview and Open Issues"

Transcription

1 Crowdsourcing Entity Resolution: a Short Overview and Open Issues Xiao Chen Otto-von-Gueriecke University Magdeburg xiao.chen@ovgu.de ABSTRACT Entity resolution (ER) is a process to identify records that stand for the same real-world entity. Although automatic algorithms aiming at solving this problem have been developed for many years, their accuracy remains far from perfect. Crowdsourcing is a technology currently investigated, which leverages the crowd to solicit contributions to complete certain tasks via crowdsourced marketplaces. One of its advantages is to inject human reasoning to problems that are still hard to process for computers, which makes it suitable for ER and provides an opportunity to achieve a higher accuracy. As crowdsourcing ER is still a relatively new area in data processing, this paper provides an overview and a brief classification of current research state in crowdsourcing ER. Besides, some open issues are revealed that will be a starting point for our future research. General Terms Theory Keywords Crowdsourcing, Entity Resolution, Record Linkage 1. INTRODUCTION Entity resolution (ER) is a process to identify records that refer to the same real-world entity. It plays a vital role not only in traditional scenarios of data cleaning and data integration, but also in web search, online product comparisons, etc. Various automatic algorithms have been developed in order to solve ER problems, which generally include two classes of techniques: similarity-based and learning-based approaches. Similarity-based techniques use similarity functions, where values of record pairs similarity is above a preset threshold are considered to be matched. Learning-based techniques use machine learning modeling ER as a classification problem and are training classifiers to identify matching and non-matching record pairs [14]. However, the accuracy 27 th GI-Workshop on Foundations of Databases (Grundlagen von Datenbanken), , Magdeburg, Germany. Copyright is held by the author/owner(s). 72 Figure 1: A HIT for ER of both classes of computer-based algorithms is still far from perfect, particularly for big data without fixed types and structures. Crowdsourcing was first introduced in the year of 2006 [6] and is gaining growing interest in recent years. Current approaches investigate how Crowdsourcing is suitable to improve the accuracy of ER, because people are better at solving this problem than computers. Although the research on crowdsourcing ER was only started in recent years, there have been several significant research contributions. This paper gives an overview on the current research state of crowdsourcing ER, classifies and compares the related research, and tries to point out some open issues. The rest of the paper is organized as follows. As crowdsourcing is still not fully established as a technology in the database community, in Section 2 background information on crowdsourcing ER are presented. Then in Section 3, the current state of the research in crowdsourcing ER is presented, different research approaches are classified and compared. After that, open issues are presented in Section 4. Finally, a conclusion is presented in Section CROWDSOURCING In their early days, computers were mainly used for specific domains that required strong calculation powers. But up until today many tasks, especially those requiring complex knowledge and abstract reasoning capabilities, are not well supported by computers. Crowdsourcing derives its name from crowd and sourcing, as it outsources tasks to the crowd via the Internet. Many crowdsourced marketplaces, which are represented for instance by Amazon s Me-

2 Figure 2: Research directions between crowdsourcing and databases chanical Turk (MTurk), provide convenience for companies, institutions or individuals recruiting large numbers of people to complete tasks that are difficult for computers or cannot performed well by computer with acceptable effort. Entity resolution is one of such tasks, which can apply crowdsourcing successfully. People are better at solving ER than computers due to their common sense and domain knowledge. For example, for Otto-von-Guericke-University and Magdeburg University it is easy for people around Magdeburg to know that both of them refer to the same university, while for computer algorithms they are very likely to be judged as different universities. The tasks on MTurk are called Human Intelligence Tasks (HITs). Figure 1 depicts a possible HIT example of ER. One HIT is usually assigned several workers to guarantee the result quality. Small amounts of money, e.g. 3-5 cents are typical at MTurk, are paid for each worker per HIT. The current research on crowdsourcing and databases covers two aspects (see Figure 2): on the one hand, databases should be adjusted to support crowd-based data processing (see [3, 10, 11]); on the other hand, crowd-based technology can help to make more effective and broader data processing possible (see [2, 5, 8, 9]). More effective data processing means that the returned query results could be more accurate after leveraging crowdsourcing. Besides, traditional databases cannot handle certain queries such as incomplete data queries or subjective operation queries. By using crowdsourcing, the scope of queries to be answered is broadened. Crowdsourcing ER belongs to the second area, i.e., crowd-based data processing. Specifically, crowdsourcing ER is an important part of join queries. Join queries allow establishing connections among data contained in different tables and comparing the values contained in them [1]. This comparison is not necessarily simple, since there are different expressions for the same real-world fact, more cases with the comparisons among different media and more cases that need human s subjective comparison. Then ER is a necessary step to better answer the join queries. 3. OVERVIEW AND CLASSIFICATION OF CURRENT RESEARCH STATE Figure 3 presents a classification for the main research on crowdsourcing ER. From the perspective of crowdsourcing, most of the researches uses crowdsourcing only for identifying matching record pairs. Only one recent approach proposes leveraging crowdsourcing for the whole process of ER, which gives a novel and valuable view on crowdsourcing ER. The workflows of approaches, which leverage crowdsourcing solely for identifying matching record pairs, vary widely. In general, the workflow of crowdsourcing ER has been developed step by step. From the beginning, ER is proposed to be solved by crowdsourcing only, until now a much more complete workflow is formed by integrating different research (see Figure 4). In addition, these approaches focus on different problems. Many novel ideas and algorithms have been developed to optimize crowdsourcing ER. Gokhale et al. presented the Corleone approach and tried to leverage crowdsourcing for the whole process of ER, which is described as hands-off crowdsourcing [4]. In Subsection 3.1 the research leveraging crowdsourcing only for the identification of matching record pairs is described. The corleone approach, which leverages crowdsourcing for the whole process of ER, is then discussed in Subsection Crowdsourcing for Identifying Matching Record Pairs Figure 4 depicts a complete hybrid workflow for ER, which contains all proposals to optimize the workflow in crowdsourcing ER. Instead of letting crowds answer HITs that contain matching questions for all records, the first step in the complete workflow is to choose a proper machine-based method to generate a candidate set that contains all record pairs with their corresponding similarities. Then pruning is performed to reduce the total number of required HITs. In order to further reduce the number of required HITs, the transitive relation is applied in the procedure of crowdsourcing, i.e., if a pair can be deduced by transitive relation, it does not need to be crowdsourced. For instance, given three records a, b, and c, the first type of transitive relation is, if a matches b and b matches c, then a matches c. The other type of transitive relation means, if a matches b and b does not match c, then a does not match c. After all record pairs are further identified by crowdsourcing or transitive relations, a global analysis can be performed on the initial result. The global analysis was suggested by Whang et al. [16]. They apply transitive relations only as an example of a global analysis after the process of crowdsourcing. In the case of the integrated ER workflow, which applies transitive relations during the process of crowdsourcing, its global analysis may be implemented by other techniques such as correlation clustering to further improve the result [16]. After a global analysis, the final result is obtained. The three segments in the dashed box are optional, i.e., in some research, some of them are not included. In the following, specific workflows for different approaches are presented and their contributions are summarized. One important assertion needs to be pointed out: most of the research is done given the assumption that people do not make mistakes when answering HITs. Therefore, each HIT is only assigned to one worker. The problems caused by mistakes 73

3 Figure 3: Classification for the research on crowdsourcing ER that crowds may make are addressed for instance in [7, 12] Crowd-based Only ER Marcus et al. proposed to solve ER tasks only based on crowdsourcing [9]. They suggested to using Qurk [10], a declarative query processing system to implement crowdbased joins, and the workflow for ER is only crowd-based, i.e., crowdsourcing is used to complete the whole process of ER. Correspondingly, their workflow does not contain any segment of the three dashed boxes. As mentioned in Section 2, although crowdsourcing can improve the accuracy of ER, it causes monetary costs and is much slower than automatic algorithms. In order to save money and reach lower latency, Marcus et al. proposed two optimization methods, summarized here. Batching:. the basic interface of crowdsourcing ER is similar to Figure 1, which asks workers to answer only one question in each HIT and is called simple join. The question in simple join is pair-based, i.e., ask workers whether two records belong to the same entity. For a task to identify the same entities from two sets of records respectively with m and n records, m*n HITs are needed. Two optimizations are provided. One is called naive batching. It asks workers to answer b questions in each HIT. Each question in it is also pair-based. In this way, the total amount of HITs is reduced to (m*n)/b. The other one is called smart batching. Instead of asking workers whether two records belong to the same entity or not, it asks workers to find all matching pairs from two record lists. If the first list contains a records that are selected from one data set and the second list contains b records that are selected from the other data set, the total number of HITs are (m*n)/(a*b). Feature filtering optimization:. for some kinds of records, certain features are useful for being join predicates. For instance, there are two groups of people photos, and if each photo is labeled with the gender of the person on it, only photos with a matching gender are necessary to be asked. The feature should be extracted properly, or it leads to large amounts of unnecessary label work and is unhelpful to reduce the total number of HITs. Even though the above described optimization methods are applied for ER, the crowd-based only workflow cannot satisfy the development for larger and larger data sets. Therefore, the research in rest of this subsection tends to develop a hybrid workflow for crowdsourcing ER Hybrid Computer-Crowd Entity Resolution Without Considering Transitive Relation During the Process of Crowdsourcing Wang et al. proposed an initial hybrid workflow [14], which contains only the segment in the red dashed box. The pruning technique in it is simply abandoning the pairs with similarities under a threshold. It describes two types of HIT generation approaches: pair-based HIT generation and cluster-based HIT generation. These two types are actually similar to naive batching and smart batching mentioned above in the optimization part of [9]. A pair-based HIT consists of k pairs of records in a HIT. k is the permitted number of record pairs in one HIT, because too many 74

4 Figure 4: Complete workflow of crowdsourcing ER pairs may lead to accuracy reductions of crowds answers. A cluster-based HIT consists of a group of k individual records rather than pairs but based on the given pairs. Wang et al. defined the cluster-based HIT generation formally and proves that this problem is NP-hard. Therefore, they reduced the cluster-based HIT generation problem to the k-clique edge covering problem and then apply an approximation algorithm to it. However, this algorithm fails on generating a minimum number of HITs. Therefore, a heuristic two-tiered approach is proposed to generate as few as possible cluster-based HITs. In addition, the following conclusion is made according to their own experimental results. 1. The two-tiered approach generates fewer cluster-based HITs than existing algorithms. 2. The initial hybrid workflow achieves better accuracy than machined based method and generates less HITs than the crowd-based only workflow. 3. The cluster-based HITs provide less latency than pairbased. HITs. Wang et al. proved that cluster-based HITs have lower latency than pair-based HITs under the premise of reaching the same accuracy as pair-based HITs, which provides a baseline to prefer cluster-based methods for HIT design. However, the introduced approximation algorithm performs worse than a random algorithm, so it is not presented in this paper. Whang et al. added a global analysis step to the initial workflow introduced above by Wang et al., but they did not consider using transitive relations to reduce the number of HITs either [16]. The global analysis after the process of crowdsourcing uses transitive relations as an example and permits other techniques, such as correlation clustering to improving the accuracy. Whang et al. focused their research on developing algorithms to ask crowds HITs with the biggest expected gain. Instead of simply abandoning the pairs with similarity scores under a threshold, which is adopted by Wang et al. [14], it develops an exhaustive algorithm, which computes the expected gain for asking crowds questions about each record pair and then chooses the question with the highest estimated gain for crowdsourcing. The exhaustive estimation algorithm is #P-hard. Therefore, the proposed GCER algorithm produces an approximate result within polynomial time and includes the following optimization on the exhaustive algorithm: 1. It only computes the expected gain of record pairs with very high or low similarities. 2. It uses the Monte-Carlo approximation to substitute the method of computing the expected gain for one record pair. 3. The results of the calculations that have already been made before can be shared to the later calculation, instead of recalculation. 4. Instead of resolving all records, only resolving records that may be influenced by the current question. At last, because the GCER algorithm is still complex, it proposes a very simple half algorithm to choose questions for crowdsourcing. The half algorithm chooses the record pairs with a matching probability closest to 0.5 to ask crowds whether they match or not. In summary, it is non-trivial that Whang et al. uses transitive relations to remedy the accuracy loss caused by such HITs that people cannot answer correctly. However, Whang et al. have not further applied transitive relations during the process of crowdsourcing. Besides, although several optimizations are proposed to improve the performance of the exhaustive algorithm, the complexity of the algorithm is still very high and infeasible in practice Hybrid Computer-Crowd ER Considering Transitive Relation During the Process of Crowdsourcing Vesdapunt et al. [13] and Wang et al. [15] considered the transitive relation in the process of crowdsourcing. Therefore, their workflows contain the segments in the red and green dashed boxes in Figure 4. Their research defines the problem of using transitive relations in crowdsourcing formally and proposes a hybrid transitive-relations and crowdsourcing labeling framework. The basic idea is that, if a 75

5 record pair can be deduced according to the matching results that are obtained by crowdsourcing, there is no need to crowdsource it. A record pair can be deduced if transitive relations exist. Two types of transitive relations are defined: positive transitive relations and negative transitive relations, which refer to the two types of transitive relations described at the first paragraph of Section 3.1. A record pair x 1, x n can be deduced to be a matching pair, only if all pairs among the path from x 1 to x n are matching pairs. A record pair x 1, x n can be deduced to be a non-matching pair, only if there exist one non-matching record pairs among the path from x 1 to x n. The number of HITs is significantly effected by the labeling order because of applying transitive relations for crowdsourcing ER, i.e. in their work Vesdapunt et al. suggest to first ask crowds questions for real matching record pairs and then for real non-matching record pairs. Because whether two records match or not in reality cannot be known, HITs are first generated for record pairs with higher matching probabilities. However, the latency that a question is answered by crowdsourcing is long, and it is not feasible to publish a single record pair for crowdsourcing and wait for its result to decide whether the next record pair can be deduced or has to be crowdsourced. In order to solve this problem, a parallel labeling algorithm is devised to reduce the labeling time. This algorithm identifies a set of pairs that can be crowdsourced in parallel and asks crowds questions about these record pairs simultaneously, then iterates the identification and crowdsourcing process according to the obtained answers until all record pairs are resolved. The proposed algorithm is evaluated both in simulation and a real crowdsourcing marketplace. The evaluation results show that its approaches with transitive relations can save more monetary costs and time than existing methods with little loss in the result quality. However, Vesdapunt et al. proved that the algorithm proposed by Wang et al. may be Ω(n) worse than optimal, where n is the number of records in the database. This proof means the algorithm is not better than any other algorithms, because Vesdapunt et al. also prove that any algorithm is at most Ω(n) worse than optimal. Therefore, Vesdapunt et al. presented their own strategies to minimize the number of HITs considering transitive relations in the process of crowdsourcing. One simple strategy is to ask crowds questions of all record pairs in a random order without considering the matching probabilities of record pairs. Although this strategy is random, it is proved that it is at most o(k) worse than the optimal, where k is the expected number of clusters. For both approaches a graph-clustering-based method is adopted to efficiently show the relations among possibly matching records. Since the expected number of clusters cannot exceed the number of records, the random algorithm is better than the algorithm proposed by Wang et al. Another strategy called Node Priority Querying is also proved to be at most o(k) worse than the optimal. These algorithms are evaluated using several real-world data sets. The results in different data sets are not stable. However, overall the node priority algorithm precedes the random algorithm and the algorithm proposed by Wang et al. The random algorithm is superior to the algorithm proposed by Wang et al. in some cases. In summary, both Vesdapunt et al. and Wang et al. considered transitive relations during the process of crowdsourcing, which opens a new perspective to further decrease the number of HITs to reduce the latency and lower the monetary cost. However, their proposed algorithms perform not stably on different data sets and can be optimized or redesigned. 3.2 Applying Crowdsourcing for the Whole ER Process All approaches introduced so far apply crowdsourcing only for identifying whether record pairs are matching or not. The proposed algorithms have to be implemented by developers. Nowadays, the need for many enterprises to solve ER tasks is growing rapidly. If enterprises have to employ one developer for each ER task, for so many tasks the payment for developers are huge and not negligible. Even in some cases, some private users cannot employ developers to help them complete ER tasks by leveraging crowdsourcing, as they have only a small amount of money. In order to solve this problem, Gokhale et al. proposed hands-off crowdsourcing for ER, which means that the entire ER process is completed by crowdsourcing without any developer [4]. This hands-off crowdsourcing for ER is called Corleone, which can generate blocking rules, train a learning-based matcher, estimate the matching accuracy and even implement an iteration process by crowdsourcing. Each of the above mentioned aspects is settled into a module, and specific implementations are presented to each module. 4. OPEN ISSUES In this section, open issues on crowdsourcing ER are presented that are from the authors perspective the most important ones to be addressed by future research. Machine-based methods and pruning approaches: The machine-based methods used in most approaches are similarity-based and pruning methods are simply abandoning the record pairs, whose matching probabilities are above or below given thresholds. Such approaches cannot get satisfactory results in the case of more and more complex data environments, because record pairs with very high matching probabilities may refer to different entities, and vice versa. Therefore, machinedbased methods should be extended to learning-based and pruning approaches should be rigorously designed to avoid abandoning the record pairs, which indeed need to be further confirmed. Only transitive relations applied: currently, transitive relations have been widely adopted in crowdsourcing ER, which are not considered at the early stage of crowdsourcing ER. Other techniques such as correlation clustering could be applied to ER. Limited comparability of results: both [13] and [15] develop strategies to minimize the number of HITs that are needed to be sent to crowdsourcing. Both of them evaluate their own algorithms using different data sets and compare the performance of their own algorithm with other existing algorithms. However, some of the evaluation results are inconsistent. The reason is that the same algorithm performs varies for different data sets perhaps leading to a different result. Therefore, research to study which algorithm is more suitable for 76

6 specific kinds of data sets is necessary for the development of crowdsourcing ER. In addition, new algorithms can be designed, which performs stably on different kinds of data sets. Only initial optimization strategies: as leveraging crowdsourcing for the whole process of ER, i.e., hands-off crowdsourcing, is just in the beginning of its devopment. The following optimization techniques can be developed: first, the current hands-off crowdsourcing for ER is based on the setting of identifying record pairs from two relational tables, which may be extended to other ER scenarios. Second, the current strategy to extract samples for generating blocking rules is quite simple, and better sampling strategies should be explored. Ontologies and indexes: once a decision is made, the knowledge injected by the crowd is widely lost. Another interesting possible research question could be, how this feedback could be gathered and described for instance by an ontology. 5. CONCLUSIONS AND FUTURE WORK This paper gives an overview of the current research state in crowdsourcing ER. Most of the approaches focus on leveraging crowdsourcing to verify the matching of record pairs. From the early stage of the research to more recent approaches, the workflow was optimized step by step and more aspects were considered for the process of crowdsourcing, which developed from crowd-based only workflow to hybrid computer-crowdsourcing workflow, which considers transitive relations. However, this does not mean that the research on the initial workflow is less significant. In contrast, every approach is valuable and contributes to various research aspects of crowdsourcing ER. Most recently, a novel perspective to crowdsourcing is proposed, which extends the crowdsourcing object and leverages crowdsourcing not only to verify the matching of record pairs, but also to implement the algorithms, train a learningbased matcher, estimate the matching accuracy and even implement an iteration process. This novel idea makes the crowdsourcing more applicable and further reduces the cost for employing dedicated people. In Section 4, some important open issues are presented. My future work will focus on the first possible research direction, i.e., exploring techniques to be applied in crowdsourcing ER, such as correlation clustering. 6. ACKNOWLEDGMENTS I would like to thank the China Scholarship Council to fund this research. I also express my gratitude to Eike Schallehn and Ziqiang Diao for their precious feedback. 7. REFERENCES [1] P. Atzeni. Database systems: concepts, languages & architectures. Bookmantraa. com, [2] S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In Proceedings of the 16th International Conference on Database Theory, pages ACM, [3] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages ACM, [4] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages ACM, [5] S. Guo, A. Parameswaran, and H. Garcia-Molina. So who won?: dynamic max discovery with the crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages ACM, [6] J. Howe. The rise of crowdsourcing. Wired magazine, 14(6):1 4, [7] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages ACM, [8] A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. Proceedings of the VLDB Endowment, 6(2): , [9] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proceedings of the VLDB Endowment, 5(1):13 24, [10] A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages ACM, [11] H. Park, H. Garcia-Molina, R. Pang, N. Polyzotis, A. Parameswaran, and J. Widom. Deco: A system for declarative crowdsourcing. Proceedings of the VLDB Endowment, 5(12): , [12] P. Venetis and H. Garcia-Molina. Quality control for comparison microtasks. In Proceedings of the first international workshop on crowdsourcing and data mining, pages ACM, [13] N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment, 7(12): , [14] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11): , [15] J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages ACM, [16] S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment, 6(6): ,

Research Statement: Human-Powered Information Management Aditya Parameswaran (www.stanford.edu/ adityagp)

Research Statement: Human-Powered Information Management Aditya Parameswaran (www.stanford.edu/ adityagp) Research Statement: Human-Powered Information Management Aditya Parameswaran (www.stanford.edu/ adityagp) My research broadly revolves around information management, with special emphasis on incorporating

More information

The Advantages of Labor Independence

The Advantages of Labor Independence Arnold: Declarative Crowd-Machine Data Integration Shawn R. Jeffery Groupon, Inc. sjeffery@groupon.com Nick Pendar Groupon, Inc. npendar@groupon.com Liwen Sun UC Berkeley liwen@cs.berkeley.edu Rick Barber

More information

Massive Data Query Optimization on Large Clusters

Massive Data Query Optimization on Large Clusters Journal of Computational Information Systems 8: 8 (2012) 3191 3198 Available at http://www.jofcis.com Massive Data Query Optimization on Large Clusters Guigang ZHANG, Chao LI, Yong ZHANG, Chunxiao XING

More information

Web Database Integration

Web Database Integration Web Database Integration Wei Liu School of Information Renmin University of China Beijing, 100872, China gue2@ruc.edu.cn Xiaofeng Meng School of Information Renmin University of China Beijing, 100872,

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Automated Workflow Synthesis

Automated Workflow Synthesis Automated Workflow Synthesis Haoqi Zhang Northwestern EECS Evanston, IL 60208 hq@northwestern.edu Eric Horvitz Microsoft Research Redmond, WA 98052 horvitz@microsoft.com David C. Parkes Harvard SEAS Cambridge,

More information

Managing General and Individual Knowledge in Crowd Mining Applications

Managing General and Individual Knowledge in Crowd Mining Applications Managing General and Individual Knowledge in Crowd Mining Applications Yael Amsterdamer 1, Susan B. Davidson 2, Anna Kukliansky 1, Tova Milo 1, Slava Novgorodov 1, and Amit Somech 1 1 Tel Aviv University,

More information

BUSINESS RULES AND GAP ANALYSIS

BUSINESS RULES AND GAP ANALYSIS Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More

More information

The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b

The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b 3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b

More information

Research of Postal Data mining system based on big data

Research of Postal Data mining system based on big data 3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Research of Postal Data mining system based on big data Xia Hu 1, Yanfeng Jin 1, Fan Wang 1 1 Shi Jiazhuang Post & Telecommunication

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

More information

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam ECLT 5810 E-Commerce Data Mining Techniques - Introduction Prof. Wai Lam Data Opportunities Business infrastructure have improved the ability to collect data Virtually every aspect of business is now open

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA. VLDB Cloud Intelligence workshop, 2012

Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA. VLDB Cloud Intelligence workshop, 2012 Integrity Verification of Cloud-hosted Data Analytics Computations (Position paper) Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA 9/4/ 1 Data-Analytics-as-a- Service (DAaS) Outsource:

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

Research on the cloud platform resource management technology for surveillance video analysis

Research on the cloud platform resource management technology for surveillance video analysis Research on the cloud platform resource management technology for surveillance video analysis Yonglong Zhuang 1*, Xiaolan Weng 2, Xianghe Wei 2 1 Modern Educational Technology Center, Huaiyin rmal University,

More information

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering

More information

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Dynamic Querying In NoSQL System

Dynamic Querying In NoSQL System Dynamic Querying In NoSQL System Reshma Suku 1, Kasim K. 2 Student, Dept. of Computer Science and Engineering, M. E. A Engineering College, Perinthalmanna, Kerala, India 1 Assistant Professor, Dept. of

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Automatic Annotation Wrapper Generation and Mining Web Database Search Result Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Linked Data and the Models

Linked Data and the Models A semantically enabled architecture for crowdsourced Linked Data management Elena Simperl Institute AIFB Karlsruhe Institute of Technology Germany elena.simperl@kit.edu Maribel Acosta Institute AIFB Karlsruhe

More information

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster , pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

A Service Revenue-oriented Task Scheduling Model of Cloud Computing

A Service Revenue-oriented Task Scheduling Model of Cloud Computing Journal of Information & Computational Science 10:10 (2013) 3153 3161 July 1, 2013 Available at http://www.joics.com A Service Revenue-oriented Task Scheduling Model of Cloud Computing Jianguang Deng a,b,,

More information

Crowdfunding Support Tools: Predicting Success & Failure

Crowdfunding Support Tools: Predicting Success & Failure Crowdfunding Support Tools: Predicting Success & Failure Michael D. Greenberg Bryan Pardo mdgreenb@u.northwestern.edu pardo@northwestern.edu Karthic Hariharan karthichariharan2012@u.northwes tern.edu Elizabeth

More information

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh

More information

SCHEDULING IN CLOUD COMPUTING

SCHEDULING IN CLOUD COMPUTING SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism

More information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Eric Hsueh-Chan Lu Chi-Wei Huang Vincent S. Tseng Institute of Computer Science and Information Engineering

More information

Crowdsourcing for Big Data Analytics

Crowdsourcing for Big Data Analytics KYOTO UNIVERSITY Crowdsourcing for Big Data Analytics Hisashi Kashima (Kyoto University) Satoshi Oyama (Hokkaido University) Yukino Baba (Kyoto University) DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY

More information

Load Balancing in Distributed Data Base and Distributed Computing System

Load Balancing in Distributed Data Base and Distributed Computing System Load Balancing in Distributed Data Base and Distributed Computing System Lovely Arya Research Scholar Dravidian University KUPPAM, ANDHRA PRADESH Abstract With a distributed system, data can be located

More information

Effective Third Party Auditing in Cloud Computing

Effective Third Party Auditing in Cloud Computing 2014 28th International Conference on Advanced Information Networking and Applications Workshops Effective Third Party Auditing in Cloud Computing Mohammed Hussain and Mohamed Basel Al-Mourad Department

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Network mining for crime/fraud detection. FuturICT CrimEx January 26th, 2012 Jan Ramon

Network mining for crime/fraud detection. FuturICT CrimEx January 26th, 2012 Jan Ramon Network mining for crime/fraud detection FuturICT CrimEx January 26th, 2012 Jan Ramon Overview Administrative data and crime/fraud Data mining and related domains Data mining in large networks Opportunities

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Big Data Research in the AMPLab: BDAS and Beyond

Big Data Research in the AMPLab: BDAS and Beyond Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

IntelliNet Delivers APM Service with CA Nimsoft Monitor

IntelliNet Delivers APM Service with CA Nimsoft Monitor IntelliNet Delivers APM Service with CA Nimsoft Monitor 2 IntelliNet Delivers APM Service with CA Nimsoft Monitor ca.com Email communications are vital to the productivity, collaboration and safety of

More information

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor

More information

Preview of Award 1320357 Annual Project Report Cover Accomplishments Products Participants/Organizations Impacts Changes/Problems

Preview of Award 1320357 Annual Project Report Cover Accomplishments Products Participants/Organizations Impacts Changes/Problems Preview of Award 1320357 Annual Project Report Cover Accomplishments Products Participants/Organizations Impacts Changes/Problems Cover Federal Agency and Organization Element to Which Report is Submitted:

More information

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

Research of Smart Distribution Network Big Data Model

Research of Smart Distribution Network Big Data Model Research of Smart Distribution Network Big Data Model Guangyi LIU Yang YU Feng GAO Wendong ZHU China Electric Power Stanford Smart Grid Research Institute Smart Grid Research Institute Research Institute

More information

Automated Collaborative Filtering Applications for Online Recruitment Services

Automated Collaborative Filtering Applications for Online Recruitment Services Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,

More information

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-

More information

A Prediction-Based Transcoding System for Video Conference in Cloud Computing

A Prediction-Based Transcoding System for Video Conference in Cloud Computing A Prediction-Based Transcoding System for Video Conference in Cloud Computing Yongquan Chen 1 Abstract. We design a transcoding system that can provide dynamic transcoding services for various types of

More information

III. DATA SETS. Training the Matching Model

III. DATA SETS. Training the Matching Model A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS Caldas, Carlos H. 1 and Soibelman, L. 2 ABSTRACT Information is an important element of project delivery processes.

More information

Intelligent Heuristic Construction with Active Learning

Intelligent Heuristic Construction with Active Learning Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field

More information

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn

More information

Some Research Challenges for Big Data Analytics of Intelligent Security

Some Research Challenges for Big Data Analytics of Intelligent Security Some Research Challenges for Big Data Analytics of Intelligent Security Yuh-Jong Hu hu at cs.nccu.edu.tw Emerging Network Technology (ENT) Lab. Department of Computer Science National Chengchi University,

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

Application of Data Mining Techniques in Intrusion Detection

Application of Data Mining Techniques in Intrusion Detection Application of Data Mining Techniques in Intrusion Detection LI Min An Yang Institute of Technology leiminxuan@sohu.com Abstract: The article introduced the importance of intrusion detection, as well as

More information

Visualization methods for patent data

Visualization methods for patent data Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

More information

Privacy Preserving Outsourcing for Frequent Itemset Mining

Privacy Preserving Outsourcing for Frequent Itemset Mining Privacy Preserving Outsourcing for Frequent Itemset Mining M. Arunadevi 1, R. Anuradha 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College, Coimbatore, India 1 Assistant

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

More information

Tomer Shiran VP Product Management MapR Technologies. November 12, 2013

Tomer Shiran VP Product Management MapR Technologies. November 12, 2013 Predictive Analytics with Hadoop Tomer Shiran VP Product Management MapR Technologies November 12, 2013 1 Me, Us Tomer Shiran VP Product Management, MapR Technologies tshiran@maprtech.com MapR Enterprise-grade

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

An Efficient Load Balancing Technology in CDN

An Efficient Load Balancing Technology in CDN Issue 2, Volume 1, 2007 92 An Efficient Load Balancing Technology in CDN YUN BAI 1, BO JIA 2, JIXIANG ZHANG 3, QIANGGUO PU 1, NIKOS MASTORAKIS 4 1 College of Information and Electronic Engineering, University

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Turker-Assisted Paraphrasing for English-Arabic Machine Translation Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

Exploitation of Server Log Files of User Behavior in Order to Inform Administrator Exploitation of Server Log Files of User Behavior in Order to Inform Administrator Hamed Jelodar Computer Department, Islamic Azad University, Science and Research Branch, Bushehr, Iran ABSTRACT All requests

More information

How To Find Local Affinity Patterns In Big Data

How To Find Local Affinity Patterns In Big Data Detection of local affinity patterns in big data Andrea Marinoni, Paolo Gamba Department of Electronics, University of Pavia, Italy Abstract Mining information in Big Data requires to design a new class

More information

A Symptom Extraction and Classification Method for Self-Management

A Symptom Extraction and Classification Method for Self-Management LANOMS 2005-4th Latin American Network Operations and Management Symposium 201 A Symptom Extraction and Classification Method for Self-Management Marcelo Perazolo Autonomic Computing Architecture IBM Corporation

More information

Effective User Navigation in Dynamic Website

Effective User Navigation in Dynamic Website Effective User Navigation in Dynamic Website Ms.S.Nithya Assistant Professor, Department of Information Technology Christ College of Engineering and Technology Puducherry, India Ms.K.Durga,Ms.A.Preeti,Ms.V.Saranya

More information

Evaluation of Different Task Scheduling Policies in Multi-Core Systems with Reconfigurable Hardware

Evaluation of Different Task Scheduling Policies in Multi-Core Systems with Reconfigurable Hardware Evaluation of Different Task Scheduling Policies in Multi-Core Systems with Reconfigurable Hardware Mahyar Shahsavari, Zaid Al-Ars, Koen Bertels,1, Computer Engineering Group, Software & Computer Technology

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification

More information

What s next for the Berkeley Data Analytics Stack?

What s next for the Berkeley Data Analytics Stack? What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Machine Learning for Naive Bayesian Spam Filter Tokenization

Machine Learning for Naive Bayesian Spam Filter Tokenization Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these

More information

Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis

Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis Derek Foo 1, Jin Guo 2 and Ying Zou 1 Department of Electrical and Computer Engineering 1 School of Computing 2 Queen

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Dr. Daisy Zhe Wang Director of Data Science Research Lab University of Florida, CISE

More information

GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING

GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING Geoinformatics 2004 Proc. 12th Int. Conf. on Geoinformatics Geospatial Information Research: Bridging the Pacific and Atlantic University of Gävle, Sweden, 7-9 June 2004 GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL

More information