SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Transcription

1 SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute of Technology, Thovalai. * Assistant Professor, Department of Electronics and Communication Engineering, CSI Institute of Technology, Thovalai. ABSTRACT Many problems in natural language processing, data mining, information retrieval, and bioinformatics can be formalized as string transformation, which is a task as follows. Given an input string the system generates the k most likely output strings corresponding to the input string. This project proposes a novel and probabilistic approach to string transformation which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for generating the top k candidates, whether there is or is not a predefined dictionary. The log linear model is defined as a conditional probability distribution of an output string and a rule set for the transformation conditioned on an input string. The learning method employs maximum likelihood estimation for parameter estimation. The string generation algorithm based on pruning is guaranteed to generate the optimal top k candidates. The proposed method is applied to correction of spelling errors in queries as well as reformulation of queries in web search. Experimental results on large scale data show that the proposed approach is very accurate and efficient improving upon existing methods in terms of accuracy and efficiency in different settings. I INTRODUCTION Many problems in natural language processing, data mining, information retrieval, and bioinformatics can be formalized as string transformation, which is a task as follows. The main objective of the project is to develop an efficient string transformation technique while performing data mining in various research domains or can be used for search engine optimization. This project proposes a novel and probabilistic approach called Trivial String Reformulation Technique to string transformation which is both accurate and efficient. The main objective of this technique is to develop a sufficient framework that can learn the strings and can generate strings with the leaning process and finally can transform the string to give an accurate and effective data mining results. 1

2 II OVERVIEW OF THE PROJECT This project addresses string transformation, which is an essential problem, in many applications. In natural language processing, pronunciation generation, spelling error correction, word transliteration, and word stemming can all be formalized as string transformation. String transformation can also be used in query reformulation and query suggestion in search. In data mining, string transformation can be employed in the mining of synonyms and database record matching. As many of the above are online applications, the transformation must be conducted not only accurately but also efficiently. String transformation can be defined in the following way. Given an input string and a set of operators, we are able to transform the input string to the k most likely output strings by applying a number of operators. Here the strings can be strings of words, characters, or any type of tokens. Each operator is a transformation rule that defines the replacement of a substring with another substring. The likelihood of transformation can represent similarity, relevance, and association between two strings in a specific application. Although certain progress has been made, further investigation of the task is still necessary, particularly from the viewpoint of enhancing both accuracy and efficiency, which is precisely the goal of this work String transformation can be conducted at two different settings, depending on whether or not a dictionary is used. When a dictionary is used, the output strings must exist in the given dictionary, while the size of the dictionary can be very large. Without loss of generality, we specifically study correction of spelling errors in queries as well as reformulation of queries in web search in this paper. However, efficiency is not an important factor taken into consideration in these methods. In contrast, our work in this paper aims to learn a model for string transformation which can achieve both high accuracy and efficiency. There are three fundamental problems with string transformation: How to define a model which can achieve both high accuracy and efficiency, How to accurately and efficiently train the model from training instances, How to efficiently generate the top k output strings given the input string, with or without using a dictionary. A PROBLEM STATEMENT String transformation has many applications in data mining, natural language processing, information retrieval, and bioinformatics. String transformation has been studied in different specific tasks such as database record matching, spelling error correction, query reformulation and synonym mining. The major difference between our work and the existing work is that we focus on enhancement of both accuracy and efficiency of string transformation. The types of the transformation rules are predefined such as stemming, prefix, suffix and acronym. Okazaki incorporated rules into an L1- regularized logistic regression model and utilized the model for string transformation. 2

3 Dreyer also proposed a log linear model for string transformation, with features representing latent alignments between the input and output strings. Finite-state transducers are employed to generate the candidates. Efficiency is not their main consideration since it is used for offline application. Our model is different from Dreyer model in several points. Particularly our model is designed for both accurate and efficient string transformation, with transformation rules as features and non-positive values as feature weights. Okazaki model is largely different from the model proposed in this project although both are discriminative models. The approximation string transformation search, which is the problem of identifying strings in a given dictionary that are similar to an input string. In approximate string search, it is usually assumed that the model (similarity or distance) is fixed and the objective is to efficiently find all the strings in the dictionary. Most existing methods attempt to find all the candidates within a fixed range and employ. There are also methods for finding the top k candidates by using n-grams. Efficiency is the major focus for these methods and the similarity functions in them are predefined. In contrast, our work in this paper aims to learn and utilize a similarity function which can achieve both high accuracy and efficiency. B PROPOSED SYSTEM This project proposes a novel and probabilistic approach called Trivial String Reformulation Technique (TSRT) to string transformation which is both accurate and efficient. (1) a log-linear (discriminative) model for string transformation, (2) an effective and accurate algorithm for model learning, and (3) an efficient algorithm for string generation. The log linear model is defined as a conditional probability distribution of an output string and a rule set for the transformation given an input string. The learning method is based on maximum likelihood estimation. Thus, the model is trained toward the objective of generating strings with the largest likelihood given input strings. The generation algorithm efficiently performs the top k candidate s generation using top k pruning. It is guaranteed to find the best k candidates without enumerating all the possibilities. A tree is employed to index transformation rules in the model. When a dictionary is used in the transformation a tree is used to efficiently retrieve the strings in the dictionary. We empirically evaluated our method in spelling error correction of queries and reformulation of queries in web search. The experimental results on the two problems demonstrate that our method consistently and significantly performs better than the baseline methods of generative model and logistic regression model in terms of accuracy and efficiency. We have also applied our method to the Microsoft Speller Challenge and found that our method can achieve a performance comparable to those of the best performing systems in the challenge. In this section, we introduce how to efficiently generate the top k output strings. We 3

4 employ top k pruning, which can guarantee to find the optimal k output strings. We also exploit two special data structures to facilitate efficient generation. We index the rules with an tree. When a dictionary is utilized, we index the dictionary with a tree. Spelling Error Correction Spelling error correction normally consists of candidate generation and candidate selection.. For single-word candidate generation, a rule based approach is commonly used. The use of edit distance is a typical approach, which exploits operations of character deletion, insertion and substitution. Some methods generate candidates within a fixed range of edit distance or different ranges for strings with different lengths. Other methods learn weighted edit distance to enhance the representation power. Edit distance does not take context information into consideration. To address the challenge, some researchers proposed using a large number of substitution rules in this project, we propose using a discriminative model. Both approaches have pros and cons, and normally a discriminative model can work better than a generative model because it is trained for enhancing accuracy. Since users behavior of misspelling and correction can be frequently observed in web search log data, it has been proposed to mine spelling-error and correction pairs by using search log data. The mined pairs can be directly used in spelling error correction. Methods of selecting spelling and correction pairs with a maximum entropy model and similarity functions have been developed. Only high frequency pairs can be found from log data, however. In this project, we work on candidate generation at the character level, which can be applied to spelling error correction for both high and low frequency words. Query Reformulation Query reformulation involves rewriting the original query with its similar queries and enhancing the effectiveness of search. Most existing methods manage to mine transformation rules from pairs of queries in the search logs. One represents an original query and the other represents a similar query. First identifies phrase-based transformation rules from query pairs, and then segments the input query into phrases, and generates a number of candidates based on substitutions of each phrase using the rules. The weights of the transformation rules are calculated based on log likelihood ratio. A query dictionary is used in this case. The existing methods mainly focused on how to extract useful patterns and rank the candidates with the patterns, while the models for candidate generation are simple. In this project, we work on query reformulation as an example of string transformation and we employ a more sophisticated model. Rule Index The rule index stores all the rules and their weights using an Aho-Corasick tree which can make the references of rules very efficient. The AC tree is a trie with failure links, on which the Aho-Corasick string matching algorithm can be executed. The Aho-Corasick 4

5 algorithm is a well-known dictionary-matching algorithm which can quickly locate the elements of a finite set of strings (here strings are the αs in the rules α β) within an input string. The time complexity of the algorithm is of linear order in the length of input string plus the number of matched entries. We construct an AC tree using all the α s in the rules. Each leaf node corresponds to an α and the corresponding βs are stored in an associated list in decreasing order of rule weights λ. One may want to further improve the efficiency by using a tree rather than a ranking list to store the β s associated with the same α. However the improvement would not be significant because the number of β s associated with each α is usually small. In string generation, given an input string, we first retrieve all the applicable rules and their weights from the AC tree in time complexity of input string length plus number of matched entries. Top k Pruning The string generation problem amounts to that of finding the top k output strings given the input string. We assume that the input string s i consists of tokens ^t 1 t 2 tn $.In the figure, for example, c11 is the token transformed from t1, and c 1 is the token transformed from t4t5t6. In this case, the top k candidates correspond with the top k paths in the lattice. Note that the length of the candidate string can be shorter or longer than the input string. We employ the top k pruning techniques to efficiently conduct the string generation task. We use triple (pos; string; score) to denote each path generated so far, corresponding to the position, the content and the score of the path. Qpathis a priority queue storing paths, and it is initialized with path (1, ^,0). Stopkis a set storing the best k candidates and their scores (string; score) found so far and it is empty at the beginning. The algorithm expands the path by following the path from its current position. After one path is processed, another path is popped up from the priority queue with heuristics. The algorithm uses the top k pruning strategy to eliminate unlikely paths and thus improve efficiency. If the score of a path is smaller than the minimum score of the top k list Stop k, then the path will be discarded and not be used further. This pruning strategy works, because the weights of rules are all non positive and applying additional rules cannot generate a candidate with higher probability. Therefore, it is not difficult to prove that the best k candidates in terms of the scores can be guaranteed to be found. In this project we use heuristics, which are intuitively reasonable and work well in practice. III SYSTEM ARCHITECTURE 5

6 Fig 3.1 System Architecture IV DATABASE DESIGN Database design is recognized as a standard of MIS and is available for virtually every computer system. In database environment, common data are available which several authorized user can use. General theme behind a database is to integrate all the information. A database is an organized mechanism that has the capability of storing information through which a user can retrieve stored information in an effective and efficient manner. The data is the purpose of any database and must be protected. The database design is a two level process. In the first step, user requirements are gathered together and a database is designed which will meet these requirements as clearly as possible. This step is called Information Level Design and it is taken independent of any individual DBMS. In the second step, this Information level design is transferred into a design for the specific DBMS that will be used to implement the system in question. This step is called Physical Level Design, concerned with the characteristics of the specific DBMS that will be used. A database design runs parallel with the system design. The organization of the data in the database is aimed to achieve the following two major objectives. Data Integrity Data independence To structure the data so that there is no repetition of data, this helps in saving. A database is an integrated collection of data and provides a centralized access to the data. Usually centralized data managing software is called RDBMS. The main significant difference between RDBMS and other type of DBMS is the separation of data as seen by the program and data as stored on direct access to storage device. This is the difference between logical and physical data. While designing a database, several objectives must be considered. They include the redundancy of Easy to leaving and use Data independence More information at low cost Accuracy and integrity While creating the tables, above objectives were considered to avoid redundancy of data. V CONCLUSION The proposed model called Trivial String Reformulation Technique (TSRT) to 6

7 string transformation which is both accurate and efficient introduced in this system has successfully solves several issues of string transformation in data mining. The approach includes the use of a log linear model, a method for training the model, and an algorithm for generating the top k candidates whether there is or is not a predefined dictionary. Two specific applications are addressed with this method are spelling error correction of queries and query reformulation in web search. Experimental results on very large data sets show that the proposed method improves upon the baselines in terms of accuracy and efficiency. REFERENCES 1. Li, Y. Zhang, M. Zhu, and M. Zhou, (2006) Exploring distributional similarity based models for query spelling correction, in Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ser. ACL 06. Morristown, NJ, USA: Association for Computational Linguistics, pp R. Golding and D. Roth,( 1999) A winnow-based approach to context-sensitive spelling correction, Mach. Learn., vol. 34, pp , February. 3. Guo, G. Xu, H. Li, and X. Cheng,( 2008) A unified and discriminative model for query refinement, in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR 08. New York, NY, USA: ACM, pp Behm, S. Ji, C. Li, and J. Lu, (2009) Space-constrained gram-based indexing for efficient approximate string search, in Proceedings of the 2009 IEEE International Conference on Data Engineering, ser. ICDE 09. Washington, DC, USA: IEEE Computer Society, p Brill and R. C. Moore,( 2000) An improved error model for noisy channel spelling correction, in Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ser. ACL 00. Morristown, NJ, USA: Association for Computational Linguistics, pp Okazaki, Y. Tsuruoka, S. Ananiadou, and J. Tsujii,( 2008) A discriminative candidate generator for string transformations, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP 08. Morristown, NJ, USA: Association for Computational Linguistic, pp Dreyer, J. R. Smith, and J. Eisner,( 2008) Latent-variable modeling of string transductions with finite-state methods, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP 08. Stroudsburg, PA, USA: Association for Computational Linguistics, pp

8 8. Arasu, S. Chaudhuri, and R. Kaushik,( 2009) Learning string transformations from examples, Proc. VLDB Endow., vol. 2, pp Tejada, C. A. Knoblock, and S. Minton,( 2002) Learning domainindependent string transformation weights for high accuracy object identification, in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD 02. New York, NY, USA: ACM, pp Hadjieleftheriou and C. Li,( 2009) Efficient approximate search on string collections, Proc. VLDB Endow., vol. 2, pp Li, B. Wang, and X. Yang,( 2007) Vgram: improving performance of approximate queries on string collections using variable-length grams, in Proceedings of the 33rd international conference on Very large data bases, ser. VLDB 07. VLDB Endowment, pp Yang, B. Wang, and C. Li, (2008) Costbased variable-length-gram selection for string collections to support approximate queries efficiently, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ser. SIGMOD 08. New York, NY, USA: ACM, pp Li, J. Lu, and Y. Lu,( 2008) Efficient merging and filtering algorithms for approximate string searches, in Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ser. ICDE 08. Washington, DC, USA: IEEE Computer Society, pp Ji, G. Li, C. Li, and J. Feng,( 2009) Efficient interactive fuzzy keyword search, in Proceedings of the 18th international conference on World wide web, ser. WWW 09. New York, NY, USA: ACM,p Vernica and C. Li,( 2009) Efficient top-k algorithms for fuzzy search in string collections, in Proceedings of the First International Workshop on Keyword Search on Structured Data, ser. KEYS 09. New York, NY, USA: ACM, pp Yang, J. Yu, and M. Kitsuregawa,( 2010) Fast algorithms for top-k approximate string matching, in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, ser. AAAI 10, pp Whitelaw, B. Hutchinson, G. Y. Chung, and G. Ellis,( 2009) Using the web for language independent spellchecking and autocorrection, in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP 09. Morristown, NJ, USA: Association for Computational Linguistics, pp S. Ristad and P. N. Yianilos, (2000) Learning string-edit distance, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, pp