Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com Computer Science/ Information System University of Putra (UPM) Kuala Lumpur, 43400, Malaysia Abstract- the search engine ranking is one of the most important areas, which the researchers are involved with it. To increase the traffic on websites, Search Engine Optimization (SEO) provides many options [1], however this process is costly and time consuming. This paper describes work on an initial model for handling some of the SEO factors to increase the Term Frequency (TF). The proposed model shows evidence of gaining the greater utilization of the aforementioned parts. In addition, the model provides users with the words and their values based on lexical and semantic approaches to provide a new title, keywords, or description in order to increase the frequency of the keywords, which are used in Meta tags and title as well. The results show the significant enhancement of the proposed model on TF as it is one of the most important factors in search engine ranking algorithm [2]. Keywords: Relativity; Title tag; keyword distribution; Meta tags; SEO; Term Frequency. 1. INTRODUCTION Ranking is the most important operations of the search engines on the web, as the search for specific terms through the search engines require appropriate ranking to achieve good results. In this manner, SEO is the process of improving the volume and quality traffic to a website from search engines via natural search results. Achieve the high rank in search engines depends on more than 200 parameters [3]. Site owners or web designers will be able to customize and improve the ranking if they manage all these parameters and use them in the right position and status. Regarding the mentioned parameters, there is a logical and obvious relationship between the Title tags, Keywords and Description Meta tags (TKD) and the content of the website. Relativity between the title tags, keywords and description Meta tags (title tag, in particular), and the body in the web pages are essential. When a search engine spider analyzes a web page, it determines keyword relevancy based on an algorithm, which is a fairly large and complex formula that calculates how web pages are classified [2]. Thus, as the distribution rate TKD in the body, the higher the ranking will be improved pursuant to this relationship and the distribution is due to reach the position of higher rank in the Search Engine Result Page (SERP). 3
2. BACKGROUND In the field of SEO, many studies already conducted and many theories have been developed. Today, designers and site owners have realized what they want and they require good ranking in the search results' page. Therefore, many SEO specialists designed and developed different models to achieve a satisfactory result. A meticulous and comprehensive work on the extraction of keyphrases in the HTML page done in [4]. He presented a new keyphrase extraction for web pages that requires no training, however his work was based on the assumption that most web pages well written suggest keyphrases based on their internal structure. It is very fast, flexible, and results are stating of the art in the extraction of keywords. Another noteworthy work has been done to evaluate some factors, which are involved with search engine ranking algorithm. The chosen factors were based on physical capacity concepts such as word length elements such as number of bytes of the original document and average term length. Their experiments did not engage major factors that users can manipulate them [5], So the method is not so practical. There is also a model for generating keywords for search engine advertisements based on semantic similarity between terms [6]. To find the frequency term in a document, Ramos used TF.IDF (Term Frequency - Inverse Document Frequency) to determine the relevance word in the document requests [7]. The TF.IDF is one of the most popular weight materials, and most research has been done based on this measure. Challenge with TF.IDF: The TF.IDF weight measure is a measuring tool often used in information retrieval and text mining. A collection of documents as [d1,,dn], and a given query including m words as [w1,,wm] are given, the TF.IDF is interested in to find the relativity of each document to the given query. TF.IDF Formula is presented as follows. TFi,j = ni,j σ nk,j k (1) Where ni,j is the number of occurrences of the considered term in document dj, and the denominator is the number of occurrences of all terms in document dj. The inverse document frequency is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient). D IDFi = Log {dj: ti dj} (2) With D : total number of documents in the corpus, {dj: ti dj} : Number of documents where the term ti appears (that is ni,j 0), Then: TFIDFi,j=TFi,j.IDFi (3) For example, a document could be considered containing 100 words wherein the word that appears 3 times. Following the previously defined formulas, the term frequency (TF) for hat equals 0.03 (3/100). Now, assume that there are10 million documents and hat appears in one thousand of those. Then, the inverse document frequency is calculated as LN (10000000/1000) =9.21. The TF.IDF score is the multiply of these quantities: 0.03 * 9.21 = 0.28. In fact, by using TF.IDF, a weight calculated for each document as di by considering a word as wi, to indicate how much di is important in related to wi. Thus, TF.IDF is weighting method for document. 4
3. THE PROPOSED MODEL Proposed Model Application: Early, the application developed by.net Framework 3.0 with C#. Here, after importing a HTML file in to the program or entering a URL (depend on internet connection availability), the application loads the data from the file or website and with an alert informs user that the file is ready to process. Furthermore, the sentences recognition method makes use of.,! and? for recognizing the sentences. It means the model assumes the web designer or website owner has used the correct punctuation. In addition, the total words number calculated after reducing the stop words in order to increase the worthiness. Because once stop words have not reduced, the numerical results are very small and unreliable. Moreover, the model shows the suggestion in a grid that whether the word is good for title or it should be changed or manipulated in the content. 4. METHODOLOGY This model uses One-Group Pretest/Post-test design for experimental procedure. It means the results show with comparison between before and after using proposed model. This comparison performs over one is the most famous and important measure tool Term Frequency (TF). In addition, this statistics applied on generated dataset. The major steps of the proposed model are mentioned below. First step is creating the dataset, which is fulfilling the requirement of our model such as standard tags, Title and Meta tags, etc. Second step is data pre-processing. In this step, stop words will remove to bring more accuracy on the results. Next step is character analysis (calculating TF). In this step, Term Frequency of each term in body and TKD are calculated. Keywords analysis and generation is fourth step in order to analyze the keywords for suggest to the users. The last step is recalculating the TF to gain a tangible result of the proposed model. One of the most important weight measuring tools is Term Frequency or TF that is the count or repetition of the word in the document [2]. Thus, if something is done for increasing the frequency of words in a document (in proper and legal way without spamming), the word importance is going up. This technique is exactly repetition of the word, consequently, there is not any semantic approach, and it is just lexical side. To increase the relativity in semantic aspect, the WordNet semantic dictionary has been exploited [8]. This repository helps to find the synonyms of the words of Title and keywords and Description meta tags. The proposed model will find the words in the content based on these synonyms and suggest them to the users to use for gain the higher relativity between content and meta tags. With using this solution, the relativity will increase in both aspects of semantic and lexical. Moreover, the TF factor will be increase too. 5. RESULTS The TF, before and after applying the proposed model had been compared. In the fig. 1, the result of TF of predefined keyword of document by the website owner or developer and TF after customizing the keywords with the model suggestions for TKD is compared. In fact, one of the duties of the model is to find the keywords that have more repetition in the content rather than predefined keywords in TKD. And another duty is to find the synonyms of the keywords with using the WordNet repository. For the calculation of TF, the frequency of the keywords calculates one by one. Then the total of the keywords count divided by total number of the content words (after reducing the stop words). 5
In addition, diversity of the TF is calculated by subtraction of the original TF and updated TF. It shows the improvement of the total term frequency after applying the model on dataset. Fig. 1 shows updated TF in compare with original TF has better result and Fig. 2 shows the percentage of TF diversity before and after applying the model. TF Comparison (before and after involving proposed model) TF 500 450 400 350 300 250 200 150 100 50 0 Updated TF Original TF 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Documents Fig.1. TF comparison (before and after applying the proposed model) TF diversity% Diversity 40 35 30 25 20 15 10 5 0-5 -10 4.1139 3.6514 5.6747 1.4599 5.4920 0.6086 2.2143 1.2085 4.2194 7.4681 3.8911 1.7331 1 4 7 10 13 18.7621-0.7080 1.8059 16-2.3952-6.5625 16.1972 1.8552 4.0984 19 22 2.9586 18.1818 11.6959 5.1597 4.0693 25 28 12.1681 31-0.3140-2.9586 8.6614 4.3651 2.9268 4.4366 34 37 2.2222 11.1446 40-1.2963-1.0811 25.2700 1.7467 43 3.7736 6.6775 1.7310 0.4664 0.9313-1.4035 46 49 13.4615-0.4843 14.2342 5.4545 2.7778 1.6783 52 55 Documents 7.7622 6.6937 2.7778 3.9906 4.1667 58 61 8.5227-0.4902 64-2.7617-5.2174 9.1667 6.4706 1.9268 5.4348 4.4713 67 70-2.7624 19.1176 2.9703 4.7619 0.7255 73 9.1168 7.0990 3.5811 15.6929 10.6383 8.1967 2.9412 1.0917 76 79 82 85 5.2186-0.2538 34.6667 26.4875 15.1515 88 3.3898 91-0.4662-0.9174 9.7633 1.0246 4.2553 94 97 0.9346 4.3197-2.1277 100 Fig.2. The percentage of TF diversity 6
Term Frequency (TF) 0.1400 0.1200 0.1000 0.0800 0.0600 0.0400 0.0200 0.0816 TF before 0.1287 TF updated Fig.3. TF Comparison Fig. 3 shows the enhancement of proposed model on one of the most popular keyword weighting measurement. The average of TF before applying the model was 0.0816 on the dataset. After applying the model on the dataset, the average of TF was 0.1287. It means, the improvement of the TF measure is greater than 57.72%. 6. CONCLUSION average of updated TF Model Improvement% = ൬ 1൰ 100 (4) average of Original TF The results show the significant enhancement of the proposed model on TF. Applying the model on dataset shows the enhancement for 57.72%. However, there is a lot of improvement, which are in the process. One of the on-going process is creating the dataset with low quality HTML files and compare the results of the model between two dataset in order to finding the efficiency and accuracy of the model. Considering of other measurements and other datasets can be observed for future works. 7. REFERENCES [1]. W3C, W.W.W.C.-. Available from: http://www.w3.org/. [2]. Thurow, S., Search engine visibility. 2008, Indianapolis: NY.: New Riders. [3]. Evans, M.P., Analysing Google rankings through search engine optimization data. Internet Research, 2007. 17(1): p. 21-37. [4]. Humphreys, J.B.K., PhraseRate: An HTML Keyphrase Extractor. 2002. [5]. Bifet, A., et al., An Analysis of Factors Used in Search Engine Ranking, in Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba. 2005: Japan. p. 207-305. [6]. Abhishek, V. and K. Hosanagar, Keyword generation for search engine advertising using semantic similarity between terms. 2007, ACM: Minneapolis, MN, USA. p. 89-94. [7]. Ramos, J., Using TF-IDF to Determine Word Relevance in Document Queries. 2001, Department of Computer Science, Rutgers University. [8]. Miller, G.A., WordNet: a lexical database for English. Communications of the ACM, 1995. 38(11): p. 39-41. 7