PROCEEDINGS OF THE 10 TH ANNUAL INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND EDUCATION IN COMPUTER SCIENCE 2014

Transcription

1 PROCEEDINGS OF THE 10 TH ANNUAL INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND EDUCATION IN COMPUTER SCIENCE 2014 IEEE Sponsor With financial support of the Central Strategic Development Found of NBU July 2014, Albena, Bulgaria Chairmen: Ivan Landjev (Bulgaria), Rumen Stainov (Germany) and Lou Chitkushev (USA) General Secretaries: Petya Assenova (Bulgaria), Vijay Kanabar (USA)

2 CSECS 2014, pp The 10 th Annual International Conference on Computer Science and Education in Computer Science, July , Albena, Bulgaria SOME IMPROVEMENTS OF THE OPEN TEXT SUMMARIZER ALGORITHM USING HEURISTICS Filip ANDONOV, Velina SLAVOVA NBU, Computer Science Department Abstract: A number of heuristics to improve the method used by the Open Text Summarizer library are proposed. Keywords: automatic summary generation ACM Classification Keywords: Natural language processing, Text analysis 93

3 2 Andonov, Slavova Introduction Open Text Summarizer is an implementation of a grammar-agnostic method for creating a summary of a text. Although the method is very simple, the idea behind it is powerful enough to make it compete, in terms of quality of results, with much more complicated methods using powerful techniques. Still the fact that it is independent of the language of the text makes some space for improvements by adding other heuristics without compromising (much) its language independence. State of the art Nowadays the Internet provides a vast ocean of unstructured information in text form. The problem of harvesting this data and analyzing it for certain purposes is generally achieved by two main approaches data mining and data structuring (semantic technologies). The second approach is the orthodox one. Unfortunately, the data is mainly still in unstructured form. This means that for practical reasons the data mining approach is preferable for now. The reasons for processing all this data are different from marketing and business intelligence, through research to intelligence and military purposes. Automatic generation of summaries is not new [Luhn, 1958]. Many such tools implementing different methods exist. Popular areas of research on this topic are latent semantic analysis [Olmos et al, 2009], clustering [Amini et al, 2005] and evolutionary algorithms [Alguliev and Aliguliyev, 2009], and hidden Markov models [Conroy and Oleary, 2001]. All these methods are applied to solve tasks such as web searches, document mining, opinion mining, etc., all of these basically just making the netizen's life easier when dealing with large texts containing little (for a given person) important information. It is easily observable that in order to get results, researchers use sophisticated methods and scientific Some improvements of the Open Text Summarizer algorithm using heuristics 94

4 CSECS 2014, July , Albena, Bulgaria 3 instruments that are based on analytical tools and are grammar-specific. Nevertheless the performance of these instruments is not perfect. In recent years the large usage of Web web content created by users on the one hand and the dawn of printed media on the other - made opinion recognition an attractive topic. It turned out that the quick spreading of opinions in social media could topple governments and spark revolutions. The problem is that there are too many and too long texts on the Internet. We assume that a text summary will give concentrated information about expressed opinion polarity. The approach The aim is to create a simple tool based on the general regularities observed in language expression. These are not necessarily studied and described, as they are not subject of grammar or other branches of linguistics but they are observable, which means statistically detectable. For example, in discourse, when one needs to express an opinion, he/she uses the concepts and the features that are in the focus of what is meant to be expressed (in words) more frequently. This led to the idea to concentrate the tool around the word-forms score. We think that abstraction-based summarization is hard enough to be more of a scientific gymnastic than a practical solution, so we are focusing on extraction-based summarization. Basic scheme Text Concepts and features frequency Center of the saying Text filtering Summary Figure 1 95

5 4 Andonov, Slavova There are two main steps detection of the focus of saying and the creation of the summary by means of generating regular language expressions. There are two main approaches to analyzing texts. The first one (abstraction-based) tries to analyze the text and to rephrase it in a consice way. This is what humans do when writing a summary. The other (extraction-based) tries to extract key sentences from the text in some way and to combine them in a structured (but shorter) text again. One famous algorithm that uses this approach is TextRank. Because we use an extraction-based method, the text of the summary is not generated but filtered from the original text. The detection of the focus The main heuristic here is the one used in the Open Text Summarizer (OTS). It basically says that the most frequent (not included in the stopword list) words are the keywords of the text and that the sentences are scored based on the number of occurrences of the keywords in them. The word-forms which express the focus however are different parts of speech, so we suggest detecting them by means of a dictionary. Unfortunately the dictionary approach is not perfect in English and in many other languages different parts of speech have the same word form. For example, walk as a noun and walk as a verb. Still the goal here is not to achieve perfect detection, because the algorithm we are trying to improve does not use the information about the parts of speech at all. After this step the different sub-forms are stemmed as in frequency counts having the basic form is important. The next thing to do is to actually put in use the information about what part of speech each word is classified as by applying different weights to them. [Nicholls and Song, 2009] have shown that nouns are proved to be the center of conceptualization, so we give them higher weights, lower Some improvements of the Open Text Summarizer algorithm using heuristics 96

6 CSECS 2014, July , Albena, Bulgaria 5 ones for verbs and even lower ones for adjectives. We had to use heuristics in order to fit the weights better. Table 1 Part of speech Weight Verb 0.5 Noun 1.0 Adjective 0.2 Unknown 1.0 Now we proceed by counting the number of occurrences of all the words in the text (as OTS dictates) but with the applied weights. Instead of directly using the number of occurrences as a measure of importance of the word, we use another heuristic. It is similar to the idea of measuring entropy. Obviously, the OTS heuristic is that the more frequently a word is used in the text, the more important it is. On the other hand, every summarization algorithm uses some form of a stop-word list. The idea here is that some words that are very common in all texts ( the in English, for example) do not contribute any meaning to the text's topic, so we remove these words so that we do not contaminate the top positions in the frequency list. However, this idea can be stretched further if we have a large language corpus, we can determine which words are common for all texts, so even if they are common in our text they do not hold a discriminative power. 97

7 6 Andonov, Slavova Table 2 Words frequent in our text Words not frequent in our text Words frequent in all texts Not important Not important Words not frequent in all texts Important Not important Thus we use the language corpus to determine the frequency of a word in all texts and then use the formula below to assign a score to it: Word_score = the number of occurrences in a text / the maximal number of occurrences in a text / the number of occurrences in a global word list (all texts) / the maximum number of occurrences in all texts Text generation In order to avoid the need for grammatical knowledge and the creation of Chomsky s trees, the entities we work with are whole sentences. To every sentence in the text we assign a score, calculated by summing the scores of all of the words it consists of. Thus the higher the number of words that occur frequently in the text and the higher the frequency they occur in text with, the higher the score of the sentence. Now we have a score of all the sentences in the text. The original OTS algorithm simply takes the sentences with the highest score and puts them in the summary. However, a lurking problem of this naïve approach is that some sentences are connected as they contain references to things in previous sentences. To minimize the problem with such severed coreference chains, we use a list of words proven to be conductors of a co- Some improvements of the Open Text Summarizer algorithm using heuristics 98

8 CSECS 2014, July , Albena, Bulgaria 7 reference. Here we also use two additional heuristics. The first is that the most important conductors are located at the beginning of a sentence. The second is that because the internal concept buffer of a person is limited, the further the link word in the sentence is, the less likely it is that this word refers to a previous sentence and not to a concept in the current one. Personal pronouns link words such as I, he, she, it, we, you, etc. bring a score of 7 if they are the first word in the sentence, 6 if they are second word, etc. Other pronouns/link words such as this, that, these, such, there, but, etc. have a score calculated the same way as the personal pronouns, but the score is halved. We apply all these rules to create a second score to each sentence a co-reference score. The final stage of our method is to choose the sentences that have: the highest score, or the next sentence has a co-reference score of 7 or more. Conclusion We focused our efforts on improving a simple approach based on general rules without compromising its core idea of being grammaragnostic. We are doing this by using additional linguistic information but we avoid the need of full sentence structure analysis. The quality of the results we observed in the preliminary tests was satisfactory and we plan a large-scale experiment with language specialists. The main advantages of the method we are using are relatively high speed and fewer computational resources. 99

9 8 Andonov, Slavova Bibliography [Alguliev and Aliguliyev, 2009] Rasim Alguliev, Ramiz Aliguliyev Evolutionary Algorithm for Extractive Text Summarization Intelligent Information Management,1, , 2009 [Amini et al, 2005] Massih R Amini, Nicolas Usunier, Patrick Gallinari Advances in Information Retrieval, Pages , Springer Berlin, Heidelberg, 2005 [Conroy and Oleary, 2001] John M. Conroy and Dianne P. O'leary. Text summarization via hidden Markov models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, , [Luhn, 1958] Luhn., H.P. "The Automatic Creation of Literature Abstracts". IBM Journal of Research and Development, Vol. 2, No. 2, pp , [Nicholls and Song, 2009] Nicholls, C. H. R. I. S., and Fei Song. "Improving sentiment analysis with part-of-speech weighting." Machine Learning and Cybernetics, 2009 International Conference on. Vol. 3. IEEE, [Olmos et al, 2009] [Ricardo Olmos, José A. León, Guillermo Jorge - Botana, and Inmaculada Escudero, New algorithms assessing short summaries in expository texts using latent semantic analysis Behavior Research Methods 41 (3), , 2009] Some improvements of the Open Text Summarizer algorithm using heuristics 100

10 CSECS 2014, July , Albena, Bulgaria 9 Authors' Information Filip ANDONOV, PhD, Chief Assistant Professor, Department of Computer Science, New Bulgarian University, fandonov@nbu.bg. Major Fields of Scientific Research: Multicriteria Optimization, Semantic Technologies Your photo here: Height: 2,58 cm Width: 1,84 cm Velina Slavova, PhD, Prof. in Computer Science, Department of Computer Science, New Bulgarian University, vslavova@nbu.bg Major Fields of Scientific Research: AI, Cognitive Science 101