Multilingual text mining

Transcription

1 Data Mining VI 89 Multilingual text mining F. Neri Research & Development Department, SYNTHEMA S.r.l., Italy Abstract The availability of a huge amount of textual data from a bewildering variety of sources leads to the well-identified paradox based on which an overload of information means no usable knowledge. In fact, up to 80% of electronic data is textual. Moreover, the most valuable information is encoded in pages which are written in various native languages, but are relevant even to non-native speakers. The process of accessing all these raw data, heterogeneous for language used, and transforming them into information is therefore inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on the ability to master the problems of multilingualism. Through multilingual text mining, users can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and find all related information. This paper describes the approach used by SYNTHEMA for multilingual text mining, showing the classification results on around 600 breaking news items written in English, Italian and French. 1 Multilingual resources construction Generally speaking, the manual construction and maintenance of multilingual language resources is undoubtedly expensive, requiring remarkable efforts. Being established in 1994 by computer scientists from the IBM Research Center, with the expertise and skills suited to provide effective software solutions, as well as carry out R&D in Natural Language Processing area, SYNTHEMA has been involved in Machine Translation, Information Extraction and Text Mining activities since 1996, primarily in the field of Technology Watch. The growing availability of comparable and parallel corpora has pushed SYNTHEMA to develop specific methods for semi-automatic updating of lexical resources. They are based on Natural Language Understanding and Machine Learning. These techniques detect multilingual lexicons from such corpora, by extracting all the

2 90 Data Mining VI meaningful term or phrases that express the same meaning in comparable documents. As a case study, let us consider a corpus made of around 350 parallel breaking news written in English, French and in Italian, used as training set for the topic of interest. English has been used as reference language. The major problem consists in the different syntactic structure and words definition these languages may have. So a direct phrasal alignment has been often needed. The following bilingual morphological analysis Italian vs English, French vs English - recognises as relevant terminology only those terms or phrases, that exceed a threshold of significance. A specific algorithm [1] associates an Information Quotient to each detected term and ranks it on its importance. The Information Quotient is calculated taking in account the term, its Part Of Speech tag, its relative and absolute frequency, its distribution on documents. This morphological analysis detects significant Simple Word Terms (SWT) and Multi Word Terms (MWT), annotating their headwords, their relative and absolute positions. SYNTHEMA strategy on multilingual dictionary construction consists in the assumption that, having taken in account a specific term S and its phrasal occurrences, its translation T can be automatically detected by analysing the correspondent translated sentences. Thus, semi-automatic lexicon extraction and storage of multilingual relevant descriptors become possible (see Fig. 1). Each multilingual dictionary, specifically suited for the cross-lingual mapping, is bidirectional and contains multiple coupled terms f(s, T), stored as Translation Memories. Each lemma is referenced to syntax or domain dependent translated terms, so that each entry can represent multiple senses. Besides, the multilingual dictionaries contain lemmas together with simple binary features, as well as sophisticated tree-to-tree translation models, which map - node by node - whole sub-trees. For this case study, the multilingual dictionary is made of around entries. Figure 1: Bilingual morphological and statistical analysis, translation memories.

3 Data Mining VI 91 2 Lexical analysis The automatic Linguistic Analysis is based on Parsing, Morphological and Statistical rules. The Parsing analysis is based on a set of pre-defined rules, which specify the most relevant fields in documents and their main features. The automatic linguistic analysis of free textual fields is based on Morphological and Statistical criteria. This phase is intended to identify only the significant expressions from the whole raw text. This analysis recognises as relevant terminology only those terms or phrases that comply with a set of pre-defined morphological patterns (i.e.: noun+noun and noun+preposition+noun sequences) and whose frequency exceeds a threshold of significance. The detected terms and phrases are then extracted, reduced to their Part Of Speech tagged base form [2 5]. Once referred to their language independent entry inside the multilingual dictionary, they are used as descriptors for documents [6,7]. Indexation based on terminology detection is extremely reliable for managing any type of documentation, especially if it is technical and scientific. In fact, unfortunately, few of us have complete knowledge about the world. And, in the consequence of this, the meanings we ascribe to words may differ from those ascribed by others. The same happens with lexical tools capable of syntactic parsing, which have always a limited capability of semantic interpretation and disambiguation, if applied to generic corpora. In such situations, these tools cannot pick out the exact interpretation for all expressions in the language. Besides, main terminology - mostly compound nouns helps understand the topic, being intrinsically linked to semantics. Figure 2: Lexical analysis.

4 92 Data Mining VI 3 Clustering analysis The classification is made by TEMIS Online Miner Light, according to the K- Mean approach. It is an application developed by TEMIS (TEMIS was established in 2000 as a Technology & Consulting Company, specialized in Text Intelligence and Advanced Computational Linguistics to develop applications related to Competitive Intelligence, Customer Relationship Management and Knowledge Management) jointly with SYNTHEMA and fulfils the following requirements: Unsupervised Classification. The application dynamically discovers the thematic groups that best describe the detected documents. Hierarchical Classification. This makes it possible to explore in depth thematic groups, subdividing them into more specific themes. The application provides a visual summary of the analysis (See Fig. 3). A map shows the different groups as differently sized bubbles (the size depends on the number of documents the bubble contains) and the meaningful correlation among them as lines drawn with different thickness (that is level of correlation). Users can search inside topics and have a look of the documents populating the clusters. The output results can be viewed by a simple Web browser. Figure 3: Thematic map and Search in topics. As an example, let us classify all the 483 documents which are the result of a specific query on the application database. We obtain 10 well-defined clusters, dealing with terrorism and war (cluster 1), Palestinian crisis (cluster 2), Italian politics (cluster 3, 4, 5), Italian school (cluster 6), economy (cluster 7), child kidnapping (cluster 8), illegal immigration (cluster 9) and general themes (cluster 10).

5 Data Mining VI 93 Having a look of the thematic network, the results are similar to what everyone would expect from reading these type of documents: all the clusters regarding politics are linked together, the Israeli-Palestinian crisis are linked the cluster concerning peace and war, etc. When searching for insemination inside the bubbles map, the system highlights all the clusters which contain documents having insemination as lexical descriptor, allowing access to them (see Fig. 3). We obtain documents dealing with inseminazione, fecondazione, legge sulla fecondazione, sterilità, fecondazione assistita, artifical insemination, insemination intervention, etc (see Fig. 4). Figure 4: Documents visualization. 4 Conclusions This paper describes a new approach used in Text Mining applied to multilingual corpora and a specific case study made on around 600 English, French and Italian breaking news, directly downloaded from MISNA, AGI and from some French news agencies. Terminologies and Translation Memories permit to overcome linguistic barriers, allowing the automatic indexation and classification of documents, whatever it might be their language. This new approach enables the research, the analysis, and the classification of great volumes of heterogeneous documents, helping people to cut through the information labyrinth. As multilingualism is an important part of this globalised society, Multilingual Text Mining is a major step forward in keeping pace with the relevant developments in the challenging and rapidly changing world.

6 94 Data Mining VI References [1] Cascini, G., Neri, F.: Natural Language Processing for Patents Analysis and Classification, Proceedings of ETRIA World Conference, TRIZ Future 2004, Florence, Italy. Neri F., Raffaelli R., Text Mining applied to multilingual corpora, Proceedings of Knowledge Mining NEMIS 2004 Final Conference, Athens, Greece, Oct 2004, 25. [2] Raffaelli, R.: An inverse parallel parser using multi-layered grammars, IBM Technical Disclosure Bulletin, 2Q, [3] Raffaelli, R.: Un ambiente per lo sviluppo di grammatiche basato su un parser inverso, parallelo e seriale, IBM Italy Scientific Centers Technical Report, pp. 1-19, [4] Marinai, E., Raffaelli, R.: The design and architecture of a lexical data base system, COLING 90, Workshop on advanced tools for Natural Language Processing, Helsinki, Sweden, Aug 1990, 24. [5] Raffaelli, R.: ABCD A Basic Computer Dictionary, Proceedings of ELS Conference on Computational Linguistics, Kolbotn, Norway, Aug 1988, [6] Galli, G., Raffaelli, R., Saviozzi, G.: Il trattamento delle espressioni composte nel trattamento del linguaggio naturale. IBM Research Center, internal report, Pisa, Italy, pp. 1-19, [7] Elia, A., Vietri, S.: Electronic dictionaries and linguistic analysis of Italian large corpora. JADT 2000, 5th International Conference on the Statistical Analysis of Textual Data, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, pp.2-4, 2000.