2) A single-language trained classifier: one. classifier trained on documents written in



Similar documents
Comparison of workflow software products

SIMPLIFYING NDA PROGRAMMING WITH PROt SQL

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

An Ensemble Classification Framework to Evolving Data Streams

Approximation Algorithms for Data Distribution with Load Balancing of Web Servers

Multi-agent System for Custom Relationship Management with SVMs Tool

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Alternative Way to Measure Private Equity Performance

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

XAC08-6 Professional Project Management

What is Candidate Sampling

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Predictive Control of a Smart Grid: A Distributed Optimization Algorithm with Centralized Performance Properties*

DEFINING %COMPLETE IN MICROSOFT PROJECT

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Enterprise Master Patient Index

The OC Curve of Attribute Acceptance Plans

An Efficient Job Scheduling for MapReduce Clusters

Cardiovascular Event Risk Assessment Fusion of Individual Risk Assessment Tools Applied to the Portuguese Population

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Predicting Advertiser Bidding Behaviors in Sponsored Search by Rationality Modeling

Expressive Negotiation over Donations to Charities

WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce

A Secure Password-Authenticated Key Agreement Using Smart Cards

TCP/IP Interaction Based on Congestion Price: Stability and Optimality

On-Line Trajectory Generation: Nonconstant Motion Constraints

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , info@teltonika.

IT09 - Identity Management Policy

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

A Resources Allocation Model for Multi-Project Management

Capacity-building and training

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

Web Object Indexing Using Domain Knowledge *

Assessment of the legal framework

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

FIRE RISK INDEXING AND FIRE RISK ANALYSIS: A COMPARISON OF PROS AND CONS

E lijzlompon gms Dlgga Image; Of i I a) m erver _ :

A Simple Congestion-Aware Algorithm for Load Balancing in Datacenter Networks

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis

Section 5.4 Annuities, Present Value, and Amortization

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Trivial lump sum R5.0

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Recurrence. 1 Definitions and main statements

Calculation of Sampling Weights

Dynamic Virtual Network Allocation for OpenFlow Based Cloud Resident Data Center

Enterprise Content Management

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Clever techniques for purchasing electricity

The Greedy Method. Introduction. 0/1 Knapsack Problem

Laws of Electromagnetism

Agglomeration economies in manufacturing industries: the case of Spain

1. Measuring association using correlation and regression

Nordea G10 Alpha Carry Index

Clustering based Two-Stage Text Classification Requiring Minimal Training Data

7.5. Present Value of an Annuity. Investigate

Semantic Link Analysis for Finding Answer Experts *

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Complex Service Provisioning in Collaborative Cloud Markets

Can Auto Liability Insurance Purchases Signal Risk Attitude?

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

LIFETIME INCOME OPTIONS

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

ELM for Exchange version 5.5 Exchange Server Migration

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

Efficient Project Portfolio as a tool for Enterprise Risk Management

Software project management with GAs

LAW ENFORCEMENT TRAINING TOOLS. Training tools for law enforcement officials and the judiciary

Marketing Society Awards for Excellence Employee Engagement Summary. Objective

Airport Investment Risk Assessment under Uncertainty

The State of South Carolina OFFICE OF THE ATTORNEY GENERAL

A Study on Secure Data Storage Strategy in Cloud Computing

Research on Transformation Engineering BOM into Manufacturing BOM Based on BOP

Fuzzy Keyword Search over Encrypted Data in Cloud Computing

Simple Interest Loans (Section 5.1) :

Types of Injuries. (20 minutes) LEARNING OBJECTIVES MATERIALS NEEDED

A Simple Approach to Clustering in Excel

Assessing Student Learning Through Keyword Density Analysis of Online Class Messages

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Protection, assistance and human rights. Recommended Principles and Guidelines on Human Rights and Human Trafficking (E/2002/68/Add.

Transcription:

Openng the ega terature Porta to mutngua access E. Francescon, G. Perugne ITTIG Insttute of Lega Informaton Theory and Technooges Itaan Natona Research Counc, Forence, Itay Te: +39 055 43999 Fax: +39 055 42272 e-ma: {francescon, perugne}@ttg.cnr.t Abstract: The ssues of mutngua access to ega nformaton are examned, and strateges of cross-anguage retreva to ega nformaton resources are ustrated as addtona servces of the Porta to ega terature created by ITTIG. Consderaton s gven to the pecuartes of ega anguage as a technca anguage cosey reated to the dverse ega systems. The paper descrbes a methodoogca approach panned for the Porta to provde a snge pont of access nto dsparate repostores where categores of aw (.e. trade aw, consttutona aw, crmna aw) are the essenta metadata to pont to reevant matera rrespectve of the anguage used n a query. Categores of aw of a specfc ega system represent the way how retreva can be satsfactory acheved. Strateges and technques for transatng ega queres to dfferent target anguages, eventuay dsambguatng ambguous words by a machne earnng approach are ustrated. Keywords: ega terature, cross-anguage retreva, mutngua metadata, word sense dsambguaton. Introducton Lega nformaton has pecuartes due to ts mutfaceted nature, methods of use and the ntegraton requrements of dfferent ega nformaton sources such as statutes, case-aw and ega doctrne. In partcuar, the nformaton retreva of ega terature requres hgh quaty ndexng as we as approprate searchng methods and toos n order to be effectvey accessed by dverse ega user communtes. The Porta of Itaan ega terature deveoped by the Insttute of Lega Informaton Theory and Technooges n 2002 and presented at the DC2003 Conference n Seatte has taken nto account such requrements (). The provson of a unfed pont of access to mutpe ega doctrne resources through the expotaton of rch metadata and the deveopment of toos for the dscovery, seecton and use of reevant ega matera are the core functons of the Porta. At the same tme, methodooges and toos have been deveoped to aow the communcaton of ega knowedge among genera users, n an attempt to sove some of the dffcutes caused by the compexty of ega anguage. Efforts so far have been made to desgn the Porta s system, conduct a survey of Itaan ega users and deveop ntegrated toos for generatng and capturng metadata for structured and sem-structured web documents (). A second phase has recenty been aunched, where emphass s put on two dstnct requrements, both addressng the need for nternatona access through the ega Porta. These consst n: a) openng up the system to a wder user communty, ncudng foregn patrons, who must be gven the possbty to access Itaan ega matera n ther natve anguage; b) provdng mutngua access to foregn ega resources. These objectves are based on the beef that the deveopment of strateges and toos that enabe access to nformaton regardess of geographc or anguage barrers s a key factor for the truy goba sharng of ega knowedge, thus makng t possbe for ega research and the ega professon to progress accordng to the requrements of modern socety. Wth the rapd ncrease n gobazaton, transnatona ssues can be expected to arse today n vrtuay any ega context, therefore the retreva of foregn aw matera wth adequate mutngua toos becomes an essenta requrement for success n modern aw practce and research. Furthermore, the prncpe of mutnguasm n the doman of aw not ony ensures democratc transparency and the equaty of ctzens rghts, but aso guarantees ega certanty.

A two-phase approach has been deveoped for the mpementaton of the Porta s mutngua access functon. Frsty, anayss has been carred out of cross-anguage retreva approaches, ther appcaton to aw and the mutnguaty of metadata. Secondy, a practca approach to the retreva of mutngua ega resources has been defned. Based on the features of the Porta s federaton system, where documents comng from structured repostores and from the web are quafed usng Dubn Core metadata set n ts XML verson, query transaton and word dsambguaton technques have been mpemented. 2 Mutngua nformaton access approaches In order that mutngua access be guaranteed, that s so that nformaton can be searched, retreved and presented effectvey, wthout constrants due to the dfferent anguages and scrpts used n documents and n metadata, both the users natve anguage and the mutpcty and weath of word-wde anguages have to be accommodated. What s needed s functonaty ke the thorough and proper handng of characters (ther presentaton, arrangement, transfer), puttng queres nto a preferred anguage and scrpt, retrevng resources rrespectve of the anguage used n searchng and ndexng, enabng word-wde communcaton no matter what the anguage. As a consequence mutngua factes must ncude both mutpe-anguage recognton and crossanguage nformaton retreva (CLIR) (2) Among the basc approaches to crossanguage retreva, based on controed vocabuary and on free (or fu-text) retreva, no optma souton exsts: ths s wdey demonstrated by the weath of appcatons, research and studes under way n the fed of CLIR (3) The use of controed vocabuary and a mutngua thesaurus n a cross-anguage retreva envronment, where seected terms from each anguage are reated to a common set of concept dentfers (4) nvoves abourntensve work deveopng and managng such toos, especay n appcatons where dverse domans and matera are to be managed. Free-text searchng and, connected wth t, methods based on dctonares or corpora (that s anaysng exstng coectons of texts from whch to extract nformaton for the appcaton specfc-transaton technques) are an aternatve approach (5). Here ether the query s transated or the document when t s ndexed, but mtatons exst and vary accordng to the technques used. Dffcutes are the ack of equvaence n transaton, the ambguty whch can arse from transatng terms between anguages when no context s provded and the avaabty of parae and comparabe corpora. Ths s partcuary true n the doman of aw. 3 Lega anguage pecuartes In mpementng the ega Porta s servces the approaches to mutngua access are consdered wth reference to the pecuartes of ega anguage, whch s a strcty technca anguage. Lke other dscpnes, aw has ts own excon, t uses ad-hoc terms and attrbutes specfc meanngs to terms taken from ordnary anguage. It s a sort of nterna code aowng communcaton between ega experts, where terms and technca expressons respond to economc crtera, makng concepts understandabe by usng a restrcted vocabuary (6). A ths appes at a natona eve. At an nternatona eve the compexty and rchness of dverse ega anguages make understandng and exchange of concepts expressed n such anguages a very dffcut task. However, the man probem of ega termnoogy at an nternatona eve many regards the dfference of ega concepts nherent to the dverse natona ega systems. In fact merey transatng words s key to confuse users when a concept does not exst n the target ega system (7). A basc dfference between ega systems es n the fact that, speakng n very genera terms and wthout takng nto account mxed ega systems, the Western word has ong had two domnant ega tradtons: Common aw, wth ts begnnngs n Engand, and Cv aw, rooted n contnenta Europe. At ths stage of the project the Porta many deas wth matera drawn from these two basc systems. Both systems have taken on a varety of cutura forms. Cv aw systems have drawn ther nspraton argey from ther Roman aw hertage and, by gvng precedence to wrtten aw, have resoutey opted for a systematc codfcaton of ther genera aw. Common aw systems are nstead based on Engsh common aw concepts and ega However no modern ega system s a pure representaton of any type of system; a jursdctons today represent mxed systems, to some extent.

organzatona methods whch assgn a preemnent poston to case-aw, as opposed to egsaton, as the ordnary means of expresson of genera aw. Qute a number of concepts beongng to dfferent ega systems can hardy be transposed or even compared wth one another. Unke technca and scentfc dscpnes, serous dffcutes arse n transatng aw matera due to the system-bound nature of ega termnoogy. Consequent to ths, categores of aw (.e. trade aw, consttutona aw, crmna aw) are specfc wthn a partcuar ega system. Therefore, categores of aw of a specfc ega system represent the way how retreva can be satsfactory acheved. Moreover, a ega system dentfes one or more anguages of a country where t s operatve. At ths stage of the study we consder a one to one mappng between a specfc ega system and a gven anguage. As often there s no conceptua nor content smarty between the categores of aw of the dfferent ega systems, mappng between such aw categores s necessary to reach proper contextuasaton of the query n the dverse ega systems. An exampe ustrates the need for such mappng. The concepts reated to property rghts, such as the deveopment of property aw, and aw, property questons on nsovency, nteectua property, etc. accordng to UK aw beong to the fed of property aw, whereas n the Itaan ega system these ega facts are reguated by prvate aw, agrcutura aw and ndustra aw. 3. Equvaence between ega anguages In order that cross-anguage retreva of ega matera be effectve, prorty must be gven to transatng one ega anguage to another ega anguage and not to the ordnary words of the target anguage, as ths can cause ambguty and msunderstandng. Therefore the whoe process of nteracton between ega anguages can be dentfed as fndng equvaents across ega systems (8). If no acceptabe equvaents can be found n the target-anguage, subsdary soutons must be sought, such as no transaton and use of source terms, paraphrasng, creatng a neoogsm wth expanatory notes. The research for equvaence mpes both a comparatve study of the dfferent ega systems and adequate knowedge of technca ega termnoogy. Some exampes are provded, hghghtng the compexty of comparng dfferent ega anguages. Pure ngustc probems are key to be encountered due to ega fase frends. Some exampes are gven beow. The terms «admnstratve trbunas» cannot be transated n French as «trbunaux admnstratfs». The Engsh word for the French trbuna s Court and the admnstratve trbunas are admnstratve commssons whch are comparabe, mutats mutands, to the French «autortés admnstratves ndépendantes». The so-caed «procaccatore d affar», typca of the Itaan ega system, has no equvaence n other systems. Transatng t as «broker» (n Engsh) or «pourvoyeur d affares» (n French) s ncorrect. Unke the Itaan, these atter terms refer to an empoyee of a company for whch he works and from whch he gets a saary and not a commsson. Despte the dffcutes n estabshng the equvaence of ega concepts beongng to dfferent ega systems, a compromse has been adopted n tryng to favour the ntegraton of dverse ega cutures, whe respectng each natona ega system. What s needed s the dentfcaton of a common ground, namey common ega concepts and facts whch, athough not perfecty concdng wth those beongng to other systems, are conceptuay cose. It s up to ega users, once the matera has been examned, to perceve the dfferences and pecuartes whch make these resources unque. It s to be underned that ths does not necessary ead to nose or unsuccessfu searches, but aows for a frst-phase search n context, usefu to gve evdence of the exstence or non-exstence of a specfc concept n other ega systems. 4 Mut-anguage metadata approach As dscussed n () our Porta s amed at ntegratng data comng from structured repostores and from Web documents n a unque pont of access. In ths paper, the archtecture has been extended to dea wth cross-anguage factes. 4. Structured data and Web documents n a mut-anguage envronment Structured data comng from dfferent repostores are usuay provded wth a specfc metadata scheme, as we as a

specfc anguage-dependent cassfcaton scheme. Whe metadata sets can be harmonzed n DC, contents of metadata, especay the content of dc:subject can not (Secton 3), except wthn a ega system, that n ths study we consder correspondent to a anguage (Secton 3). Data provders are requred to expose metadata, makng repostores compant to the DC metadata and ready to be harvested usng the OAI-PMH protoco (). At the end of ths process the servce provders are abe to set up many XML metadata repostores as orgna anguages consdered (Fg. ). 2) A snge-anguage traned cassfer: one cassfer traned on documents wrtten n anguage A and a transaton of the most mportant terms of anguage B to A, n order to cassfy documents of anguage B. Both these approaches assume that documents of dfferent anguages share the same cassfcaton scheme. However, as dscussed n Secton 3 ths s not the case wth us: we dea wth documents of dfferent ega systems, correspondng to dfferent anguages, each havng a partcuar cassfcaton scheme, that usuay cannot be harmonzed one another. Fg. The OAI harvestng of structured metadata n dfferent anguages. On the other hand, Web documents do not usuay contan any partcuar metadata scheme, nor any reabe or unform HTML meta-tags, whch can hep the quafcaton of matera of nterest. As dscussed n () n our archtecture Web documents are seected and harvested wth no pror agreement between servce provders and data provders, usng a focused crawer that seects documents wthn each ega system, correspondng to a anguage (Secton 3). For ths knd of documents, an automatc metadata generator capabe of appyng DC metadata has been deveoped and tested (). Partcuary a machne earnng approach based on a naïve Bayes cassfers, has been used for dc:subject. In our archtecture the probem of extendng the automatc metadata generator to dea wth mut-anguage documents s the probem of extendng the automatc cassfer n a mutanguage envronment. Mut-anguage automatc document cassfcaton has not wdey studed on ts own. In terature two man knds of cassfers for mut-anguage document categorzaton can be dstngushed (9): ) A poy-anguage traned cassfer: one cassfer traned on documents wrtten n dfferent anguages; Fg. 2 Cassfers traned to appy dc:subject to Web documents wthn dfferent anguage domans. Therefore, n our archtecture a naïve Bayes cassfer s traned (Fg. 2) and used to appy dc:subject to Web documents wthn each anguage and cassfcaton scheme. At the end of ths process the servce provder coects as many DC-quafed HTML metadata repostores as orgna anguages (Fg. 3). Fg. 3 The harvestng and DC quafcaton of Web documents n dfferent anguages. 4.2 Indexng ega documents After havng coected data from structured repostores and from Web documents the servce provder obtans two types of archves contanng data n dfferent formats (XML records and HTML documents), sharng the

same DC metadata descrpton scheme. Each type of archve s composed by sub-archves contanng data n one anguage. At ths stage an ndexng procedure, abe to hande XML records and HTML documents (0), can be mpemented wth the am of provdng a unform vew and ntegrated access to the data of our Porta. The number of ndexes obtaned wth the same DC metadata scheme s the same as the number of anguages consdered (Fg. 4). Fg. 4 Indexng of DC metadata from repostores usng dfferent formats and anguages. 5 Access modates Havng but the porta ndexes n a mutanguage envronment, users may access data by these two modates of query: ) metadata-based document queryng (MBDQ); 2) keyword-based document queryng (KBDQ), combned wth category (category-based document queryng (CBDQ)). Case ) Advanced search: the user submts a query fng n the feds reated to DC metadata. Case 2) Smpe search: the user submts a query, fng an unquafed text box wth keywords. Moreover, n order to make the query more focused, the user may choose a ega category of the ega system assocated wth a anguage doman. Deang wth queryng and retreva of mutanguage documents, essentay nvoves the probem of query transaton. As dscussed n Secton 3., especay n ega doman, a word n query anguage can be ambguous, havng therefore dfferent transatons n a target anguage, each correspondng to a ega category n the target ega system (.e. the Itaan word doo has two dfferent transaton nto Engsh: fraud and mace, respectvey beongng to prvate aw and crmna aw). The rght sense of an ambguous word n query anguage can be obtaned ony by word contextuazaton, gvng the rght sense to the context n terms of ega category Then such a ega category n the query ega system, can be mapped to the correspondent ega category n the target ega system, therefore the rght transaton of the ambguous word can be obtaned. If more than one category n the target ega system corresponds to the ega category of the query ega system, more than one transaton of the ambguous word are seected. Therefore, the knowedge of a ega category n both the modates of queryng (MBDQ and KBDQ+CBDQ) s essenta n order to dentfy the rght transaton of an ambguous word. Once query s transated n target anguages and contextuazed, documents of dfferent anguages are gven back. The procedures used to obtan these resuts n MBDQ and n KBDQ+CBDQ modes are descrbed respectvey n Secton 5. and 5.2 (Fg. 5 can be used as reference). Fg. 5 Query transatons of n MBDQ and KBDQ+CBDQ modates 5. Query based on metadata (MBDQ) MBDQ represents an advanced search standard modaty of queryng a quafed ndex. The user frst of a s requred to choose a ega system, thus mpcty dentfyng a anguage for queres, and a ega category, n terms of dc:subject, thus mpcty dentfyng the rght transatons of possbe ambguous words. Then each metadata fed s fed wth a set of words w, w,..., ) ( 0 w n, representng a context expressed n the query natve anguage, that has to be transated by a thesaurus-based context transaton procedure. Not every fed has to be transated. In fact, DC metadata can be dvded nto query

anguage-dependent and query anguagendependent metadata. For exampe dc:tte s query anguage-ndependent snce, for exampe, the tte of a document has to be quered n ts natve anguage, ndependenty from the query-anguage. Therefore ony the contents of query anguage-dependent metadata have to be transated. Whe n a mut-anguage envronment dc:subject s usuay query anguage-ndependent (or neutra ()), wthn a mut-anguage ega doman ths s not true (Secton 3). For ths reason dc:subject has to be transated, by a mappng of ts vaues from a ega system to dfferent target ones. Aso the content of dc:descrpton (wth ts quafers, such as Abstract) s query anguage-dependent and t s a wdey used access pont: the nformaton contaned s often expressed usng a sem-technca anguage; therefore dc:descrpton eement has been hed as beng as mportant to transate as dc:subject n the Porta functons. The contents of dc:subject and dc:descrpton, submtted n a natve anguage are transated n a pvot anguage (Engsh) (2). Then, from the pvot anguage, the query s transated agan to the other anguages used by the Porta. The use of a pvot anguage n a N - anguage envronment aows the reducton of the number of bngua thesaur from a factor 2 N to a factor N, and aso aows the souton of the probem of the non-avaabty of some bngua thesaur. As dscussed n Secton 5 the man probem wth transaton s that a snge word ( w ) or expresson n the natve anguage can have dfferent transatons n a target anguage, dependng on the context. For exampe, et us assume, wthout oosng generaty, that w be an ambguous snge word of the context ( w 0, w,..., w n ) n dc:descrpton n query natve anguage (Fg. 5). Accordng to Fg. 5, dfferent Engsh transatons y, y,..., y } can be assocated to {, 0,, q w, each one correspondng to as many ega categores h, h,..., }. For exampe, { 0 h q beng the anguage =Itaan and w = doo, possbe transatons n Engsh are = fraud reated to aw category y,0 h = prvate aw and y, = mace reated 0 to aw category h = crmna aw. The rght transaton can be obtaned ony by knowng the sense, namey the category h, of the context n the query natve anguage, where w s contaned. Such a context, or ega category, s requred and s provded by the user usng dc:subject eement. When a category c (Fg. 5) s seected n the dc:subject wthn a ega system, the probem arses of dfferent cassfcaton schemes n dfferent anguages, correspondng to dfferent ega systems (Secton 3.). The probem can be soved by usng a thesaurus-based category mappng. In fact, when the category c s submtted as a query parameter, the category c s mapped n the correspondng, or the cosest, categores n the pvot anguage, and from t to the other anguages consdered by the Porta ( c c c, c } ), usng category en { t fr thesaur. In accord wth Fg. 5 and wthout oosng generaty, et us assume that ony one c en = h n the Engsh ega ega category system corresponds to the ega category c c c h ). Consequenty, the ( en = Engsh transaton y, (Fg. 5) reated to the sense h, can be seected (n our exampe, the Engsh word y, = mace, reated to aw category h = crmna aw s seected as the rght transaton of the Itaan word w = doo ). If more than one category of the target ega system can be assocated to c, a the correspondng transatons of the current w are seected. When a the words of the current context are transated n dc:descrpton, we obtan the transaton of the submtted context ( w 0, w,..., w n ) from anguage to Porta target anguages. The vaue c n dc:subject s aso mapped to the correspondng categores n the target anguages. Now queres n dfferent anguages are ready to be dspatched to the reated doman anguage ndexes (Fg. 6).

dspatched to the dfferent anguage ndexes (Fg. 7). Fg. 6 MBDQ: resuts of query transaton n dfferent anguages (n grey metadata whose content s transated). 5.2 Query based on keywords and ega categores (KBDQ+CBDQ) A query based on keywords and ega categores represents the smpe search modaty of queryng our Porta. In ths mode the user s provded ony wth an unquafed text box to be fed wth a context w, w,..., ) ( 0 w n of words n a natve anguage. Words dentfyng the context w be transated nto the target anguages of the Porta (thesaurus-based context transaton). Moreover, the user may or may not provde a ega category of the query ega system. Snce category s essenta for transaton of ambguous words, f a ega category s not provded, the system attempts to nfer the correspondent ega category from the query context. If the user seects a ega category c, among the vaues of dc:subject n the query ega system, a procedure of thesaurus-based category mappng s executed, as descrbed n Secton 5., obtanng the correspondences of c n Porta target ega systems (Fg. 5). If the user fs ony the unquafed text box, wthout choosng any vaue n dc:subject, the rght sense to the query context s provded by a procedure of automatc word sense dsambguaton, whch assgns a ega category to a context as descrbed n Secton 5.3. The ega category thus dentfed n natve query anguage, s then mapped to the reated ega categores n target ega systems (thesaurus-based category mappng). At the end of the process, the rght transatons of ambguous words can be obtaned, as dscussed n Secton 5. (Fg. 5), and as many dfferent queres as target anguages used by our Porta can be Fg. 7 KBDQ+CBDQ: resuts of query transaton n dfferent anguages (n grey metadata whose content s transated). 5.3 Automatc word sense dsambguaton The probem of assgnng the rght meanng to a word n context s a probem of assgnng the rght sense to the context tsef out of the varous meanngs that can be assgned to the ambguous word. Accordng to terature ots of methods have been used to sove the probem of automatc dsambguaton: - Thesaurus-based dsambguaton (3); - Dsambguaton based on sense defntons (4); - Dsambguaton based on transaton n a second-anguage dctonary (5); - Bayesan dsambguaton (6). In our Porta word dsambguaton s a probem of context categorzaton wth respect to the ega categores consdered wthn a ega system. Moreover (6) context categorzaton s the same probem of document categorzaton, once we vew contexts as documents and word sense a s categores. For these reasons n our system we use the same naïve Bayes cassfers descrbed n Secton 4., traned wth abeed documents of dfferent ega categores of a partcuar ega system and anguage. At the end of the tranng phase each category profe (n our case a vector of weghted terms reevant to t ()) can aso be consdered as a context profe to be used for dsambguaton functon. Gven the set of categores, consdered by our Porta, n a ega system, the naïve Bayes cassfer used for word sense dsambguaton s a rankng cassfer whch, for a gven query context, returns the scores for the dfferent categores. Each score represents the evdence

for pacng a gven context to a certan category. It s mportant to execute automatc word dsambguaton pror to transaton, because, as dscussed, correct word transaton depends on contextuazaton actvty of words n ther natve anguage. 6 Concusons Mutngua access to ega documents s probematc due to the ega system-bound nature of ths type of nformaton. In the ega terature Porta created by ITTIG an approach has been deveoped aowng cross-anguage retreva of both structured and unstructured documents by expotng content of dc:subject n dverse ega systems. The desgned functonaty ams to provde a snge pont of access nto dsparate repostores where categores of aw are the essenta metadata to pont to reevant matera rrespectve of the anguage used n a query. Ths s done through technques abe to transate ega queres to dfferent target anguages, eventuay dsambguatng ambguous words by a machne earnng approach. Bascay, the approach gves the advantage of accessng mut-anguage ega documents respectng the dentty and the pecuartes of dfferent ega systems. References. E. Francescon, G. Perugne, Integraton between structured repostores and web documents, n Proc. of the DC Conference 2003, pp.99-07 2. C. Peters, E. Pcch, Across anguages, across cutures: ssues n mutnguaty and dgta brares, n D-Lb Magazne, May 997. Retreved Juy 7, 2004, from http://www.db.org/db/may97/peters/05 peters.htm 3. J. Mayfed, P. McNamee, Three prncpes to gude CLIR research, 2002, Retreved Juy 7, 2004, from http://ucdata.berkeey.edu/sgr- 2002/sgr2002CLIR-8-mayfed.pdf 4. C. Fuhr, Mutngua Informaton Retreva, n Survey of the State of the Art n Human Language Technoogy, 996. Retreved Juy 7, 2004, from http://csu.cse.og.edu/hltsurvey/ch8nod e7.htm 5. D. W. Oard, Aternatve approaches for Cross-Language Text Retreva, 997. Retreved Juy 7, 2004, from http://www.ee.umd.edu/medab/fter/sss/ papers/oard/paper.htm 6. C.J.P. Van Laer, The appcabty of comparatve concepts, n Eectronc Journa of Comparatve Law, vo. 2.2, August 998. 7. R. Sacco, Drot et angue, n Rapports taens au XV Congrès nternatona de drot comparé, Mano, 998. 8. J. Kerby, La traducton jurdque, un cas d espèce, n Jean Caude Gémar (Ed.), Langage du drot et traducton, Essas de jurngustque, Montréa 982. 9. N. Be, C. Koster, M. Vegas, Cross- Lngua Text Categorzaton n Proc. of ECDL 2003, pp. 26-39. 0. Swsh-e, Smpe Web Indexng System for Humans Enhanced, Retreved Juy 7, 2004, http://swsh-e.org.. W. Lee, S. Sugmoto, M. Nagamor, T. Sakaguch, K. Tabata, A Subject Gateway n Mutpe Languages: a Prototype Deveopment and Lessons Learned, n Proc. of DC Conference 2003, pp. 59-66. 2. F. Sebastan, Interactve Query Expanson wth Automatcay Generated Category-Specfc Thesaur, n Amta G. Chn (ed.), Text Databases and Document Management: Theory and Practce, Idea Group Pubshng, Hershey, US, 200, pp. 03-7. 3. D. Yarowsky, Word sense dsambguaton usng statstca modes of Roget s categores traned on arge corpora, n COLING 4, 992, pp. 454-460. 4. M. Lensk, Automatc sense dsambguaton, n Proc. of the 986 SIGDOC Conference, pp. 24-26. 5. I. Dagan, A. Ita, Word sense dsambguaton usng a second anguage monongua corpus, Computatona Lngustc, n. 20, 994, pp. 563-569. 6. W. A. Gae, W. K. Church, D. Yarowsky, A method for dsambguatng word sense n a arge corpus, Computer and Humantes n. 26, 5, 993, pp. 45-439.