KANSHIN: A Cross-lingual Concern Analysis System using Multilingual Blog Articles

Similar documents
Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

Web (2.0) Mining: Analyzing Social Media

Lecture Overview. Web 2.0, Tagging, Multimedia, Folksonomies, Lecture, Important, Must Attend, Web 2.0 Definition. Web 2.

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Information & Data Visualization. Yasufumi TAKAMA Tokyo Metropolitan University, JAPAN ytakama@sd.tmu.ac.jp

Collecting Polish German Parallel Corpora in the Internet

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing

Search and Information Retrieval

Interactive Information Visualization of Trend Information

Semi-Supervised Learning for Blog Classification

Instagram Post Data Analysis

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Introduction to PatBase

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Using Wikipedia to Translate OOV Terms on MLIR

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

CLIR-Based Collaborative Construction of a Multilingual Terminological Dictionary for Cultural Resources

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

A MULTILINGUAL AND LOCATION EVALUATION OF SEARCH ENGINES FOR WEBSITES AND SEARCHED FOR KEYWORDS

Fogbeam Vision Series - The Modern Intranet

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

The role of multimedia in archiving community memories

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval

How To Make Sense Of Data With Altilia

BLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERING

Sentiment analysis on tweets in a financial domain

Semantic Search in Portals using Ontologies

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

TraMOOC Project Overview Presentation. Overview Presentation

Domain Classification of Technical Terms Using the Web

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Journal of Global Research in Computer Science RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

MOVING MACHINE TRANSLATION SYSTEM TO WEB

NAIC & The Intelligence Community Open Source Architecture

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

FHWA Office of Operations (R&D) RDE Release 3.0 Potential Enhancements. 26 March 2014

2-3 Automatic Construction Technology for Parallel Corpora

The Role of Reactive Typography in the Design of Flexible Hypertext Documents

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

An Analytical Research on Social Media System

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

Web-based Multimedia Content Management System for Effective News Personalization on Interactive Broadcasting

Delivering Smart Answers!

Text Mining - Scope and Applications

CLOUD COMPUTING CONCEPTS FOR ACADEMIC COLLABORATION

JamiQ Social Media Monitoring Software

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Text Analytics for Competitive Analysis and Market Intelligence Aiaioo Labs

Implementing a Web-based Transportation Data Management System

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

PROMT Technologies for Translation and Big Data

Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods

Analysis and Visualization with Large Scale Temporal Web Archives

Text Analytics with Ambiverse. Text to Knowledge.

A Platform for Managing Term Dictionaries for Utilizing Distributed Interview Archives

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

Development of Framework System for Managing the Big Data from Scientific and Technological Text Archives

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

WEGOV ANALYSIS TOOLS TO CONNECT POLICY MAKERS WITH CITIZENS ONLINE

Sentiment Analysis on Big Data

Above the fold: It refers to the section of a web page that is visible to a visitor without the need to scroll down.

RAI ACTIVE NEWS: INTEGRATING KNOWLEDGE IN NEWSROOMS

A Framework for Data Management for the Online Volunteer Translators' Aid System QRLex

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Knowledge as a Service for Agriculture Domain

The Cloud: Searching for Meaning

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques

A CASE STUDY IN THE APPLICATION MARKET: BEHAVIOR OF PLAY STORE CUSTOMERS

Pierce County Web Content Management System Scripted Demo

How To Access Multilingual Information On The Web With Google And Clir

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

ASTIN Goal and the Eastin System

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A Framework of User-Driven Data Analytics in the Cloud for Course Management

IC05 Introduction on Networks &Visualization Nov

Graphical Web based Tool for Generating Query from Star Schema

Applications of Deep Learning to the GEOINT mission. June 2015

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

CASE STUDY: SPIRAL16

Concept Term Expansion Approach for Monitoring Reputation of Companies on Twitter

aloe-project.de White Paper ALOE White Paper - Martin Memmel

in Video Game Spaces: Image, Play, and Structure in 3D Worlds

A new home page design is being finalized, which will add a new link to material in other languages through the top navigation of the homepage.

HydroDesktop Overview

1. Introduction. Keywords: content adaptation, social media, engagement, automation

The Italian Hate Map:

Composing Human and Machine Translation Services: Language Grid for Improving Localization Processes

The Ontology and Architecture for an Academic Social Network

Blog Post Extraction Using Title Finding

Transcription:

KANSHIN: A Cross-lingual Concern System using Multilingual Blog Articles Tomohiro Fukuhara RACE, The University of Tokyo 5--5 Kashiwanoha, Kashiwa, Chiba JAPAN Yoshiaki Arai Tokyo Denki University - Nishiki-cho, Kanda, Tokyo JAPAN Hidetaka Masuda Tokyo Denki University - Nishiki-cho, Kanda, Tokyo JAPAN Hiroshi Nakagawa ITC, The University of Tokyo 7-3- Hongo, Tokyo JAPAN Akifumi Kimura Faculty of Engineering, The University of Tokyo 7-3- Hongo, Tokyo JAPAN Takayuki Yoshinaka Tokyo Denki University - Nishiki-cho, Kanda, Tokyo JAPAN Takehito Utsuro University of Tsukuba --, Tennodai, Tsukuba JAPAN Abstract An architecture of cross-lingual concern analysis (CLCA) from multilingual blog articles, and its prototype system are described. As various people who are living in various regions and countries join the Web, cross-lingual information retrieval (CLIR) becomes important. In this paper, we propose a CLCA as one of CLIR applications for understanding concerns of people across languages. We propose a layer architecture of CLCA, and its prototype system called KANSHIN. The system collects Japanese, Chinese, Korean, and English blog articles, and analyzes concerns across languages. Users can find concerns from several viewpoints such as temporal, spatial (geographical), and a network of. The system also facilitates users to browse multilingual keywords using Wikipedia, and the system facilitates users to find splog features. An overview of CLCA architecture and the system is described. Introduction Today many people who are living in the world use the Internet. They can communicate with others by using e- mail, browse news articles, and publish their thoughts and opinions on the Web. These textual information is described in various languages. According to the language observatory project, 6, languages are currently spoken in the world[]. Although machine-readable languages are limited at this moment, so many information is exchanged by various languages on the Web. We call this phenomenon as language-explosion. We can see the language-explosion phenomenon on the Web. In Wikipedia, which is a Web-based multilingual free encyclopedia, 53 languages are used for describing articles. Another example is the blogosphere. According to the Technorati s report 3 that analyzes the state of blogosphere in April, 7, the blogosphere is separated into various language spaces such as Japanese(37%), English(36%), Chinese(8%), Italian(3%), Spanish(3%), Russian(%), French(%) and so on. Although many spam blogs (splogs) are contained, the blogosphere can be seen a multilingual space. As language-explosion phenomenon proceeds, multilingual information access (MLIA) would play an important role on the Web[5]. Oard also argued the importance of multilingual blog analysis[]. Although various MLIA applications are proposed, we propose a cross-lingual concern analysis (CLCA) as one of MLIA applications. With CLCA, one can find various viewpoints by comparing information written in several languages. The goal of CLCA is to facilitate people to understand concerns of people across languages by providing hot topics in each language space, http:en.wikipedia.orgwikiwikipedia:about http:meta.wikimedia.orgwikilist of Wikipedias 3 http:www.sifry.comalertsarchives493.html

clustering and summarizing topics, translating foreign languages into user s mother tongue (see.). In this paper, we propose a layer architecture of CLCA, and its prototype system called KANSHIN. The system collects and analyzes multilingual blog articles, and analyzes articles from temporal, focal, geographical (spatial), and network viewpoints. Users can find various viewpoints on a topic across languages. The system provides a cross-lingual keyword navigation tool based on interlanguage links of Wikipedia, and a splog survey tool. By using the proposed system, users can find and compare concerns of people across languages. This paper consists of following sections. Section describes reviews of previous work, and propose a layer architecture of CLCA. Section 3 describes a prototype system of CLCA. In section 4, we describe related work. In section 5, we summarize arguments, and describe future work. Cross-lingual concern analysis In this section, we review previous work, and propose a layer architecture of CLCA.. Previous work As previous work, we review () Columbia NewsBlaster, and () the language grid project. Columbia NewsBlaster 4 is one of related work. The system collects news articles written in several languages, and translates articles into English, and classifies and summarizes translated articles[]. The focus of this work is on NLP techniques of clustering, summarization of multilingual texts. Meanwhile, our focus is on the mixture of representation methods, i.e., we use textual information, and graphical representations such as geographical and network views (see 3). In the language grid project, Ishida and his colleagues propose tools to support cross-lingual and cross-cultural communication[6]. The focus of this project is on the human-to-human communication in the real-world communication and computer-mediated communication (CMC) using video and audio conferencing tools. Meanwhile, our focus is on analytical aspect of concerns across languages. We consider that the mixture of several represetations of concerns is important for CLCA. In the next subsection, we describe a layer architecture of CLCA. 4 http:newsblaster.cs.columbia.edu. Layer architecture of CLCA For facilitating people to find concerns across languages, we propose a layer architecture of CLCA 5. Figure shows the layer architecture of CLCA. This architecture consists of the following seven layers.. Fundamental layer. Linguistic resource layer 3. Search layer 4. NLP application layer 5. Machine translation layer 6. Integration layer 7. Trust layer We describe each layer. The first layer is the fundamental layer in which character code and communication components are provided. At the bottom line of Figure, there are two components: () Unicode, and () Internet components. These are fundamental components for CLCA for describing texts in various languages, and exchanging texts with other people and other computers. The second layer is the linguistic resource layer. There are two components in this layer; () linguistic resources, and () multilingual textual data resources. The linguistic resources contain thesaurus, dictionaries, and NLP tools. Multilingual textual data contain textual data on the Web. The third layer is the search layer in which there is a cross-lingual information retrieval (CLIR) component. In this layer, people can retrieve multilingual textual data on the Web. The fourth layer is the NLP application layer in which various NLP applications such as text summarization, clustering, opinion analysis, sentiment analysis, and visualization are provided. These NLP applications work with specific language. The fifth layer is the machine translation (MT) layer. This component translates the output of NLP applications in the fourth layer. The sixth layer is the integration layer in which various outputs of NLP applications, which are translated into a user s mother tongue, are integrated into a report. Intelligent user interface would be needed fro this layer. The seventh layer is the trust layer where trust among people would be established. In this layer, one can find concerns across languages, and she can have a better conversation with other people who are living in different areas or countries. 5 The idea of this architecture is borrowed from the layer cake architecture of the Semantic Web (http:www.w3.orgsw).

8 5 9 5 9 6 3 7 4 3 7 4 Text Summarization Unicode Machine Translation Text Clustering Language Resources (Dictionaries, NLP tools) Opinion Cross-lingual Information Retrieval Trust Integration Sentiment the Internet Text Visualization Multilingual Text Data (Blogs, Web pages) Figure. Layer architecture of the crosslingual concern analysis. WWW Japanese Chinese RSSAtom Aggregating feeds RSSAtom feeds Index table word Media Article table article title 394 World Cup 454 Crawler Crawler Crawler article 394 Business 454 Business English Korean English Indexing and storing Storing words and articles into tables words Articles Querying in user s mother tongue User Query translater 禽流感 Comparative trend graph 鳥インフルエンザ Retrieval 日本語 中文 Gathering RSSAtom feeds of We written in several languages. Keyword(s) Bird flu 조류독감 한국어 English 7 6 5 4 3! Japanese Korean Chinese Co-occurred words Daily and Monthly topics s for each language Table. Summary of collected data (at March 9, 8, 8: JST) Language # of # of articles Chinese 895,8 9,49,747 Japanese 3,994,473,53,87 Korean 6,389 7,69,65 English,76 6,4,86 Total 5,6,76 43,984,485 3 Prototype system In this section, we describe () an overview of the system, () a cross-lingual keyword navigation tool, and (3) a splog survey tool called SplogExplorer. 3. Overview KANSHIN is a system that collects and analyzes multilingual blog articles[3, 4]. Figure shows an architecture of the system. The system collects and analyzes four languages, i.e., Japanese, Chinese, Korean, and English blog articles. Table shows a summary of the collected blog data at this moment. The system provides several viewpoints for finding concerns across languages. Users can find following viewpoints.. Temporal view of concerns. Focal view of a topic 3. Geographical view 4. Network view In the following, we describe each viewpoint. Figure. Overview of the system. The system collects Japanese, Chinese, Korean, and English blog articles. A user can retrieve articles across languages. She can also find co-occurred words with a keyword, and topicwords of the day. Temporal view of concerns Users can find temporal trends of concerns across languages. Figure 3 shows daily trends of concerns over crude in the four languages (from October 6, 7 to January 5, 8). In this figure, two points are indicated. At the point A, we can see a burst of Korean articles around December, 7. The reason of this burst is an oil spill accident occurred in Korea 6. Meanwhile, at the point B, we can see another burst of Japanese blog articles around January 5, 8. This is because crude oil prices past $ a barrel on the New York Mercantile Exchange. Thus, we can find differences of concerns over the same topic. We can also compare concerns by using co-occurred words (see next subsection). Users can retrieve blog articles across languages by using his or her mother tongue. Because the system translates a keyword into other three languages, users can retrieve articles without translating the keyword manually. The system translates a keyword by using bilingual dictionaries and Wikipedia. Furthermore, a multilingual keyword navigation tool based on the interlanguage link of Wikipedia is provided for supporting users to select an appropriate translation from keyword candidates (see 3.). 6 http:en.wikipedia.orgwiki7 Korea oil spill

B A Figure 3. Daily trends of concerns over crude in four languages ( Ja stands for Japanese, Ko for Korean, Zh for Chinese, and En for English blogs). The point A in Korean blogs shows concerns over an oil spill accident in Korea, and the point B in Japanese blogs shows concerns over news stories about the rise of crude oil prices in New York. Focal view of a topic Users can find co-occurred words for a keyword. Table shows a list of co-occurred with crude during December 3, 7 to January 7, 8. We can see that the term crude is co-occurred with oil, price, and market in Japanese, Chinese, and English blogosphere. Meanwhile, in Korean blogosphere, Taean ( ), Accident ( ), Chung-Nam ( ), Offshore ( ) are appeared because an oil spill occurred near the Taean county of South Korea in December 7. Thus, we can see differences of focuses on a topic among languages. Geographical view Providing geographical views is also important for CLCA. The system provides a geographical viewpoint of concerns. Figure 4 shows a geographical analysis result of a blog site. Although this function currently works in Japanese blogs, one can find locations that are mentioned in blog articles instantly. For extracting location information, the system extracts address information by using a named-entity extraction tool called CaboCha[8], and Yahoo! LocalSearch API 7. 7 http:developer.yahoo.co.jpmaplocalsearchv localsearch.html (in Japanese) Figure 4. Geographical analysis of blog articles. In this figure, each point represents a location mentioned in blog articles. Points are plotted on the Google map. Extracted address information is plotted on the Google map 8. Network view Network analysis is also important for understanding concerns on the Web[]. Therefore, the system provides a network view of. Users can find a link structure of. The system crawls based on the depthfirst search (up to three hops), and extracts hyperlinks from blog articles. A starting URL (seed URL) is specified by a user. Figure 5 shows an analysis result of link structure between Korean. We can find similar are connected together in this figure. 3. Cross-lingual keyword navigator The cross-lingual keyword navigator 9 is a support tool for browsing translation candidates of a keyword across languages. Figure 6 shows a network on video games. Users can find several translation candidates for the keyword, and choose one of keywords for retrieval. The cross-lingual keyword navigator utilizes interlanguage links of Wikipedia. Wikipedia has an interlanguage link feature that enables editors to link two articles written in language A and B together. By analyzing ILL, we can find translation candidates. By comparing bilingual dictionaries, Wikipedia also contains new words and technical terms. The tool analyzes ILL of Wikipedias, and shows re- 8 http:maps.google.com 9 Available at http:arai.cdl.im.dendai.ac.jp http:en.wikipedia.orgwikihelp:interlanguage links

Table. Co-occurred words with crude in each language (The period is from December 3th, 7 through 6th January, 8). Japanese blog Chinese blog Korean blog English blog Term (Japanese) Articles Term (Chinese) Articles Term (Korean) Articles Term Articles High ( ) 746 Oil ( ) 3 Taean ( ) 8 Prices Price ( ) 678 Full text ( ) 9 Accident ( ) 44 Barrel 87 3 Inflating ( ) 677 Price ( ) 75 Chung-Nam ( ) 7 Price 7 4 Effects ( ) 37 Country ( ) 64 Offshore ( ) 85 New 89 5 Future ( ) 35 Future ( ) 58 Damage ( ) 8 Futures 88 6 Market ( ) 34 International ( ) 55 Taean County ( ) 59 US 8 7 Up ( ) 33 Market ( ) 5 Incident ( ) 58 Year 64 http:kr.blog.yahoo.comleecw http:kr.blog.yahoo.comyou54433 http:kr.blog.yahoo.comleehee36 http:kr.blog.yahoo.comirespectu79 http:kr.blog.yahoo.comncbhj768 http:kr.blog.yahoo.comnahayeong http:kr.blog.yahoo.comkasap http:kr.blog.yahoo.coma33jeon http:kr.blog.yahoo.comfree6kr http:kr.blog.yahoo.comwoowin5 http:kr.blog.yahoo.combm4 http:kr.blog.yahoo.comqkdhfh http:kr.blog.yahoo.comspunkyzoe http:kr.blog.yahoo.comzinukim http:kr.blog.yahoo.comdocwun http:kr.blog.yahoo.comloveyun998 http:kr.blog.yahoo.comohsuk374 http:kr.blog.yahoo.comdirectorsung http:kr.blog.yahoo.commoooogirl3 http:kr.blog.yahoo.comnohbully http:kr.blog.yahoo.comdlgkrrlf55 http:kr.blog.yahoo.comkwonroc http:kr.blog.yahoo.comjoongtae http:kr.blog.yahoo.combluesky http:kr.blog.yahoo.comsolee346 http:kr.blog.yahoo.comcsemden http:kr.blog.yahoo.comweinberg5b http:kr.blog.yahoo.comkkheung66 http:kr.blog.yahoo.comrainstorm4953 http:kr.blog.yahoo.comezencall http:kr.blog.yahoo.combulnosarang http:kr.blog.yahoo.comchoonggyuk http:kr.blog.yahoo.compapablog http:kr.blog.yahoo.comdedodeda http:kr.blog.yahoo.comgogokbc http:kr.blog.yahoo.comsupchoi http:www.newscham.netnews http:kr.blog.yahoo.comcybele43 http:kr.blog.yahoo.comlonlib67 http:kr.blog.yahoo.comgermanistikerin http:kr.blog.yahoo.comthd9 http:kr.blog.yahoo.comkimjh8877 http:kr.blog.yahoo.comimpotato8 http:kr.blog.yahoo.comjoomic http:kr.blog.yahoo.comjshlee74 http:kr.blog.yahoo.comdukjinwesley http:kr.blog.yahoo.comdaihoon http:kr.blog.yahoo.comgayong9 http:kr.blog.yahoo.comhappysuhyun73 http:kr.blog.yahoo.comtiandemaybelle http:kr.blog.yahoo.comhjbtw5 http:kr.blog.yahoo.comohgom http:kr.blog.yahoo.comdlrwo8864 http:kr.blog.yahoo.comjaqueline4 http:kr.blog.yahoo.comparetoak http:kr.blog.yahoo.comhilldegarde http:kr.blog.yahoo.comjki9673 http:kr.blog.yahoo.comsunginjp http:kr.blog.yahoo.compeugot36xt http:kr.blog.yahoo.comjinjum http:kr.blog.yahoo.comciysage7 http:kr.blog.yahoo.comkwanghwany http:kr.blog.yahoo.compolaroid47 http:kr.blog.yahoo.comny5a http:kr.blog.yahoo.comgordiohan http:kr.blog.yahoo.comssamku http:kr.blog.yahoo.comkopanes Pajek Figure 5. A network view of relations between (the seed is a Korean blog site). The system follows hyperlinks until three hops. The largest node (blue) represents a seed url specified by a user. The second largest node (red) represents connected to the seed. Figure 6. Screen image of the cross-lingual keyword navigator. This tool analyzes interlanguage links of Wikipedias, and shows relations between keywords. lations of keywords as a network. Users can find candidate keywords for a keyword. 3.3 SplogExplorer One of big issues in the blogosphere is splogs. Because there are so many spam blogs called splogs in the blogosphere[7], filtering out splogs is important for every language spaces. As a first step towards splog filtering, we prepared a splog analysis tool called SplogExplorer for investigating the splogosphere. SplogExplorer has following functions: () checking and detecting splogs function, () finding copy-and-paste splogs function, and (3) collaborative annotation function. Figure 7 shows the screen image of the SplogExplorer. Users can find splog candidates by browsing parameters such as number of articles, number of links in an article, and so on. Table 3 shows a list of parameters that SplogExplorer analyzes. We chose language-independent features so that we can apply this tool to other languages. Although the current dataset used in SplogExplorer is Japanese blog articles,

Table 3. List of feature variables of splogs # of entries per day # of characters in an entry # of HTML tags in an entry # of within-site links of a blog site # of external links of a blog site speech and text data. The direction of the GALE project is similar to this research, but the focus is different. Our research focus is on analytic aspect in which users can compare concerns across languages. TextMap 3 and Google Trends 4 are also related systems. TextMap shows a geographical viewpoint of concerns[9], and Google Trends shows a temporal viewpoint of concerns. These systems are monolingual systems. We propose a cross-lingual concern analysis using multilingual blog articles. Suggestion of judgment (proposed by the system) List of keywords appearing in this entry Search conditions (keywords, # of articles, etc.) Feature values of this entry Figure 7. Screen image of the SplogExplorer. Users can find splogs and their feature values. 5 Conclusion In this paper, we described a layer architecture of the cross-lingual concern analysis (CLCA), and proposed its prototype system. The more multilingual texts are available on the Web, the more CLCA becomes important. One of important issues in the next generation search is a CLCA. The prototype system facilitates users to find () temporal trends of concerns across languages, () focuses of a topic based on co-occurred words with a keyword, (3) geographical and network viewpoints of articles. The system also facilitates users to find multilingual keyword candidates, and provides a tool for exploring the splogosphere. Our future work is to incorporate machine translation systems for supporting users to read multilingual articles. References we will use other languages, and compare features of splogs across languages. 4 Discussion In this section, we describe related work. 4. Related work Google news is one of related work. Google news collects news articles that are written in various languages on the Web, and classifies and summarizes articles. The concept of Google news is different from CLCA because Google news does not compare concerns of people across languages directly. In this research, we proposed a CLCA system that compares concerns of people across languages. In the Global Autonomous Language Exploitation (GALE) project of DARPA, researchers are creating systems that assist military persons to understand multilingual http:news.google.com http:www.arpa.miliptoprogramsgalegale.asp [] L. A. Adamic and N. Glance. The political blogosphere and the 4 u.s. election: divided they blog. pages 36 43, 5. [] D. K. Evans, J. L. Klavans, and K. R. McKeown. Columbia newsblaster: Multilingual news summarization on the web. In D. M. Susan Dumais and S. Roukos, editors, HLT-NAACL 4: Demonstration Papers, pages 4, Boston, Massachusetts, USA, May - May 7 4. Association for Computational Linguistics. [3] T. Fukuhara, T. Murayama, and T. Nishida. Analyzing concerns of people using weblog articles and real world temporal data. In Proceedings of WWW 5 nd Annual Workshop on the Weblogging Ecosystem: Aggregation, and Dynamics, 5. (Available at http:www.blogpulse.compapers5fukuhara.pdf, Accessed 8--6). [4] T. Fukuhara, T. Utsuro, and H. Nakagawa. Cross-lingual concern analysis from multilingual weblog articles. In A. Nijholt, O. Stock, and T. Nishida, editors, Proceedings of the 6th Workshop on Social Intelligence Design, pages 55 64, 7. (ISSN: 574-846). [5] F. C. Gey, N. Kando, C.-Y. Lin, and C. Peters. New directions in multilingual information access. SIGIR Forum, 4():3 39, 6. 3 http:www.textmap.com 4 http:www.google.comtrends

[6] T. Ishida. An infrastructure for intercultural collaboration. In IEEEIPSJ Symposium on Applications and the Internet (SAINT-6), pages 96, 6. (Available at http:langrid.nict.go.jpenpublication.html, Accessed 8--6). [7] P. Kolari, A. Java, and T. Finin. Characterizing the Splogosphere. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, and Dynamics, 5th World Wid Web Conference. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, May 6. [8] T. Kudo and Y. Matsumoto. Japanese dependency analysis using cascaded chunking. In CoNLL : Proceedings of the 6th Conference on Natural Language Learning (COLING Post-Conference Workshops), pages 63 69,. [9] L. Lloyd, D. Kechagias, and S. Skiena. Lydia: A System for Large-Scale News, volume 377 of Lecture Notes in Computer Science, pages 6 66. 5. [] Y. Mikami, P. Zavarsky, M. Z. A. Rozan, I. Suzuki, M. Takahashi, T. Maki, I. N. Ayob, P. Boldi, M. Santini, and S. Vigna. The language observatory project (lop). In WWW 5: Special interest tracks and posters of the 4th international conference on World Wide Web, pages 99 99, New York, NY, USA, 5. ACM. [] D. W. Oard. Towards analysis tools for a multilingual blogsphere. In Proceedings of the Computational Approaches to Analyzing Weblogs, pages 76 78. AAAI Press, 7. AAAI Technical Report SS-6-3.