Language Technology based on Big Data: Current Situation and Future Perspectives

Similar documents
Survey Results: Requirements and Use Cases for Linguistic Linked Data

Framework for Modeling Partial Conceptual Autonomy of Adaptive and Communicating Agents

Impact of Varying Vocabularies on Controlling Motion of a Virtual Actor

Self Organizing Maps for Visualization of Categories

Interactive Multimedia Courses-1

Text Mining and Qualitative Analysis of an IT History Interview Collection

TAMALPAIS UNION HIGH SCHOOL DISTRICT Larkspur, California. GRAPHIC DESIGN (Beginning)

Machine Learning using MapReduce

Course Title: Introduction to Video Game Design Board Approval Date: 4/15/13 Credit / Hours: 0.5credit

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

Strategic Agenda for the Multilingual DSM: A Research Perspective. Hans Uszkoreit DFKI

Undergraduate Degree in Graphic Design

Clustering Connectionist and Statistical Language Processing

SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE

ACALANES UNION HIGH SCHOOL DISTRICT Adopted: 3/2/05 Visual and Performing Arts Subject Area COURSE TITLE: Digital Design 1

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Study Plan for Master of Arts in Applied Linguistics

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Language and Computation

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN INFORMATION MANAGEMENT (BSc[IM])

From the concert hall to the library portal

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Introduction to Text Mining and Semantics. Seth Grimes -- President, Alta Plana

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

North Carolina Essential Standards Beginning Visual Arts. Visual Literacy. Note on Numbering: B-Beginning High School Standards

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases

Artificial Intelligence for ICT Innovation

Web 3.0 image search: a World First

IT Challenges for the Library and Information Studies Sector

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN INFORMATION MANAGEMENT (BSc[IM])

School of Computer Science

Information Services for Smart Grids

3D Data Visualization / Casey Reas

Modern foreign languages

Search and Data Mining: Techniques. Introduction Anna Yarygina Boris Novikov

A1 Introduction to Data exploration and Machine Learning

IAC Ch 13, p.1. b. Oral communication.

New Jersey Core Curriculum Content Standards for Visual and Performing Arts INTRODUCTION

Building Authorities with Crowdsourced and Linked Open Data in ProMusicDB

M3039 MPEG 97/ January 1998

Mensch-Maschine-Interaktion 1. Chapter 8 (June 21st, 2012, 9am-12pm): Implementing Interactive Systems

Students who successfully complete the Health Science Informatics major will be able to:

CURRICULUM VITAE. Prof. Erkki Oja, Aalto University, Espoo, Finland WWW: January 23, 2014

Common Core Progress English Language Arts

SDMX technical standards Data validation and other major enhancements

Proposed Minor in Media Studies. Department of Communication. University of Utah

High-dimensional labeled data analysis with Gabriel graphs

CINEMA DEPARTMENT COURSE LEVEL STUDENT LEARNING OUTCOMES BY COURSE

The Scientific Data Mining Process

Speech Processing Applications in Quaero

CSC384 Intro to Artificial Intelligence

Digital Asset Manager, Digital Curator. Cultural Informatics, Cultural/ Art ICT Manager

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

Appendices master s degree programme Artificial Intelligence

Professional Organization Checklist for the Computer Information Systems Curriculum

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Text Mining: The state of the art and the challenges

Multiple Goals of Teaching the Methods and Theory of Terminology

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

FACULTAD DE BELLAS ARTES DE ALTEA

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Introduction to Pattern Recognition

Scheme of work for Learning English through Short Stories

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY AUTUMN 2016 BACHELOR COURSES

College of Communication and Information. Library and Information Science

Semantic Navigation Maps for Information Agents in Environment Information Systems

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Click to edit Master title style

How To Teach English To Other People

Aesthetic Experience and the Importance of Visual Composition in Information Design

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Search and Information Retrieval

Course Description for the Bachelors Degree in Library and Information Science

Program curriculum for graduate studies in Speech and Music Communication

New Frontiers of Automated Content Analysis in the Social Sciences

Data, Measurements, Features

Master s Program in Information Systems

The A-Z of Building a Digital Newspaper Archive: A Case Study of the Upper Hutt City Leader

Databases & Data Infrastructure. Kerstin Lehnert

Jefferson Township Public Schools. Technology Curriculum. Video Production II: Television Studio. Grades 10, 11 & 12. August 2011

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

PARIS AGENDA OR 12 RECOMMENDATIONS FOR MEDIA EDUCATION

Master of Arts in Linguistics Syllabus


Applications of Deep Learning to the GEOINT mission. June 2015

Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing. David Lechevalier Anantha Narayanan Sudarsan Rachuri

Proceedings of the Ninth Scandinavian Conference on Artificial Intelligence (SCAI 2006)

How To Learn To Be A Creative Artist

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN INFORMATION MANAGEMENT (BSc[IM])

The University of Jordan

What s the next big thing in Broadcasting? Chances are we re already working on it.

Enhancing Lotus Domino search

ANALYTICS IN BIG DATA ERA

A Proposal for OpenEXR Color Management

SHAPING THE FUTURE TOGETHER

The REPERE Corpus : a multimodal corpus for person recognition

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN LIBRARY AND INFORMATION MANAGEMANT (MSc[LIM])

Transcription:

Language Technology based on Big Data: Current Situation and Future Perspectives Timo Honkela 30 October 2014 Centre for Preservation and Digitisation Department of Modern Languages KITES-symposium

Introductory remarks

Department of Modern Languages Language Technology HELSINKI Center for Preservation and Digitisation MIKKELI

Digital humanities Research within humanities with the help of computers Digital resources Computational models Basic motivation One can already fly to moon and build sophisticated factorial products The most important open questions in the world are related to humanities and social sciences

Changing role of computers Machines are increasingly capable of performing pattern recognition and learning. Traditionally ICT systems were programmed to perform their operations in a manner that made them predictable. The systems do not repeat their actions in similar manner over and over but they evolve and can take contextual factors into account better than before

Early personal experiences on rule-based natural language processing H. Jäppinen, T. Honkela, H. Hyötyniemi & A. Lehtola (1988): A Multilevel Natural Language Processing Model. Nordic Journal of Linguistics 11:69-87. What is the turnover of the ten largest stock exchange companies in forestry? Morphological analysis Dependency parsing Logical analysis Database query formation Result from the SQL database

DIGITAL RESOURCES Images Texts Speeches/ convers. Videos Interactive systems Numerical data Multimedia documents Computational models Computer software

Complexity of language as an object of study and as an means of representation and communication

en.wikipedia.org > 6000 languages, many more dialects en.wikipedia.org A large number of different cultures blogs.state.gov Billions of people A vast number of ways to relate language, concepts and the world to each other

Language as a projection Timo Honkela: Self-Organizing Map as a Means for Gaining Perspectives Metalithicum, Einsiedeln, June 2014

Challenge: A tension between the usability and standardization of content descriptions and richness and evolution of language and its interpretation, genre and style variation, and contextuality, subjectivity and cultural dependence

red wine red skin red shirt Gärdenfors 2000

Color naming (amateurs vs professionals)

Richness and contextuality of interpretation Shall I Compare Thee To A Summer's Day A small elephant versus a big mouse A beautiful scenery, painting or composition Democracy, equality, sustainability, fairness, science,...

Present and emerging methodological possibilities

Opportunities: Analysis of contextual data

Classical example: Learning meaning from context: Maps of words in Grimm fairy tales s ex n o nt i t a co l e t r x r d n te o w o f o ap g m n i g rn zin a e ni l d rga e t a f-o m l o e t s Au ing us Honkela, Pulkki & Kohonen 1995 a t a d t

Independent Component Analysis of wellbeing-related words in Reddit texts (Honkela, Izzatdust, Lagus 2012)

Opportunities: Analysis and visualization of text corpora

We are facing a new situation Systems can simulate or imitate human interpretation to some extent Systems are actually becoming increasingly epistemologically autonomous Not only software that is used in some analysis contain prebuilt assuptions but also evolves over time based on the data it has read or seen

Map of Finnish Science Chemistry Health Bio- and environmental sciences Culture and society Natural sciences and engineering (Honkela & Klami 2007)

Opportunities: Analysis of multimodal data

An example of automatic multimedia content analysis Acknowledgements: Finnish Broadcasting Company (YLE) Jorma Laaksonen users.ics.aalto.fi/jorma/ scholar.google.com/citations?user=suhzeyiaaaaj&hl=en Mikko Kurimo users.ics.aalto.fi/mikkok/ elec.aalto.fi/en/about/careers/professors/mikko_kurimo/

Video analysis / scene classification Speaker recognition Speech recognition (speech to text)

Video analysis / scene classification Speaker recognition OCR Speech recognition (speech to text)

Opportunities: Analysis of multimodal corpora

Labeling movements Förger & Honkela

WALKING JOGGING RUNNING LIMPING

Opportunities: Modeling subjectivity and contextuality of interpretation

GICA: Grounded Intersubjective Concept Analysis Honkela, Raitio, Lagus & Nieminen 2012

Analysis of health in the State of the Union addresses Subjects on objects in contexts: Using GICA method to quantify epistemological subjectivity. Timo Honkela, Juha Raitio, Krista Lagus, Ilari T. Nieminen, Nina Honkela, and Mika Pantzar. Proc. of IJCNN 2012.

Distant.. close reading We will have more and more methods that make machines to help in conducting close reading

Opportunities: Crossing language borders

Google Speech-tospeech Translation

Consider how different languages divide the conceptual space in different ways (cf. e.g. Melissa Bowerman et al.)

Opportunities: Analysis of human interpretation in the description of data

Analyzing Emotional Semantics of Abstract Art Using Low-Level Image Features. He Zhang, Eimontas Augilius, Timo Honkela, Jorma Laaksonen, Hannes Gamper and Henok Alene, Proceedings of IDA 2011.

Opportunities: Using text mining to support qualitative research

Text Mining for Qualitative Research Nina Janasik, Timo Honkela, and Henrik Bruun. Text mining in qualitative research: Application of an unsupervised learning method. Organizational Research Methods, 12(3):436 460, 2009.

Nina Janasik, Timo Honkela, and Henrik Bruun. Text mining in qualitative research: Application of an unsupervised learning method. Organizational Research Methods, 12(3):436 460, 2009.

Opportunities: Sentiment analysis

Honkela, Korhonen, Lagus & Saarinen: Five-dimensional sentiment analysis of corpora, documents and words, WSOM 2014 P: Positive E: Engagement R: Relationships M: Meaning A: Achievement (Seligman et al.)

Opportunities: Interoperability without standardization?!

Emergence of a coherent lexicon in a community of interacting SOM-based agents (Lindh-Knuutila, Lagus & Honkela, SAB'06) Related to e.g. Steels and Vogt on language games Simulating processes of language emergence and communication 44

Concept Formation and Communication - General Theory Ci: N dimensional metric concept space S: symbol space, The vocabulary of an agent that consists of discrete symbols λ : Ci Cj R, i j A distance between two points in the concept spaces of different agents ξ: si Si C An individual mapping function from symbols to concepts φi: Si D An individual mapping from agent i's vocabulary to the signal space D and an inverse mapping φ 1 i from the signal space to the symbol space Observing f1 and after symbol selection process, agent 1 communicates a symbol s* to agent 2 as signal d. When agent 2 observes d, it maps it to some s2 S2 by using the function φ 11. Then it maps the symbol to some point in its concept space by using ξ2. If this point is close to its observation f2 in the sense of λ, the communication process has succeeded. Timo Honkela, Ville Könönen, Tiina Lindh-Knuutila, and Mari-Sanna Paukkeri. Simulating processes of concept formation and communication. Journal of Economic Methodology, 15(3):245 259, 2008.

Libraries

Museums Citizens Archives Artists Libraries Teachers Researchers Journalists Universities DIGITAL RESOURCES Societies Media Companies Information specialists Decision makers Municipalities State

DIGITAL RESOURCES Images Texts Speeches/ convers. Videos Interactive systems Numerical data Multimedia documents Computational models Computer software

Content and information professionals Users of the contents (professionals and lay people) Formal metadata Language technology resources and systems Machine learning and pattern recognition systems Other forms of description Resources

Thank you for your attention!