Language Technology based on Big Data: Current Situation and Future Perspectives Timo Honkela 30 October 2014 Centre for Preservation and Digitisation Department of Modern Languages KITES-symposium
Introductory remarks
Department of Modern Languages Language Technology HELSINKI Center for Preservation and Digitisation MIKKELI
Digital humanities Research within humanities with the help of computers Digital resources Computational models Basic motivation One can already fly to moon and build sophisticated factorial products The most important open questions in the world are related to humanities and social sciences
Changing role of computers Machines are increasingly capable of performing pattern recognition and learning. Traditionally ICT systems were programmed to perform their operations in a manner that made them predictable. The systems do not repeat their actions in similar manner over and over but they evolve and can take contextual factors into account better than before
Early personal experiences on rule-based natural language processing H. Jäppinen, T. Honkela, H. Hyötyniemi & A. Lehtola (1988): A Multilevel Natural Language Processing Model. Nordic Journal of Linguistics 11:69-87. What is the turnover of the ten largest stock exchange companies in forestry? Morphological analysis Dependency parsing Logical analysis Database query formation Result from the SQL database
DIGITAL RESOURCES Images Texts Speeches/ convers. Videos Interactive systems Numerical data Multimedia documents Computational models Computer software
Complexity of language as an object of study and as an means of representation and communication
en.wikipedia.org > 6000 languages, many more dialects en.wikipedia.org A large number of different cultures blogs.state.gov Billions of people A vast number of ways to relate language, concepts and the world to each other
Language as a projection Timo Honkela: Self-Organizing Map as a Means for Gaining Perspectives Metalithicum, Einsiedeln, June 2014
Challenge: A tension between the usability and standardization of content descriptions and richness and evolution of language and its interpretation, genre and style variation, and contextuality, subjectivity and cultural dependence
red wine red skin red shirt Gärdenfors 2000
Color naming (amateurs vs professionals)
Richness and contextuality of interpretation Shall I Compare Thee To A Summer's Day A small elephant versus a big mouse A beautiful scenery, painting or composition Democracy, equality, sustainability, fairness, science,...
Present and emerging methodological possibilities
Opportunities: Analysis of contextual data
Classical example: Learning meaning from context: Maps of words in Grimm fairy tales s ex n o nt i t a co l e t r x r d n te o w o f o ap g m n i g rn zin a e ni l d rga e t a f-o m l o e t s Au ing us Honkela, Pulkki & Kohonen 1995 a t a d t
Independent Component Analysis of wellbeing-related words in Reddit texts (Honkela, Izzatdust, Lagus 2012)
Opportunities: Analysis and visualization of text corpora
We are facing a new situation Systems can simulate or imitate human interpretation to some extent Systems are actually becoming increasingly epistemologically autonomous Not only software that is used in some analysis contain prebuilt assuptions but also evolves over time based on the data it has read or seen
Map of Finnish Science Chemistry Health Bio- and environmental sciences Culture and society Natural sciences and engineering (Honkela & Klami 2007)
Opportunities: Analysis of multimodal data
An example of automatic multimedia content analysis Acknowledgements: Finnish Broadcasting Company (YLE) Jorma Laaksonen users.ics.aalto.fi/jorma/ scholar.google.com/citations?user=suhzeyiaaaaj&hl=en Mikko Kurimo users.ics.aalto.fi/mikkok/ elec.aalto.fi/en/about/careers/professors/mikko_kurimo/
Video analysis / scene classification Speaker recognition Speech recognition (speech to text)
Video analysis / scene classification Speaker recognition OCR Speech recognition (speech to text)
Opportunities: Analysis of multimodal corpora
Labeling movements Förger & Honkela
WALKING JOGGING RUNNING LIMPING
Opportunities: Modeling subjectivity and contextuality of interpretation
GICA: Grounded Intersubjective Concept Analysis Honkela, Raitio, Lagus & Nieminen 2012
Analysis of health in the State of the Union addresses Subjects on objects in contexts: Using GICA method to quantify epistemological subjectivity. Timo Honkela, Juha Raitio, Krista Lagus, Ilari T. Nieminen, Nina Honkela, and Mika Pantzar. Proc. of IJCNN 2012.
Distant.. close reading We will have more and more methods that make machines to help in conducting close reading
Opportunities: Crossing language borders
Google Speech-tospeech Translation
Consider how different languages divide the conceptual space in different ways (cf. e.g. Melissa Bowerman et al.)
Opportunities: Analysis of human interpretation in the description of data
Analyzing Emotional Semantics of Abstract Art Using Low-Level Image Features. He Zhang, Eimontas Augilius, Timo Honkela, Jorma Laaksonen, Hannes Gamper and Henok Alene, Proceedings of IDA 2011.
Opportunities: Using text mining to support qualitative research
Text Mining for Qualitative Research Nina Janasik, Timo Honkela, and Henrik Bruun. Text mining in qualitative research: Application of an unsupervised learning method. Organizational Research Methods, 12(3):436 460, 2009.
Nina Janasik, Timo Honkela, and Henrik Bruun. Text mining in qualitative research: Application of an unsupervised learning method. Organizational Research Methods, 12(3):436 460, 2009.
Opportunities: Sentiment analysis
Honkela, Korhonen, Lagus & Saarinen: Five-dimensional sentiment analysis of corpora, documents and words, WSOM 2014 P: Positive E: Engagement R: Relationships M: Meaning A: Achievement (Seligman et al.)
Opportunities: Interoperability without standardization?!
Emergence of a coherent lexicon in a community of interacting SOM-based agents (Lindh-Knuutila, Lagus & Honkela, SAB'06) Related to e.g. Steels and Vogt on language games Simulating processes of language emergence and communication 44
Concept Formation and Communication - General Theory Ci: N dimensional metric concept space S: symbol space, The vocabulary of an agent that consists of discrete symbols λ : Ci Cj R, i j A distance between two points in the concept spaces of different agents ξ: si Si C An individual mapping function from symbols to concepts φi: Si D An individual mapping from agent i's vocabulary to the signal space D and an inverse mapping φ 1 i from the signal space to the symbol space Observing f1 and after symbol selection process, agent 1 communicates a symbol s* to agent 2 as signal d. When agent 2 observes d, it maps it to some s2 S2 by using the function φ 11. Then it maps the symbol to some point in its concept space by using ξ2. If this point is close to its observation f2 in the sense of λ, the communication process has succeeded. Timo Honkela, Ville Könönen, Tiina Lindh-Knuutila, and Mari-Sanna Paukkeri. Simulating processes of concept formation and communication. Journal of Economic Methodology, 15(3):245 259, 2008.
Libraries
Museums Citizens Archives Artists Libraries Teachers Researchers Journalists Universities DIGITAL RESOURCES Societies Media Companies Information specialists Decision makers Municipalities State
DIGITAL RESOURCES Images Texts Speeches/ convers. Videos Interactive systems Numerical data Multimedia documents Computational models Computer software
Content and information professionals Users of the contents (professionals and lay people) Formal metadata Language technology resources and systems Machine learning and pattern recognition systems Other forms of description Resources
Thank you for your attention!