How To Create A Global Data Science System For People Of Every Language



Similar documents
Artificial Intelligence for Social Good! Robert Munro! Idibon! CS 124 / Ling 180:! From Languages to Information! Stanford! January 2015!

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Travel agents guide to SMS messaging. How to use SMS messaging for marketing, booking, and customer support

to Voice. Tips for to Broadcast

Systems of Discovery The Perfect Storm of Big Data, Cloud and Internet-of-Things

Why Should You Use Sahana Eden?

The Power of Social Data: Transforming Big Data into Decisions. Andreas Weigend

Childcare. Marketing Tips. 10 Must-Do Marketing Tips to Grow the Enrollment of Your Early Childhood Program

How to fill every seat in the house. An event manager s guide to SMS Marketing

Khan Academy increased international awareness and involved thousands of product users worldwide into translation effort using Crowdin.

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Ridiculously Good Outsourcing. The Monetization of Big Data: Made Possible By Humans. (888) TASK

Attributes and Objectives of Social Media. What is Social Media? Maximize Reach with Social Media

Applications of Deep Learning to the GEOINT mission. June 2015

Availability Digest. Everbridge Emergency Notification July 2014

Social Media Implementations

Service Overview. KANA Express. Introduction. Good experiences. On brand. On budget.

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Example Interview INSTRUCTIONS FOR THE CHILD COMPETENCE INTERVIEW

SOCIAL MEDIA MONITORING AND SENTIMENT ANALYSIS SYSTEM

For healthcare, change is in the air and in the cloud

How an Innovative Marketing Strategy Can Pave the Way to New Customers

Machine Learning and Predictive Analytics Foster Growth [1]

The partnership. The joint solution shifts wireless discussions to line-of-business and marketing needs, unlocking new revenue streams.

The Real-time Monitoring System of Social Big Data for Disaster Management

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Social Media and Disasters:

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Exploring Big Data in Social Networks

The new way to accelerate and scale your social media marketing. Unlock your potential.

Unit 3. Effective Communication in Health and Social Care. Learning aims

Applied Software Project Management

Please note this bulletin replaces Equality Bulletin 34.

White Paper Building. SMS Enabled. Enterprise Applications. Author: David Angers. FOUNDER, CTO, Infracast Ltd

FirstToSee: Leveraging Social Media for Emergency Response. June 21, 2013

Top 10 Tips for Successful Message Mapping

ON24 Platform 10 Webcasting Industry Standard for Demand Generation and Customer Engagement

Text-To-Speech Technologies for Mobile Telephony Services

Voice Conference Manager

SOCIAL MEDIA ADVERTISING STRATEGIES THAT WORK

Speech Analytics. Whitepaper

Knowledge Communities, Gamification, and the Rise of Online Reputation Systems for IT and Beyond

Pick and Mix Services

White Paper 10 Things to Ask Your Next Virtual Assistant

Synerscope Sept 2013

Google Lead Generation For Attorneys - Leverage The Power Of Adwords To Grow Your Law Business FAST. The Foundation of Google AdWords

Customer Service Plan

Guide for Local Business Google Pay Per Click Marketing!

Text Mining - Scope and Applications

Adobe Experience Manager: Social communities

SAS CLOUD ANALYTICS MAY 2015

Mailchimp VS All Clients

Advanced Training Reliance Communications, Inc.

How HMI Users can Benefit from a Process Historian. by Jim Frider, Product Marketing Manager, Information Products, Schneider Electric

BIG DATA FUNDAMENTALS

Data Analytics for Healthcare: Creating understanding from big data

Marketo. Case Study: Marketo uses Hootsuite to improve lead quality and maintain 93% customer satisfaction. Introduction: Breaking Down Business Silos

The Business Case for Unified Communications November 2013

Simpli Networks. m

Here is a very useful checklist for you to use when comparing EBC to other courses.

SOCIAL MEDIA GUIDELINES FOR CANADIAN RED CROSS STAFF AND VOLUNTEERS

Feature Guide. Want to talk it through? pure360.com call: Work With Data. Work With Messages

Internet Marketing for Local Businesses Online

The Social Media Plan

How to Dominate Your Local Market Online Now

2015 State of Artificial Intelligence & Big Data in the Enterprise

Google Lead Generation for Attorneys

The Guide to: Marketing Analytics"

A Sensible Approach to Asset Management

John A. Volpe National Transportation Systems Center. Connected Vehicle and Big Data: Current Practices, Emerging Trends and Potential Implications

See how social media listening and engagement can help your business

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Transcription:

Bringing Data Science to the Speakers of Every Language!!!!!!!! Robert Munro, PhD! CEO, Idibon!

CEO, Idibon! Global text analytics in 50+ languages! About me: technology and global development! Working with leaders in industry & social good! Industry: CTO / CIO! Energy infrastructure in Liberia and Sierra Leone! Global epidemic tracking! Crowdsourcing and natural language processing for disaster response! Other! Ph.D. in NLP from Stanford! Bicycled 20+ countries!

Recommendations for language processing for social good! Look beyond English! Inherent benefit understanding and support speakers of every language!! Employ people in those languages! Crowdsourced workers speak 100s of languages, and want to use them!! Embrace the variation! You can t rely on consistent spellings, but you can learn to model the diversity!!

How many languages are in the connected world?!.'/%0#-%102-%30#*+"45% 5000 2000 &%'(%)#*+,#+"-% 5 5 5 5 5 5 4.5 4 1500 50 1400 720 540 500!"#$%

How many languages are in the connected world?!.'/%0#-%102-%30#*+"45% 5000 2000 &%'(%)#*+,#+"-% 5 5 5 5 5 5 4.5 4 1500 50 1400 720 540 500!"#$%

How many languages are in the connected world?! 5000 2000 &%'(%)#*+,#+"-% 5 5 5 5 5 5 4.5 4 1500 50 1400 720 540 500!"#$%

How many languages are in the connected world?! 6,7*+%#%80'*"%2*%10"% 0#*4-%'(%"9"$:'*"%'*% 10"%8)#*"1%2-%10"%"#-:% 8#$1% 5000 ;*4"$-1#*42*+% "9"$:'*"%2-%+'2*+% 1'%<"%='$"% 3'=8)23#1"4% 2000 &%'(%)#*+,#+"-% 5 5 5 5 5 5 4.5 4 1500 50 1400 720 540 500!"#$%

Every human communication this year! Source: Ethnologue, Nationalencyklopedin

7% of our communications are digital, most is still direct spoken language!

If every online picture is worth a thousand words, it would double social media.! Every picture!

Every 3 months, the world's text messages exceed the word count of every book.! Every book. Ever.! Source: Google Books

Print communication is smaller than anything shown.!!

Print communication is smaller than anything shown.! Ditto any one social network.!

The Twitter firehose is about the size of the dot above the i in English.!! Beyond the processing capacity of most organizations.!! Might not be a representative sample of all human activity for your area of interest.!

There are more than 6,000 other languages.! Only the top 1% are shown.!

No language from the Americas made the cut.! Quechua!

Email spam would be larger than every block except spoken Mandarin ( ).! Source: Mashable

Short messages (SMS and IM) make up 2% of the world s communications.! The largest and most linguistically diverse form of written communication that has ever existed.!! # PhDs focused on processing large volumes of short messages in low resource languages?!!!!!!!!1!

If the Facebook like is a one-word language it is in the top 5% of languages by word count.!

Your browser probably won't show Sundanese script! (!!!!!!!)!

Sundanese speakers outnumber the populations New York, London, Tokyo and Moscow.! Combined.!

You misread Sundanese as "Sudanese" which is a variety of Arabic! We have a blind spot for knowing about the existence of languages.!

This is the breakdown of languages that most of our data is moving towards!

>*%"#$10?,#@"% -1$,3@%.#2A%'*% B#*,#$:%CDE%DFCF% % January 12, 2010! G'-1%)'3#)%-"$923"-%(#2)"4E% <,1%='-1%3"))H1'/"$-% $"=#2*"4%(,*3A'*#)I%% %

Messages start streaming in!

Messages start streaming in!

Mission 4636! Fanm gen tranche pou fè yon pitit nan Delmas 31 (18.4957, -72.3185) Fanm gen tranche pou fè yon pitit nan Delmas 31 Undergoing children delivery Delmas 31 18.495746829274168, 72.31849193572998 Emergency!"##$%"& '($)#*$'"+,& -$'"%.(/0"+&1& %".*.-$'"+ Fanm gen tranche pou fè yon pitit nan Delmas 31 Undergoing children delivery Delmas 31 18.495746829274168, 72.31849193572998 Emergency 2.-$'/.)&/#& ("3/)"+&1& $-'/.)$4*"&/'"5#& $("&/+")'/3/"+

Global collaboration! 2,000 volunteers, transferred to paid workers in Haiti!

J'821#)%K#3$"HL'",$% @2%*#*%92)%M@#8E%8$"% 8',%)2%$"-"9/#% =',*%=#)#4%"%)#8% =#*4"%8',%=',*% @2%=#)#4%:'%#)"%)#I% % NK#3$"HL'",$%.'-821#)%/0230% )'3#1"4%2*%102-% 92))#+"%'(%M@#8%2-% $"#4:%1'%$"3"29"% 10'-"%/0'%#$"% 2*O,$"4I%P0"$"('$"E% /"%#$"%#-@2*+% 10'-"%/0'%#$"%-23@% 1'%$"8'$1%1'%10#1% 0'-821#)IQ% %

J'821#)%K#3$"HL'",$% @2%*#*%92)%M@#8E%8$"% 8',%)2%$"-"9/#% =',*%=#)#4%"%)#8% =#*4"%8',%=',*% @2%=#)#4%:'%#)"%)#I% % NK#3$"HL'",$%.'-821#)%/0230% )'3#1"4%2*%102-% 92))#+"%'(%M@#8%2-% $"#4:%1'%$"3"29"% 10'-"%/0'%#$"% 2*O,$"4I%P0"$"('$"E% /"%#$"%#-@2*+% 10'-"%/0'%#$"%-23@% 1'%$"8'$1%1'%10#1% 0'-821#)IQ% %

J'821#)%K#3$"HL'",$% @2%*#*%92)%M@#8E%8$"% 8',%)2%$"-"9/#% =',*%=#)#4%"%)#8% =#*4"%8',%=',*% @2%=#)#4%:'%#)"%)#I% % NK#3$"HL'",$%.'-821#)%/0230% )'3#1"4%2*%102-% 92))#+"%'(%M@#8%2-% $"#4:%1'%$"3"29"% 10'-"%/0'%#$"% 2*O,$"4I%P0"$"('$"E% /"%#$"%#-@2*+% 10'-"%/0'%#$"%-23@% 1'%$"8'$1%1'%10#1% 0'-821#)IQ% %

Local knowledge! Workers collaborating to find locations: Dalila: I need Thomassin Apo please Apo: Kenscoff Route: Lat: 18.495746829274168, Long:-72.31849193572998 Apo: This Area after Petion-Ville and Pelerin 5 is not on Google Map. We have no streets name Feedback from responders: "just got emergency SMS, child delivery, USCG are acting, and, the GPS coordinates of the location we got from someone of your team were 100% accurate!" Dalila Apo Haiti responders The ability for someone to make a real-time difference at any other place in the world: Apo: I know this place like my pocket Dalila: thank God u was here (18.4957, -72.3185) here = anywhere

How do we automate processing the world s data?!

English! Generations of standardization in spelling and simple morphology! Whole words suitable as features for NLP systems! Most other languages! Relatively complex morphology! Less (observed) standardized spellings! More dialectal variation!

Haitian Krèyol! No standard (wide-spread) spellings! More or less French spellings! More or less phonetic spellings! Frequent words (esp pronouns) are shortened and compounded! Regional slang / abbreviations!

Haitian Krèyol! =R-2E%="-2E%% =R32E%="$32% % % % L%#%8%H%.%#%S%1%2%"%*% % T # 8 # : 2 - : " * %

The extent of the subword variation! >30 spellings of odwala ( patient ) in Chichewa! >50% variants of odwala occur only once in the data used here:! Affixes and incorporation! kwaodwala -> kwa + odwala! ndiodwala -> ndi odwala (official ngodwala not present)! Phonological/Orthographic! odwara -> odwala! ndiwodwala -> ndi (w) odwala!

Chichewa! The word odwala ( patient ) in 600 text-messages in Chichewa and the English translations!

Modeling the variation gives accurate results!!"#$$%&'!% $%!()*%+%,-./0'112!(+3/!22"/$2"#0#!245!"# @'&'!# $%!(*%1%,-$3/*%!(/;&/$2"#0#!245 A/#!/H/0+%??#&#0%(#;!/ 211;1?/*#()/1%*/ $2??%:2?!"#$%&'!% $%!(*%+%!"# @'&'!# $%!(*%+% A5/B;1$%+#C2/?D2++#!:?//!"#6$%6&'!6%/$%!6(*%+%!"#6@'6&'!6# $%!6(*%+% E5/F2:$2!(/!"#6$%6$%!6%/&'!()*'+'!"#6@'6$%!6# &'!()*'+' G5/."2!(#&3/D12"#0(;1?!"# 6&'!///$%!6(*%+%,7./!22"/$2"#0#!285 9%(2:;13/</7=2>'2?(/&;1/%#"8!"# 6&'!///$%!6(*%+%,7./!22"/$2"#0#!285 9%(2:;13/</7=2>'2?(/&;1/%#"8 A/#!/!" 0+%??#&#0%(#;!/ 211;1/D;?(6D1;02??#!:I/ #$%&'()* *#()/?0%+2I

Comparison with English! >33,$#3:U%G23$'H(% 6"$3"*1%'(%1$#2*2*+%4#1#%

Taking it to the world!

The benefits of understanding everyone!.,=#*%42-"#-"-%"$#423#1"4%2*%10"%)#-1%vw%:"#$-u%% -=#))8'Y% X*3$"#-"%2*%#2$%1$#9")%2*%10"%)#-1%VW%:"#$-U%

Reports of strange new illnesses pre-date official records!.x_% 4"3#4"-% [`W%=2))2'*% 2*("31"4]%.CZW%[a2$4%\),]%% /""@-% [bwf^%(#1#)]%.czc%[k/2*"%\),]% ='*10-% [CF^%'(%/'$)4% 2*("31"4]%

but the reports are in 1000s of languages! cf^%'(%"3')'+23#)%429"$-21:% cf^%'(%)2*+,2-a3%429"$-21:% -!0"""" 1"""" # 23, ("""" -""""."""" / + +&""" '("" )"""" *"""" ' +!"""" # $"""" % &"""" # 1 H5N1

Crowdsourcing, big data, and expert analysts! G'-1%2*('$=#A'*%2-%2*%8)#2*%)#*+,#+"U%% G,)A8)"%-@2))%#*4%8$'3"--2*+%-1$#1"+2"-%$"?,2$"4I% % -!0"""" 1"""" # 23, ("""" -""""."""" / + +&""" '("" )"""" *"""" ' +!"""" # $"""" % &"""" # 1 H5N1

i"8'$1-% =2))2'*-%8"$%4#:U% =#*:%)#*+,#+"-E% =,30%*'2-"% Digital Disease Discovery! a2+%d#1#% =#302*"%)"#$*2*+U% "Y1$#3A'*E%e)1"$2*+% f%8$2'$2ag#a'*% L$'/4-',$32*+% 10',-#*4-%'(% *#A9"H)#*+,#+"% -8"#@"$-% >*#):-1-%% -"9"$#)% 4'=#2*% "Y8"$1-% h)'<#)% ='*21'$2*+% K#("$%/'$)4%

The impact of scalable monitoring! \',*4%02-1'$23#)%-2+*#)-%10#1%8$"H4#1"4%!""#$%&'$(& )*%+,-&<:%`%/""@-E%LdL%<:%W%j%'*%LZZ%.'/%3#*%/"%e)1"$k='4")%="42#H4$29"*%#=8)2e3#A'*5% % NXl=%B#3?,2%B"$#-%/210%1'4#:l-%3')4%#*4%m,%$"8'$1%III% #3$'--%10"%=24H>1)#*A3%-1#1"-E%#%)2n)"%<21%'(%#*%2*3$"#-"Q% January 4, 2008 CNN Weather! %%

The impact of scalable monitoring! P$#3@"4%o<')#%2*%;+#*4#%W%4#:-%<"('$"%p'$)4%."#)10% M$+#*2g#A'*I% % N/"%/"$"%#<)"%1'%8,))%2*%=,30%$230"$%4#1#%($'=%#%)#$+"$% *,=<"$%'(%-',$3"-E%-'%/"%@*"/%*'1%O,-1%0'/%=#*:%8"'8)"% /"$"%2*("31"4E%<,1%/0#1%@2*4%'(%1$#*-8'$1%10":%1''@%/0"*% 10":%/"*1%($'=%10"2$%92))#+"%1'%10"%0'-821#)%2*%10"%*"#$"-1% =#2*%1'/*IQ%."/%*0&1(+*"2&34&!%+%*5$&6--%7/$8&"+&9:;#&<505&5+,&!$"/5$&<%=%$">7%+0?& & 6#%&@&#%+,%*&@&=;$$5#%&@&A%+0&0"&B"->;05$I% P02-%2-%8"$-'*#)):%24"*A(:2*+%f%3',)4%)"#4%1'%8"$-"3,A'*I% M8"*%4#1#%/',)4%0#9"%#%*"+#A9"%2=8#31I%

The impact of scalable monitoring! P$#3@"4%oHL')2%M,1<$"#@%2*%h"$=#*:%D%4#:-%<"('$"% oldli%.'/%4'%/"%='a9#1"%2*('$=#a'*%8$'3"--2*+5% % % % % % % % % G#$+2*-%#$"%-=#))U%'*):%('$H8$'e1%<2+%4#1#%#*4% 3$'/4-',$32*+%3#*%0#9"%#%-,-1#2*"4%2=8#31%

Idibon s current work! Hurricane Sandy! Idibon s CTO ran FEMA s Aerial Damage Assessments.! We have >1,000,000 manual tags on communications.! MIT Humanitarian Response Lab! Identifying reports about supplyline interruptions.! Research data from a combination of crowdsourcing and natural language processing!!

Recommendations for language processing for social good! Look beyond English! Inherent benefit understanding and support speakers of every language!! Employ people in those languages! Crowdsourced workers speak 100s of languages, and want to use them!! Embrace the variation! You can t rely on consistent spellings, but you can learn to model the diversity!!

Thank you!!!!!!!!! Robert Munro, PhD! CEO, Idibon!