INSIGHT SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications José Curto David Schubmehl IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com SDL, a leading CXM provider, has announced a SaaS machine translation (MT) service, SDL BeGlobal, for search and content analytics vendors based on its industrial strength translation capabilities. The service can be easily integrated with existing text analytics or search indexing products to provide a quick yet comprehensive translation of documents into English (or another language) before applying text analytics techniques such as entity extraction, sentiment analysis, or relationship detection to the document being processed. In detail: The use of a SaaS-based machine translation service for handling multilingual content will make search and text analytics vendors more competitive in the market, offering support for a wider variety of languages while significantly lowering the overall cost of supporting multilingual capabilities. Search and text analytics vendors should strongly consider the option of replacing third-party libraries for handling multilingual content with a SaaS-based machine translation service for those applications where Internet access is available and the use of a SaaS-based option is feasible. The ability to translate multilingual content such as social media, Web content, and enterprise data in real time gives businesses opportunities to rapidly identify and respond to customer insights and trends, allowing these organizations to improve sales, increase ROI, and enhance the customer experience. IN THIS INSIGHT This IDC Insight examines the potential for using SaaS-based machine translation (MT) services as a replacement for traditional language dictionaries and recognizers. SDL, a leading CXM provider, announced on March 11, 2013, that its machine translation solution, SDL BeGlobal, can now be seamlessly integrated with text analytics, search, or content management solutions, enhancing businesses' ability to track and monitor global customer sentiment and business trends. The use of a SaaS-based machine translation service for handling multilingual content can make search and text analytics vendors more competitive in the market, offering support for a wider variety of languages while lowering the overall cost of supporting multilingual capabilities. Filing Information: March 2013, IDC #240357, Volume: 1 Search and Discovery Technologies: Insight
SITUATION OVERVIEW The "globalization" of the Internet is causing major changes for vendors and users across the world. A medium that was initially dominated by English language users is fast becoming more representative of the world, though penetration rates still vary significantly by region. Europe (including Russia and Turkey) has a 63% penetration rate, according to Internet World Stats. In the EU28 (including Croatia, which will join the Union on July 1, 2013), 67% of the population is now online. Although penetration rates in Asia are only slightly more than a quarter of the population, these users now constitute 45% of the Internet population. The 275 million Internet users in North America are now only slightly more than 10% of the world Internet population. Speakers of Asian languages now dominate Web usage, but the languages represented by Web content are still heavily biased toward English. Data from Language Connect shows that of the top 1 million Web sites online, 55% are published in English and only 4.6% in Chinese (many, of course, published in multiple languages). While this remains imbalanced, it is a far cry from the days when 90% of Web content was in English a few years ago. While content in English is declining, the share of content published in other European languages is not increasing proportionately. The fastest growth is in non-european languages, though Spanish and Portuguese have gained significance because of Latin American markets. Aside from English, Spanish, and Portuguese, only five other European Union (EU) languages (German, French, Italian, Polish, and Dutch), out of 60+ languages spoken in the European Union, are published on more than 1% of the top million sites. The language landscape in which businesses operate is complex, perhaps more so in Europe than most regions. There are 60+ languages spoken in member states of the EU, and 23 official languages (soon to be 24). Localizing Web sites for the range of languages that might be used in digital commerce is daunting, verging on impossible, and only very rich, very large companies can currently afford to be multilingual on a large scale (at most and very rarely 25 30 languages globally). In Asia, we are seeing increasing usage of languages such as Chinese, Korean, and Japanese for regional business and commerce. In addition, an increasing percentage of the languages used to file patents and trademarks are Chinese, Korean, and Japanese. The increasing usage of these languages is placing increasing pressure on search and content analytics vendors to provide a strong and comprehensive strategy for handling multilingual content. Traditionally, the approach has been for search and text analytics vendors to annually license and pay royalties for dictionaries and language recognizers from third-party suppliers. The costs for these annual licenses routinely start in five figures and often costs vendors six or even seven figures on a year-toyear basis. Since the intellectual property required to handle multilingual content is significant, many vendors chose the route of using a supplier rather than taking years and several software engineers and linguists to develop their own multilingual capabilities. Now SDL, with its BeGlobal product, is offering another way; use its SaaS services to machine translate multilingual content into English (or French, or German, or Spanish; it's the user's choice) and then perform text analytics or search indexing on the translated English content. This frees the vendor from supporting a myriad of 2 #240357 2013 IDC
languages with its text analytics or search indexing and provides an easy way for vendors to support a wide variation of languages from Chinese and Korean to Urdu and Pashto. Currently, SDL BeGlobal supports 40+ different languages, resulting in 80+ language translation pairs. The service is based on SDL's core machine language translation technology, which has hundreds of enterprise customers and has been adapted for dozens of domains, including ediscovery and litigation support. A number of existing search and text analytics vendors are already using the service, such as Ipsos MORI for predictive analytics, Raytheon BBN Technologies for broadcast monitoring and Web content monitoring, ZyLAB for ediscovery, Next IT for identification of user intent, and Expert System for semantic intelligence. Raytheon BBN Technologies has embedded SDL BeGlobal into its Foreign Broadcast Monitoring Service. The service offers automatic translation of content for use in search, display, and even entity extraction of people, places, and things. For foreign broadcast monitoring, a video broadcast is first captured and automatically transcribed into the foreign language of the broadcast using Raytheon BBN Technologies' speech-to-text solution. The transcript is automatically translated using SDL's BeGlobal machine translation service. This is done with only a 3 5 minute delay from the original broadcast. The solution then creates a searchable archive of the spoken content in both English and the source language through a browser-based interface. The company's Web monitoring solution captures content from numerous online sources and stores them in a local repository. The information can then be automatically translated using SDL BeGlobal machine translation and made available for search queries. ZyLAB is also using SDL BeGlobal in its ediscovery and early case assessment platform. Many of ZyLAB's customers are multinational companies that have internal documents and correspondence in several languages. To support an ediscovery project, the research system needs to be multilingual in nature to support a comprehensive document review prior to the start of a case. ZyLAB uses SDL BeGlobal to translate all information up front or during a review, enabling litigation support researchers to quickly uncover relevant data and route critical information for a complete translation when needed. FUTURE OUTLOOK The growing use of multilingual content in a wide variety of settings is forcing search and content analytics vendors to look at many options to handle it. In addition, the explosion of Asian, African, and South American languages on the Internet only adds to the complexity. Traditional methods of handling multilingual content are to either develop multilingual dictionaries and language recognizers in-house or utilize the services of a third-party library vendor to supply the necessary technology to identify, recognize, and process multilingual content. Third-party vendors usually license their software on an annual basis with an additional royalty based on software sales or usage, which can result in charges from thousands to hundreds of thousands of dollars annually. SDL BeGlobal is 2013 IDC #240357 3
offering another way to handle multilingual processing: use machine translation to translate the document or passage from its original target language (Chinese, Korean, Urdu, Pashto, etc.) to English and then use the vendor's normal English language text analytics or search functions on it instead of processing it in the native language. The machine translation approach offers speed and flexibility as long as the application is attached to the Internet. Since SDL supports 40+ languages, it is a broader solution than most vendors could do with in-house development or even the use of third-party libraries. SDL is also continuously improving its machine translation service due to its machine learning algorithms and the amount of content (much of it domain specific) that is being translated. One concern is the potential accuracy of performing machine translation on documents before performing text analytics functions. Will the accuracy of the underlying text analytics be as precise and accurate with machine translation done first? That is currently unknown, although indications to date indicate that it is certainly good enough to do a first pass and then, if necessary, do a human translation, as ZyLAB is recommending with its use of SDL in its ediscovery suite. IDC recommends that vendors interested in machine translation services conduct their own testing of machine translation to understand the pros and cons of automatic translation with their search and text analytics processing. The use of a SaaS-based machine translation service for handling multilingual content will make search and text analytics vendors more competitive in the market, offering support for a wider variety of languages while significantly lowering the overall cost of supporting multilingual capabilities. Since the SaaS option scales as the vendor's usage scales, it can be a relatively inexpensive option to start with as the cost scales in relation to usage. Search and text analytics vendors should strongly consider the option of replacing third-party libraries for handling multilingual content with a SaaS-based machine translation service for those applications where Internet access is available and the use of a SaaS-based option is feasible. The advantages of being able to handle 40+ languages with a single API call will certainly offer a level of appeal for many vendors based on their requirements for accuracy and immediacy. 4 #240357 2013 IDC
Copyright Notice This IDC research document was published as part of an IDC continuous intelligence service, providing written research, analyst interactions, telebriefings, and conferences. Visit www.idc.com to learn more about IDC subscription and consulting services. To view a list of IDC offices worldwide, visit www.idc.com/offices. Please contact the IDC Hotline at 800.343.4952, ext. 7988 (or +1.508.988.7988) or sales@idc.com for information on applying the price of this document toward the purchase of an IDC service or for information on additional copies or Web rights. Copyright 2013 IDC. Reproduction is forbidden unless authorized. All rights reserved. 2013 IDC #240357 5