Video Transcription in MediaMosa Proof of Concept Version 1.1
December 28, 2011 SURFnet/Kennisnet Innovatieprogramma Het SURFnet/ Kennisnet Innovatieprogramma wordt financieel mogelijk gemaakt door het Ministerie van Onderwijs, Cultuur en Wetenschap. Voor deze publicatie geldt de Creative Commons Licentie Attribution 3.0 Unported. Meer informatie over deze licentie is te vinden op http://creativecommons.org/licenses/by/3.0/
Contents Introduction... 4 Speech recognition... 4 Open Source Speech recognition tools... 4 SPRAAK / CMUSphinx... 5 MediaMosa... 5 Assets and Mediafiles... 6 Job processing... 6 Proof of Concept... 6 MediaMosa 3.0... 6 Installation of SPRAAK... 7 Installation of CMUSphinx... 7 Transcription modules... 7 Frontend... 7 Search... 7 SOLR... 8 Display of search results... 8 Subtitles... 10 JWplayer... 10 Flowplayer... 10 VideoJS... 10 Support in MediaMosa for Subtitle files... 11 Solution in the PoC... 12 Subtitles generation... 12 Conclusions and Recommendations... 13
Introduction Open Source Automatic Speech recognition (ASR) is a proven technology. Using ASR in MediaMosa a digital asset management system, would be of great importance to deliver video content in MediaMosa. In the Technology scout Video Transcription 1 it was concluded that ASR tools in MediaMosa could be used for searching content and it could provide a first version of subtitles to facilitate further processing. Considering the fact that for a useable subtitle file the word error count should be low, the subtitles should not be used as generated. The reader of this document should also consider reading the Technology scout for more background information. The aim of this document is to describe the Proof of Concept and all experiences encountered with the integration in MediaMosa. The second aim is to describe necessary and possible changes in MediaMosa to facilitate integration of Automatic Speech recognition tools. These change proposals can be used for future development of MediaMosa. The implementation of this Proof of Concept resulted in this document and a Proof of Concept website: http://spraak.mediamosa.surfnet.nl/. Speech recognition In the context of this Proof of Concept, video transcription can be explained as the automatic process of the conversation of speech and images into machine-readable data. This data can be used for captioning videos (subtitles) but also for extending the video metadata using transcripts. As written in the Technology scout, there is an important difference between transcription and captioning; Transcript: a transcript is a written text representation of spoken words not synchronized with the spoken words of the mediafile. Captions/Subtitles: Captions or subtitles are written text representations of spoken words synchronized with the spoken words. Captions/Subtitles are synchronized transcripts. Open Source Speech recognition tools In the Technology scout it was suggested the proof of concept should implement SPRAAK for Dutch en CMUSphinx for English speech recognition. 1 http://mediamosa.org/content/technology-scouting-project-video-transcripting-technology 4
SPRAAK / CMUSphinx SPRAAK (Speech Processing, Recognition and Automatic Annotation Kit; "Spraak" is also the Dutch word for "speech") is an open source speech recognition package. It is an efficient and flexible tool that combines many of the recent advancements in automatic speech recognition with a very efficient decoder in a proven HMM 2 architecture. SPRAAK serves several purposes. The first one is a highly modular toolkit for research into speech recognition algorithms. It allows researchers to focus on one particular aspect of speech recognition technology without needing to worry about the details of the other components. The second goal is a state-of-the art recognizer with a simple interface, so that non-specialists with a minimum of programming requirements can use it. Next to speech recognition, the resulting software enables applications in related fields as well. Examples are linguistic and phonetic research where the software can be used to segment large speech databases or to provide high quality automatic transcriptions. (http://www.spraak.org). CMUSphinx is an open source toolkit for speech recognition, which includes a recognizer library written in C; an adjustable, modifiable recognizer written in Java; language model tools; and acoustic model training tools. (http://cmusphinx.sourceforge.net). MediaMosa From the website: MediaMosa is software to build a Full Featured, Webservice Oriented Media Management and Distribution platform. With MediaMosa you can build a state-of-the-art, scalable Middleware Media Distribution Platform, which facilitates access to, and usage of (shared) storage capacity, metadata databases, transcoding- and streaming servers. A MediaMosa platform offers functionality for searching, playing, uploading, transcoding, as well as a fine granularity media access control system towards its users. MediaMosa is based on the Representational State Transfer (REST) architecture and is designed to support content streaming applications by providing a back-end-, audio- and video-infrastructure. The main features of a MediaMosa platform are: Delivery platform for audio, video (and in fact any other content) Streaming of any format (e.g. Flash, H.264 MPEG-4 and Windows Media) Transcoding based on FFmpeg Flexible Metadata Element Sets Access management functions on Media Enhanced Still functions Open Source under GPLv2 license MediaMosa is a free and open software package. It is based on the Drupal CMS and supports 2 HMM: Hidden Markov Models: http://en.wikipedia.org/wiki/hidden_markov_model 5
the use of several other Open Source software such as FFmpeg (http://www.mediamosa.org/) Assets and Mediafiles Assets in MediaMosa describe events or entities, using metadata. An asset has a number of mediafiles that are a representation of the asset. Mediafiles are not limited to type and can be a video, audio, image, document etc. A mediafile can also be a transcoded version of another mediafile. A mediafile has technical metadata, which is generated from technical analysis. The mediafile itself does not yet have user-modify-able metadata (apart from the 'tag' field). Job processing As described in the Technology Scout, a transcoding module does the process of Transcription. MediaMosa has a modular integration of tools. Tools can be added as a Drupal Module and modified on different places using the Drupal Hook system. A (transcode) profile is defined as a tool with a predefined set of parameters. A transcode job can be started with a media file as input and a transcode profile. The result is a transcoded media file. With this concept in mind, we implemented two different Speech recognition tools: mediamosa_tool_spraak and mediamosa_tool_cmusphinx. Proof of Concept The Proof of Concept (PoC) was deployed on a single VM server in the SURFnet VMware cluster at SARA, with the following software: Operating system: Debian MediaMosa 3.0 Spraak software CMUSphinx software Drupal 6 installation with the MediaMosa-CK modules. MediaMosa 3.0 The latest version from Github was taken and installed (MediaMosa 3.0.2 -build 1734). Some additional changes to MediaMosa where needed in order to integrate the speech recognition tools. The general adaptations were: Creation of two transcoding tools: mediamosa_tool_spraak and mediamosa_tool_cmusphinx. The introduction of a hook_post_transcode() function. In MediaMosa a transcode job always completed with a mediafile as a result. For storing transcription data in metadata we need to hook into this process when the file is generated (and analyzed). In the hook a file can be transformed to metadata. Fixed some bugs in transcode profile handling. Support for storing of transcription metadata with assets. Added VideoJS media player to MediaMosa, Added metadata field on assets for storing transcription results. Support in play objects for subtitles and more mediafiles formats. 6
These changes are described in more detail later. Installation of SPRAAK Installation of SPRAAK was rather straightforward. The PoC installation was done on a Debian system (version: squeeze ) and included some additional Debian packages (scans, sox, mplayer). SPRAAK was delivered with an NBEST setup: was made to transcribe broadcast news shows without any tunable parameters for the end-user. Since in MediaMosa we generally cannot predict what kind of voices will be uploaded, we used this generic setup. It is also a sensible setup for an educational setting where a speaker gives a lecture. The NBEST setup has two language models, one with a size of 100k and one of 400k. Installation of CMUSphinx CMUSphinx is available in different forms; Spinx2, Sphinx3, Sphinx4 and PocketSphinx. Sphinx documentation 3 recommends version 3 as follows: A slightly slower but more accurate speech recognizer. It is usually used a server implementation of Sphinx or evaluation. Since accuracy is more important than speed in a typical MediaMosa server setup, Sphinx3 was chosen for the proof of Concept. Installation and configuration was performed following a wiki description 4. For a general-purpose transcription, Keith Vertanen's English Gigaword Language Models are recommended. In the PoC the 64k language version is used. The choice for an acoustic model was more difficult; we used the HUB4 (broadcast news) acoustic models - for wideband (16kHz) speech as a general-purpose acoustic model. Transcription modules In MediaMosa two modules were created for the transcription decoders: mediamosa_tool_spraak and mediamosa_tool_cmusphinx. These two tools define their own metadata field (to store the transcription data). Normally a transcoding ends with a transcoded file, which can be added to an asset. The transcription process needed some additional processing to add the contents of the file to metadata, so a new hook 'mediamosa_tool_<name>::post_transcode' was introduced. In this process also the transcription results are transposed to a subtitle file. Frontend The frontend is a standard Drupal (http://www.drupal.org) installation with the MediaMosa-ck module. Drupal 6 was chosen since it was mentioned that this version was more stable than the Drupal 7 version. A number of changes had to be made in the frontend in order to show the working of the proof of concept: support for search, support for transcription options, support for the video player, support for jobs status, better cron handling and better still presentation. The most important changes are described in the next chapters. Search The transcription tools in MediaMosa store the results found in a new metadata field under the 3 http://cmusphinx.sourceforge.net/wiki/versions 4 http://sphinx.subwiki.com/sphinx/index.php/hello_world_decoder_quickstart_guide 7
asset (the install hook of the mediamosa_tool_spraak creates this). SPRAAK generates output as a single text file with every line containing a recognized word and a time index. In the PoC the time index is transformed to a better readable format (hh:mm:ss, truncated to seconds), in order to show a better readable output. This is stored in the metadata field (transcription) under an asset. A typical MediaMosa search REST call would be: /asset/ [GET]. Searching in MediaMosa uses CQL (Common Query Language), a search in transcription metadata could look like: /asset [GET] cql= transcript=nieuwe (more advanced cql can be used here) SOLR MediaMosa supports searching in MySQL or searching in SOLR. SOLR indexing is implemented by collecting all data of an asset including all metadata. So searching with SOLR on transcription data works out of the box. No special additions were necessary. However some additional testing is needed as the context sets in the SOLR specification were implemented in the SOLR module of MediaMosa, this code should be moved to the metadata modules (and Spraak as it defines its own metadata group). Display of search results The PoC used the existing cron function of the MediaMosa Construction Kit (MediaMosa-CK) to add the transcription results in a Drupal field. This field is shown in the interface (which in normal applications is of course not useful, but was done for demonstration purposes): Changing the output to direct links to a specific point in the video is not too difficult; add a get parameter to the found search link and jump on that page to that timeframe. With VideoJS this can be done with JavaScript, for example: myplayer.currenttime(120); myplayer.play(); More specialized players can show a graphical representation of the search results, for example the Buchenwald website http://hmiapps.ewi.utwente.nl/buchenwald/ shows the results in a 8
separate bar in the video: With the search results described earlier, it should not be too difficult to generate such an additional search result bar in the player. The three known players described later have no support (yet) for this. Search in OCR applications The chosen implementation for searching in transcriptions can be used for searching in OCR data as well. Needed here is a 'transcode' tool that converts the slides to text. The Tesseract 5 tool looks promising to fulfill this task. OCR data can be stored in a metadata field like the transcription tools. Also, to synchronizing the slides with the video a similar solution as the transcription tools could be used: store slide numbers with the stored OCR data. This PoC however did not include a setup with this Tesseract OCR tool. 5 http://en.wikipedia.org/wiki/tesseract_%28software%29 9
Subtitles Automatically generated subtitles should only be used in videos if recognition is better than 95%. In the Proof of concept we violated this rule for demonstration purposes only. Most well known video players have at least some support for subtitles. Subtitles can be given in the variable list as a link of part of the playlist in the object code of the player. Some examples: JWplayer JWplayer has extensive support; it can show and switch between a number of subtitles (based on language) with the help of plugins. Flowplayer Flowplayer has relative good support for subtitles, see the examples on the caption documentation page: http://flowplayer.org/plugins/flash/captions.html. VideoJS VideoJS at the moment has support for one subtitle file, but has no button to show it. However, in the same way as Flowplayer is using jquery to modify the player object, it should be possible to be used here. Subtitles in VideoJS are limited to showing the text as given; no fancy buttons are available at the moment. However the player is nicely integrated with jquery, and can be extended. Also, changing the subtitles based on language can be implemented in the same way as is described by the Flowplayer documentation. 10
Support in MediaMosa for Subtitle files In MediaMosa an asset consists of several mediafiles, which are other digital representations of the asset. A subtitle file is no different, so can be stored as a mediafile. In order to make a distinction between subtitle files and others, MediaMosa should introduce some kind of internal type field, which can be any of {mp4, subtitle, still, webm, ogv, } These types can be used for internal use; such as the object code of streaming servers. In MediaMosa an admin defines a streaming server per streamable file type: And every streaming server has an object code definition, which can be modified: 11
In this example, when a play request is made for a mp4 file, with request-type=object, the object code will be served. In MediaMosa only one file plus a default still will be served per object code. The VideoJS player has support for fallback mediafiles according to html5 standards, so some changes had to be made to the object code. MediaMosa will parse {TICKET_URI} with a link to the streamable mediafile, in the PoC we had to introduce {WEBM_URI}, {OGV_URI} and {SRT_URI} with links to the webm, ogv and generated subtitle file. We also had to make a way to identify these files, and choose the 'tag' file of the mediafile. The tag is added to the mediafile in the hook_post_transcode() function (described earlier). For final implementation the use of tag should be changed to a new mediafile type field, where it can also integrate the is_still flag. Considering the trend that object codes use several links to several mediafiles nowadays (the video, the poster image, subtitles and alternative video formats), the MediaMosa concept of a object code per streaming server definition must be improved. The object codes can be stored separately from the streaming server definitions. That would also pave the path for requesting a file with a specific player, where MediaMosa can maintain and provide the correct player definitions. Solution in the PoC We choose the VideoJS player, which has basic support for subtitle files in.srt format and is fully open source. VideoJS uses an html5 video tag, which is meant to play on modern browsers who have build video support in their browsers. There is a fallback to a flash player (default Flowplayer) implemented in case of older browsers. In the PoC we implemented VideoJS with subtitles. Installation was simple; we added the VideoJS sources in the MediaMosa tree under /mmplayer, and changed the object code for mp4 to the VideoJS object code. Subtitles generation SPRAAK generates a list of recognized words with time codes, one word/time code per line. A subtitle file has different format. The most widely used format is SUBRIP, and is also used in VideoJS. An example: 1 00:00:20,000 --> 00:00:24,400 Altocumulus clouds occur between six thousand 2 00:00:24,600 --> 00:00:27,800 and twenty thousand feet above ground level. At the moment there is no open source tool to convert a timed list of words to a subtitle file. A good conversion takes all sort language constructs into consideration. In the PoC a relatively simple solution was implemented based on word-wrap functionality. The implementation in the PoC only supports one subtitle file, the final implementation in MediaMosa should support several languages of a subtitle file; distinguishing the language of the mediafile is possible by providing a standard Language ISO code in addition to the already 12
proposed type (instead of tag). A list of subtitle files belonging to the asset must be obtainable with a REST call. Conclusions and Recommendations This PoC was an implementation of some of the findings described in the technology scout Video Transcription. The implementation resulted in this document and a PoC website: http://spraak.mediamosa.surfnet.nl/. In this PoC two transcription tools were implemented: SPRAAK for Dutch and CMUSphinx for English spoken language. The results of open source speech recognition without further training of the tools, gave mixed results when used in subtitles, but, in our opinion, are sufficient for searching through metadata. The sample videos used in this PoC were recordings of the eight o clock news, which produced good results. More research should be done in order to improve the speech recognition results when using no studio quality videos. The implementation resulted in two separated transcoding -modules in MediaMosa. The concept of separate modules in MediaMosa for separate tools resulted in clean implementations of the two tools. Without having to modify the MediaMosa-core, the tools can easily be added. Some extra changes are needed in MediaMosa to handle the specific speech results. The most important two are new metadata fields added to an asset and the support for more tickets in the html object codes of video players. It is highly recommended to use the results of this PoC in the next development of MediaMosa (MediaMosa 3.5) to make it useful for the MediaMosa community. Besides the obvious benefits to viewers with hearing disabilities, transcription and captioning also offer a number of additional benefits to a much broader community of users that should not be overlooked: Indexing and Searching: Transcription produces additional metadata, which is time coded as well. This allows the content to become easily searchable with traditional text searches. With user generated metadata it is not possible to search within a video. Improved Accessibility: Improved accessibility will make content more useful to a broader audience. Viewers with many types of learning disabilities will benefit from the increased comprehension and increased retention that captioning brings. Transcoding technology will make it easier to produce captions. Improved quality: To improve the quality of the captions, there should be functionality developed to manually edit the automatically produced captions. Localization: Adding translations to your captions, with support for multiple caption tracks, widens your potential audience massively. The PoC website Video Transcription can be found on the MediaMosa (http://www.mediamosa.org/projects) homepage, section projects. 13