Speech-to-Text Transcription in Support of Pervasive Computing

Speech-to-Text Transcription in Support of Pervasive Computing Jarrah Sladek, Andrew Zschorn and Ahmad Hashemi-Sakhtsari Human Systems Integration Group Command and Control Division Defence Science Technology Organisation PO Box 1500, Adelaide 5111, South Australia {Jarrah.Sladek, Andrew.Zschorn, Ahmad.Hashemi-Sakhtsari}@dsto.defence.gov.au Abstract Speech recognition technology can help transcribe discussions, interviews, meetings, and conversations. This paper describes a concept demonstrator of an automatic speech-to-text transcriber that uses speech recognition. It is defined in terms of motivation for the product, how users operate it, and its similarities and differences with other work being carried out in other research and commercial bodies. Keywords: meeting capture tools, continuous speech recogniser, COTS software, speaker dependence, CSCW, time stamping, utterance. 1 Introduction Speech recognition is among a myriad of tools that can be used to create a computer system that is pervasively and unobtrusively embedded in a human environment. This paper describes a concept demonstrator called Automatic Transcriber of Meetings (AuTM) that uses speech recognition. The main aim of the AuTM prototype is to demonstrate the creation and retrieval of transcriptions that are associated with collaborative efforts such as meetings and interviews. Landay et al (1999) claims that as a variety of low cost note taking devices becomes pervasive, shared notes can help work groups communicate ideas and information. Outline This paper starts with AuTM system overview, focusing on the core functions that have currently been developed. Section 3 describes work carried out by other research organisations, followed by Section 4 that outlines relevant commercial product development. 2 AuTM Overview The overall goal of the AuTM project is to develop a concept demonstrator of a system that automatically transcribes the speech of collaborative events such as meetings or interviews. It uses a commercial off-the-shelf (COTS) speech recogniser known as Dragon NaturallySpeaking (NS) to convert the speech to text. NS is a Microsoft Windowsbased, large vocabulary, speaker-dependent, continuous speech recogniser. As NS is speaker-dependent; users are required to spend time training it to their voices before they can use it. This also means that NS is, in its default state, unsuitable for transcription where there are multiple speakers. To overcome this problem, AuTM requires that each speaker have a separate instantiation of NS. The utterances of separately running copies of NS are collated and coordinated over TCP/IP connections, using server and client programs. The AuTM system allows live development of the transcription to be visible to each speaker in a meeting. The server and client programs also provide extra functions, which allow producing a minutes -style transcript of a meeting. The meeting facilitator, through using the server program, performs the following functions: using the agenda and annotating transcripts with highlights such as action items and motion details, and summarising the meeting. AuTM makes audio recordings of the speaker s speech using head-worn noise-cancelling microphones. This enables correction of recognition errors by revision of the text transcript session, by matching the text with the corresponding audio. Once the transcript has been corrected, it can be saved as a HTML or Microsoft Word document. This transcript includes the text of each utterance, the name of the speaker who spoke it and timestamps indicating when they started and finished speaking it. The purpose of the time stamped utterances is to accurately record who said what and when in a meeting transcript. It is possible to keep the audio files on a web server, with the transcript, so that they can also be downloaded and played. 3 Other Research Activities The research work that relates to AuTM lies in three main areas: computerised meeting room systems, multimedia meeting capture tools, and personal note taking applications. 3.1 Computerised Meeting Room Systems Previous work outlined by Zschorn et al (2002) has indicated there are at least three projects related to AuTM, which are also very similar to each other. These are being conducted at: BBN Technologies (Colbath et al, 1998),

International Computer Science Institute (ICSI) Berkeley University (Janin 2001, Morgan et al 2001), and Interactive Systems Laboratory (ISL) Carnegie Mellon (Waibel et al 1998, 2001; Yu et al 1998, 1999 and 2000; Gross et al 2000). While these projects are all very similar to each other in their approach, they differ significantly from AuTM s approach. Each project uses speaker-independent speech recognisers that were initially developed for slightly different purposes and then adapted and optimised for the meeting case, to recognise the speech of all speakers. This is the most significant difference between these projects and AuTM, which uses multiple instances of a COTS speaker-dependent speech recogniser. The BBN, ICSI, and ISL projects take the approach of developing and optimising their own speaker-independent speech recogniser to capture utterances from all speakers. While this approach means they have far more control over the speech recogniser, these speech recognition systems are less powerful than NS. For instance, NS version six has a vocabulary of approximately 250,000 words (Scansoft 2002), while the speech recognisers used in the other projects have between 30,000 and 45,000 words (Yu et al 2000, Colbath et al 1998). This makes Out-of-Vocabulary (OOV) errors a major problem for the other projects. To help solve the OOV problem the ISL project (Yu et al 2000) uses the Web to obtain extra vocabulary for the speech recogniser. It would be of benefit to make each user responsible for vocabulary maintenance within the AuTM system. The BBN technologies (Colbath et al, 1998) and ISL (Waibel et al, 1998) projects have a strong emphasis on meeting browsers. These are applications used to navigate text, audio and video records of a meeting. The BBN meeting browser is particularly advanced, with timealigned automatic topic classification, and a search function. In contrast, AuTM produces simple HTML transcripts. The automated approach of the BBN and ISL systems contrasts with that of AuTM, which leaves the summarising to the moderator of the meeting. We assume that AuTM s approach is more accurate, but at the expense of operator time. Waibel et al (2001) mentions using the ISL system with speakers in remote locations. This is clearly where AuTM is heading in the near future, and a task to which it is particularly suited. Although it is not stated explicitly, the other three projects centre around a special meeting room, which is perhaps wired with microphones and computers. In contrast AuTM has a more mobile, distributable nature. Like AuTM, the ICSI project (Morgan et al 2001) uses head-worn microphones, although they also make recordings on lapel and desk microphones in the hope of introducing that less obtrusive technology in the future. ISL s system uses lapel microphones only (Yu et al 2000). Two problems brought about by using lapel or desk microphones and one speaker-dependent speech recogniser are, firstly, detecting a speaker-change, and then correctly identifying the new speaker. That is, because the speaker s speech is picked up by more than one audio channel, it is necessary to separate audio channels. These issues do not arise when using headworn noise-cancelling microphones, as is the case with AuTM. Another related research project is the MeetingManager work at Massachusetts Institute of Technology (MIT) (Oh et al 2001). It is part of their e21 and Oxygen projects. This project is focused on aiding the facilitator work and compiling summaries of meetings. The work centres on recording short movie clips of important sections of a meeting, rather than using a speech recogniser to put the discussion into text. The facilitator is responsible for identifying which parts of the discussion need to be recorded, and they can label and annotate those movie clips with text information using various methods, including speech recognition. The facilitator is responsible for formulating detailed agenda information. At the conclusion of the meeting this information is included with the movie clips into a database, which is capable of being queried by Structured Query Language (SQL). So, like AuTM, the MIT project is reliant on a human moderator to input intelligent highlighting and summaries of the discussion. The National Institute of Standards and Technology (NIST) in the United States is currently working on an Automatic Meeting Transcription system, which is looking at providing development and infrastructure for speech transcription in meetings. The infrastructure includes rich transcription, and a corpus of audio and video from meetings collected at NIST, using a variety of microphones and video cameras (NIST 2002). The goals of this project are very similar to the AuTM system, although the work is still in its initial stages. In particular, AuTM will benefit from the research and development of content processing technologies, such as text extraction and summarisation. The outcomes of this work could influence AuTM s future directions. Much work has been done on Electronic Meeting Systems (EMS) (Nunamaker et al 1991). These systems typically consist of 10-20 personal computers linked together in special meeting rooms running custom software. The user interface for the meeting tools is based on text/keyboard input. The software often organises meetings by dividing proceedings into three phases: idea generation (brainstorming), organisation (grouping), and prioritisation (voting). The meeting software needs to be configured before the meeting starts. After the configuration is completed, participants are required to follow a predefined procedure and organisation of the group process. This allows less room for informal interactions and ways of capturing these interactions. No connection to a publicly available interactive workspace or display is supported. EMS has been shown to improve the quality of group decisions and the time needed to reach agreements (Nunamaker et al 1991). However, the hardware needed

to run these systems makes them prohibitively expensive and impractical to use in many settings. Furthermore, these systems may force the focus of meetings to document creation, redirect some of the group s attention to complex computer interfaces, or require participants to type during meetings, which can be disruptive (Landay et al 1999). AuTM was specifically designed without relying on custom hardware. That is, the system uses off-theshelf automatic speech recognisers (ASR) and normal PCs or laptops. Systems such as EMS are better suited for certain types of structured meetings, such as those focused on decision-making or idea generation. In contrast AuTM imposes less meeting structure and supports a wide variety of meeting styles. Xerox PARC has a long tradition of exploring the value of collaborative and pervasive computing tools for meeting capture (Stefik et al 1987, Pedersen et al 1993) and salvage of multimedia meeting records (Moran et al 1997). Unlike EMS tools, these tools support unstructured, sketched based interfaces, that are often pen-based and that try to simulate or improve on the capabilities of whiteboards found in most conference rooms. The motivation of such tools is to improve record keeping methods, without shifting the meeting focus or process. The Tivoli project (Pederson et al 1993) allows users to manipulate handwritten text in structured ways using a large electronic whiteboard, Xerox LiveBoard, as mentioned in Elrod et al (1992). Tivoli creates notes of meetings with correlated time-stamped audio, allowing participants to access the audio from the notes after the meeting (Moran et al 1997). The group notes of the meeting are formed through drawings on the Xerox liveboard, and typed text notes from a laptop. Drawbacks of the Tivoli system are that the user interface is quite complicated and non-intuitive. Similar to Tivoli, the Classroom 2000 system (Abowd et al 1996 and 1998) records classroom audio, presentation slides, LiveBoard notes, and provides ways to browse through them after the class. Ink based meeting capture tools such as Tivoli and Classroom 2000 raise serious questions about handwriting legibility, and possible solutions for method improvement such as handwriting recognition. It is clear that speech-to-text offers a more natural and easily used interface than handwritten text. DOLPHIN is another well-known collaboration system that has pioneered some of the early development on meeting capture (Streitz et al 1994). It allows computer support for different types of meetings: face-to-face meetings with the Xerox LiveBoard, and using computers connected via audio/video networks. The members of the group concurrently use their mouse and keyboard to interface with the software. DOPLHIN enables participants to create both informal and formal documents. The system is rather dated now, and there appears to be no further developments that are relevant to AuTM. Meeting rooms equipped with special equipment are expensive, so AuTM addresses the fundamental problems of other systems: cost and lack of ubiquity. The AuTM system can basically be used wherever a laptop and network connection exists, which in today s environment poses minimal restrictions. There are a number of advantages in using COTS software. It is freely accessible to users, available at a cheaper cost when purchasing in larger volumes, and readily integrated with other COTS products. 3.2 Multimedia Meeting Capture Systems Multimedia notes and records of meetings provide many benefits over traditional paper counterparts. A video recording of a meeting allows people to review a meeting that they have attended or to catch up on a meeting that they have missed (Chiu et al 1999). An example of a multimedia meeting capture system is LiteMinutes. The system assumes one person is taking the notes, and those notes are typed into an applet on a wireless laptop (Chiu et al 2001). After the meeting, the notes are parsed and a file with the slides and video correlated by the time to each note is emailed to all participants. Users can revise the notes in their email editors and send them back to the server, thus updating the notes displayed on a common web page. Notes, slides and video can also be accessed during a meeting for instant replays. Another multimedia meeting capture system is called NoteLook, which is a client server note taking system designed to run on tablet computers (Chiu et al 1999). Users can take ink notes, add thumbnail images, annotate video streams, place slides or images on the background of a page, and then view the note files created on a web page. An interesting feature is the ability to summarise notes, by taking snapshots regularly and capturing all slides viewed from all sources. 3.3 Personal Note Taking Systems Since typing can interfere with the group meeting, research into informal, personal note-taking systems has been conducted by various organisations. The Freestyle system allows handwritten notes and annotated documents to be shared using electronic mail (Levine et al 1991). These documents can then be read and manually arranged on the desktop computer of the recipient. There has also been research in portable, handwritten note taking and audio recording systems, such as NotesPal (Davis et al 1999) Filochat (Whittaker et al 1994) and Dynomite (Wilcox et al 1997). NotesPal is an inked based, collaborative note taking application that runs on Personal Digital Assistants (PDA s). Meeting participants write notes in their handwriting on a PDA. These notes are shared with other participants by synchronising later with a shared notes repository. NotesPal shares many characteristics in common with other personal note taking tools such as Dynomite and Audio Notebook (Stifelman 1996), which both rely on inked-based notes without handwriting recognition. As mentioned earlier it is assumed that typed notes would be more legible than hand written ink-based notes. Filochat and Dynomite are both pen-based note-taking systems that run on tablet-based computers. These systems also record the audio track of a meeting and automatically create an index to the audio from the

electronic ink. Thus, NotesPal, Filochat and Dynomite all provide tools for synchronising audio recordings with personal notes, in much the same way as AuTM does. However, all such personal note-taking tools outlined do not integrate speech-to-text for transcription of the notes. 4 Other Commercial Activities Dictaphone Corporation (2002) has produced a number of products that are based on the principle of distributive speech recognition with multi-user input stations (Kuhnen et al 1999). In particular ExSpeech (See Figure 1.1) is a system that involves a central dictation system including a central server computer and a plurality of voice input stations connected to the server computer. Furthermore, speaker recognition capabilities are provided at each voice input station, and any authorised user may use the station. The product features networked computers having either handheld microphones or headsets interfaced thereto and constitute dictation stations. The e- mail system of the computer network is used to transport voice files from the dictation stations to the transcription stations. In addition, the e-mail system may be used to forward dictation files into a central recorder, from which a transcription system can play back the dictation files. The idea of a number of distributed clients communicating to a central server is very similar to the AuTM prototype. It is unclear how the audio signals are converted over the network. Unlike the Dictaphone Enterprise Express server handles network traffic system outlined, AuTM does not incorporate document review work-stations (Kuhnen et al 1999) to carry out error correction duties. AuTM is considered to be far more portable in terms of error correction strategies and hardware components needed for operation. IBM Corporation is one such company that has invested heavily in the area of pervasive computing, and distributed speech recognition. The IBM Websphere Voice Server for Transcription (IBM, 2002) is a specialised speech recognition product that is aimed at software developers and services providers. According to IBM (2002) the product provides transcription (or deferred recognition) functions that can be integrated into a workflow or document management application. IBM (2002) also states that multiple users can dictate audio text from various locations and devices such as microphones, handheld recorders and telephones. The initial market segments that IBM have targeted focus on the transcription services for the medical and legal communities. The work being conducted by IBM has far more in common with Dictaphone Corporation products than AuTM. Speech recognition engine (EXSpeech) Voice input station Key Audio Transcribed and edited text file Raw Audio &text files Document review stations. Uses audio to correct recognition errors in the transcript. Figure 1.1 Example of EXSpeech (Adapted from Dictaphone Corporation, 2002)

Dictaphone and IBM make mention of their products being used in a ubiquitous manner similar to the AuTM system, however, their applications are not specifically designed for group meetings. Furthermore, AuTM provides the ability to annotate the record with functionality such as client/server messaging, and the creation of motion and action items. 5 Conclusions There are a number of research projects and commercial products that share the same goals as AuTM. The four research projects (BBN, MIT, ICSI, ISL) are quite similar, but these have far more in common with each other than any of them have with AuTM. All of these use a single, in-house speaker independent speech recogniser rather than AuTM s use of multiple instances of COTS speaker-dependent speech recognisers. AuTM implements a continuous speech recogniser in a manner that increases capacity of the overall system by offering less specialised operational components. AuTM is intended to be mobile and not rely on specialist hardware, while each of the four projects (BBN, MIT, ISL and ICSI) appears to have been used in specially set-up meeting rooms. The four projects are older than AuTM, and some have extra, useful features such as automatic summarisation and purposebuilt meeting browsers. In practice it seems that all four relevant research projects try to coherently summarise a meeting and make it accessible afterwards, using a timeline to display relevant information, and marking important points of the meeting. AuTM together with the ISL, BBN, and ICSI projects share some characteristics with the MIT project, but what sets the MIT project apart is its use of selectively storing movie clips of meetings, rather than using speech-to-text to generate text transcripts. The work being carried out at Dictaphone Corporation and IBM uses similar technology to the AuTM prototype, though it does not focus on the meeting room type application. Most of the other systems outlined in this paper are proactive in capturing some form of media from a conference room or classroom, including audio, video, presented media and participant notes, and then automatically compiling a summary. A scan of the literature has concluded that the most unique features of AuTM software are: the live development of transcription capabilities created through a computer network, and the time stamping of utterances for each speaker to ensure an accurate record of who said what and when in a meeting transcript. 6. Acknowledgments We would like to thank Jason Littlefield, Barry Dwyer and Steve Graham for their contribution and continual development of the AuTM system. 7. References Abowd, G. D., Atkeson, C. G., Brotherton, J., Enqvist, T., Gulley, P. and Lemon, J. (1998): Investigating the Capture, Integration and Access Problem of Ubiquitous Computing in an Educational Setting. Proceedings of CHI '98: ACM Conference on Human Factors in Computing Systems, 440-447, ACM Press. Abowd, G. D., Atkeson, C. G., Feinstein, A., Hmelo, C., Kooper, R., Long, S., Sawhney, N. and Tan, M. (1996): Teaching and Learning as Multimedia Authoring: The Classroom 2000 Project. Proceedings of Multimedia '96, 187-198, ACM Press. Chiu, P., Kapuskar, A., Reitmeier, S. and Wilcox, L. (1999): NoteLook: Taking notes in meetings with digital video and ink. Proceedings of ACM Multimedia '99, 149-158, ACM Press. Chiu, P., Boreczky, J., Girgensohn, A. and Kimber, D. (2001): LiteMinutes: An Internet-Based System for Multimedia Meeting Minutes. Proceedings of World Wide Web 2001, 140-149, ACM Press. Colbath, S. and Kubala, F. (1998): Rough n Ready: A Meeting Recorder and Browser. A research note of the Perceptual User Interfaces Conference, San Francisco, CA, November 1998. Davis, R. C., Landay, J. A., Chen, V., Lee, R. B., Lin, J., Morrey, C. B., Schleimer, B., Price, M. N. and Schilit, B. N. (1999): NotePals: Lightweight Note Sharing by the Group, for the Group. Proceedings of CHI '99: ACM Conference on Human Factors in Computing Systems, 338-345, ACM Press. Dictaphone Corporation 2002. http://www.dictaphone.com/products/enterpriseexpress /exspeech/. Accessed 2 Sept. 2002. Elrod, S., Bruce, R., Goldberg, D., Halasz, F., Janssen, W., Lee, D., Mccall, K., Pedersen, K., Pier, K., Tang, J. and Welch, B. (1992): Liveboard: A Large Interactive Display Supporting Group Meetings, Presentations and Remote Collaboration.. Proceedings of Human Factors in Computing Systems, ACM CHI ' 92, 599-607, ACM Press. Gross, R., Bett, M., Yu, H., Zhu, X., Pan, Y., Yang, J. and Waibel, A. (2000): Towards a Multimodal Meeting Record. Proceedings of the 2000 IEEE International Conference on Multimedia and Expo, vol. 3, 2000, 1593-1596. IBM Corporation 2002, Introducing the IBM WebSphere Voice Server for Transcription, An IBM White Paper [Online, accessed 2 Sept. 2002] Available: http://publib.boulder.ibm.com/voice/pdfs/white_papers /WSVSforTranscription.pdf. Janin, A. (2001): Meeting Recorder. Avios, San Jose. Kuhnen, R., Larossa-Greene, C. and Howes, S. L. (1999): Distributed speech recognition system with multi-user input stations, US Patent 6 308 158.

Landay, J. A. and Davis, R. C (1999): Making sharing pervasive: ubiquitous computing for shared note taking. IBM Systems Journal, 38(4): 531-550, IBM Corp. Levine, S. R. and Ehrlich, S. F. (1991): The Freestyle System: A Design Perspective. Human-Machine Interactive Systems, 3-21, Plenum Publishers. Moran, T. P., Palen, L., Harrison, S., Chiu, P., Kimber, D., Minneman, S., Van Melle, W. and Zellweger, P. (1997): `I'll Get That Off the Audio': A Case Study of Salvaging Multimedia Meeting Records. Proceedings of CHI 97, 202-209, ACM Press. Morgan, N., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Janin, A. Pfau, T., Shriberg, E. and Stolcke, A. (2001): The Meeting Project at ICSI, Proceedings of the Human Language Technology Conference, in press. NIST 2002, NIST Automatic Transcription project, [Online, accessed 3 Sept. 2002] URL: http://www.nist.gov/speech/test_beds/mr_proj/ Nunamaker, J. F., Dennis, A. R., Valacich, J. S., Vogel, D. R. and George, J. F. (1991): Electronic Meeting Systems to Support Group Work. Communications of the ACM, 34(7): 40-61, ACM Press. Oh, A., Tuchinda, R. and Wu, L. (2001): MeetingManager: A Collaborative Tool in the Intelligent Room. Proceedings of the Student Oxygen Workshop 2000, Cambridge, MA, USA. Pedersen, E. R., McCall, K., Moran T. P. and Halasz, F. G. (1993): Tivoli: An Electronic Whiteboard for Informal Workgroup Meetings. Proceedings of ACM INTERCHI'93 Conference on Human Factors in Computing Systems, 391-398, Addison-Wesley Longman Publishing Co. Scansoft: Dragon Naturally Speaking Professional Solutions Data Sheet. FTP://ftp.scansoft.com/pub/doc/naturallyspeaking/DNS 6ProfDatasheet.pdf. Accessed 30 August 2002. Stefik, M., Bobrow, D. G., Foster, G., Lanning S. and Tatar, D. (1987): WYSIWIS Revised: Early Experiences with Multiuser Interfaces. ACM Transactions on Office Information Systems, 5(2): 147-167, ACM Press. Stifelman, L. J. (1996): Augmenting Real-World Objects: A Paper-based Audio Notebook. Proceedings of Human Factors in Computing Systems, CHI ' 96, 199-200, ACM Press. Streitz, N. A., Geissler, J., Haake, J. M. and Hol J. (1994): DOLPHIN: Integrated Meeting Support Across Local and Desktop Environments and LiveBoards. Proceedings of ACM CSCW'94 Conference on Computer-Supported Cooperative Work, 345-358, ACM Press. Waibel, A., Bett, M., Finke, M. and Stiefelhagen, R. (1998): Meeting Browser: Tracking and Summarizing Meetings. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 281-286, Lansdowne, VA, 1998. Waibel, A., Bett, M., Metze, F., Ries, K., Schaaf, T., Schoultz T., Soltau, H., Yu, H. and Zechner, K. (2001): Advances in Automatic Meeting Record Creation and Access. Proceeding of the International Conference on Acoustics, Speech and Signal Processing 2001. Whittaker, S., Hyland, P. and Wiley M. (1994): Filochat: Handwritten Notes Provide Access to Recorded Conversations. Proceedings of ACM CHI '94 Conference on Human Factors in Computing Systems, pp. 271-277, ACM Press. Wilcox, L. D., Schilit B. N. and Sawhney N. (1997): Dynomite: A Dynamically Organized Ink and Audio Notebook. Proceedings of Human Factors in Computing Systems, CHI '97, 186-193, ACM Press. Yu, H., Clark, C., Malkin, R. and Waibel, A. (1998): Experiments in Automatic Meeting Transcription Using JRTk. Proceedings of The 1998 IEEE International Conference of Acoustices, Speech and Signal Processing, vol. 2, 921-924. Yu, H., Finke, M. and Waibel, A. (1999): Progress in Automatic Meeting Transcription. Proceedings of 6th European Conference on Speech Communication and Technology (Eurospeech-99), Budapest, Volume 2: 695-698. Yu, H., Tomokiyo, T., Wang, Z. and Waibel, A. (2000): New Developments in Automatic Meeting Transcription. Proceedings of the ICSLP, Beijing, China, October 2000. Zschorn, A., Littlefield, J., Broughton, M., Dwyer, B. and Hashemi-Sakhtsari, A. (2002): Automatic Transcriber for Computer Supported Collaborative work. DSTO Technical Report, unpublished Report. Copyright 2003, Australian Computer Society, Inc. This paper appeared at the Inaugural Asia Pacific Forum on Pervasive Computing, Adelaide. Conferences in Research and Practice in Information Technology, Vol. 25. B. Ainsley, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included.