Home Home | About Us | Sitemap | Contact  
  • Info For
  • Professionals
  • Students
  • Educators
  • Media
  • Search
    Powered By Google

Traveling in Cyberspace:Voice-Recognition Systems

Philip Craiger and R. Jason Weiss
University of Nebraska at Omaha

Computer technology has changed the work environment drastically in the last few years. If you were part of the workforce in the 1960s and 1970s you might remember your organization’s computer (IF they had one) as a large mainframe isolated in a large, dark, temperature controlled room located in the bowels of a building. Specialized technicians programmed "the monster" by feeding it a set of punched cards. Minutes or hours later, your printout—stacks of green and white striped paper—would be available for pickup. Today, almost every office worker with a desk has a computer sitting on top of it; we use computers to set appointments, write letters, send and receive e-mail, generate spreadsheets and multimedia presentations, gather information, and so on. Truly, the Age of the Ubiquitous Computer is upon us.

In this and following editions of TIP we will explore technological advances which are not yet part of mainstream business applications, but will undoubtedly be so in the future. This edition examines voice-recognition systems (VRS, also known as speech recognition systems). We begin with a basic description of VRS and its promise. Next, we discuss some issues surrounding the implementation of VRS. Finally, we relate our experience with a VRS system for dictation and navigation.

Voice-Recognition Systems

Remember "HAL," the computer system from the movie 2001: A Space Odyssey? Although fictitious, HAL was the epitome of an artificially intelligent machine. HAL was an advanced computer system with the ability to recognize and interpret human speech, and to generate intelligent responses (speech) in return. The year 2001 is just around the corner, but a computer with HAL’s independence is still many, many years away. However, dramatic advances in VRS in the last few years have given us software with the ability to distinguish and interpret human speech and take appropriate action. For a review and a more comprehensive explanation of existing VRS, see the PC Magazine reference at the end of this article.

The intelligence displayed by HAL is based on natural language processing, or NLP. NLP is a complex and sophisticated area of artificial intelligence wherein computers are developed with the ability to "understand" speech and respond to commands in a manner similar to human understanding of speech. VRS systems are actually a simple application of NLP. They understand spoken words based on a limited set of predefined patterns of speech. For example, a user might open a file in a word processor by speaking the command "word processor, open, file, research." In contrast, NLP systems can understand speech based on the context of the spoken words and speech patterns, analogous to the way humans understand speech. As such, NLP is considered one of the most difficult research problems in artificial intelligence. The rest of this article will focus on VRS because there are commercial systems and even "freeware" systems that are available, and we are assuming that the reader may find these systems much more useful than NLP systems.

In its most simple form, a VRS system is software that allows a user to speak into a microphone (attached to the computer), that translates sound waves into an electronic format, typically, text. In theory, a user need never touch the keyboard or mouse: All written text and associated commands (e.g., selecting from menus, clicking buttons, and so on) are effected via speech. The ability to generate text and issue commands via the spoken word clearly provides great flexibility for the user. After all, who can’t talk faster than they type? This flexibility is particularly important for users with disabilities such as lost limb function or severe arthritis, which render keyboard or mouse use difficult or impossible. VRS therefore frees workers with disabilities to use their skills in tasks heretofore unavailable to them.

Of course, there are drawbacks as well. Security is one issue. Security is a very important factor in technology, and many companies go to great lengths to ensure that computer systems’ security aren’t violated. Such simple things as passwords for protected accounts would be open to "theft" if a user provided a logon name and a password in an open environment.

 

Concerns and Hurdles

So why isn’t VRS software more prevalent? Cost is not an issue, as VRS packages retail around the price of most office software suites. Rather, the main hurdle for VRS systems has been its lack of accuracy at interpreting and representing the spoken word. No VRS systems guarantee 100% accuracy. A recent PC Magazine Labs test indicated that the best accuracy attained in a test of four VRS packages was only 91% (PC Magazine, 1998). This is understandable given that speech comprehension is a highly complex skill. While humans, in general, have little difficulty with this, it’s quite another thing to develop a computer program to do the same. Such factors as speaker gender, dialect, accents, and background noise are only a few of the factors that may affect the accuracy of an VRS system. Table 1 below illustrates some of the interpretation errors found by PC Magazine Labs.

 

Table 1. Examples of Interpretation Problems in VRS

 

What the speaker said:

How the software interpreted

draw a cube draw on Cuba
travel to California troubled California
Adobe PhotoShop 5.0 Adobe PhotoShop five porno
Cupertino, California Have signed the Liberty Bell
with warmest regards with more research guards
e-mail she-mail
Personally, I want to let you know how much I enjoyed your book. Personally, I want to let you know how much I enjoyed your brother
Antivirus NT fires
By December 31, you’ll receive a $50,000 bonus By December 21, you’ll receive $8 by defective gas and bonus
to our finest restaurant postwar France restaurant
less-than-fashionable summer attire Last and vegetable are attacked.
We have now sold 1 million copies of your book, How to Win Friends We is now sold 1 million copies of your block not to win France
Leaves have turned orange and brown The esoteric are urging Brown
Congratulations! We have sold 1 million copies of your book Congratulations. We have sold 1 million copies of your butt
amiable behavior animal behavior
Personally, I want to let you know how much I enjoyed your book, especially the following Personally, I want to let you know how much I have been drinking, especially the following
cold training and mop up college training in Moscow

 

Increasing the complexity of the problem is the fact that a VRS has to access a database of known words, and select the correct one from competing alternatives ("to," "too," "two," etc.). The selection process requires the software to examine the context in which the word was used. This typically causes delays of a second or two that are confusing to the user. Imagine that it took 2 seconds between the time you pressed a key and the associated character appeared on the screen—such delays are not only confusing, but annoying, irritating, and likely to lead to errors.

Vocabulary Size

The size of VRS systems’ vocabulary affects processing requirements and accuracy. Navigation employs a small, constrained vocabulary, while dictation systems require very large dictionaries. Both must be capable of adding new words, which is particularly important in fields with idiosyncratic vocabularies (like most Computer Science disciplines). Most general commercial VRS systems have a "base" vocabulary of 30,000 to 64,000 words, with a capacity as high as 250,000 words. A number of VRS systems are also available for particular professions that employ nonstandard terminology, such as law and medicine.

Jason and I have different levels of experience with VRS. The Macintosh comes with a very simple VRS that allows users to issue commands to be carried out by the computer, including opening applications, printing, and so on. I (PC) used the system for a short time and found that it was faster to do things by hand. Jason has more extensive experience with VRS, so I’ll turn the narrative over to him.

Watson, Come Here…

My main experience with VRS was with IBM’s VoiceType technology, included with OS/2 Warp Version 4. VoiceType came along at a time when my wrists began to hurt from keyboarding, and thus promised welcome relief from the pain. Moreover, like everyone else, I can talk a whole lot faster than I type; I thought I’d be able to dictate papers and e-mail so quickly that I’d have to look for ways to spend all of my free time. The image of space-age computing—telling my computer what to do instead of directing it through obscure manipulations of the keyboard and mouse—was very alluring. Reality, of course, was somewhat different from what appeared on the back of the cereal box, and for the foreseeable future I plan on using the keyboard and trackpad exclusively. For the record, here are my experiences with one of the earlier software-based VRS systems.

VoiceType recognized discrete speech for dictation and continuous speech for navigation. In other words, when dictating, one adopted a fairly robotic form of speech, pausing briefly between words. This is harder to do than you’d think, although I’d heard from folks who reached rates of 90wpm with practice (using much faster computers than my own). In addition, punctuation must be dictated as well. In contrast, navigation required continuous speech, meaning that one had to say startnetscapenavigator in order to begin web browsing. If I determined that I had finished talking to the computer for a while, the VoiceType system could be set for "sleep" mode, in which it ignored all speech except for the command to resume listening.

Both dictation and navigation were trained separately. Training helps the VRS system understand users’ particular speech patterns, and thus greatly improves accuracy. Training the dictation function required a sentence-by-sentence reading of a short story by Mark Twain. When the system did not recognize a word, I had to go back and repeat the entire sentence. Sometimes the system appeared to refuse to recognize a word, requiring many frustrating repetitions. How many ways are there to pronounce "of"? The navigation system was more straightforward to train, demanding different combinations of navigation commands and numbers, such as "Move right border right twenty." All told, it took a few hours to train the dictation and navigation systems, and a few more hours for the computer to process the training material. During the processing period, the software advised that I not use the computer for any other purpose, effectively locking me out of my computer. The time savings due to more efficient operations with VRS were obviously not to be reaped on that first day.

Back in my freshman year, a writing teacher instructed us to "write the way you talk,"—advice has benefited both my writing and my speech. However, I discovered that there is a big difference between writing conversationally and dictating. Even at the best of times, writing is not a purely linear activity. Through at least one round of editing, one rearranges sentences, discards excess wording, and excises passive voice. My writing was not nearly so efficient when I dictated it. E-mail was a little easier to dictate, since I was less concerned with concise wording, but it still didn’t feel as natural as a telephone conversation, nor were the results particularly pleasing to review. While the original sentences came out mostly the same (if a little long), the cumbersome process of editing through voice commands dragged the operation down. For example, one can effect changes in mid-word when typing. It’s nothing to simply stop, backspace over a few letters, and type out a more appropriate choice. During dictation, one has to complete the word, stop dictating, tell the system to delete the previous word, and resume dictation. This is alarmingly disruptive to one’s train of thought, although I suspect it would not be so given sufficient experience with the system.

The dictation system grew fairly accurate once I got some uncommon words and names into its dictionary. The main problem in discrete speech dictation is in learning to dictate effectively: I always slurred phrases like "forget it" into one word, and I never remembered to punctuate. A second problem was that the software, attempting to choose between sound-alike words using context, showed a parade of dancing sentences until the most likely suspect stood still. Very often it was the sentence I had spoken, but it was distracting to watch the computer go through this process. In contrast, typing invariably transfers the exact information from the keyboard onto the screen (unless you’ve spilled a Coke on your keyboard, in which case all bets are off). The final problem with dictation has nothing to do with the software so much as the environment in which one uses it. VRS requires a certain measure of privacy unlikely for those of us who have office-mates or who prefer to keep our office doors open. One’s voice is audible at great distances in a quiet office, and one never likes the thought of eavesdroppers on even the most simple exchange.

Navigation worked a lot better than dictation, given that the system had a much smaller universe of words to recognize. It was fairly easy to start up an application, resize windows, and scroll through documents. Unfortunately, while a wide variety of navigation commands were available, both common and idiosyncratic to particular applications, I never seemed to manage to learn them all, and often found myself reverting back to the mouse for expediency. One clever feature worth special note is the way navigation was integrated into web browsing so that one could verbally select links on a web page. While hands-free web browsing was a smart feature, I doubt it would stand up to the profusion of links offered by today’s web portals like Yahoo and Excite or the graphical image maps prevalent on so many pages.

All told, my experience with VoiceType was an exciting indication of things to come. I performed almost all my common tasks through voice command and navigation, and it was a fairly easy process. It was not, however, transparent. While the dictation system accurately recognized much of my speech, it did not accommodate the way I work. A better system would distinguish simple punctuation and would seamlessly integrate dictation and editing, instead of separating them into entirely different functions. Best of all, such a system would recognize continuous speech. Indeed, all of these features are available in more current dictation systems such as IBM’s ViaVoice and Dragon System’s Naturally Speaking. However, as PC Magazine notes in its review, VRS is still not quite usable as simply another input device. It’s much closer, though, and I’ll be sure to give it another chance if my wrists ever start hurting again…

Next Edition of TICS

In our next edition we will continue our discussion of cutting-edge technology: Things on the horizon that may soon break into the mainstream. Our next topic is one of our favorites, virtual reality (VR). VR and its counterpart "immersive reality" are 3-dimensional models of objects and environments that are viewable via either a computer monitor, or specialized equipment, such as head-mounted displays, data gloves, and such. It’s becoming popular as a means of representing many game scenarios, and is used in high-tech training in various fields including the military, medicine, and so on. See you then…

References

Speech recognition: Finding its voice. PC Magazine, Oct 20, 1998. http://www.zdnet.com/pcmag/features/speech98/index.htm

 


TIP

Vol. 36/No. 3  January, 1999


Return to Table of Contents