Traveling in Cyberspace:Voice-Recognition Systems
Philip Craiger and R. Jason Weiss
University of Nebraska at Omaha
Computer technology has changed the work environment drastically in the
last few years. If you were part of the workforce in the 1960s and 1970s you might
remember your organizations computer (IF they had one) as a large mainframe isolated
in a large, dark, temperature controlled room located in the bowels of a building.
Specialized technicians programmed "the monster" by feeding it a set of punched
cards. Minutes or hours later, your printoutstacks of green and white striped
paperwould be available for pickup. Today, almost every office worker with a desk
has a computer sitting on top of it; we use computers to set appointments, write letters,
send and receive e-mail, generate spreadsheets and multimedia presentations, gather
information, and so on. Truly, the Age of the Ubiquitous Computer is upon us.
In this and following editions of TIP we will explore
technological advances which are not yet part of mainstream business applications, but
will undoubtedly be so in the future. This edition examines voice-recognition systems
(VRS, also known as speech recognition systems). We begin with a basic description
of VRS and its promise. Next, we discuss some issues surrounding the implementation of
VRS. Finally, we relate our experience with a VRS system for dictation and navigation.
Voice-Recognition Systems
Remember "HAL," the computer system from the movie 2001: A
Space Odyssey? Although fictitious, HAL was the epitome of an artificially intelligent
machine. HAL was an advanced computer system with the ability to recognize and interpret
human speech, and to generate intelligent responses (speech) in return. The year 2001 is
just around the corner, but a computer with HALs independence is still many, many
years away. However, dramatic advances in VRS in the last few years have given us software
with the ability to distinguish and interpret human speech and take appropriate action.
For a review and a more comprehensive explanation of existing VRS, see the PC Magazine
reference at the end of this article.
The intelligence displayed by HAL is based on natural language
processing, or NLP. NLP is a complex and sophisticated area of artificial intelligence
wherein computers are developed with the ability to "understand" speech and
respond to commands in a manner similar to human understanding of speech. VRS systems are
actually a simple application of NLP. They understand spoken words based on a limited set
of predefined patterns of speech. For example, a user might open a file in a word
processor by speaking the command "word processor, open, file, research." In
contrast, NLP systems can understand speech based on the context of the spoken words and
speech patterns, analogous to the way humans understand speech. As such, NLP is considered
one of the most difficult research problems in artificial intelligence. The rest of this
article will focus on VRS because there are commercial systems and even
"freeware" systems that are available, and we are assuming that the reader may
find these systems much more useful than NLP systems.
In its most simple form, a VRS system is software that allows a user to
speak into a microphone (attached to the computer), that translates sound waves into an
electronic format, typically, text. In theory, a user need never touch the keyboard or
mouse: All written text and associated commands (e.g., selecting from menus, clicking
buttons, and so on) are effected via speech. The ability to generate text and issue
commands via the spoken word clearly provides great flexibility for the user. After all,
who cant talk faster than they type? This flexibility is particularly important for
users with disabilities such as lost limb function or severe arthritis, which render
keyboard or mouse use difficult or impossible. VRS therefore frees workers with
disabilities to use their skills in tasks heretofore unavailable to them.
Of course, there are drawbacks as well. Security is one issue. Security
is a very important factor in technology, and many companies go to great lengths to ensure
that computer systems security arent violated. Such simple things as passwords
for protected accounts would be open to "theft" if a user provided a logon name
and a password in an open environment.
Concerns and Hurdles
So why isnt VRS software more prevalent? Cost is not an issue, as
VRS packages retail around the price of most office software suites. Rather, the main
hurdle for VRS systems has been its lack of accuracy at interpreting and representing the
spoken word. No VRS systems guarantee 100% accuracy. A recent PC Magazine Labs test
indicated that the best accuracy attained in a test of four VRS packages was only 91% (PC
Magazine, 1998). This is understandable given that speech comprehension is
a highly complex skill. While humans, in general, have little difficulty with this,
its quite another thing to develop a computer program to do the same. Such factors
as speaker gender, dialect, accents, and background noise are only a few of the factors
that may affect the accuracy of an VRS system. Table 1 below illustrates some of the
interpretation errors found by PC Magazine Labs.
Table 1. Examples of Interpretation Problems in VRS
What the speaker said: |
How the software
interpreted |
| draw a cube |
draw on Cuba |
| travel to California |
troubled California |
| Adobe PhotoShop 5.0 |
Adobe PhotoShop five porno |
| Cupertino, California |
Have signed the Liberty Bell |
| with warmest regards |
with more research guards |
| e-mail |
she-mail |
| Personally, I want to let you know how much
I enjoyed your book. |
Personally, I want to let you know how much
I enjoyed your brother |
| Antivirus |
NT fires |
| By December 31, youll receive a
$50,000 bonus |
By December 21, youll receive $8 by
defective gas and bonus |
| to our finest restaurant |
postwar France restaurant |
| less-than-fashionable summer attire |
Last and vegetable are attacked. |
| We have now sold 1 million copies of your
book, How to Win Friends |
We is now sold 1 million copies of your
block not to win France |
| Leaves have turned orange and brown |
The esoteric are urging Brown |
| Congratulations! We have sold 1 million
copies of your book |
Congratulations. We have sold 1 million
copies of your butt |
| amiable behavior |
animal behavior |
| Personally, I want to let you know how much
I enjoyed your book, especially the following |
Personally, I want to let you know how much
I have been drinking, especially the following |
| cold training and mop up |
college training in Moscow |
Increasing the complexity of the problem is the fact that a VRS has to
access a database of known words, and select the correct one from competing alternatives
("to," "too," "two," etc.). The selection process requires
the software to examine the context in which the word was used. This typically causes
delays of a second or two that are confusing to the user. Imagine that it took 2 seconds
between the time you pressed a key and the associated character appeared on the
screensuch delays are not only confusing, but annoying, irritating, and likely to
lead to errors.
Vocabulary Size
The size of VRS systems vocabulary affects processing
requirements and accuracy. Navigation employs a small, constrained vocabulary, while
dictation systems require very large dictionaries. Both must be capable of adding new
words, which is particularly important in fields with idiosyncratic vocabularies (like
most Computer Science disciplines). Most general commercial VRS systems have a
"base" vocabulary of 30,000 to 64,000 words, with a capacity as high as 250,000
words. A number of VRS systems are also available for particular professions that employ
nonstandard terminology, such as law and medicine.
Jason and I have different levels of experience with VRS. The Macintosh
comes with a very simple VRS that allows users to issue commands to be carried out by the
computer, including opening applications, printing, and so on. I (PC) used the system for
a short time and found that it was faster to do things by hand. Jason has more extensive
experience with VRS, so Ill turn the narrative over to him.
Watson, Come Here
My main experience with VRS was with IBMs VoiceType technology,
included with OS/2 Warp Version 4. VoiceType came along at a time when my wrists began to
hurt from keyboarding, and thus promised welcome relief from the pain. Moreover, like
everyone else, I can talk a whole lot faster than I type; I thought Id be able to
dictate papers and e-mail so quickly that Id have to look for ways to spend all of
my free time. The image of space-age computingtelling my computer what to do instead
of directing it through obscure manipulations of the keyboard and mousewas very
alluring. Reality, of course, was somewhat different from what appeared on the back of the
cereal box, and for the foreseeable future I plan on using the keyboard and trackpad
exclusively. For the record, here are my experiences with one of the earlier
software-based VRS systems.
VoiceType recognized discrete speech for dictation and continuous
speech for navigation. In other words, when dictating, one adopted a fairly robotic form
of speech, pausing briefly between words. This is harder to do than youd think,
although Id heard from folks who reached rates of 90wpm with practice (using much
faster computers than my own). In addition, punctuation must be dictated as well. In
contrast, navigation required continuous speech, meaning that one had to say startnetscapenavigator
in order to begin web browsing. If I determined that I had finished talking to the
computer for a while, the VoiceType system could be set for "sleep" mode, in
which it ignored all speech except for the command to resume listening.
Both dictation and navigation were trained separately. Training helps
the VRS system understand users particular speech patterns, and thus greatly
improves accuracy. Training the dictation function required a sentence-by-sentence reading
of a short story by Mark Twain. When the system did not recognize a word, I had to go back
and repeat the entire sentence. Sometimes the system appeared to refuse to
recognize a word, requiring many frustrating repetitions. How many ways are there to
pronounce "of"? The navigation system was more straightforward to train,
demanding different combinations of navigation commands and numbers, such as "Move
right border right twenty." All told, it took a few hours to train the dictation and
navigation systems, and a few more hours for the computer to process the training
material. During the processing period, the software advised that I not use the computer
for any other purpose, effectively locking me out of my computer. The time savings due to
more efficient operations with VRS were obviously not to be reaped on that first day.
Back in my freshman year, a writing teacher instructed us to
"write the way you talk,"advice has benefited both my writing and my
speech. However, I discovered that there is a big difference between writing
conversationally and dictating. Even at the best of times, writing is not a purely linear
activity. Through at least one round of editing, one rearranges sentences, discards excess
wording, and excises passive voice. My writing was not nearly so efficient when I dictated
it. E-mail was a little easier to dictate, since I was less concerned with concise
wording, but it still didnt feel as natural as a telephone conversation, nor were
the results particularly pleasing to review. While the original sentences came out mostly
the same (if a little long), the cumbersome process of editing through voice commands
dragged the operation down. For example, one can effect changes in mid-word when typing.
Its nothing to simply stop, backspace over a few letters, and type out a more
appropriate choice. During dictation, one has to complete the word, stop dictating, tell
the system to delete the previous word, and resume dictation. This is alarmingly
disruptive to ones train of thought, although I suspect it would not be so given
sufficient experience with the system.
The dictation system grew fairly accurate once I got some uncommon
words and names into its dictionary. The main problem in discrete speech dictation is in
learning to dictate effectively: I always slurred phrases like "forget it" into
one word, and I never remembered to punctuate. A second problem was that the software,
attempting to choose between sound-alike words using context, showed a parade of dancing
sentences until the most likely suspect stood still. Very often it was the sentence I had
spoken, but it was distracting to watch the computer go through this process. In contrast,
typing invariably transfers the exact information from the keyboard onto the screen
(unless youve spilled a Coke on your keyboard, in which case all bets are off). The
final problem with dictation has nothing to do with the software so much as the
environment in which one uses it. VRS requires a certain measure of privacy unlikely for
those of us who have office-mates or who prefer to keep our office doors open. Ones
voice is audible at great distances in a quiet office, and one never likes the thought of
eavesdroppers on even the most simple exchange.
Navigation worked a lot better than dictation, given that the system
had a much smaller universe of words to recognize. It was fairly easy to start up an
application, resize windows, and scroll through documents. Unfortunately, while a wide
variety of navigation commands were available, both common and idiosyncratic to particular
applications, I never seemed to manage to learn them all, and often found myself reverting
back to the mouse for expediency. One clever feature worth special note is the way
navigation was integrated into web browsing so that one could verbally select links on a
web page. While hands-free web browsing was a smart feature, I doubt it would stand up to
the profusion of links offered by todays web portals like Yahoo and Excite
or the graphical image maps prevalent on so many pages.
All told, my experience with VoiceType was an exciting indication of
things to come. I performed almost all my common tasks through voice command and
navigation, and it was a fairly easy process. It was not, however, transparent. While the
dictation system accurately recognized much of my speech, it did not accommodate the way I
work. A better system would distinguish simple punctuation and would seamlessly integrate
dictation and editing, instead of separating them into entirely different functions. Best
of all, such a system would recognize continuous speech. Indeed, all of these features are
available in more current dictation systems such as IBMs ViaVoice and Dragon
Systems Naturally Speaking. However, as PC Magazine notes in its review, VRS
is still not quite usable as simply another input device. Its much closer, though,
and Ill be sure to give it another chance if my wrists ever start hurting
again
Next Edition of TICS
In our next edition we will continue our discussion of cutting-edge
technology: Things on the horizon that may soon break into the mainstream. Our next topic
is one of our favorites, virtual reality (VR). VR and its counterpart "immersive
reality" are 3-dimensional models of objects and environments that are viewable via
either a computer monitor, or specialized equipment, such as head-mounted displays, data
gloves, and such. Its becoming popular as a means of representing many game
scenarios, and is used in high-tech training in various fields including the military,
medicine, and so on. See you then
References
Speech recognition: Finding its voice. PC Magazine, Oct 20, 1998. http://www.zdnet.com/pcmag/features/speech98/index.htm
TIP
Vol. 36/No. 3
January, 1999
Return to Table of Contents