Speech Analytics: April 2008

Sunday, April 27, 2008

What is English

Few recent news items remind me again that we constantly refer to a language by its name (English, Spanish etc.) but forget that it is constantly evolving and it is context and location dependent.

SpinVox announced their VoxGeist. After analyzing more than 50M voicemails (wow - assuming 20 sec in average per voicemail, this is more than 270 thousands hours of transcribed audio which is the largest database of transcribed voicemails) SpinVox can analyze spoken voicemail language dynamics much better than other companies. Although voicemails are different from conversations, according to SpinVox, there is a clear penetration of slang words which are not part of a standard dictionary (or in speech reco jargon- not part of the standard language model).

The Voxgeist list of Most Referenced Slang Words and Phrases with Translations in the UK:

-- Diesel = a muscular man/women

-- Hot Mess = person is a disaster

-- Spun = crazed

-- Busted/Hurt = extremely unattractive

-- Ice/Bling = diamonds

-- Shady = untrustworthy

-- Woot/Stoked = excited

-- One = I'm out; leaving

-- Peace out = goodbye

-- Props = respect

-- Newbie = beginner, new kid on the block

-- Blasted = to get in serious trouble.

-- Scheisty = secretive amongst others

-- Bluetool = someone who always wears a Bluetooth earpiece, even when they're not on the phone

-- Baller = wealthy

-- Crunk = crazy drunk

-- Celebutante = famous just for being wealthy

-- Schlumpadinka = women who let their style go

-- Multislacking = having two or more non-work related web pages open on your work computer at one time.

-- Hola = greetings; hello

While the Canadian regional dictionary currently includes words such as:

1. Allophone –Someone whose first language is neither English nor French.

2. Ble d’Inde –corn on the cob

3. Bourassa – Robert Bourassa a former premier of Quebec

4. Cabot – Famous explorer of North America

5. Coulee – means to flow

6. Keener: A brown-noser whose excessive keen-ness for the unpleasant task at hand makes the rest of us look bad.

7. Loonie: the nic name for the Canadian $1 coin

8. Mickey: A mickey is one of those curved, flat, 13-ounce bottles of booze that winos carry.

9. Nanaimo - A type of chocolate bar, originally produced in the city of Nanaimo, British Columbia. It consists of a crumb-based layer, topped by a layer of light custard or vanilla butter icing, which is covered in soft chocolate.

10. Nunavut – newest Canadian territory , next to the North West Territories

11. Oolochan – small ocean fish

12. Poutine: Poutine is a cholesterol-rich Canadian "delicacy" consisting of French fries covered in cheese curds and gravy. When prepared badly, it congeals in your guts like concrete.

13. Toonie: the nick name for the Canadian $2 coin

14. Tourtiere – French meat pie

15. Sniggler: Someone who takes the parking spot you wanted, or is generally annoying

Apart from great PR to SpinVox, it is a demonstration for the required human intelligence for understanding messages and how experience is a key. Well the question is how will automatic systems can learn such slang. Usually, the language models were created scanning enormous amounts of texts. However, how can systems learn if texts do not include this new words. Well, it will be either by transcribing a lot of data or to wait until spoken slang penetrates the written language. Actually in a recent study, about Writing, Technology and Teens there is a clear indication how not only slang but written abbreviations (from the SMS world) are penetrating the written language. For example, 38% of teenagers say they have used text shortcuts in school work such as “LOL” (which stands for “laugh out loud”). Even more interesting is that the written abbreviations finds their way back to the spoken language: "I can speak very well, but there are also times that I have been laughing and actually said LOL. So it all depends on how much you text and who you are around at the time. – 11/12th Grade Girl, Pacific Northwest City."

The web is currently the best source for learning texts statistics. However not such sources exist for spoken language - dialogs and monologues.
It will be great to see an initiative when people contribute their transcribed voicemails archive (maybe sending it via Twitter) so technologies around the world can enhance the capabilities of automatic speech recognition systems.

Thursday, April 3, 2008

Speech based search is aiming high

The momentum in speech recognition is evident with the new relations between Yahoo and Vlingo.
Yahoo also is leading a $20 million round in Vlingo.

The promise is a speech based search from your cellular handset. We are all aware that voice commands on cellular phones are not taking off. This is similar to computers. There the fact that you can do things quietly and using a full size qwerty keyboard, prevented the proliferation of speech based activity. I do not know when was the last time you tried a speech based search solution. So let me remind you what is the setup process and some of the user experience:

In order to remember myself, I just installed the rather new Tazti engine on my laptop. As part of the setup I received some guidance which immediately put me in a defensive mode thinking about cellular search:

"Speak in a quieter environment"
"Make sure your microphone is positioned correctly"
"Speak more clearly and do not ruse"
"Obtain a higher quality microphone"

I guess some they forget the comments about my horrible accent (or they just didn't want to offend me..).

Of course that I aborted my attempts to use the speech recognition as I type much faster and when interacting with a large screen and many options, it is simpler just to type and click.

So Vlingo/Yahoo trust that unlike the speech vs. full size qwerty keyboard, people will use the speech on the cellular handsets as the three key typing is very cumbersome (at least to people older than 17) and also the mini qwerty keyboards are not extremely usable. Will people really use it ? Given today's state of the are systems, I do not a believe that it will be widely used. However this is why Vlingo just raised $20M - to make it real and cross this enormous technology challenge.

Wednesday, April 2, 2008

Nuance jumps on the voicemail to text wagon

Just two weeks after the $100M round of SpinVox (see previous post on this blog), the speech analytics giant Nuance announced a voice-mail to sms/email service. Although referring to Nuance speech technology intellectual properties (more than 200 patents - and trust me they are using it), Nuance is relying on "over 3,000 Nuance transcriptionists, hosted in a Nuance-owned facility".

I must say that I appreciate Nuance's honesty admitting that humans are required in the cycle (unlike other companies providing similar services). However, is it a scalable operation and can it maintain the margins required from a US technology public company?

Let's play with some numbers:

Another player in this market is SimulScribe who charges $10 per 40 Voicemails (per month). Assuming 20 seconds in average per voicemail, you gain $10 for 800 seconds. which is $0.75 per min. This translates to $45 for transcription of 1 hour of speech.

Assuming real-time transcription by a human, there seems to be very high margin in such a service (especially if you can leverage offshore low paid employees).
This can also be a perfect fit to Amazon’s Mechanical Turk service where you can get people to perform simple tasks. Well, just one issue makes this amazing business opportunity a problem - privacy. Is it allowed to transfer your voicemail to another continent? - in many cases it is prohibited by some regulations. Will people be willing to have their private voicemails transferred to random people using the Mechanical Turk service to do it from home? - that's for you to answer. So instead of doing it over the globe, you need to build secured facilities to host the transcribers and
data and this is were the costs starts to accumulate. Add a 24/7 service which cannot be performed around the globe and you get something which can be profitable but is risky.
So here comes the technology to rescue - by leveraging voice to text technology some of the data can be transcribed automatically. But how can you leverage it ? Assuming even a very optimistic 85% accuracy estimation (which is far from being the truth for compressed telephony), someone must review the results and fix the embarrassing errors. I believe that most errors are OK but some (which I call embarrassing) may turn the service into something that will cause subscribers to walk away because they are embarrassed with the messages (or will cause a lot of embarrassment to the service provider). So until the systems accuracy is XXX %, humans will still be required. So what is the XXX required accuracy? I will leave this to others to comment.

Regards, Ofer