Speech Analytics: 2008

Wednesday, October 1, 2008

Nuance buys speech recognition business from Philips

Another step in speech recognition consolidation: Nuance Communications buys Philips speech recognition unit

More than five years after acquiring Philips Speech Processing Business Units, Nuance completes acquiring the Philips speech recognition business. While the older acquisition was for telephony and voice control, the recent acquisition is mainly around the healthcare speech dictation market. It seems more of a business driven acquisition (the EU market) rather than technology. Will the future of Speech Magic be the same as Speech Pearl? Nuance was very aggressive and I assume efficient with its acquisitions making sure that no multiple engines will coexist for the long term.

Nuance Communications buys speech recognition unit

Wednesday October 1, 8:40 am ET
Nuance Communications pays $96.1 million for speech recognition unit in Europe

BURLINGTON, Mass. (AP) -- Nuance Communications Inc., which makes speech recognition software, said Wednesday it bought Royal Philips Electronics' speech recognition unit for about 66 million euros, or US$96.1 million.

Nuance bought the unit in order to expand its presence in the European health care market.

Under the deal, Netherlands-based Royal Philips received a payment of 21.7 million euros on Sept. 26 and will receive 44.3 million euros in a deferred payment on Sept. 21, 2009. Nuance said the deal will add between $36 million and $39 million in revenue in 2009, and now expects revenue to reach $410 million. The buyout will add up to 1-cent per share in profit.

The addition of the speech recognition unit, Nuance said, enhances the company's ability to provide documentation and communication services to health care organizations throughout Europe. It said about $2 billion is spent annually in Europe for health care companies to manually process clinical information.

ScanSoft Completes Acquisition of Philips Speech Processing Business Units

Leading Telephony and Voice Control Technologies and Applications Expand ScanSoft's Presence in Telephony, Automotive and Embedded Markets

PEABODY, Mass., 3rd February, 2003 - ScanSoft, Inc. (Nasdaq: SSFT), a leading provider of imaging, speech and language solutions, today announced that is has closed the acquisition of the Speech Processing Telephony and Voice Control business units and related intellectual property from Royal Philips Electronics (NYSE: PHG, AEX: PHI).

"ScanSoft's acquisition of the Telephony and Voice Control business units from Philips Speech Processing further enhances our market share in key markets and gives the company additional competitive momentum in our target markets," said Paul Ricci, ScanSoft's chairman and CEO. "With a broader set of technologies and an enhanced distribution channel, ScanSoft is well positioned to capitalise on growth opportunities in the telephony, automotive and embedded markets. In addition, we expect the relationship we have forged with Philips to contribute to the development of new speech technologies in the future."

The businesses acquired by ScanSoft include:

Telephony - The Philips Speech Processing Telephony business allows enterprise customers, telephony vendors and carriers to speech-enable a range of services, including directory assistance, interactive voice response and voice portal applications. Philips automatic speech recognition (ASR) engine, SpeechPearl®, supports more than 40 languages and can process a vocabulary of more than one million words, making it the solution of choice for telephony applications that target global and broad regional markets. Philips has also leveraged its expertise in telephony ASR to develop VoiceRequest™, an enterprise auto-attendant solution, and automated directory assistance implementations that have been deployed by Telia in Sweden, the Japan Multimedia Service and Telefónica de Argentina, among others.

Voice Control - The Philips Speech Processing Voice Control business is addressing growing consumer demand for speech-enabled automotive, mobile and consumer electronics products. Philips' SpeechWave™ and VoCon® small-footprint speech recognition engines are ideal for embedded applications including voice-control of climate and entertainment features within cars. These solutions are also used within navigation systems and to enable automated voice dialing within mobile phones, including those from Philips.

The consideration for the transaction comprises a $27.5 million three-year, zero-interest convertible subordinated debenture, convertible at any time into common shares of ScanSoft at $6.00 per share; 4.1 million euros in cash, of which 3.1 million euros were paid at closing and 1 million euros are payable by December 31, 2003; and a 5 million euro 5% interest note due December 31, 2003. The cash payable is subject to adjustment in accordance with the provisions of the agreement as amended.

Tuesday, September 2, 2008

iPhone speech recognition - status

The iPhone mania is pushing many vendors to offer speech recognition on this platform. As there are many offerings, I enclose a summary to ensure a simple presentation of the current status. I am not certain it covers all iPhone speech offering and will be glad to receive comments about additional speech recognition solutions for the iPhone.

There are many posts on the net about speech recognition around iPhone, some for command and control and some for perfroming free speech web search. Many people are disappointed that there is no speech recognition for iPhone while other are discussing working or semi-working applications and few names are mentioned:

AT&T
Nuance
VoiceSignal
VoiceDialer
VOICE DIAL
VoiceThis
Fonix

AT&T exposed recently a research project - Watson Speech Mashups Architecture - a new software framework that casts AT&T’s WATSON speech recognition as a web service to economically bring speech processing technologies to the larger web and mobile developer community. This new capability provides network-hosted speech technologies for multimedia devices with broadband access (iPhone, BlackBerry®, IPTV set-top box, SmartPhones, etc.) without the need to install, configure, and manage speech recognition software and equipment. This enables easy and rapid development of new speech and multimodal mobile services as well as new web-based services. The software implementation is based on well-established web programming models, such as SOA, REST, AJAX, JavaScript and JSON.

AT&T provides a video of the YellowPages.com application with speech recognition on an iPhone.

Nuance & VoiceSignal are promoting separate solutions or maybe the same one? VoiceSignal is part of Nuance but still exposed its own iPhone speech recognition solution. I am confused and I bet many others inside Nuance are confused.

I wrote in the past about the Nuance iPhone offering. What I managed to understand from the various announcements is that Nuance named its solution OVS for open voice search (in some cases it is referred to as open vsearch). It is also known as the Mobile Solution Suite. Surprisingly on the VoiceSignal website, there is a demo video of vSearch which is a speech recognition based search application for iPhone. The application logo displayed on the iPhone in this video is the VoiceSignal logo (unlike the logo on the Nuance vsearch demo which is of course the Nuance logo).
If you are still not confused, take a look at the Nuance announcement from June 10th and compare to the VoiceSignal announcement from August 24th:

Nuance Unveils Voice Search Prototype On iPhone

APPLE WWDC, SAN FRANCISCO. June 10, 2008 — Nuance Communications, Inc. (NASDAQ: NUAN), a leading provider of speech solutions, today unveiled a ground-breaking prototype for voice search capabilities shown on the Apple iPhone (NASDAQ: AAPL).

The newly designed application introduces a new, more compelling consumer and search experience. Through Nuance speech recognition servers, mobile consumers — with no training required — can simply speak requests into their phone like “Find the Apple store in Boston, Massachusetts,” “Score of the Boston Celtics game,” or “Play Hannah Montana Best of Both Worlds” to quickly and accurately search the mobile web or in the future dictate an IM, SMS or e-mail message. The prototype, code-named “OVS” for open voice search, will allow mobile operators to offer simple ‘say anything’ search capabilities and is search engine agnostic, able to link to any search engine of an operator’s choosing. A video demonstration of the new application can be found at www.nuance.com/mobilesuite.

VoiceSignal Voice Enables iPhone in Proof of Concept Development

WOBURN, Mass., August 24, 2007– VoiceSignal Technologies, Inc., a leading supplier of speech recognition solutions, today announced that VoiceSignal engineers have ported several of VoiceSignal’s applications to the iPhone. These initial proof-of-concept applications include VSearch (mobile local search by voice) and VTunes (voice enabled music player).

The video demonstrations can be found either on the VoiceSignal website (www.voicesignal.com) or on the following YouTube links:

VTunes: http://www.youtube.com/watch?v=zne4rwCCmAc
VSearch: http://youtube.com/watch?v=ayrCCw5xWug

I am surprised by the Nuance marketing performance on this issue. I hope that someone from Nuance will wake up to clear this issue and maybe add some comments to enable us understand their offering.

SpeechCloud's VoiceDialer (free) was the first iPhone application to try to offer speech dialing on the iPhone. VoiceDialer takes advantage of the iPhone's always-on internet connection to record your voice and send it to SpeechCloud's servers to perform the actual recognition. Similar to the AT&T and Nuance approach. Once recognized, the application pulls up the contact's name and allows you to select which number to dial. Some of the criticism of the application is that it requires too much manual interaction (tapping on buttons) to actually dial a number, and slow response time due to the transferring of data across wireless networks.

VoiceDial ( by Makayama) ($ 14.99 on apple store) avoids actual speech recognition and instead perform audio comparison. VoiceDial requires you to actually record your own voice for each contact which can then later be used to match your voice command. If you are willing to pay the $15 and willing to record yourself saying your contacts, MercuryNews claims the product "works as advertised" and "had no problems recognizing the contact I wanted to call, even when it was similar to other names I'd recorded."

HRL Technologies' VoiceThis Dialer , ( $9.99) is an application that actually tries to perform speech recognition within the iPhone itself. No wireless connection required. Instead, the application runs within the iPhone. VoiceThis Dialer promises to offer completely hands free activity with the ability to dial contacts and even quit the application with your voice.

Fonix Speech is currently developing iSpeak, which includes a run-time engine that sits on the phone allowing users to interact with the personal contents of their Apple iPhone™. Unlike other voice applets that enable voice search of the Internet by sending commands over the airwaves, this client-side application gives users the power of voice interaction with their personal content and eliminates network latency. Fonix iSpeak™ connects the user by just saying the phone number or by saying the name of a person in the contacts database. Additionally, users will be able to navigate their music libraries and launch a song or playlist simply by saying the name of the artist, song, or playlist.

To summarize, we are facing a proliferation of speech recognition applications on the iPhone.
A Key for the evolution of speech recognition on the iPhone is the 3G capability (which provides a fast channel to server side conmputing) and the platform openness - both released recently. As this two criteria are fulfilled, we should expect a quick growth in speech recognition applications availability for iPhone.

Tuesday, August 19, 2008

IBM is behind Vlingo's technology - will Nuance sue IBM?

Recently Nuance decided to play aggressively against Vlingo. While filing a lawsuit against Vlingo seems reasonable for Nuance who has a long history for using the lawsuit weapon against small companies (usually pre acquisition attempts). However the recent Business Week article - IBM's Speech Recognition, expose the fact that Vlingo is using the IBM technology as the basis for their solution.

I wonder if Nuance will continue the legal actions given this information. IBM speech recognition patents is one of the broadest in the industry and in general nowone should mess with IBM about patents.

Wednesday, July 16, 2008

Google Video Search via Speech Recognition

Finally a hint on the expected Google move to the speech recognition arena.

Google announced at the Official Google Blog, the availability of a new video search capability based on speech recognition.

It was release as a gadget you can embed on your iGoogle homepage and is a good preview of things to come.
The gadget only searches videos uploaded to YouTube's Politicians channels, which include videos from Senator Obama's and Senator McCain's campaigns, as well as those from dozens of other candidates and politicians. It usually takes less than a few hours for a video to appear in the index after it has been published on YouTube.

So apart from congratulations to the google team who are exposed to the public for the first time, how are they compared to other speech recognition engines aimed for broadcast quality? The google team refer to their precision: "While some of the transcript snippets you see may not be 100% accurate, we hope that you'll find the product useful for most purposes." While I do not understand what are the purposes for just searching within the YouTube political channel, people should be aware of much more mature solutions developed in the past years. From the pioneering work of BBN and IBM to the existing online solutions like everyzing, tveyes, blinx, snipp.tv by NSC and more. Based on the perceived quality, the google team has a long way to go in order to get to the first league and to be able to analyze data which is not at broadcast quality. The good news as users is that the YouTube data is easier to process relative to telephony calls speech recognition performed widely today at contact centers by companies like Verint, Nice, Autonomy, Utopy, Nexidia, CallMiner and other players.

Friday, July 11, 2008

SuperHuman Speech Recognition

Last week the the Speech Technologies Group at the IBM Haifa Research Lab (HRL) coordinated a full-day seminar on Speech Technologies. The seminar was a great success with more than 100 participants.

The Keynote presentation at the IBM Speech Technologies Seminar 2008 was: "Superhuman Speech Recognition: Technology Challenges and Market Adoption" by Dr. David Nahamoo, IBM Fellow, Speech CTO and Business Strategist, IBM Watson Research Center. You can view the presentation below. More presentations will be posted soon.

SuperHuman Speech Recognition Jul 2 2008 - Upload a Document to Scribd

Saturday, July 5, 2008

11 Indian languages available from Nuanace

Nuance just extended their Indian languages support. In addition to Hindi and Indian English, they support also: Marathi, Malayalam, Tamil, Kannada, Telegu, Bengali, Gujrati, Oriya, and Punjabi. I wonder what will be an automatic language identification results when trying to discriminate between these languages automaticallyl.

Nuance Communications Launches 9 New Indian Languages for Speech Recognition

By VARIndia Correspondent

Nuance Communications has released 9 new Indian languages for speech recognition in the contact centre...

Tuesday, June 17, 2008

Nuance-Vlingo: If you are not sued, you do not exi(s)t

Nuance is not only active to push speech based search to the iphone competing with the vlingo/Yahoo offering. Apparently, Nuance just filed a lawsuit against Vlingo for infringing one of their 1000 patents. Nuance and its predecessor (ScanSoft) has a long history of lawsuits which were in sync with their M&A and business strategy. In some cases, when competing on a large account or negotiating a good M&A price, Nuance used the lawsuit mechanism to get a better deal.

Just few examples from the past:

ScanSoft, ART could settle lawsuit with acquisition

"Peabody-based ScanSoft Inc. may settle a lawsuit by acquiring defendant ART Advanced Recognition Technologies Inc., according to an Israeli newspaper report.

The deal could amount to tens of millions of dollars, according to Globes Online, which attributed the report to a Hebrew newspaper called Yediot Ahronot."

ScanSoft files suit against Voice Signal

"Voice Signal Technologies Inc., a Woburn start-up that sells speech-recognition systems used in wireless phones, has been hit with a patent-infringement lawsuit by ScanSoft Inc., after Voice Signal refused to accept what executives yesterday called a lowball takeover bid from ScanSoft."

There is also the famous TellMe case.

I think it is a great feedback to Vlingo whose management includes are ex-Nuance employees. If you are not sued in this industry, you do not exi(s)t.

Sunday, June 15, 2008

iPhone speech recognition

The iPhone is attracting many developers wishing to add its next cool application. Nuance recently introduced vsearch - a voice search application. Similar to Vlingo's recent application on the blackberry.

The official video from Nuance

I believe the following video demonstrate better the hands free aspect of search.

Is it a gimmick or will people actually use it? What happens when there are speech recognition mistakes? Are there such mistakes? I like much more the unofficial video demonstrating the yahoo/vlingo voice search.

If someone has some statistics on voice base search, I will be glad to receive it.

Tuesday, May 27, 2008

Speech analytics in the contact center - what's driving adoption rates?

Ok, so many analysts have been talking about the growth rate of speech analytics continuing to accelerate through 2008 and beyond. While I will concede this is a reasonable prediction for this technology, the reality is that, while more and more companies are budgeting for, evaluating and even purchasing these technologies, the adoption of these solutions into core customer business practices, and more importantly the quantifiable business benefits delivered tell a story of fuzzy results and an ever-denser fog through which to see potential measurable results.

Over a series of posts on this topic, its my goal to offer insight from experiences working with over 100 clients, my view of effective and not so effective approaches and hopefully some practical remedies that can be applied to your business today.

I’m not going to spend time here revisiting the history of the world of speech analysis or take some dive into the weeds on the technology. If the title of this article resonated with you, I assume you’ve done your homework and have probably struggled with some of these same issues during a recent project. Or, you’ve postponed such a project because cracking the code on this technology has been too elusive to provide a level of confidence in moving forward. No. My mostly benevolent, and maybe a tiny bit self-serving, objective here is to share my experiences and opinions I’ve refined over the past five years working with these solutions.

If you want more details on the technology, there are plenty of sources out in cyberspace for all the facts and figurers, bits and bites you want; if you’re into that sort of thing. Do you’re research and then pick this back up and read through it before you make your next move.

Some of the headings under which the adoption rate of these solutions fall include:

-Value realized by early adopters
-Once bitten, twice shy
-Overwhelmed quality management functions
-The hype cycle - oversold rudimentary capabilities
-The “Toy in the Happy Meal® Syndrome”
-Fuzzy ROI
-Best Practices
-Managing organizational change

As this series progresses, we'll tackle each of these, individually and as they potentially influence eachother, in combination. I hope this series of posts stimulates others to contribute their experiences. I look forward to our journey.

Friday, May 23, 2008

Speech Technologies Seminar 2008

The Speech Technologies Group at the IBM Haifa Research Lab (HRL) invites speech professionals to a full-day seminar on Speech Technologies, to be held on Wednesday, July 2, 2008.

This full-day seminar provides a forum for the research and development communities from both academia and industry to share their work, exchange ideas, and discuss issues, problems, and work-in-progress, as well as future research directions and trends. The seminar agenda will be posted at a later date. It will include frontal presentations and a poster session.

The seminar will take place at the HRL site on the Haifa University campus, in the auditorium (room L100). Lunch and light refreshments will be served. Participation is free.

See http://www.haifa.il.ibm.com/Workshops/speech2008/index.shtml for detail.

**Program**
09:00 Registration 09:30 Opening Remarks Oded Cohn, Director, IBM Haifa Research Lab 09:45 Challenges of Speech Solutions in Call Centers Nava Shaked, Manager, CRM & Call Center, IBM Israel 10:15 Actionable Intelligence via Speech Analytics Ofer Shochet, Senior VP, VERINT 10:45 Discriminative Keyword Spotting Joseph Keshet, IDIAP 11:15 Break 11:30 Recent Advances in Speech Dereverberation Emanuel Habets, Bar-Ilan University & Technion 12:00 On Improving the Quality of Small Footprint Concatenated Text-to-Speech Synthesis Systems David Malah, Head of Signal Processing Lab, Technion		12:30 Keynote. Superhuman Speech Recognition: Technology Challenges and Market Adoption David Nahamoo, Speech CTO and Business Strategist, IBM Watson Research Center 13:30 Lunch 14:30 Using Speech Processing Technologies in Audio Search Applications Ido Itzhaki, Director, Business Development, NSC 15:00 Intra-class Variability Modeling for Speech Processing Hagai Aronowitz, IBM Haifa Research Lab 15:30 Retrieving Spoken Information by Combining Multiple Speech Transcription Methods Jonathan Mamou, IBM Haifa Research Lab 16:00 Poster Session & Refreshments

Wednesday, May 21, 2008

Voicemail to SMS pricing is going down.

In a recent post, I demonstrated that voicemail to text business maybe extremely profitable.
Recently, SpinVox published new pricing offered to Cincinnati Bell Wireless customers. An unlimited number of voice-to-text conversions now cost just $5.99 per month. As for most people unlimited is ~1.2 voicemails per day, this new pricing imply a nice profit to SpinVox and at the same time is going to the comfort zone of SMBs. SpinVox is smart enough to post a study about users habits with a focus on Telco's revenues:

Carriers are reporting a 33% uplift in Voicemail deposits, as the calling party knows their message will be seen in minutes.
87% of people return a SpinVox message, which is driving a 10% uplift in voice and 15% in text.
This is all equating to a 110% uplift in the voice message revenue line.

Based on these parameters, no wonder that the Telco's are interested in this service and will be willing to fund some of it.

Sunday, April 27, 2008

What is English

Few recent news items remind me again that we constantly refer to a language by its name (English, Spanish etc.) but forget that it is constantly evolving and it is context and location dependent.

SpinVox announced their VoxGeist. After analyzing more than 50M voicemails (wow - assuming 20 sec in average per voicemail, this is more than 270 thousands hours of transcribed audio which is the largest database of transcribed voicemails) SpinVox can analyze spoken voicemail language dynamics much better than other companies. Although voicemails are different from conversations, according to SpinVox, there is a clear penetration of slang words which are not part of a standard dictionary (or in speech reco jargon- not part of the standard language model).

The Voxgeist list of Most Referenced Slang Words and Phrases with Translations in the UK:

-- Diesel = a muscular man/women

-- Hot Mess = person is a disaster

-- Spun = crazed

-- Busted/Hurt = extremely unattractive

-- Ice/Bling = diamonds

-- Shady = untrustworthy

-- Woot/Stoked = excited

-- One = I'm out; leaving

-- Peace out = goodbye

-- Props = respect

-- Newbie = beginner, new kid on the block

-- Blasted = to get in serious trouble.

-- Scheisty = secretive amongst others

-- Bluetool = someone who always wears a Bluetooth earpiece, even when they're not on the phone

-- Baller = wealthy

-- Crunk = crazy drunk

-- Celebutante = famous just for being wealthy

-- Schlumpadinka = women who let their style go

-- Multislacking = having two or more non-work related web pages open on your work computer at one time.

-- Hola = greetings; hello

While the Canadian regional dictionary currently includes words such as:

1. Allophone –Someone whose first language is neither English nor French.

2. Ble d’Inde –corn on the cob

3. Bourassa – Robert Bourassa a former premier of Quebec

4. Cabot – Famous explorer of North America

5. Coulee – means to flow

6. Keener: A brown-noser whose excessive keen-ness for the unpleasant task at hand makes the rest of us look bad.

7. Loonie: the nic name for the Canadian $1 coin

8. Mickey: A mickey is one of those curved, flat, 13-ounce bottles of booze that winos carry.

9. Nanaimo - A type of chocolate bar, originally produced in the city of Nanaimo, British Columbia. It consists of a crumb-based layer, topped by a layer of light custard or vanilla butter icing, which is covered in soft chocolate.

10. Nunavut – newest Canadian territory , next to the North West Territories

11. Oolochan – small ocean fish

12. Poutine: Poutine is a cholesterol-rich Canadian "delicacy" consisting of French fries covered in cheese curds and gravy. When prepared badly, it congeals in your guts like concrete.

13. Toonie: the nick name for the Canadian $2 coin

14. Tourtiere – French meat pie

15. Sniggler: Someone who takes the parking spot you wanted, or is generally annoying

Apart from great PR to SpinVox, it is a demonstration for the required human intelligence for understanding messages and how experience is a key. Well the question is how will automatic systems can learn such slang. Usually, the language models were created scanning enormous amounts of texts. However, how can systems learn if texts do not include this new words. Well, it will be either by transcribing a lot of data or to wait until spoken slang penetrates the written language. Actually in a recent study, about Writing, Technology and Teens there is a clear indication how not only slang but written abbreviations (from the SMS world) are penetrating the written language. For example, 38% of teenagers say they have used text shortcuts in school work such as “LOL” (which stands for “laugh out loud”). Even more interesting is that the written abbreviations finds their way back to the spoken language: "I can speak very well, but there are also times that I have been laughing and actually said LOL. So it all depends on how much you text and who you are around at the time. – 11/12th Grade Girl, Pacific Northwest City."

The web is currently the best source for learning texts statistics. However not such sources exist for spoken language - dialogs and monologues.
It will be great to see an initiative when people contribute their transcribed voicemails archive (maybe sending it via Twitter) so technologies around the world can enhance the capabilities of automatic speech recognition systems.

Thursday, April 3, 2008

Speech based search is aiming high

The momentum in speech recognition is evident with the new relations between Yahoo and Vlingo.
Yahoo also is leading a $20 million round in Vlingo.

The promise is a speech based search from your cellular handset. We are all aware that voice commands on cellular phones are not taking off. This is similar to computers. There the fact that you can do things quietly and using a full size qwerty keyboard, prevented the proliferation of speech based activity. I do not know when was the last time you tried a speech based search solution. So let me remind you what is the setup process and some of the user experience:

In order to remember myself, I just installed the rather new Tazti engine on my laptop. As part of the setup I received some guidance which immediately put me in a defensive mode thinking about cellular search:

"Speak in a quieter environment"
"Make sure your microphone is positioned correctly"
"Speak more clearly and do not ruse"
"Obtain a higher quality microphone"

I guess some they forget the comments about my horrible accent (or they just didn't want to offend me..).

Of course that I aborted my attempts to use the speech recognition as I type much faster and when interacting with a large screen and many options, it is simpler just to type and click.

So Vlingo/Yahoo trust that unlike the speech vs. full size qwerty keyboard, people will use the speech on the cellular handsets as the three key typing is very cumbersome (at least to people older than 17) and also the mini qwerty keyboards are not extremely usable. Will people really use it ? Given today's state of the are systems, I do not a believe that it will be widely used. However this is why Vlingo just raised $20M - to make it real and cross this enormous technology challenge.

Wednesday, April 2, 2008

Nuance jumps on the voicemail to text wagon

Just two weeks after the $100M round of SpinVox (see previous post on this blog), the speech analytics giant Nuance announced a voice-mail to sms/email service. Although referring to Nuance speech technology intellectual properties (more than 200 patents - and trust me they are using it), Nuance is relying on "over 3,000 Nuance transcriptionists, hosted in a Nuance-owned facility".

I must say that I appreciate Nuance's honesty admitting that humans are required in the cycle (unlike other companies providing similar services). However, is it a scalable operation and can it maintain the margins required from a US technology public company?

Let's play with some numbers:

Another player in this market is SimulScribe who charges $10 per 40 Voicemails (per month). Assuming 20 seconds in average per voicemail, you gain $10 for 800 seconds. which is $0.75 per min. This translates to $45 for transcription of 1 hour of speech.

Assuming real-time transcription by a human, there seems to be very high margin in such a service (especially if you can leverage offshore low paid employees).
This can also be a perfect fit to Amazon’s Mechanical Turk service where you can get people to perform simple tasks. Well, just one issue makes this amazing business opportunity a problem - privacy. Is it allowed to transfer your voicemail to another continent? - in many cases it is prohibited by some regulations. Will people be willing to have their private voicemails transferred to random people using the Mechanical Turk service to do it from home? - that's for you to answer. So instead of doing it over the globe, you need to build secured facilities to host the transcribers and
data and this is were the costs starts to accumulate. Add a 24/7 service which cannot be performed around the globe and you get something which can be profitable but is risky.
So here comes the technology to rescue - by leveraging voice to text technology some of the data can be transcribed automatically. But how can you leverage it ? Assuming even a very optimistic 85% accuracy estimation (which is far from being the truth for compressed telephony), someone must review the results and fix the embarrassing errors. I believe that most errors are OK but some (which I call embarrassing) may turn the service into something that will cause subscribers to walk away because they are embarrassed with the messages (or will cause a lot of embarrassment to the service provider). So until the systems accuracy is XXX %, humans will still be required. So what is the XXX required accuracy? I will leave this to others to comment.

Regards, Ofer

Monday, March 24, 2008

Macintosh speech recognition

Seems like ages from the Macintosh days. Great to find this speech recognition ad that is still appealing to us. After more than 10 years not much progress is such applications.

Are there signs for a similar application that is really working?!

Saturday, March 22, 2008

$100M to SpinVox

Congratulations to SpinVox for their $100M round. This announcement together with a very effective marketing campaign brought SpinVox to the news on many channels. Many people asked me about this company - is the technology finally mature to hand speaker independent speech recognition task in a noisy environment and highly compressed channels. Especially people are interested in speech analytics once mentioned together with the magic words facebook or twitter.

Various companies are reviving the speech recognition market with new applications and in some cases confuse laymen whether it is a new technology or a new application. It reminds me of one of my favorite Dilbert cartoons I am attaching.

In recent posts at TechCrunch, GigaOM and many others, SpinVox is mentioned together with a reference to other companies active in the voicemail speech to text market: SpinVox, GotVoice
Simulscribe, Jott, Yap, Vlingo.

In future post, I will focus on the technological difference between these companies (as one can understand from their marketing material). For now, I will just comment that $100M round is not standard for a technology company. However, for a service company it is. And that is actually the key to understand the SpinVox operation. Speaker independent transcription is not 100% and is not even 90%. When taking into account noisy channels and compressed telephony the numbers are reduced much further. While providing a service for voicemail transcription, humans are required in the process for verification. A verification process requires listing to the message which can be performed at at most 1.3 times faster the message length. This is very similar to the time it takes for a trained person to transcribe the call so the technology cannot help much. So what is the technological gap - it is mainly to ensure that the transcribers will not be trained much as usually is needed for low wages high turnaround operations.

All people that are involved at speech analytics believe that this voice thing is gonna be big. Yet, we should be realistic about what can be achieved and when.

Tuesday, March 18, 2008

SpinVox Targets Cambridge for Speech Recognition Skills

Good news for SpinVox - Do you think that voicemail to text will eventually work without human intervention?

Dr. Tony Robinson – world-renowned academic and entrepreneur - joins to build new centre of expertise

LONDON & ATLANTA---SpinVox, the founder and global leader in Voice-to-Screen messaging, has announced that Dr. Tony Robinson has joined the company as director of its Advanced Speech Group (ASG).

Robinson previously provided SpinVox with expert advice on its speech technology strategy along with Phil Woodland, a Professor of Information Engineering at Cambridge University’s Machine Intelligence Laboratory and coordinator of the Speech Research Group. Woodland retains his role as a consultant to SpinVox.

SpinVox will relocate some of its existing team of Automated Speech Recognition (ASR) experts from its world headquarters in Marlow, Bucks to its new ASG centre in Cambridge.

Robinson’s remit will be to further build a world-class team from the ASR expertise that is concentrated in the Cambridge area. Under his leadership, the SpinVox ASG will further develop the Voice Message Conversion System that is at the heart of SpinVox services.

Robinson is an established, internationally-known academic, originally working at the University of Cambridge, who has successfully made the transition to entrepreneurship. He was formerly founder and owner of Cantab Research Ltd, and CTO of Zentian Ltd, both of which are involved in high accuracy real-time speech recognition in challenging environments.

“Tony is one of only a handful of people who have a complete academic and commercial understanding of all aspects of speech recognition," says Christina Domecq, SpinVox co-founder and CEO. “We’re delighted that Tony is joining us to help take us to the next stage of our growth worldwide.”

Cambridge and the HTK

Cambridge’s global pre-eminence in speech recognition - which has resulted in a cluster of hundreds of voice specialists based at the University and in the specialist companies housed in the science parks and campuses that surround the institution - is based, in part, on the development at the University of the Hidden Markov Model Toolkit (HTK), the worldwide standard software for building speech recognition systems.

HTK, which was acquired by Microsoft in November 1999 as part of its acquisition of Entropic Inc., is now available free of charge to developers. Cambridge University is responsible for maintaining and developing the HTK.

SpinVox VMCS

"HTK Model technology is one of the foundations of SpinVox VMCS,” adds Daniel Doulton, SpinVox co-founder and chief strategy officer. “Indeed it is at the heart of modern speech recognition systems. At a time when everyone, from industry giants such as Cisco, IBM and Microsoft to specialists such as Nuance, are focussing on voice, Cambridge is recognised as the centre of the speech recognition universe and that’s why SpinVox is setting up there.”

“SpinVox is the Google of speech – it has successfully cornered the market for voice conversion services and the accumulated resources it has assembled represents a huge opportunity for ambitious speech developers and researchers to build their careers,” emphasises Robinson. “The company’s VMCS voice message conversion system is the most advanced of its kind and I believe we have seen only the beginning of its huge potential.”

VMCS works by combining state-of-the-art speech technologies with human intelligence and learning. A fully automated system, it `knows what it doesn't know` and is able to call for assistance when required. VMCS is continually evolving and currently converts messages in English, French, Spanish and German.

Cambridge Connectionist Speech Group

After completing his PhD, and following his subsequent appointment as SERC Advanced Research Fellow, Robinson built the Cambridge Connectionist Speech Group in Cambridge University. In the 1990s the group participated in projects including the TREC Spoken Document retrieval tracks and the DARPA speech recognition evaluations before, in 1995, Robinson was appointed a lecturer in the Department of Engineering and simultaneously founded SoftSound.

In its first five years, SoftSound achieved the first deployment of automatic subtitle generation - on BBC’s `Eastenders`. From 1997 to 2000 SoftSound was a key partner in the EU-funded THISL project which created the first audio indexing and retrieval system based on large vocabulary speech recognition.

In May 2000, Autonomy invested in SoftSound which provided access to worldwide markets and resulted in rapid expansion. Central to SoftSound's success is a patented algorithm for speech recognition which allows faster operation with less memory usage.

The author of over 100 academic papers and holder of three patents, Robinson has competed in marathons in London, New York, Paris, Amsterdam, Inverness, Snowdonia and in the Cambridge area. He is also a keen fell runner, having conquered Snowdon and other Welsh and Scottish peaks, and raced across the North Yorks Moors.

To find out more about SpinVox go to www.spinvox.com

About SpinVox

SpinVox® brought together the two most popular methods of communication – voice and text – and created a new category of messaging called Voice-to-Screen™. Its award-winning service is now making everyday communication simpler and more powerful, creating new recurring revenues for wireless, landline, cable and VOIP carriers as well as service providers and web partners. SpinVox has already launched its service with Alltel, Cincinnati Bell, Rogers Wireless, Sasktel, Telstra, Telus, Vodafone Spain, Vodacom South Africa and Six Apart and announced a deal with Skype. As a managed service provider any network or service can rapidly and cost-effectively implement SpinVox.

At the heart of SpinVox is its Voice Message Conversion System™ (VMCS), which works by combining state-of-the-art speech technologies with a live-learning language process. VMCS is being rolled-out across four continents in four languages - English, French, Spanish and German.

Monday, March 17, 2008

Speechless

Speech recognition without speech.

By picking up nerve signals, Audeo understands 150 words and phrases.

Can it be used for real applications ?

Hawthorne Videoactive Report

Welcome

Speech analytics is growing constantly enabling various applications and at the same time pushing forward the basic technology . This blog is an attempt to create interaction between speech analytics professionals around the globe and provide an open platform to promote new products, technologies, conferences and ideas. Any member of the speech analytics group and linkedin can add posts to this blog.

It will take sometime to bootstrap this blog and bring this to the attention of the professional gang. Be patient and promote your view/needs.