Just two weeks after the $100M round of SpinVox (see previous post on this blog), the speech analytics giant Nuance announced a voice-mail to sms/email service. Although referring to Nuance speech technology intellectual properties (more than 200 patents - and trust me they are using it), Nuance is relying on "over 3,000 Nuance transcriptionists, hosted in a Nuance-owned facility".
I must say that I appreciate Nuance's honesty admitting that humans are required in the cycle (unlike other companies providing similar services). However, is it a scalable operation and can it maintain the margins required from a US technology public company?
Let's play with some numbers:
Another player in this market is SimulScribe who charges $10 per 40 Voicemails (per month). Assuming 20 seconds in average per voicemail, you gain $10 for 800 seconds. which is $0.75 per min. This translates to $45 for transcription of 1 hour of speech.
Assuming real-time transcription by a human, there seems to be very high margin in such a service (especially if you can leverage offshore low paid employees).
This can also be a perfect fit to Amazon’s Mechanical Turk service where you can get people to perform simple tasks. Well, just one issue makes this amazing business opportunity a problem - privacy. Is it allowed to transfer your voicemail to another continent? - in many cases it is prohibited by some regulations. Will people be willing to have their private voicemails transferred to random people using the Mechanical Turk service to do it from home? - that's for you to answer. So instead of doing it over the globe, you need to build secured facilities to host the transcribers and
data and this is were the costs starts to accumulate. Add a 24/7 service which cannot be performed around the globe and you get something which can be profitable but is risky.
So here comes the technology to rescue - by leveraging voice to text technology some of the data can be transcribed automatically. But how can you leverage it ? Assuming even a very optimistic 85% accuracy estimation (which is far from being the truth for compressed telephony), someone must review the results and fix the embarrassing errors. I believe that most errors are OK but some (which I call embarrassing) may turn the service into something that will cause subscribers to walk away because they are embarrassed with the messages (or will cause a lot of embarrassment to the service provider). So until the systems accuracy is XXX %, humans will still be required. So what is the XXX required accuracy? I will leave this to others to comment.