Sunday, April 27, 2008

What is English

Few recent news items remind me again that we constantly refer to a language by its name (English, Spanish etc.) but forget that it is constantly evolving and it is context and location dependent.

SpinVox announced their VoxGeist. After analyzing more than 50M voicemails (wow - assuming 20 sec in average per voicemail, this is more than 270 thousands hours of transcribed audio which is the largest database of transcribed voicemails) SpinVox can analyze spoken voicemail language dynamics much better than other companies. Although voicemails are different from conversations, according to SpinVox, there is a clear penetration of slang words which are not part of a standard dictionary (or in speech reco jargon- not part of the standard language model).

The Voxgeist list of Most Referenced Slang Words and Phrases with Translations in the UK:

-- Diesel = a muscular man/women

-- Hot Mess = person is a disaster

-- Spun = crazed

-- Busted/Hurt = extremely unattractive

-- Ice/Bling = diamonds

-- Shady = untrustworthy

-- Woot/Stoked = excited

-- One = I'm out; leaving

-- Peace out = goodbye

-- Props = respect

-- Newbie = beginner, new kid on the block

-- Blasted = to get in serious trouble.

-- Scheisty = secretive amongst others

-- Bluetool = someone who always wears a Bluetooth earpiece, even when they're not on the phone

-- Baller = wealthy

-- Crunk = crazy drunk

-- Celebutante = famous just for being wealthy

-- Schlumpadinka = women who let their style go

-- Multislacking = having two or more non-work related web pages open on your work computer at one time.

-- Hola = greetings; hello

While the Canadian regional dictionary currently includes words such as:

1. Allophone Someone whose first language is neither English nor French.

2. Ble dInde corn on the cob

3. Bourassa Robert Bourassa a former premier of Quebec

4. Cabot Famous explorer of North America

5. Coulee means to flow

6. Keener: A brown-noser whose excessive keen-ness for the unpleasant task at hand makes the rest of us look bad.

7. Loonie: the nic name for the Canadian $1 coin

8. Mickey: A mickey is one of those curved, flat, 13-ounce bottles of booze that winos carry.

9. Nanaimo - A type of chocolate bar, originally produced in the city of Nanaimo, British Columbia. It consists of a crumb-based layer, topped by a layer of light custard or vanilla butter icing, which is covered in soft chocolate.

10. Nunavut newest Canadian territory , next to the North West Territories

11. Oolochan small ocean fish

12. Poutine: Poutine is a cholesterol-rich Canadian "delicacy" consisting of French fries covered in cheese curds and gravy. When prepared badly, it congeals in your guts like concrete.

13. Toonie: the nick name for the Canadian $2 coin

14. Tourtiere French meat pie

15. Sniggler: Someone who takes the parking spot you wanted, or is generally annoying

Apart from great PR to SpinVox, it is a demonstration for the required human intelligence for understanding messages and how experience is a key. Well the question is how will automatic systems can learn such slang. Usually, the language models were created scanning enormous amounts of texts. However, how can systems learn if texts do not include this new words. Well, it will be either by transcribing a lot of data or to wait until spoken slang penetrates the written language. Actually in a recent study, about Writing, Technology and Teens there is a clear indication how not only slang but written abbreviations (from the SMS world) are penetrating the written language. For example, 38% of teenagers say they have used text shortcuts in school work such as “LOL” (which stands for “laugh out loud”). Even more interesting is that the written abbreviations finds their way back to the spoken language: "I can speak very well, but there are also times that I have been laughing and actually said LOL. So it all depends on how much you text and who you are around at the time. – 11/12th Grade Girl, Pacific Northwest City."

The web is currently the best source for learning texts statistics. However not such sources exist for spoken language - dialogs and monologues.
It will be great to see an initiative when people contribute their transcribed voicemails archive (maybe sending it via Twitter) so technologies around the world can enhance the capabilities of automatic speech recognition systems.

No comments: