Is the Babel Fish era of artificial speech intelligence finally here?


AI, Artificial speech intelligence

The world is more connected than ever before – we can facetime family and friends across the world in an instant or we can jump on a plane and be in a different continent in a matter of hours. What has been more challenging however, is communicating in a language we don’t speak. But is this all about to change?

The idea of artificial speech translation has been around for a long time thanks to Douglas Adams who created a creature known as the Babel fish in the Hitchhikers Guide to the Galaxy:

"The Babel fish is small, yellow, leech-like - and probably the oddest thing in the universe. It feeds on brain wave energy, absorbing all unconscious frequencies and then excreting telepathically a matrix formed from the conscious frequencies and nerve signals picked up from the speech centres of the brain, the practical upshot of which is that if you stick one in your ear, you can instantly understand anything said to you in any form of language: the speech you hear decodes the brain wave matrix.”

The Babel fish came to represent one of those devices that technology enthusiasts dream of long before they become practically realisable, like portable voice communicators and TVs flat enough to hang on walls: a thing that ought to exist one day.

In the same year the Hitchhikers Guide to the Galaxy was published, a professor of computer science at Carnegie Mellon University in Pittsburgh first proposed the technology at Massachusetts Institute of Technology in 1978. But it wasn’t until 1991 that he launched the very first speech translation system, with a 500-word vocabulary. The system ran on large on work stations and would listen to the words being spoken, convert it to text and then translate it into the desired language which took several minutes to process.

Since then we’ve come a long way with devices that now resemble Babel fish thanks to a wave of advances in artificial translation and voice recognition.

Now, when a translation engine listens to words being spoken, it will attempt to identify the language it hears and what’s being said. Waveforms of sound are analysed to identify which parts seem to correspond to translations as it builds. The engine then attempts to translate what it thinks it hears into what it believes to be normal speech in the destination language and all this happens in just a couple of seconds.

To achieve this, a combination of different machine intelligence technologies is used: pattern matching software to identify sounds, neural networks and deep learning to identify “long-term dependencies” and predict what is being said, and encoders to process all this information. The task is supported by databases of common words, meanings and information learned from previous analyses of millions of documents.

This complex interplay of tech already generates accuracy of around 85% with translation taking between two to five seconds. The hope is that as AI evolves, both fluency and speed will improve.

Google and Skype have been fast to explore the potential of AI in this area. Google has incorporated a translation feature into its Pixel earbuds, using Google Translate, which can also deliver voice translation via its smartphone app. In January 2019, Google also introduced interpreter mode for its home devices. If users say, “Hey, Google, be my French interpreter” the device will activate both spoken, and, on smart displays, text translation. Google suggests hotel check-in as a possible application – perhaps the obvious example of a practical alternative to speaking travellers’ English, either as a native or as an additional language.

While artificial speech intelligence has come a long way, it’s still got a long way to go. In future, it will need to work offline for situations where internet access isn’t possible. It will also need to cope with physical challenges such as background noise.

However, there is a chance that some important limitations will never be overcome. Language is closely related to human emotions, so the subtlety of tones and voices change the impact and meaning and, as we stand today, language can’t be separated from culture. There is also the challenge that as humans we are taught to be socially aware. Over several years we have acquired manners and know how to address people appropriately. 

Etiquette-sensitive artificial translators, if built, could relieve people of the need to be aware of differing cultural norms. They would facilitate interaction while reducing understanding. At the same time, they might help to preserve local customs. But it’s yet to be seen if the technology can reach this level of complexity.

But while limitations may still exist, as machine learning, and networking technologies improve, it seems likely that organisations will begin to make more use of these real-time language translation tools. They may help them to unlock fresh opportunity to build new revenue channels. 

Look out for our next blog in which we’ll explore the different use cases for AI-driven translation technology.

 

Posted by Helen Thomas