In the age of the internet, people are being drawn closer and closer— you can Snapchat your friend from Turkey, video call your parents on their fancy vacation, send a quick text to your old pen pal (now your new keyboard pal) in Japan.
But as the world is drawn closer together, our attention spans are becoming more and more commodified. We spend hours scrolling through Instagram, while spending less time engaging with each other directly.
Ironically, artificial intelligence is now changing that.
In March of 2021, Google unveiled their Live Captions feature on Chrome browsers. Live Caption uses machine learning to instantly create closed captions on any video or audio clip, providing deaf and hard-of-hearing individuals greater access to internet content.
In the past— and still today— closed captions were either pre-programmed for video formats, or a stenographer would type an almost-instant caption that would be broadcast on television. However, in places where captioning isn’t the “norm,” such as on apps like Instagram or TikTok, captions are almost impossible to find. Live Caption changes this: with a few taps on the screen, any user can have instantaneous, accurate captions that broaden the reach of audio and video.
Google’s Live Caption is a type of NLP or natural language processing. NLP is a form of artificial intelligence that uses algorithms to facilitate an “interaction” of sorts between people and machines. NLPs help us decode human languages into machine languages, and often vice versa.
To understand the history of NLP, we have to go back to one of the most ingenious scientists of the modern era: Alan Turing. In 1950, Turing published “Computing Machinery and Intelligence”, which discussed the notion of sentient, thinking computers. He claimed that there were no convincing arguments against the idea that machines could think like humans, and proposed the “imitation game”, now known as the Turing Test. Turing suggested a way to measure whether or not artificial intelligence can think on its own: if it could correctly fool a human into believing it is a human with a certain probability, it can be thought of as intelligent.
From 1964 to 1966, German scientist Joseph Weizenbaum wrote an NLP algorithm known as ELIZA. ELIZA utilized pattern-matching techniques to create a conversation. For example, in the DOCTOR script, if the computer was told by a patient “my head hurts”, it would respond with a phrase similar to, “why does your head hurt?” ELIZA is now considered to be one of the earliest chatbots, and one of the first to fool a human in a limited type of Turing Test.
The 1980s were a major turning point in the production of NLPs. In the past, NLP systems like ELIZA formed conversations by relying on a complex set of rules– the AI couldn’t “think” for itself; rather, it was a bit like a chatbot and used “canned” responses to fit the context. When the human said something it didn’t have a response for, it would give a “non-directional” response with something like, “Tell me more about [a topic from earlier in the conversation].
By the late 1980s, NLPs instead focused on statistical models that helped them form conversations based on probability.
Modern speech recognition NLP includes a few common principles, such as speech recognition, audio recognition, language identification, and diarization, which can distinguish between speakers. Google’s Live Caption system uses three deep learning models to form the captions: a recurrent neural network (RNN) for speech recognition, a text-based RNN to recognize punctuation, and a convolutional neural network (CNN) to classify sound events. These three models send signals that combine to form the caption track, complete with applause captions and music captions.
When speech is recognized in an audio or video format, the Automatic Speech Recognition (ASR) RNN is turned on, allowing for the device to start transcribing the words into text. When this speech stops, for example, when music is playing instead, the ASR stops running to conserve phone battery and trigger the [music] label in the caption.
As the speech text is formulated into a caption, the punctuation is formed on the previous complete sentence. The punctuation is continually adjusted until the ASR results do not interfere with the meaning of the complete sentence.
Right now, Live Caption can only create captions for English text, but it’s constantly being improved upon and will someday expand to other languages. Early versions of Spanish, German, and Portuguese captioning are currently available on Google Meet.
Accessibility-centered NLPs aren’t solely limited to creating captions. Another Google project, Project Euphonia, is using NLP to help individuals with atypical speech or speech impediments be better understood by speech recognition software. Project Euphonia collects 300-1500 audio phrases from volunteers with a speech impediment. These audio samples can then be “fed” to speech recognition models to train for a variety of speech impairments. Additionally, the program creates simplified voice systems that can use facial tracking or simple sounds to signal different actions, like turning on a light or playing a certain song.
One of Google’s newest ASR NLPs is seeking to change the way we interact with others around us, broadening the scope of where — and with whom — we can communicate. The Google Interpreter Mode uses ASR to identify what you are saying, and spits out an exact translation into another language, effectively creating a conversation between foreign individuals and knocking down language barriers. Similar instant-translate tech has also been used by SayHi, which allows users to control how quickly or slowly the translation is spoken.
There are still a few issues in the ASR system. Often called the AI accent gap, machines sometimes have difficulty understanding individuals with strong accents or dialects. Right now, this is being tackled on a case-by-case basis: scientists tend to use a “single accent” model, in which different algorithms are designed for different dialects or accents. For example, some companies have been experimenting with using separate ASR systems for recognizing Mexican dialects of Spanish versus Spanish dialects of Spanish.
Ultimately, many of these ASR systems reflect a degree of implicit bias. In the United States, African-American Vernacular English, also referred to as AAVE, is an extremely common dialect of “traditional” English, most commonly spoken by African-Americans. However, multiple studies have found significant racial disparities in the average word error rate across different ASR systems, with one study finding the average word error rate for Black speakers to be almost twice that of white speakers in ASR programs from Amazon, Apple, Google, IBM, and Microsoft.
Going forward, creating more diverse training for AI that includes regional accents, dialects, and slang can help reduce disparities in the accuracy of ASR across races and ethnicities.
Technology has incredible potential to bring people together, but when people are left out, whether as a result of disabilities, race, ethnicity, or otherwise, it can be a divisive and isolating force. Thanks to natural language processing, we’re starting to fill in these gaps between people to build a more accessible future.