Audiobooks, podcasts and voice assistants: how voice is revolutionizing technology

Dai podcast agli assistenti vocali: come la voce sta rivoluzionando la tecnologia thumbnail

If we had said to a man of the early 19th century: “one day you will be able to turn on the light bulb with your voice”, he would probably have taken us for crazy. In fact, as a primitive instrument of communication between human beings, the voice has long played an element that was not always central to technology, at least until a few years ago. Then the boom: podcasts, voice assistants, dedicated social networks, the ability to send audio messages and artificial intelligences capable of recognizing and transcribing what has been said verbally. But how did this vocal revolution come about? And where will it take us?

We could identify the year zero of the vocal revolution, almost like certain religious traditions, in 2011: year he was born Siri (on iPhone 4S). Before that, the voice was used very rarely on the web and in what we would learn to call IoT (Internet of Things). The vocal medium was in fact a form that was mostly relegated to the use of digital radio, music and accompaniment to video content. In short, all elements already present in everyday life outside the network. But above all it was a passive means. It was essentially the machine that communicated with us, or at least that allowed us to communicate with other users. There was therefore no dialogue man-machine.

Voice and technology: historical notes

Surprisingly, the first use of the voice as a tool to communicate with a machine dates back to 1911 con Radio Rex. It was essentially a toy consisting of a mechanical dog and a kennel. Calling the dog by name – Rex in fact – it was possible to get him out of the kennel briefly. After the Second World War, several researchers, especially the IBM, began to study voice technology, without however particular results in terms of implementation. Worth noting is perhaps only the Picturephonein 1962: the first device for video calls in the world. However, it was too cumbersome and expensive to be a success (15 minutes of “videocall” cost about 80 dollars).

Video calls – which in any case kept the voice medium linked to the visual content, and remained a form of communication between human beings only – reached their peak with the arrival of the new millennium, with apps such as MSN Messenger e Skype. The data is also significant for the evolution of the web medium. With the first internet (web 1.0) the network was in fact made up mostly of static pages dedicated to reading. The 2000s would then have paved the way (with web 2.0) for the era of hyper connectivity. And then we come to the famous year 0: 2011.

In just 11 years, the voice has assumed, albeit slowly, a predominant role in the use of the network and technology. The primitive and limited functionality of the first Siri (he could at most consult the weather and start calls) will be the basis for the new voice learning technology that will make the fortune of the first Amazon devices. In fact, the first physical voice assistant will arrive in 2014: Echo. So Alexa was just bornwhich today lives in the homes of millions of users around the world.

If Siri had opened the doors to a new era of virtual assistants in 2011, two years later WhatsApp will reinvent the concept of messaging with the introduction of instant voice messages.

The importance of the voice as a means of expression

But why does the human being prefer to use the voice? The simplest answer would be “for convenience and immediacy”. This answer would also be correct, but in reality there is much, much more behind it. As any good actor can teach us, the voice represents a unique tool for conveying emotions and messages through nuances, inflections and intentions.

Of course with an emoticon we can emphasize a mood, but it is only with the voice that you can really empathize. At the base of the success of voice messaging there is precisely this primitive need to tell emotions, rather than stories and anecdotes: the container that enhances the content. To be able to create a sort of empathy with the machine, albeit virtual and fictitious, it reassures users and makes them feel understood.

The Voice Revolution: Podcasts, Clubhouse, Twitter Spaces and Audiobooks

Wanting to disconnect from the man-machine relationship, and focusing on the voice as a tool of the web to convey messages, we can see how the last few years have been the years of the voice. Technology always and everywhere, within reach of your pocket, has favored the creation of new formats, such as i podcast: an advanced and smart version of the slower and more complex radio programs. And then the new artistic forms. Think of the possibility of listening to a book read and interpreted by an actor, with expressive intensity that returns exactly what the author wanted to communicate: the audio books.

Italian clubhouse

In the full lockdown of 2020, when it was the empathy and closeness with our peers that were most lacking, there was a boom in video app downloads. And not only that, an app began to appear on iPhones all over the world: Clubhouse. The first social network entirely dedicated to vocal discussions. No text, no photo and no video. Only and exclusively the voice.

Clubhouse’s staggering popularity waned within a year or so, but the platform had opened the door, as Siri did in 2011, to a new world. Indeed in 2021 Twitter launched Spaces: vocal chat rooms that reflect the concept of the Clubhouse. And it is ironic that this is done by the same platform that has always favored short texts, with the famous limit of 280 characters.

The voice as a form of accessibility: speech recognition

As often happens in the tech field, however, a new technology is always the basis for another that will be born. Advanced artificial intelligences (AI) capable of understanding human messages have revolutionized many aspects of daily life, including in terms of accessibility.

The technologies of speech recognition I am now able to recognize a voice message and transcribe it. A tool that can break down many communication barriers, especially for the deaf. The range of possibilities in terms of accessibility, in this sense, extends to all daily life perennially connected: from messaging to videogames, passing through the use of literary contents (just think, as above, to audio books).

Recently the University of Illinois (UIUC) has started a collaboration with tech giants such as Amazon, Apple, Google, Meta, Microsoft. The partnership is based on a project called Speech Accessibility, which aims to improve speech recognition for users with disabilities. Speech Accessibility is aimed at people with Lou Gehrig’s disease (ALS), Parkinson’s, cerebral palsy, Down’s syndrome and others conditions that could limit users’ communication skills.

The present and the future of voice in technology

Currently, speech recognition technologies allow various operations that simplify users’ lives. The range of possibilities encompasses both productivity and leisure. It is now possible change radio stations in your car without taking your hands off the wheel, or search for studios without having to manually type in keywords. And then the smart homes: just a word, for exampler operate vacuum cleaners, house lights, dishwashers and the entire connected ecosystem. But what are the possibilities for the future?

The recent news announced by Google Assistant confirm that the tech trends will continue to go in an increasingly vocal direction. Voice learning AI may soon take over in the film industry as well (which is actually already happening with deepfake).

Google Assistant

It will then be interesting to understand how and if the machines will be able to understand the nuances of the voice. An advanced technology in this sense could be able to detect the user’s moods, also providing psychological support. The robots could then come to empathize with humans.

Last but not least, there is the metaverso: essential actor when it comes to the future. The possibilities are almost endless for those who will be able to seize them. Indeed, for those who will be able to make their voices heard.