Podcast
The Future of AI Speech
In this episode of The Future Of, host Jeff Dance is joined by Dr. Yossi Kishet, Chief Scientist at aiOla, to discuss the transformative role of AI in speech recognition and synthesis. They explore recent advancements, the impact of cultural nuances in language, potential applications in healthcare, and ethical concerns surrounding AI speech technology.
Jeff Dance: In this episode of The Future Of, I’m joined by Dr. Yossi Kishet, Chief Scientist at aiOla, a leader in capturing spoken data to create workflows for blue-collar industries. Yossi is also an award-winning scholar with over a hundred research papers on automatic speech recognition and speech synthesis. He has a background as an electrical and computer engineering associate professor and a director of the Speech Language and Deep Learning at the Technion, the Israel Institute of Technology. We’re here today to explore the future of AI speech with Dr. Yossi. Yossi, you said that your research interests are driven by a passion for understanding and quantifying speech. Can you share with us more about your journey and how you landed in your profession?
Yossi: So it happened like many others in my field. My first job was processing speech at a company called Varient, which was involved in creating digital voicemails, like the ones we used to have on our phones. I don’t think we use them anymore, but it was a great technology in the ’90s. I wrote the speech compression for them, and I really liked that. It was the first time a computer program was interacting with real-life audio.
Jeff Dance: Fascinating! We’re excited to go deeper. Tell us a little bit more about what you do for fun.
Yossi: For fun, I read books and spend time with my kids, who do a lot of sports. We travel a lot to different countries. I enjoy music as well; I used to listen to classical and opera, and now I listen to hip hop, which is amazing.
Jeff Dance: That’s fun to hear! I’ve had a similar journey myself. In the past, we have talked about The Future of Conversational AI and Voice AI. Our topic today is a little different. What do you mean when we say we’re going to talk about the future of AI speech? Tell us more about what that means.
Yossi: We’re focusing on mainly two applications: text-to-speech and speech-to-text. Speech-to-text is also known as automatic speech recognition, or simply speech recognition. The other application, text-to-speech, involves speech synthesis, which sounds very natural and can mimic voices.
Jeff Dance: Thank you. What are some of the most recent developments in AI speech that have led to this performance today?
Yossi: Let me clarify. ChatGPT from OpenAI is unbelievable for text processing. Another model from OpenAI called Whisper is also remarkable, but it’s important to note that OpenAI didn’t create it; they trained it on a huge amount of data. It’s based on a known model called Transformer, which is partially the same model used for ChatGPT.
Jeff Dance: Whisper?
Yossi: Yes, it has two components: one for processing the input, which is speech, and another for text. It was trained on 600,000 hours of transcribed speech — speech that has transcriptions. This level of training had never been done before; we used to train models on just thousands of hours. The data being utilized has resulted in performance that is unbelievable. We’ve done comparisons where we tested American speakers—referred to as L1 speakers—along with L2 speakers, like Korean and Japanese individuals speaking American English, under various noise conditions. We checked the performance of human listeners against Whisper’s performance, and Whisper slightly outperformed human listeners. This is remarkable. This refers to read speech, not spontaneous or conversational speech, but it’s almost there.
Jeff Dance: That’s amazing! So basically, our conference call notes that are automatically transcribed are going to get better and better. As we communicate with people around the world, even if we miss some words due to accents, these models can aid us and provide accurate meeting notes.
Yossi: I have to tell you something that reminds me of a talk given by Dr. David Nahamu at IBM Speech in 2012. He, along with Michael Picceni, discussed performance that had around 20% error—two incorrect words in a ten-word sentence. Dr. Nahamu predicted that ten years later, we would achieve superhuman performance, and at that time, I disagreed completely. I thought it was impossible, but here we are, seeing significant advancements.
Jeff Dance: That’s incredible! If we look at AI and human-level performance, back in 2017, prior to the introduction of Transformers, we estimated that human-level performance would take until the 2040s, 2050s, or even 2060s to achieve across various criteria. However, in recent years, all those projections have been moved forward, not just in speech but across many domains. We’re measuring against human performance benchmarks, which is our standard, right?
For you, it’s been a culmination of your career to see things accelerate so rapidly in the last few years. Amazing, right?
Yossi: Yes, it’s an amazing time to do research. We waited so long from 2017 to now to see this progress. OpenAI is expected to announce even better models shortly that outperform the current versions. There are still gaps we need to address, though. For example, at AiOla, we’re aiming to transcribe speech for specific domains. If you’re a medical doctor, you have specialized terminology during surgeries that require accurate transcription, and models like Whisper struggle with that jargon. It needs to be fine-tuned to cater to those specific fields, such as aviation or manufacturing, where unique jargon exists.
Jeff Dance: Right, every country, industry, and profession has its own jargon, and general models might not capture that uniqueness.
Yossi: Exactly. In aviation, for example, when checks are done before and after flights, they can’t carry pens or anything that might be left on the plane, which could ground it. They often have to rely on memory and verbal communication to ensure nothing is overlooked. In fields like manufacturing and nanotechnology, professionals are often fully covered in masks and gloves, which limits their ability to write things down. Again, the reporting in these environments frequently involves complex jargon, posing a challenge.
Jeff Dance: So, there are plenty of opportunities to create specialized solutions. We have some amazing superhuman models now, but as we think about diverse use cases for text-to-speech and speech-to-text, there’s still a lot of room for growth.
Yossi: Absolutely. There’s still a gap in handling heavy noise, reverberation, echo cancellation, and automatic speech recognition. We also aim to recover voices for people who are nonverbal or have difficulty speaking, helping them analyze their own speech.
To give you another example from my research: there are children around the ages of four to six who have pronunciation difficulties. They may replace the “R” sound with “L” or “Y.” Typically, they would see a speech therapist, who would tell them they’re mispronouncing words. Often, it’s just a matter of mapping the sounds differently in their brains. When children realize this, it leads to successful speech therapy.
In our research, we generate a model of the child’s voice with the pronunciation issues, as well as a version without them. This allows children to hear their own voice pronouncing the correct and incorrect “R,” giving them perfect feedback. We’re also developing avatars so they can visualize correct mouth and teeth placement based on sound. This technology can bridge speech therapy gaps in locations that lack such resources, like in Africa or India.
Jeff Dance: Right, right. In areas where they don’t have the funds or resources for specialized services, how many kids go to speech therapy? Do you know what the statistic is in the US or in first-world countries?
Yossi: In the US, it’s somewhere between 10% and 30%. That’s significant. Additionally, I think the average waiting time for speech therapy in the US is around seven weeks.
Jeff Dance: So, 10% to 30% of kids are going to speech therapy. That’s huge!
Jeff Dance: Right, because there’s such a huge demand. Having these tools that help kids hear their own language and solve their own problems with technology sounds really transformational in that particular industry. That’s pretty exciting. Can you tell us more about who the biggest players in this space are? Why do you think large companies are investing here? Is it a fundamental component of technology?
Yossi: They’re either fascinated by this modality of speech, like me, or they find it essential. For example, Amazon drives their Echo and Alexa devices through speech, and they have a fantastic team. Meta has a Facebook AI Research (FAIR) group focused on speech recognition, speech synthesis, and music synthesis. Google has a great team as well; they drive their devices with voice.
Jeff Dance: It’s crazy. It seems like everything’s becoming a vector database these days. What was novel not long ago is now almost ubiquitous for software developers. We work with neural networks over these vector databases. Our CTO even helped write a paper that’s been cited around 40 times, discussing how to vectorize information, which is fundamental to advancements in AI. It’s astonishing how quickly this technology has proliferated across many industries. We even have entire conferences focused on generative AI in robotics, yet here we are, discussing how it’s changing your industry, as well as nearly every other industry.
Yossi: I have to say, it’s amazing that this shift is happening in speech. It used to occur only in machine learning, vision, and NLP (natural language processing), but speech never saw this level of open collaboration. For years, we never released source code or databases for free; it was very commercial. Only recently—perhaps in the last five years—have we started releasing code and papers written in a way that allows for reproducible results. That openness is vital for humanity to advance.
Jeff Dance: Right. What are some of the bigger problems that AI speech can solve?
Yossi: Looking at it broadly, language changes approximately every 100 years. You might find it hard to communicate with your grandfather due to new words, pronunciation changes, or speaking too fast. This is a significant challenge because AI has been trained primarily on American English data from the 1980s, perhaps even the 1990s. We also lack sufficient data from other languages, and even when we have that data, language encapsulates the culture of a country or group of people. It’s not just about the language; it’s also the nuances of how it is spoken. For some people, one language can sound violent while another may sound too soft. This cultural aspect is crucial to consider in speech synthesis and automatic speech recognition. We tend to focus solely on data collection, but we’ll eventually face far more difficult scenarios without a unified understanding of language technology.
Jeff Dance: Right.
Yossi: We also have challenges with noise and accents, but I believe the larger challenges lie in how different cultures and ways of speaking express themselves. For example, Japanese has a different rhythm, while Italian has its own nuances, as does American English.
Jeff Dance: That makes sense. Let’s transition to the future a bit. You mentioned that 10 years ago, you didn’t think we would achieve human-level performance. Now that we have these fundamental building blocks accelerating everything, where do you think things will go? What are your perspectives on the future?
Yossi: I can only estimate. We need better algorithms because we’ve exhausted our current data sources.
Jeff Dance: That’s okay.
Yossi: In speech, we’re not there yet. We need much more spoken data, especially in varied conditions—data that isn’t just read speech or standard. We need more conversational speech data, and that would be interesting because it could provide a huge boost to our transcription capabilities.
Furthermore, we could explore using speech to detect various human conditions. Research has shown that we can detect social anxiety or different types of schizophrenia through speech processing. For example, schizophrenia can lead to motor dysfunction, and we may be able to detect it through speech patterns. I believe we can also identify the onset of Parkinson’s disease and other motor-related diseases before they become evident. We could potentially utilize household devices like Amazon Alexa to detect childhood asthma attacks by analyzing the child’s speech environment.
Jeff Dance: Nice! So, as we think about the future, more data will significantly improve our accuracy in various fields.
Yossi: Yes, particularly for diseases or conditions that exhibit motor evidence. For instance, depression is much more complex. Severe depression may cause a person to speak slowly, while someone with minor or major depression without motor evidence may not show detectable signs through speech. Speech patterns can sometimes reflect unconscious behavior. Parameters such as the duration of stop consonants can indicate underlying conditions.
For example, if your parents speak Brazilian Portuguese but you live in America and speak fluent American English, various studies suggest that your stop consonants will shift in duration based on the phonetics of each language. Research has shown that the unique characteristics of each language affect this speech evidence, and it’s only the beginning of such analyses.
Jeff Dance: So we have this great foundation to keep narrowing our focus and getting more accurate around these specifics. What about concerns?
What concerns do you have as we look to the future?
Yossi: One major concern is deep fakes or deep voice fakes—training systems to mimic someone’s voice can lead us into a realm where facts may become indistinguishable from fiction. Currently, synthetic voices aren’t quite as convincing as the originals, but as technology advances, someone could create an impressive synthetic voice given enough data about the target person. This presents significant ethical dilemmas, as at some point we may not be able to detect these fakes automatically.
Another broader concern revolves around AI regulation. Currently, I’m writing a book to redefine what the regulations for AI should look like. My assertion is that we need a moral infrastructure within AI systems. AI will be connected to the real world and may even develop its own algorithms to improve itself, which raises huge concerns.
I believe we need a moral framework that goes beyond traditional laws because laws, as defined by Isaac Asimov in I, Robot, cannot be adhered to by intelligent AI systems. We must develop an infrastructure of laws that can humanize AI.
Humans have developed mechanisms in accordance with Freud’s framework—our id (basic desires), ego (social communication), and superego (internalized moral constraints)—that enable us to coexist within a community. We must consider embedding a similar ethical structure in AI systems, ensuring they uphold societal values and compulsions through their operating principles.
This structure could also address liability issues. For example, in the event of an accident involving an autonomous vehicle, the question of fault is complex. While it might be blamed on the manufacturer, responsibility often lies with the vehicle’s owner. An effective moral infrastructure could provide clarity on ownership responsibility, creating a more socially accountable system.
Jeff Dance: Thank you. Those are profound insights. This is a time of deep reflection as advancements unfold rapidly, prompting us to rethink technology. I mentioned before, technology has a life of its own, so as we design the future, we must consider how to facilitate positive outcomes.
I see AI as inherently artificial; it cannot surpass its programming. However, the key questions are about what programming and controls are built in. Are we, for example, telling a drone that it cannot shoot anyone? What safeguards do we incorporate? I believe we must always retain some level of human oversight.
I agree with Microsoft’s approach around co-pilot, which suggests that while AI can support us, we still maintain control. These mechanisms we discuss are essential to create and integrate responsibly.
Do you have any additional thoughts about the future of speech AI, its direction, and the need for intentionality in its design?
Yossi: Yes, we must consider the diverse communities and various cultures globally. We don’t want to end up with a single homogenized culture in AI. Speech is a reflection of culture, and it’s vital we keep that in focus. Integrating this element into our work is just as important as ensuring regulation and security for AI systems.
Jeff Dance: That is indeed crucial. If AI reflects the data it is trained on, but that data only captures a subset of information, we risk standardizing our experiences at the expense of cultural diversity and language evolution. This poses interesting questions about what we might gain or lose as AI evolves—how it will shape our future and change society so rapidly. Thank you for sharing those insights. What advancements in AI are you most excited about?
Yossi: I’m particularly excited about medical applications—identifying medical problems and understanding the brain better. In my lab, we’ve started experiments comparing the internal processes of speech recognizers with brain function. We use fMRI to identify where speech processing occurs within the brain, from auditory reception to syntactic and semantic processing. We’re correlating these findings with the performance and mapping of models like Whisper and others. I believe this could enhance our understanding of the brain’s operations.
There’s been notable work by Charlotte Caschato in France, who compared brain activity with large language models (LLMs) and identified correlations between different levels of LLMs and brain functions. This research is still in its early stages, but it could be a gateway to understanding the brain through machines or vice versa.
Jeff Dance: That’s very deep. What’s been the most rewarding aspect of being deeply involved in this space?
Yossi: The fact that it works! Connecting a computer program has a tangible influence on the physical world. At first, the connection was limited to speech, microphones, and loudspeakers, but now it’s incredible how much capability exists.
Jeff Dance: And that capability fits in our pockets now, right? We carry computing power that transforms daily activities. That’s extraordinary. To wrap up this topic on AI speech, if someone’s listening and wants to delve deeper, who are the key figures, institutions, or publications to explore for more information?
Yossi: First, I must mention that there aren’t many good books because by the time they’re published, the field has moved on. Institutions like Johns Hopkins University, Carnegie Mellon University, and the University of Southern California are excellent sources. Meta, Google, and OpenAI are also at the forefront of research. Cambridge University in the UK is notable, as are several European institutions like the University of Edinburgh.
Jeff Dance: Understood. These places are where the focus is on speech AI, alongside the big tech companies.
Yossi: Yes, absolutely.
Jeff Dance: Thank you! Do you have any parting words for our listeners that you would like to share?
Yossi: Yes, be cautious about what you believe; not all information is factual. We must be diligent about verifying sources, leaning toward a more scientific approach. Additionally, we should also embrace the humanities. We need philosophers and literary figures to explore the philosophy of AI and its moral implications.
Jeff Dance: Thank you! It’s been a pleasure to have you. We appreciate your research, your leadership in speech recognition, synthesis, and AI, as well as your insights about the future. I’m looking forward to following your work, especially as you write your book and explore the broader concerns of AI and how we reshape and design the future with intention.
Yossi: Thanks so much for having me. It’s been great!
Jeff Dance: Thank you!