The Future Of Voice AI

Conversational or voice AI is an artificial intelligence tech that enables people to interact with various advanced chatbots as they would do with other humans.

The technology itself is relatively new, but the progress in its adoption and implementation has been significant, similar to advances in fields such as generative AI. Conversational AI is rapidly improving and is shifting towards B2B, as there are many business opportunities.

To understand exactly how conversational AI can be applied and where it is heading in the future, in this episode of “The Future Of” Jeff is joined by Kane Simms, Founder of VUX World, Roger Kibbe, Senior Developer Evangelist at Samsung Research America, with additional insights from Kim Aspeling, Director of Creative Production at A Million Ads.

Kane Simms: There’s technologies out there that can detect your emotions. Sentiment analysis has existed for quite a long time. Voice biometrics has existed for quite a long time. Speech recognition has existed for quite a long time. It’s just that what’s happened in the last 10 years is that the accuracy level of these technologies has got to a point where they’re actually ready for production.

Jeff Dance: Welcome to The Future Of, a podcast by Fresh Consulting, where we discuss and learn about the future of different industries, markets, and technology verticals. Together we’ll chat with leaders and experts in the field and discuss how we can shape the future human experience. I’m your host, Jeff Dance. 


Welcome. It’s a pleasure to have you both with me on this episode, focus on the future of voice AI. Excited to have two serious voice leaders and evangelists that think about work on the future. Let’s start with some introductions. Kane, if I can start with you, would you care to tell the listeners a little bit about yourself?

Kane: Indeed. Thank you for having me. It’s a pleasure to be back in the saddle alongside Roger as well. It’s been a little while. My name’s Kane Simms. I’m the founder of a consultancy called VUX World. We help organizations plan and execute customer experience strategies with a focus on artificial intelligence and namely conversational AI and voice AI which is what we’re going to talk about today.

Jeff: Awesome. I noticed you were named a top 10 voice AI influencer by Voicebot and a top 20 voice AI influencer by SoundHound. That was pretty cool.

Kane: Yes, very cool.

Jeff: It’s not been a very long time in this space, but it’s changing so fast. It’s nice to have some experts in the room that have spent probably their 10,000 hours if that’s what defines us as experts, but thank you for being here with us. Roger, tell us a little bit more about yourself.

Roger Kibbe: Well, first, Kane, nice to see you again. I’m with a legend and voice in the room, Kane Simms. I’ve been working with voice and conversational AI since 2018. I work for Samsung and as a developer evangelist. What I do is talk to companies, designers, developers, and talk about conversational AI and how we can use conversational AI to drive their business goals forward, or if it’s not a business, they’re creating a game or having something fun or something useful and really exploring how voice can be––where it is the best user interface and oftentimes where it is not a good user interface and building the appropriate tooling and capabilities for the job.

Jeff: Nice. Thanks for being with us, Roger. I noticed that you are also the ambassador of the Open Voice Network working on a voice registry system, like a DNS for voice applications, also the CEO, Chief of Voice Craft, a voice app. Can you tell us a little bit more about those two things as well?

Roger: Yes. Open Voice Network is an industry consortium, voluntary people from all walks in the industry really getting together and talk about standards. If you look at the platforms, it’s pretty fractured right now. The whole idea there is, wow, we would all benefit if we talked about some standardization. It’s a working group. I’ve been focusing on that registry system part of it, but there’s a lot of things around privacy, around interoperability, et cetera.

It’s an exciting place to be. We call ourselves ambassadors for it. I like that term. Then I have a small––it’s a little on ice right now, but I built several voice applications for the different voice assistants through Voice Craft, my company, primarily things that are fun and games. It’s something at the end of the day, kick up your legs, go talk to your voice assistant and have some fun.

Jeff: It’s great to hear that you’ve had experience at developer level as a leader in one of the world’s biggest companies but also at the standard level. I think that multitude of experience is really helpful, pertinent as an expert here.

Kane: You know someone is really passionate about it when they finish work and they start building voice apps in their spare time.

Jeff: Exactly. Nice.

Roger: There you go.

Jeff: Kane, for VUX, do you guys actually offer services, does your organization offer services? Tell me a little bit more about what you guys do.

Kane: It’s a strategic consultancy. One of the biggest challenges that companies face is trying to figure out and demystify this whole landscape. If we’re going to automate some of our customer experience, how exactly do we go about doing that, how do we select the right technology, how do we put together the right teams, what do we do as far as where do we start with use cases, how do we design a good conversation, how do we implement that, what’s the best practice for improving it over time?

There’s a whole range of things that businesses struggle with and that’s what we are there to do is to help them do that. We can do things like roadmap planning, use case assessments and validation, conversation design training, a whole bunch of stuff around what it takes to, one, plan properly how to put together a right strategy, and then, two, how to go about executing that as well. I’ve been doing it now for around four and a half years in total. It’s still one of those areas where there’s just big, big gaps in knowledge and experience at the enterprise level. Companies like ours are there to try and help fill those gaps where we can.

Jeff: Awesome. I love that you have that deep expertise and they’re actually not only putting on events and leading conversations, but when you’re actually there in the trenches, then it makes a big difference when you actually do the work.

Kane: Definitely. We didn’t really mention that. What I’ve mentioned is the media side, which is what most people know us for, which is the podcast, the newsletters, the articles that we write and it’s every single day we publish something somewhere to try and help educate and help guide people in the right direction as far as how do you do this, why should you do this, what are the challenges, how do you overcome them, who should you speak to about certain, very specific challenges and those kind of things. It’s almost like we’re running two companies at the minute, which is mad.

Jeff: Complicated, but awesome. It would be good to give the listeners a little bit more insight into the marketplace, the industry, and how things have evolved in the last few years given how much has changed. Can you guys give us your overview essentially of where we are today?

Roger: First of all, conversational AI and voice, and it just keeps on improving at this really dramatic rate. What I think has happened a lot in the last couple of years is there’s been a little bit of shift to the industry focus from consumer to B2B side. I think this was inevitable.

There’s still strong consumer––and those are going along and doing pretty well. I think what you’re really seeing, and this is pretty exciting, is on the B2B side is businesses and companies realizing, “Hey, you know what, if I put a conversational AI voice interface in front of this, this is going to be better for my workers or my customers,” etcetera.

What I really see is innovation on that side. You call up the call center now and you’re like, “Oh, no, I’m in the dreaded IVR.” All of a sudden it works really well. What is that? That’s really conversational AI coming into place instead of those strict rule-bound old-school IVRs.

Now it’s actually understanding what you want to do quickly, and you’re not mashing the zero key to get a person because that actually works. To my mind, the biggest trend I’m seeing is definitely innovation in the B2B space. I think that’s good. There’s a saying in Silicon Valley is, “B2C is where it’s sexy and B2B is where the money is.”

It’s good to be focusing on the B2B side because I think it’ll be a lot of innovation. As people get used to interacting with companies via voice, they’ll also be more comfortable interacting at home with their voice assistant. It’s a virtuous cycle.

Kane: What you are seeing there, Roger, if you look back at 2018, 2019, there were stats around which was saying that 33% of UK households had a smart speaker, 37% of US households had a smart speaker. This is a few years ago going into the pandemic. As Roger alluded to there, what these things have done is they’ve gotten people used to and comfortable talking to stuff.

When you do call your bank, and all of a sudden, it’s an AI assistant answers the phone, there’s a bit more tolerance there. Around about the same time as when people were buying smart speakers and the adoption of smart speakers at one point was faster than the smartphone, if you can believe that, what was happening at the same time is that it was around about the same time that Facebook released the API’s ability for third-party developers to build conversational chatbots in Messenger.

A bunch of companies started building chatbots and stuff like that around about 2016, 2017. Then as the adoption of voice assistance began, those companies started really learning how to build chatbots properly over the course of 2019 to 2022 and the mature organizations, the HSBCs, the Verizons, the city banks, those kind of organizations have now scaled what they’ve done, put it across multiple different channels.

It’s text, it’s a voice in the IVR, it’s on the website. At the moment it’s forecast––I think it’s Juniper that forecasted that in the US alone spend on conversational AI technology by the end of next year is going to be $148 billion, I believe it is. It’s really, really ramping up. The progress over the last five years, not just on the technology side, but on the adoption and implementation of this technology has been pretty significant.

Jeff: Clearly it’s big. The majority of us have access to this now in our pockets, in our homes. Roger, you talked about this moving more quickly than the current state in the business field. Talk to us a little bit more about that; some of the bigger industries that are embracing this and some of the movement that you see.

Roger: You may not know it now, but when you talk to an IVR, especially if it starts working well, you’re talking to conversational in a conversational AI system, whether that’s your bank or your airline, etcetera. One thing I think is really interesting, it makes a lot of sense, is I think McDonald’s bought a voice company. They’re talking about their drive-throughs. Well, that’s really a really good use case for voice.

It’s a limited set of things that can order. It fits and should work pretty well, but I’m excited about that. There’s this whole concept, I’ll call it the deskless worker. This is people who don’t sit at the desk, they’re out in the field, they’re salespeople, they’re repair people, anyone out there. What do they have to do? They often go out in the field and they have technology they have to work with, but they’re busy with their hands and they can’t do it. There’s a voice interface in front of that, that’s a big win.

Or we’ve all had it, you’re talking to someone, and they have to put their phone down, look at the computer screen, and do something. That’s not really natural. As we start building voice interfaces in front of this technology, you get to be maybe a little more natural, maybe the voice you can ask it in the middle of a conversation for some information, and then continue the conversation without that laptop, touch my keyboard, which breaks the flow there.

Kane: It’s interesting when you alluded to Google and why they might be interested in this stuff, if you look across businesses, 90% of data within organizations is unstructured data, as in data that can’t be accessed via an API, can’t be understood logically, can’t be searched, can’t be analyzed. Google’s job is to gather unstructured data in the form of text, searches, and stuff like that. You’re right, there’s no wonder that they’re interested in this stuff, but when you think about-–we talked about voiced AI, but really what we’re alluding to behind the scenes is natural language understanding.

That’s the real key component here because that’s the thing that takes meaning from a set of words. On the business side, industries you asked about who’s using it, insurance, financial services, banking, things like that, retail, hospitality, travel, healthcare, or industry sectors, government, where you’ve got a lot of customer demand, you’ve got a lot of pressure on the business operations, very difficult to recruit staff, very difficult to keep people in jobs, and all that kind of stuff, increasing demand, especially.

If you look at some retailers, some retailers have had phenomenal sales over COVID, some healthcare institutions have had incredible pressure put on them over COVID. The amount of influx of customer contact has risen tremendously and AI is one of the ways in which businesses have been trying to manage that. There’s some really good use cases at the very large side. If you look at the likes of Deutsche Telekom or Verizon. They’re using chatbots on their website to encourage customers to self-service, so they don’t need to call the contact center making better use of their website.

If you look at a company like Bank of America, world-renowned for its voice assistant, its mobile app, incredibly helpful, about 10 million interactions a month, I think it is, that they handle. Then you look at Verizon and they’re doing agent assist use cases where the people in the call center, they’re having a conversation with a customer, but they’ve got an AI assistant on the back end that’s actually helping them through that conversation, suggesting what the customer is after, helping them process transactions quicker, which reduces the average handle time of the call, which saves the company money.

There’s a whole range of areas where this stuff is being used. Then you can look at things like emotional AI and sentiment analysis and conversational intelligence which essentially can monitor conversations like this or monitor calls in a call center, summarize calls into extract intent, it can do all kinds of business analysis. It’s absolutely incredible. Voice biometrics for authentication that saves banks millions and millions of pounds per year just to identify who somebody is.

The breadth of this technology is quite wide and the application of it is inherent everywhere. Voice is a very good input. A lot of what Roger was saying there around field workers and stuff like that and the task that they’re being given is a terrible task. They work with their hands out in the field yet they have to do their work on a tablet that’s this big with a set of gloves on and thumbs that they can’t type with anyway. It’s like, it’s a really, really good input device.

The applications of it although we’re seeing a lot of application, we’re seeing some real business benefit from it and customer benefit from it. We’re really just scratching the surface because there’s not––If you take all the companies in the world, you’re probably looking at 10% of them who are really utilizing this stuff now. That might even be a stretch.

Jeff: I think it was 2015 when Amazon had their Echo device come out but then every other major tech company seemed to come out with one shortly thereafter. Any thoughts on why that moved so fast?

Kane: Interesting question. One is that the go-to-market for those devices was very effective. Google would give them away with Spotify subscriptions and Amazon dropped the price to the point where they were actually losing money on every Echo that they sold in 2017, 2018 because they were just rushing them into houses. The aim of those devices strategically from Amazon, Google, etcetera’s perspective was to get the devices into households and establish a footprint because it’s the controller of your smart home, it’s the controller of your music, it’s now becoming the controller of your daily tasks and those kinds of things.

Really, it was a case of a real, real big effort on behalf of both companies to get adoption and get these devices into homes. I think that a lot of the market and a lot of the price reductions was certainly a huge factor. At the same time, and I’m sure Roger has got a lot of thought on this stuff as well, which is that it was the first device ever which existed solely to speak to.

There’s been no device ever in the history of technology the only point of this thing is to speak to. It doesn’t even look that great. The first one was just like a circular hockey puck thing, and all you could do is speak to it, two buttons on top. There was a novelty factor to it as well, which is that this is really cool and really new, and then there was an accessibility factor as well. A lot of older people were using it and stuff like that. It was a really good player for kids, for games, and homework, and that kind of stuff.

It was a myriad of a whole bunch of different things ranging from the novelty factor to the go-to-market deployment strategies from those companies. Roger, I’m sure you’ve got some other thoughts as well.

Roger: I completely agree with you. I think if instructed to look at what’s the core business model of these companies. Amazon wants to sell me more things. Any retailer knows that the more you are top of mind, the more likely the consumer is to buy from you. I may be just checking the weather with Alexa or playing some music, but I’m interfacing with an Amazon product which sometimes throws a little ad or something in there to my chagrin sometimes there, but that’s helping them be top of funnel, top of the mind, and sell more things.

Google is an advertising company. Let’s face it. They want to be in there, but they want to collect some data to really in the end advertise to you better and understand what you’re doing better. Both of them had a really strong incentive to get into the home, as you said. I agree with you. I think it was the novelty of it. I also think it’s really easy––I’m a technologist. I live in Silicon Valley, around Silicon Valley. It’s really easy to think, “Oh, everybody understands technology super well.” It’s just simply not true.

I can see it in my in-laws who are always struggling with, “How do I do this?” All of a sudden, you had this device, and it was so simple, and it was technology, and I didn’t have to learn how to communicate with the device in theory, at least. This doesn’t always work perfectly. I could just talk to that device naturally and have something happen. That really removes a bunch of the friction from the technology side.

I think a lot of people saw that and said, “Wow, this is really cool. I don’t have to learn how to do all these things to make something happen. I just ask it just like I’d ask another person to do something and something happens.” I think that was a huge amount of the initial appeal was the simplicity and a lack of needing to understand the technology.

Jeff: Did either of you guys watch the Knight Rider show?

Roger: Yes.

Kane: Definitely.

Jeff: David Hasselhoff, Michael Knight, KITT. The Knight industry’s 2000. Seems like we’re getting closer, right? 

Roger: Yes.

Jeff: Any thoughts on the current state of the automobile industry and their integration?

Roger: It’s such the obvious place for a voice interface, the very best place. Your hands are busy, you need eyes on the road, safety there. I don’t know about post-pandemic, but there is some––I believe it’s one––is it a trillion hours? I hope I’m not an order of magnitude off that in the US people spent commuting. It’s probably a little less now, but that’s a lot of hours. There are people behind the wheel. That’s just the US. If globally, it’s N times more than that.

What do we typically do in the car? It’s passive. I listen to music. I’m a big fan of listening to the podcasts there, but voice–one, they’re fabulous for controlling listening to music or listening to the podcasts there. I have the little Spotify car thingy. I can say, “Hey, Spotify.” Talk to Spotify and it does what I want it to do when I’m on the road. I think it lets you think about, “Hey, not only can I do these passive things and passively, but I can play games.”

There’s a company, Drivetime FM, that’s all about––these are apps, but they’re voice apps. You can play games in the car and have some fun there. I can even do some business and get probably simple things because this is a voice interface, which has things that you can and can’t do. It’s such an obvious place to me where a voice interface is the very best way of interfacing with your technology. Despite everything that’s been done in the car, I think 10 years from now, we’re going to be like, “Boy, we were living in the Stone Ages back then versus what we could do in the car with voice.”

Kane: Definitely.

Jeff: We’ll see people driving old cars by themselves and be like, “Oh, look at that cowboy over there. Look what he’s doing.”

Jeff: Yes. Let’s shift and talk a little bit more about the future. Kane, you talked about some of the advancements of voice AI. Tell us more about some of the sophisticated things that are trending that’ll make it even better.

Kane: To be honest, the technology itself is pretty good. It’s the reason why Amazon Alexa, Google Assistant, Bixby, it’s the reason why those things even existed here, is because the technology is actually pretty good. There’s companies out there like UOB who can pull out a voice from a noisy environment, like on a train or something like that, and you can process and understand what that says.

There’s technologies out there that can detect your emotions. Sentiment analysis has existed for quite a long time. Voice biometrics has existed for quite a long time. Speech recognitions existed for quite a long time. It’s just that what’s happened in the last 10 years is that the accuracy level of these technologies has gotten to a point where they’re actually ready for production.

Most of the things that are hindering that future that you mentioned, which is that everything is voice-enabled and we don’t have to look at our screens anymore. Part of what’s hindering that is just simply the adoption from the businesses and also from customers can’t adopt it unless businesses are using it in some way, shape or form. I’ve got loads of smart speakers, but for the use cases that I just mentioned, which is read my latest Pocket articles that I’ve saved to Pocket, I can’t do that because Pocket hasn’t built the skills in those places for me to do that.

I can’t go into my bank’s app and just ask it to transfer me $5,000 from my savings account because that capability doesn’t exist in my bank’s app. It’s not even a technology problem, I don’t think. It’s more a case of businesses adopting these technologies in the right places, building customer confidence, and iteratively rolling out. There is definitely though things that need to be improved on the technology. Things like accents have traditionally been a problem, although they’re getting better.

If you get into the minutia of it, the studies are from 2020 and 2019, which is that certain speech recognition systems recognize white American males better than they do Black American females for argument’s sake. There’s definitely some adjustments that need to be happening as far as accessibility and equality is concerned. Mostly I think it’s a case of actually just using the technology and putting it to good use.

I don’t actually think that vision you’re talking about, which is that everything’s voice enabled and screens disappear, I don’t actually think that that is where we’re heading because voice is good for certain things. Very good as a fast input, very good for data capture, very good for those things you were talking about there which is emotional recognition, and all that kind of stuff, but it can be bad for some use cases.

A lot of businesses now are actually using omnichannel experiences where you might call the call center and begin talking about something there, but then they might send you a text and you might need to send them a picture or send them a video of yourself so they can ID you or whatever it might be. I actually think we’re moving to a world where voice, although it was on its own channel and built to be the future, which I do think it still is, also voice I think is going to infiltrate all of our existing digital channels to the point where it’s another modality in those channels as well.

Jeff: It’s a principle modality, but it’s not a single modality essentially. Hey, I’m seeing this, but I’m also seeing this on the screen as a confirmation or to show me some options so that I may select that next thing.

Kane: Exactly. It might default to voice-only on your headphones, but then it might default to a multimodal on your mobile, but then actually what starts out as a voice-only interaction or what starts out as a multimodal interaction might turn into the other depending on the use case. A voice conversation might turn to a text conversation, text conversation might turn to a voice conversation, and I think that the whole concept of devices and modalities, we’re going to see fuse to the point where it becomes about the best option for that particular thing that you’re trying to get done based on where you are and what devices you got access to.

Jeff: Sure. We might get rid of some screens in the process because it’s just become more natural, but we also might just be pairing and integrating more to make it natural too.

Kane: Exactly.

Jeff: If we jump ahead 10 to 20 years, any thoughts on what that could look like?

Kane: Over the long term, things tend to happen a bit quicker than you might think, but in the short term, it feels as though progress is quite slow. If you think about where we were five years ago, 2018 or so, Amazon Alexa and the smart speaker movement was just getting started really. There wasn’t that many skills in the skill store, there wasn’t that many devices in people’s homes, and so in five years’ time, as in now, from five years ago, the devices are everywhere.

It’s a household name. Everybody knows what it is and it’s become established, essentially. The stuff around everything being voiced enabled was really hyped up in 2019. You look at CES, and there’s voice-enabled toilets, and there voice-enabled this and voice-enabled that. Where we’ve settled to, and I think COVID has played a big part in getting us to this point, which is that all the frivolities and all of the superfluous stuff, which was just put voice anywhere it can go because it can go anywhere.

You just need a chip and a mic and an internet connection. It’s really about putting voice where it deserves to be and where it should be. I think the next five years are going to be figuring that out. There may be coffee machines that actually don’t have voice control within them. They may be toilets that certainly don’t because how lazy do you have to be?

Jeff: They aren’t listening. The toilets are not listening.

Kane: You’ve still got some degree of privacy. I actually think the next five years is going to be more about not voice being everywhere, but voice being in the places where it does its job best. I think that’s what I would encourage anyone who’s considering exploring voice technology to do is to not get carried away with the hype and some of this stuff that we’re talking about, but be focused on where can it be applied to make sense for your business and sense for your customers. In 10 years’ time, I hope certainly that we’ve broken through that, and we are at a point where, as I mentioned, the Pockets of the world, the Evernote of the world, the services that you use on a daily basis are accessible from any device in any modality fundamentally.

If I want to take a note on my watch or if I want to, as we’ve been talking about this use case about reading articles to me, or if I want to check my bank balance, or if I want to move an appointment or set an appointment or join a Zoom meeting or whatever it is that I need to do, wherever I am and whatever device I’ve got with me or on me, I should really be able to do it from that.

Increasingly so, voice is going to be a big part of the interface modality in those environments, because I don’t know if your listeners have used a smartwatch or something like that, but tapping on a smartwatch is a nightmare. It just is. It’s terrible. Typing on a phone to send a text message is a nightmare. Working your way through apps to get to the right place and tapping and swiping is just so long.

Those are the things that are going to start to disappear slowly but surely. Even on a computer, I dictate all my emails now. I dictate all the notes that I write. I’m using my voice for almost everything. Maybe I’m a bit of an anomaly in that respect, but I don’t think I will be in 10 years’ time. I think that we’re going to be using our voice more to get things done.

However, I don’t think we’re going to be at the point where we’ve got ambient computing everywhere, it’s going to be doing absolutely everything for us, and it’s going to be the best thing since sliced bread. I think we’re probably a little bit further away from that, but Roger, I’d love to get your thoughts on where you think it’s heading.

Roger: Well, first of all, I very much agree with your vision, Kane, and what you’re talking about. I love the fact you’re talking about getting back to the basics and what works over the next five years. Because I think we, as an industry, got our vision way ahead of where people actually wanted and wanted to use a voice assistant. I think there’s been a little retrenchment in a really good way around saying, “Okay, what is–” Think in a multimodal way, “What is the best way? Should this have a voice interface?”

I’ll share––there’s a little saying. An old boss of mine, who’s a legend of Adam Shire had, and he’s like, “If it’s on-screen, then maybe the UI is the best place to deal with it. If it’s not on-screen, then consider a voice interface.” I don’t think that’s 100% true, but I think those kind of paradigms and thinking about, okay, if I can’t see it, maybe I should voice-enable it, but if it’s already there and it’s right in the screen in front of me, maybe I voice-enable it, but it may be just as easier and more efficient to go click on it.

I think the art and science of that multimodal design will advance over the next five years. I very much agree with you, Kane, is it’s perfecting that and really understanding where this modality makes the most sense for what we’re trying to do and makes it easier. Because we all want to get things done. I pick up a device, I want to get something done. What’s the fastest, quickest, easiest way to get it done? That may often be talking to it, tapping it, etcetera, a combination there.

In 10 years, I will say––maybe this is beyond 10, but what I’d really like voice to be is imagine if you have a human assistant, you can go ask them to do things. Travels are my favorite. “Oh, I need to fly to New York next week.” They would know, “Oh, well, you know what? Roger flies United a lot. His mile is there. I’ll book it on United. Here’s the hotel he likes to stay at.” Boom, boom, boom. They take care of all these things for you. “Yes, it’s done. It’s booked. You’re leaving next Tuesday at 3:00 PM out of San Francisco.”

I’d like to see voice get to that level where I ask something in general and it understands enough of what my desires are, what my preferences are that it goes and really completes a fairly complicated task for me and then returns. Frankly, that’s not so much about voice. It’s a lot more about the AI behind the voice and what happens there and really having, I call it, putting the smart, the assistant, make it a capital A, assistant where it actually gets things done because right now, what we mostly do is we bark commands and that’s fine. We’re at the bark commands stage of the tech, but actually instead of barking, asking it to do something and then have an agent go off and go do all these things and then return with voice being the primary interface is where I’d like to see things in 10-ish years. We’ll see.

Kane: It works on the business now as well. If you imagine calling a retailer to make a complaint, or you want to return an item or something like that and you call up, the retailer should know the phone number because it’s probably on your account. They should know what you’ve just purchased last because they know what you’ve purchased and when it was delivered so they should be able to preempt what it is that you’re calling about.

We know that someone bought something two days ago. We know it was delivered. There’s either an issue or they’ve got a question about it and so what the assistant should do over the phone line, the chat interface there is come out with that say, “Are you calling about the such and such that you just bought two days ago?” “Yes, I am.” “Okay, well, what’s wrong with it?””

All of a sudden, now you’re having a contextual conversation personalized to that customer and their needs. It’s easier to do that on the business side in theory because you’ve got control over that data. What we’re talking about here on the assistant side, on the Amazon Alexa, Google Assistant, Apple Siri side kind of thing becomes a lot harder because that data exists in lots of different places, lots of different private organizations that have it and it becomes a real challenge to do.


Jeff: In addition to the conversation we had with our guests, on today’s episode, we asked another expert to provide their insights on the future.


Kim Aspeling: Hi, I’m Kim Aspeling and I’m the Director of Creative Production at A Million Ads. What we do at A Million Ads is essentially we create data-driven audio ads. They run across all digital audio platforms, so if you’re listening to podcasts or you’re listening to music via digital radio. What does AI voice look like 10 to 20 years from now?

Well, put simply, I really think it will be an integral part of our daily lives. It’s going to be almost impossible to distinguish the difference between human and AI. Now, something that we do really need to be wary of and considerate of is we really need to create trust and protect privacy in this new voice AI world.

It’s going to become really, really important for advertisers who already struggle to gain trust. Fortunately, most ad tech companies are already doing this in the right way, but as always, it only takes one bad apple. I think we are going to see a lot more regulations in place to keep up with the pace of ever-growing AI.


Jeff: Any thoughts on this notion of, we mentioned preventative and that got me thinking about the healthcare space and health in general, thoughts about how this will play into that in the future?

Roger: Yes. Well, there’s some pretty amazing technology that can actually listen to your voice and start detecting something as an anomaly. I’m speaking slower or something a little different than I did yesterday, better go see your doctor, or even starting to detect some things. I think somebody was playing with COVID actually and listening to your voice and actually, there’s some telltale sounds about how voices change and whether you had COVID.

I’ve heard that with several other ailments there where it could be detected. I think that’s really powerful. You’re absolutely right. The notes right because especially now because of the insurance industry requirements, doctors have to document, they spend all their time typing things in and doing that and there’s some really cool tech around that let’s say, voice-enable this so doctors can focus on what doctors should be focusing on which is caring for the patient.

The other thing that’s really interesting is medication adherence. You go in, you have an issue, the doctor stitches you up or takes care of you, and then you gotta take one of these for the next 10 days. Well, you know that the adherence is really, really low. What happens then? That person comes back a couple of months later because whatever the underlying issue is wasn’t resolved with that medicine.

This idea of using a voice assistant for adherence to remind people, or almost be a nag so to speak or just be something, or older people, as you start losing a little bit of your memory and so did I take my pills? or recording it that way.

Jeff: Did you take your medication?

Roger: Exactly. It’s funny when I got into voice and I talked to my doctor, this is what he wanted. He was like, “Oh, I really want something that tells people to take their medicine or reminds them or helps them record that they did that,” because he’s like, “That’s one of my biggest problems,” is people if they took their medication, they’d be okay, but they forget, they get busy, they throw it in that kitchen cabinet after using it a couple of times and then, unfortunately, they’re back in a couple of months with an ailment that could have been resolved. I’m super bullish on voice in healthcare.

Kane: Yes, there’s lots of different areas as we’ve got into. I think there’s definitely some of the operational side. There’s a reason why Microsoft acquired Nuance for $20 billion because Nuance are knee-deep in operational integrations within all of the major healthcare systems in the US. It’s definitely making doctors’ lives easier. There’s been some research on how to make surgeons’ lives easier as well. If you think about your operating table, you’ve got your hands hopefully full of doing something productive.

You can’t really do much else while you’ve been sanitized and all that kind of stuff. There’s some interesting use cases there. Also on the customer side as well, if your listeners are interested in checking out something called Woebot which is a mental health chatbot that essentially you can have more or less a free-flowing, fairly open conversation with and it will help guide you and help you with your mental health. It will help you deal with stress, anxiety, and a whole bunch of other things.

Talking about saving lives, there’s instances where maybe early intervention from something like Woebot potentially can save lives. There’s been some interesting use cases in India where a chatbot was created on Facebook Messenger to help women that suffering with domestic abuse and they found that the usage of that peaked at 2:00 AM in the morning and they specifically chose chat and not voice because they don’t want to be talking to something in a house with an abusive husband and that helped a whole bunch of women get help and stuff like that.

There’s all kinds of different use cases internally in the operational side, externally on the consumer side, and also in that bit in between as Roger’s saying, getting your medication reminders, appointment scheduling, and booking, all of those kinds of things where the customer and the entity need to do something together. I think there’s some big opportunities for it there as well.

Roger: It’s interesting, I think it was USAA Insurance who released a bot and they thought, “It’s going to answer simple insurance questions for people.” Then they realize people are actually asking, like getting a little personal with it. Think about it, you don’t talk to your insurance company unless something usually bad is happening. What they realize is there is sometimes, you were talking about mental health Kane and people feeling––they saw that. Is it people are getting more personal than sometimes they do the people that they talk to on the phone there?

Some people may be more comfortable talking to a bot which isn’t judging them or they’re thinking that it may judge them, than actually talking to a person. I think the mental health issue is really interesting or just a sympathetic ear, which is what a lot of my understanding that USAA bot issue was, was they made it a lot more sympathetic. It was very businesslike, and they needed to make a lot more sympathetic because people were literally complaining, “This horrible thing happened. I got in a car accident. What do I do?”

You can think of how that’s an emotionally charged thing, and then you got to deal with, “Oh my insurance company.” Sympathy there was important. I think it’s interesting to see the human behavior when you’re communicating with a bot and how that may vary from how they would communicate with someone in person. Bright, open area of study and probably opportunity to do some pretty amazing things.


Kim: Anything that we can do to make those day-to-day tasks even easier in that handsfree environment, thinking about a busy young family, and someone cooking, being able to do things at the same time and really multitask is going to be super significant. Thinking about advertising, and obviously, what we do at A Million Ads, we already work with actionable audio companies like Say It Now.

What we essentially do is we work with them to create dynamic ads for brands where customers can buy the product simply by using a voice command. It’s really instant, and we’ve seen some amazing, amazing results at the back of it. How do we actually take it one step further and utilizing that free-flowing conversation I mentioned earlier, be able to interact with AI in a way that isn’t just like a Buy Now message, but a useful and enjoyable experience or personalized to us, something that we’re actually in control of?


Jeff: Any thoughts on the future of voice connected to robotics? I’d like to say, “Alexa, do the dishes and take out the trash and do the laundry and order me my favorite pizza” all in one breath. Do you guys have any thoughts for the future there? We’re obviously spending time there but I’m interested in your perspectives as well.

Kane: Yes, it’s interesting. I don’t know where I stand on the whole home robot thing. Again, we were talking about references to old programs so we talked about Knight Rider. I’m sure there’s a few people who’ve seen the Jetsons in the past. Whether we’re heading to that level of robotics in the home, I know Amazon have launched Astro and stuff like that, which is the first edition of an in-home robot. There’s a couple of voice-enabled hoovers and stuff like that. Honestly, I don’t know where that stuff will head to be honest fundamentally, because there’s only a handful of things I could imagine it doing and I wouldn’t really want it following me around the house.

Having said that, if you would’ve told me 20 years ago, I’d have been talking to a hokey pokey inside of my living room then I might have thought you were a bit mad then as well, but certainly what do robots do is the exact same answer to that question as when you ask what does an assistant do? What does a digital assistant do? Which is it does grunt work that people don’t like doing or they get tired of doing. A bot never has a day off, never has a sick day, doesn’t want to put annual leave in, never has to leave early to go and pick the kids up. It does stuff repetitively consistently over time and so I think that we’re going to see definitely a proliferation of robots in the enterprise and robots in areas in public and whatnot where that grunt work is needed. How is the voice interface going to help or hinder? I think if you look at Sophia, one of the famous humanoid robots if you like, you see a robot like that and immediately you just assume that you can talk to it.

It’s the same thing when you see a digital avatar on screen or in an app, or if you want to get into the metaverse conversation, digital avatars inherently must be able to be spoken to because that’s exactly what they are, an embodiment of a human being. If you ever have a robot in the world, wherever it is, whether it’s on a factory stuck in boxes or whether it’s in a house or stuck in the dishwasher, if it’s got a face, if it looks like a human, it needs to have a conversational interface because there’s just an expectation there.

I also think there’s room beyond that when it comes to the practical side of it, it would be ideal to be able to ask your washing machine what’s wrong with it. Be able to ask the robot if it’s got a problem, how to debug it, how to fix it. All of that stuff that you need to Google and watch YouTube videos and figure out how to do stuff should be solved by not just the voice interface, but having the capability to serve those needs as part of the software. Yes, I think that in closing, anything that looks like a human, you should be able to speak to it, and voice is naturally the only real modality of doing that.

Roger: Yes. I don’t know. I share Kane’s a little bit of skepticism of where robots are going to be the Jetson’s robot that does everything in the home. I don’t know if we’re going to get there. I think a little simpler, your guys’ example is great. Just yesterday I was trying to figure out something with my microwave, right? It was a weird setting and you know they have 27 buttons and it’s like ba-ba-ba.

I’m like, oh my gosh, this thing so needs a voice interface. For me to just ask what it is or why isn’t the manual in some voice-accessible way embedded in there so I can ask it, “Oh, what’s that error code there?” I think you start seeing smarts in our machines that are in our home that start approaching––they work better. A Jetson’s robot. When it can do the laundry, do the dishes all for it, but I’m afraid we’re going to go through more gimmicky things on our path there and that’s okay. Sometimes the best place for technology is to do gimmicks and play with it and that’s the foundation upon which you build something that’s truly useful. It’s okay to be gimmicky.

Kane: There is a really good use case actually in the UK. They probably have something similar in the US. It’s being trialed at the moment, which is a robot that will drop shopping off at your door. It will go to a local shop. It’s only a little thing. It’s just a little car. It’s got a lid on it and you put the shopping in, you’ll order it online. The people in the shop put the shopping in, they put your address in and the bot will go on the pavement really slowly and it will navigate to your address. I can see the use for a voice interface there.

Maybe if you’re a really loyal customer, it might use your voice biometrics to open the lid and authenticate that you can actually access the shopping. Maybe though actually, it will have capabilities that you might want to ask of the shop or did you not have, or it can actually be proactive, “Oh, we didn’t have any milk, sorry,” or you can ask, “What time do you close tonight?” Or you can ask whatever it is about the ingredients or what have you.

I can see some capabilities there where you’ve got again, bots doing very specific, repetitive, tedious jobs, but as an enhanced capability, maybe there’s some voice interaction that could be useful there.

Roger: I’ve always said the opportunities for voice are kind of like the barbells with young kids because they don’t have all these pre-built conceptions about how it’s going to work and they’re very patient. Then older people because they didn’t grow up with technology and it’s still very intimidating to them and so the very friendly voice interface, non-technical way of interacting with their technology is fabulous for them. I see that it’s those barbells that’s where the greatest opportunities are is the youngest and the oldest can really benefit and really probably be the people who mainly use voice in ways that the rest of us aren’t yet.


Kim: I’m also really looking forward to seeing how we regulate this side of AI. For me, as I said, trust and connection is super, super important, especially within advertising. There’s a certain amount of trust that needs to exist between brands and consumers to connect, especially when we’re talking about a channel as personal and intimate as audio, thinking about when you listen to podcasts and you really trust the host that you’re listening to. With celebrity voices, for example, a logical idea would be that, yes, well-known voices could license their likeness to be used as an AI, but one of the reasons brands like to use a celebrity is because it carries that level of familiarity and trust.

I choose to listen to that podcast host because I trust them, and so if they’re recommending something to me, I want to do it because I believe in them. If we’re using a celebrity voice AI to try and convey trust without the celebrity’s direct involvement, we need to ask, are we actually crossing an ethical line here? There’s a really great passage in this book called Invisible Woman by Caroline Criado Perez, and the short of it is that after the Enron court documents were leaked, all the internal company emails became public, making it the largest database of genuine human-to-human interactions.

You think, okay, that’s perfect for training voice AI programs. Unfortunately, though, Enron gender makeup had a heavy male bias, and so smart speakers actually carry that same bias. Similarly, speech scientists have a lot of white male speech data taken from their databases of things like Ted Talks, which again, skewed towards that bias, unfortunately. If you’re a woman and you’d like to test this yourself, next time you’re smart speaker is not actually picking up your voice, try and speak in a lower, deeper voice and they’ll be able to pick you up a lot better.


Jeff: If you think about where VR is going with not just analyzing your voice and having conversational UI built into it, but also your eye movements and your emotions and some of your body language. Any thoughts on how this will play into the metaverse?

Kane: Well, in the metaverse, you don’t have a keyboard if you’re wearing a headset, and all you have is your eyes. Fundamentally, it might be able to recognize time where your hands are in that without holding onto some controls and be able to recognize where your elbows and knees are and stuff like that, but we’re not there yet. In the absence of that, the navigation really seems to me, the perfect way of navigating the operating system within the metaverse is a voice user interface. Then also even within there, there’s definitely going to be digital beings that are not actually real. Digital avatars, virtual humans, and interacting with them is going to be through a conversational AI.

I think there’s definitely a whole lot of opportunities for this technology in that space. Again, I tend to be a bit cynical with certain things like in-home robotics in the metaverse, I think it’s early days. The metaverse I think might actually skip a generation. My dad’s not going to be wearing his VR headset and going to the club and playing cards, but I think the operating system level needs some navigation capability and I think voice is perfect. Then within those environments, as more use cases and scenarios get built-in them, I think we’re going to see actually interaction with virtual beings being dominated by a voice interface as well.

Roger: Yes, so I’m bullish on voice in the metaverse for all the reasons Kane talked about. I think that those virtual beings is something that people are starting to pay attention to, but maybe not enough. If you have a presence in the metaverse that is up 24/7. Either you have people in there staffing it 24/7, or you start having virtual bots that people can have conversations with, either to handle it when it’s not staffed or to just handle the overflow. I think it’s really interesting.

There’s a startup called Inworld and what is his name? Ilya Gelfenbeyn, I think, is the guy who founded it. This is all about voice and bots in the metaverse and that’s the vision, is that you’re going to need bots in the metaverse to make it work. He’s the guy who started up api.ai, that became Google’s Dialogflow, so pretty well established in the industry. I think there’s some interviews with him, they’re pretty interesting to understand this more but yes, voice is going to be a critical part of the metaverse.

 A side little thing that I think will be fun is I go into the metaverse and I get to dress up my avatar however it looks like, why can’t I have something in a little tech that changes my voice or makes my voice customized the way I like it? That’s not a great business use case, but part of the fun of the metaverse is changing who you are, and having your alter ego. Well, a different voice could be part of an alter ego as well.

Jeff: Around ethics. There’s obviously been a lot of debate on the ethics of AI and voice assistance that listen to all. Any thoughts on that? That’s the first question. Then the next question I’m going to ask is how do we continue to design for the future knowing that we can play a role? Thought leaders like yourselves can really play a role in helping us design the future.

Roger: You’re not going to stop the advancements in tech. This is not going to happen, so attempts to do that have failed throughout history, actually. To slow things down. If that’s going to be true and the technology’s going to get better and better and better, then we need to work together to think about ethics and privacy, and what does that mean. Part of the challenge, privacy, and I could go off on a whole tangent with privacy, is everybody likes to talk about privacy and it’s a hot topic, and yet for the general consumer, they click through, they don’t pay attention. The little thing that pops up in the web about the cookies who pays attention to that? I don’t.

I know what they’re asking me there. How do you protect privacy in a way that people can understand? Because I think the way that we handle privacy right now is these complicated legal agreements, pop ups, and things that don’t make sense. How is it that I can give Kane’s company, I’m interfacing with his company via voice. I’m just making this up. I give you permission to store this information about me and that and do these things for me but not for my permission to go here.

I don’t think we found a UX or a way to do this, that isn’t techy and geeky and unusable there. I think there’s an opportunity, maybe voices involved there, AI to allow me to set permissions in a usable human way that is certainly very different than what happens today.

Kane: I don’t even think that even in the conversational AI or voice AI space that the whole concept of ethics has really been grasped properly yet. Everyone’s rushing onto the hype and building chatbot, and voice assistance, and launching stuff and that, but they don’t really think about what the reality of is of what they’re doing, is what they’re doing is they’re capturing data from customers, speech data, textual data, that data’s been processed in whatever cloud it is you are using, Amazon, Google, whatever.

That text is being stored and monitored and kept for who knows how long. Nobody hardly––voice assistance today has any GDPR policies in place. I don’t know what it’s like in the US but in the UK, you can only store certain data if you absolutely have to, you have to have a reason for it and a policy for it. You can only store it for so long. It has to be deleted automatically after two years even if you are storing it for a certain reason.

There’s a whole range of things that you need to do when it comes to gathering data from people. I don’t think anyone’s really considering it. One of the things that we do regularly is to consider that ethical side of things because also depending on the use case that you are using certainly from a business, if you’ve got a voice assistant that’s there to process passports for somebody, that assistant now has autonomy and it has authority, it can say no you can’t have a passport. All of a sudden when there’s this bot, this virtual entity you’ve got no recourse. You can’t argue with it. You can’t do anything other than just accept what the machine tells you.

We’re going to hit a real ethical minefield when these assistants start doing things like that that are more important, that have an impact on people’s lives. Trying to get a parking permit, the bot says no, what do you do about that? Trying to cancel insurance, the bot says no, you’re paying for your insurance constantly and there’s a whole range of things where we have to question whether it’s ethical to allow an AI assistant to do some of these things because what’s the impact of that on people?

Roger: Amen, is all I have to say to that last comment there Kane.

Jeff: Nice. You guys chose to specialize in this area and obviously I see you both as thought leaders in this space. For others that might want to get involved, any career advice that you would share?

Kim: Well, Roger alluded to the shortage of talent in this whole industry, it requires very specific skills around conversation design, around machine learning, around data science, and a whole bunch of things, computational linguistics, and a whole range of skill sets that are needed. You don’t need to have a human-computer interaction degree to be able to do some of this stuff. There’s a lot of conversation designers that were player writers or that were content designers and things like that. I know a lot of developers, for example, that build websites for a living with no JavaScript and so they can build a lot of these solutions themselves. What I would say, if anyone’s interested in exploring this as a career path, I can only offer the advice that I did myself, which is, I read everything I could get my hands on about this stuff, I bought every single book that exists on how to design develop these systems, I read every single one of them. 

I’ve got notifications set up on my email, so anytime Google publishes, or Google ranks an article that matches any of these phrases, voice AI, voice assistants, conversational AI, it pings my inbox. I meet people regularly, like Roger and others. I think networking is important because you’ll unlock opportunities, you’ll learn from people with experience, and just read everything you possibly can get your hands on. You’ll find yourself that within 12 months you’ll know what you’re talking about, and you’ll know your stuff. 

Then thirdly, roll your sleeves up and get on with it. If you’re interested in design go and set up a Voiceflow account and design some stuff. If you’re interested in development and machine learning and figuring things out, then get access to a Cognigy platform or Deepgram’s APIs for speech recognition. Just roll your sleeves up and have a play around because that’s how you’re going to really learn how this stuff works, and ultimately, in time you’ll develop the skills, you’ll develop a bit of experience and a career will be sitting in front of you.

Roger: Here, here. I’d say add listening to Kane’s podcast.

Kane: I would say that as well. Thank you, Roger. Yes, I would.

Roger: I will do the self-promotion for you. He said read but also listen, there’s some great podcasts out there. Kane has one of the very best ones, and we’re on a podcast as well so listen to this podcast. Yes, I completely agree. I bang my fists on the table talking about conversational designers because I think it’s such an opportunity. You’re right, if you have done playwriting, or you’re an English major, or you’ve done that, you have an advantage. This is all about understanding how people communicate, and how they talk, and then going, “Okay, now I need to design something for a machine or really understand that.” That is not technology skills. This is all about, really a combination of English and psychology and those kinds of skills, those more social sciences, and humanities skills. I think there’s a grand, grand opportunity for people to jump in. We need more conversational designers, and then I just– GoPlay, right? If you’re a developer, you absolutely can go and build for Alexa or Bixby or Google and GoPlay, but if you’re not, there’s tools like Voiceflow. They’re really gooey drag and drop, GoPlay. I think a lot of people when they try it, they’re like, “Oh, this will be easy.” I will tell you, the more when you build a voice app and you start thinking about it, it gets harder and harder because you realize human communication is rich, there’s lots of different ways to communicate, and so it gets you to think about that. GoPlay is one of these tools. You can do the deep geeky developer code in OJS, or you can use these drag-and-drop tools, but it’s just a great way to understand and think about how a voice interface will work.

Jeff: Kane, tell us more about your podcast as well for our listeners.

Kane: Cool. It’s called VUX World. VUX World. It’s on all the podcast players, vux.world is the website. You can find it there as well. Very similar to this really, we find industry thought leaders, practitioners, business leaders, who have utilized conversational AI and natural language processing technologies for the benefit of their customer experience, their business processes, and their customers. We pick their brains about how they do what they do so that the people tuning in can do what they do better, which is build and deploy better conversational assistants, higher quality conversational AI, and all that kind of stuff. We’ve had Roger on in the past, we’ve had people from Samsung, Google, Microsoft. We’ve had big businesses like Comcast, Mercedes, the whole kind of nine yards. If you’re interested in how conversational AI can be used to improve customer experience and improve business processes, then definitely check out VUX World. Cheers.

Jeff: Thank you. Thank you for all the insights and wisdom, as serious evangelists and leaders in the voice AI space. Loved hearing your thoughts and just great to have you on the show. Thank you.

Kane: Thanks for having us. My pleasure.

Roger: Thank you.

Kane: Nice to see you again, Roger.

Roger: Appreciate it.


Jeff: The Future Of podcast is brought to you by Fresh Consulting. To find out more about how we pair design and technology together to shape the future visit us at freshconsulting.com. Make sure to search for The Future Of in Apple Podcasts, Spotify, Google Podcasts or anywhere else podcasts are found. Make sure to click Subscribe so you don’t miss any of our future episodes, and on behalf of our team here at Fresh, thank you for listening.