‹ All episodes

Computational Audiology Network (CAN)

Automated Speech Recognition (ASR) for the deaf

April 24, 2022 Jan-Willem Wasmann / Dimitri Kanevsky / Jessica Monaghan / Nicky Chong-White Season 1 Episode 2

More Info

Computational Audiology Network (CAN)

Apr 24, 2022 Season 1 Episode 2

Jan-Willem Wasmann / Dimitri Kanevsky / Jessica Monaghan / Nicky Chong-White

Automated Speech Recognition (ASR) for the deaf and communication on equal terms regardless of hearing status.

Episode 2 with Dimitri Kanevsky, Jessica Monaghan and Nicky Chong-White. Moderator Jan-Willem Wasmann

You are witnessing a recording of an interview that was prepared as an experiment using an automated speech recognition system (speech to text). One of the participants, Dimitri Kanevsky is deaf and needs to read the transcript of what is said in order to follow the discussion, the other participants are normal hearing. We all need to time to read the transcript and confirm that we understand each other properly. We are using Google Meet and Google Relate, a prototype system not yet publicly released, that is trained on Dimitri’s speech. In addition, we are in different time zones (16 hours apart), haven’t met in person before, and English is not the first language for all of us. Of course, we hope the internet connection will not fail us. There will be a video recording (YouTube) and an audio-only recording. The video recording includes the transcript of what is said by Dimitri.

In order to read the transcript on Dimitri's screen please watch the audiovisual version on Youtube:
https://youtu.be/7bvFCo3VXlU

Jessica Monaghan works as a research scientist at the National Acoustic Laboratories (NAL, Sydney) with a special interest in machine learning applications in audiology. She studied physics in Cambridge (UK) and received a Ph.D. in Nottingham (UK). She worked as a research fellow in Southampton and Macquarie University in Sydney. Her work focuses on speech reception and how to improve this in case of hearing loss. Recently she studied the effect of facemasks on speech recognition.

Nicky Chong-White is a research engineer at the National Acoustic Laboratories (NAL, Sydney). She studied Electrical Engineering at the University of Auckland (NZ) and received a Ph.D. in speech signal processing at the University of Wollongong (AU). She has worked as DSP engineer with several research organisations including Motorola Australian Research Centre and AT&T Labs. Nicky holds 10 patents. She is the lead developer behind NALscribe, a live captioning app to help people with hearing difficulties understand conversations more easily, designed especially for clinical settings. She has a passion for mobile application development and creating innovative digital solutions to enrich the lives of people with hearing loss.

Dimitri Kanevsky is a researcher at Google. He lost his hearing in early childhood. He studied mathematics and received a Ph.D. at Moskow State University. Subsequently, Dimitri worked at various research centers including Max Planck Institute in Bonn (Germany) and the Institute for Advanced Studies in Princeton (USA) before joining IBM in 1986 and Google in 2014. He has been working for over 25 years in developing and improving speech recognition for people with profound hearing loss leading to Live Transcribe and Relate. Dimitri has also worked on other technologies to improve accessibility. In 2012 he was honored at the White House as a Champion of Change for his efforts to advance access to science, technology, engineering, and math (STEM) for people with disabilities. Dimitri currently holds over 295 patents.

Quotes from the interview

Dimitri: 'There is no data like more data.' (Mercer)

Jessica: 'Blindness cuts us off from things, but deafness cuts us off from people.' (Helen Keller)

Nicky: 'Inclusion Inspires Innovation.'

Jan-Willem: 'Be careful about reading health books. You may die of a misprint.' (Mark Twain)

Automated Speech Recognition (ASR) for the deaf and communication on equal terms regardless of hearing status.

Episode 2 with Dimitri Kanevsky, Jessica Monaghan and Nicky Chong-White. Moderator Jan-Willem Wasmann

Quotes from the interview

Dimitri: 'There is no data like more data.' (Mercer)

Jessica: 'Blindness cuts us off from things, but deafness cuts us off from people.' (Helen Keller)

Nicky: 'Inclusion Inspires Innovation.'

Jan-Willem: 'Be careful about reading health books. You may die of a misprint.' (Mark Twain)

Further reading and exploring

https://blog.google/outreach-initiatives/accessibility/impaired-speech-recognition/

0:18

Welcome everybody. I'm really excited to record this second episode of"The Computational Audiology Network" podcast and I'm really happy with today's guests. Jessica Monaghan, she's working for NAL at Sydney. Good morning, Jessica. Good morning. And Nicky Chong-White, she's also working for NAL in Sydney. Good morning, Nicky. Morning. And thank you, Jessica and Nicky, for spending your weekend in this podcast or sacrificing it and Dimitri Kanevsky from Google, you're stationed in New York so good afternoon, Dimitri. Good afternoon. Good to see you all and before we further start this interview, I would like to explain everybody at home the system we are using. So I prepared a short statement, disclaimer. So you are witnessing a recording of an interview that was prepared as an experiment using automated speech recognition. So that's a system that translates speech to text, all live. One of the participants Dimitri Kanevsky is deaf and he reads the transcripts. And he needs this to follow the discussion. The other participants are normal hearing. And we all need to take time to read the transcripts and confirm that we understand each other. So that's for us something new to take into account. I'm used to listening to somebody and reading then sometimes the transcript, but not talking myself and reading the transcript. So we'll see how that will work out. We are using Google Meet and Google Relate. It's a prototype system, not yet publicly released and it's been specifically trained on Dimitri's speech. And in addition, we are in different time zones, Jessica and Nicky are 10 hours ahead of us. Dimitri is six hours lagging. So we are 16 hours apart and we haven't met in person before. So yeah, that's can be sometimes a-a barrier and English is not well my my first language nor Dimitri's. So that might be a challenge for the speech recognition system as well. So let's hope that technology will not fail us. And there will be a video recording and audio-only recording and edited video recording will also include the transcript of what is said by Dimitri. Yeah, and I guess for people at home, yeah, the final recording may look different than from how it's experienced live. I was really glad that we were able to practice this a little bit and yeah, I would like to continue with introducing the first guest Jessica Monaghan. Yeah, Jessica, I think we met two years ago if I remember well? You remember the VCCA? yes. I remember I gave the first talk yes, indeed and I-I remember I was a little nervous and I think your video clip it didn't start up right away. But you kept your cool. And I thought, then I guess it will bring good luck today as well. So now you work as a research scientist at the NAL in Sydney and with a special interest in machine learning in audiology. You studied Physics in Cambridge in the U.K. and received a Ph.D. in Nottingham. And then you continued working as a research fellow in Southampton And your work is focused on speech recognition and how to improve this in case of hearing loss. And you shared that you recently have studied the effect of face mask On speech recognition. Jessica, could you explain us your initial interest for ASR or Automated Speech Recognition? Thank you. So I started researching using ASR as a master's student in Roy Paterson's lab in Cambridge. And that was also my first experience of research and my first introduction to using machine learning. So that was something that I found really fascinating. And there I was working on a project trying to improve the robustness of speech recognition to different talkers by using an human auditory model as a front-end to try and give it the same robustness to different speakers. When a human hears someone talk, they don't need they hear the same thing no matter what what someone's saying and despite the different acoustics of the situation. But yeah, at the time, Automatic Speech Recognition had to be trained a bit more for individual speakers. So that was really interesting. And I worked in other areas for my Ph.D. and [inaudible] they still looking at machine learning, but I always retained this interest in Automatic Speech Recognition. And then when I started working at NAL, that was in 2020, so it's just at the start of the pandemic. And so we were seeing the impact of face masks and barriers on communication, particularly in clinics. And so we we done this research looking at how face masks impacted speech and how we could apply a particular gain to try and improve understanding for hearing aids users. And with the ubiquity of ASR by that point and, having it on on different devices, then it was clear step to us to that that could be used to aid communication. So I was really excited to work on NALscribe with Nicky. And she'll probably talk more about that next. Yes. I'm sure that Nicky will further explain NALscribe. And I just wondered with this effect of the face mask, is it then more acoustics, like filtering or does it have effect on your articulation? So apparently there isn't apparently there isn't much effect on your articulation. It really is just the filtering effect of the mask. Okay,'cause I experience sometimes that if you wear a mask and then your chin is push yeah, pulling more or less your mask from your nose and then you're maybe not articulating that well, but you didn't see that effect. No. In fact, for instance surgical face masks don't have much effect on the acoustics, even though they're constricting your face in the same way as as other masks. Okay. But yeah, it seems to be just the acoustic filter. So it is primarily a gain or some compensation that you could then build into your system. Yes, that's right. So you could apply it as an additional gain for hearing aids so that we are setting that they could change to a mask mode when they needed that. Ah, cool. So you've applied it to different devices both the NALscribe then as in in hearing aid prescriptions. So yeah, we haven't applied that to NALscribe, but we did find that it works quite well with masks nevertheless. We we did some tests on that. Ah, okay. Yeah. I-I-I thought that it was also applied to NALscribe, but it's then in the hearing aid devices and the rehabilitation that you've applied it. Yes, that's right. We considered applying it to NALscribe, but since since you're able to be quite close to the the microphone, then we, it was the calculations were done assuming they the talker would be at some distance from the speaker. So yeah, we found it wasn't really necessary. Okay good to know. And I guess good moment to get over to Nicky. Nicky, you led development of app and previously you studied electrical engineering at the University of Auckland in New Zealand, and you received a Ph.D. in Speech Signal Processing at the University of Wollongong in Australia. You recognize at least the name, and then you worked as a DSP engineer with several research organizations, including Motorola, Australian Research Center, and AT&T Labs. I, oh, see that you hold 10 patents. And yeah, you were the lead developer of NALscribe, a live captioning app to help people with hearing difficulties. Could you explain us, hey, why you started developing this this app or where your interest in speech recognition initiated. Yeah, thanks for the introduction. I'm impressed that you got Wollongong correct. There you go. And so someone must have programmed that and I'm sure. Yes, I did my Ph.D. in Speech Signal Processing at Wollongong University. And that was probably my first sort of introduction to digital signal processing techniques to analyze speech speeches and find efficient sort of parameter representations of speech. And so even though during my Ph.D. I was focused on speech analysis and coding and a little bit of synthesis. It's all the same methods that are used in speech recognition. So it was quite a strong foundation. And then after that, when I worked at AT&T Labs and I think one of my first meetings there was a presentation by a group who had just recently they'd recently released this new intelligent voice response system, which was called behind the scenes we called it"How may I help you." And that was when an AT&T customer could just ring up on the phone and instead of being met with this automated sort of robotic system that said,"Press one, for accounts. Press two for products and services." You just had a voice, another automated voice that said,"How may I help you?" And the person the customer could just speak naturally and say, I don't know,"I'd like to pay my bill." And that was quite mind-blowing at the time. This was late '90s, early 2000s. And I think was the first sort of system like that. And then I think that really inspired me to delve more into speech recognition and natural language understanding. Yeah, fast forward, what, 20 years? We're working at NAL and basically[pause] With the the pandemic and we saw now opportunities where we could, Revisit, What we've done previously in speech recognition. And really now not so much develop it further ourselves, but how do we apply that speech recognition technology and package it in an easy way for people to access. And yeah, that was when we were doing research on a lot of user research on the problems they were finding with communication with masks, especially that was when we thought there's there's a real opportunity here to to produce something that can really help people, and and make a difference because we were discovering people had really strong emotions. Like negative emotions when they were trying to communicate, it was frustration and embarrassment, and anxiety. People who didn't want to go out, like they were staying home or avoiding those social interactions because all those communication difficulties. So that was the motivation behind NALscribe. Wow, yeah, and also impressive how much has happened then in, well, 20 years in improvement in this system. So if I understand well, it is been more focused on the last part in the design, how to make people use the technology or that you, yeah, translate it into benefits for persons in using it? Yeah, that's right. So we've had, amazing peop researchers like Dimitri or, the people at Google and Apple and Microsoft who have done, all the hard yards and collected all this data that we can we now have these more sophisticated training methods. We're not just using our acoustic models and having with little speech corpus to work with. It's just millions and more than millions of hours of speech in real situations, from YouTube, from phone calls, from everything. Our focus At NAL is how do we turn that into something that that can help people? Thank you, Nicky. And I think we cannot wait any longer with listening to Dimitri who has done quite some work, I guess, that was important preparation for another later work done by Nicky and Jessica. Dimitri, you work as a Researcher at Google. And you lost your hearing in early childhoods. I understood you studied Mathematics in Moscow and also received a Ph.D. there. And then you started working at various research centers, including the Max Planck Institute In Bonn, Germany, and also the Institute for Advanced Studies in Princeton, the USA. And then you joined IBM in 1986. And I think you've been working for more than 25 years in speech recognition, if I'm correct,[pause] Dimitri? And then you yeah, joined Google somewhere in the last five years, I think, I didn't know the exact date. I saw that you had developed Google Live Transcribe. Google Relate, the systems we are now using today, but you also worked on other technologies to improve accessibility. And in 2012, Dimitri was honored at the White House as a champion of change for his efforts to advance access to science, technology, engineering, and math for people with disabilities. And Dimitri currently holds over 295 patents. So I-I hope this was captured well and well, Dimitri, I'm really honored to have you here in the in the show. And I-I-I wondered when if your motivation to work on speech recognition, when you decided to study mathematics, did you already have that ambition then to work on in this field? I had no intention to do speech recognition when I did math. After my dream was to work forever in Mathematics. But then after I receive Ph.D. at the time it was Soviet Union. My family and I decided to immigrate to Israel and I lip read pretty well in Russian but I realized I not be able to lipread so well in Hebrew and in English. So I knew the haptics helped me to lipread better. So while I was waiting for permission from Soviet Union to immigrate to Israel, it was about 10 months. I Learned electric engineering and developed haptic wearable devices that have several channels. One channel just for low-band audio. Does amplification but other transformed high frequencies to low frequencies so I could understand frequencies in Hebrew. You have a lot, Shalom, Shabbat,[inaudible] So yeah, but language is not I took this device to Israel, it had small speech recognition technology and I got some grant from Israel government and developed a startup. And this device had a lot of impact. It was first variable, haptic device but I continue to do mathematics. But when I immigrate, no when I went to America and worked at the Institute for Advanced Studies, it was very difficult because there were no transcription services at all in America. And I decided that I should work in speech recognition and make temporal break in mathematics. And develop communication means for me and because I developed this haptic speech technology. I based speech recognition talking to me, they did not take just some abstract mathematician, they took person who could do practical application. So I thought in five years, we developed speech recognition technology that will solve all my problems, but other people problems. Five years passed. No. And for next five years, next five years, that lasted for twenty-five years at IBM where we developed good algorithm that improve significantly but still not enough to be used for communication. And in 2014, I moved to Google. In there finally we achieve very good speech recognition accuracy our team. And I move to California from New York to develop practical application. This is my story, and now I went back to mathematics. I start again do mathematics and solve 50 years old mathematical problem that my adviser gave me 50 years ago and finally had time to figure it and finish it. Wow, great, so problem was a little bit underestimated, you needed much more time to to solve the problem of speech recognition and we are all having benefits of this today. So glad to hear you have more time for your other passion of mathematics and but is still is there a mathematics and the tools maybe that you developed there and for instance machine learning or the data analysis that is needed for the speech recognition? First, I get to work as mathematician in speech recognition. I develop new optimization algorithm. I don't know if you heard about Baum-Welch algorithm for Hidden Markov Model. At that time, it worked only for polynomials, for polynomials function. For maximum likelihood and nobody knew how to extend this to different kind of objective function, like maximum mutual information and my contribution was, I discovered algorithm that could extend Baum-Welch. It extended efficient quasi-line algorithm to different type of objective function that is allowed significant improved speech recognition accuracy. But now I try to apply my abstract mathematic. Algebraic geometry number theorem in machine learning and trying to develop new kind of machine learning. Then based on more abstract mathematics. Wow, I-I'm impressed with that. I-I must admit I'm not able to fully grasp and and appreciate it in how you have have been able to do this. And so I expect and also that you've been working probably in a-a in a team with many different specialists for developing, for instance, Google Live Transcribe? Absolutely correct. Google Live Transcribe became possible because I get very remarkable coworker, my friend, Chet who with compassionate the difficulty that I had at the time, I used only manual transcription services from stenography. And I told Chet, the speech recognition already good enough. You have all these speech recognition in Google document, Google Docs. But you need to click on the microphone each time that you want to speak and it immediately stop if you do pause, I couldn't use this for conversation. So on weekend, imported this system into android and gave me first prototype. And this how we started to polished it, tested for users, added many languages. Live Transcribe was born. We got a few more very talented people like Sagar and others who was project manager. We got team, big team of software developers that implemented this. We got also sound notification so you could detect if your dog barking, baby crying and you continue to add more and more wonderful features. For example, now we are adding offline speech recognition. Now, Live Transcribe has offline speech recognition. Before Live Transcribe required data or Wi-Fi connection. Now, you can use Live Transcribe, it beginning to be going to public soon. You can use this in elevator where you're losing connection. You can use this in India, in Africa, where there are no good network connection. Wow, that's a really, I think important development, also the [pause] robustness of this system have with low connectivity. And and Jessica, I remember, you also had a question for Dimitri that maybe fits in nicely, I think now in his explanation so far? So I was wondering if in your experience of using ASR and developing it, whether, so did you just see a-a gradual improvement or was there a particular step change that that you can recall in in the accuracy? Yes, two factors.[Inaudible] the significant improvement of accuracy. First, neural network of course. Computers became powerful enough to process faster neural network. And second factor was that you got unlimited data for training from YouTube. This was actually my for basic work at Google because YouTube has manually uploaded caption but manual uploaded caption has a lot of errors. You could not use directly to trans speech recognition. So you put filter that with high probability detection, which audio segment had good manual transcription and you got so much data that in one day, we suddenly improved speech recognition by many, many, percent. I remember before that you could spend several years. And we are very happy to improve speech recognition accuracy by quarter percent. Suddenly every day, five percent improvement, absolutely more five percent improvement. It was exciting in our team. Wow, wonderful and Jessica, do you have follow-up questions or or Nicky, is there something you would like to add or or ask to Dimitri? Yeah, my main questions to Dimitri was, yeah, what what are the next barriers to overcome like that you see? What's gonna get you that, well, I don't know if we can get an extra 5% improvement, but is there something that you see is is holding back the accuracy to where it is today? Or can we, have we have are we close to the limit? Or is is there still a lot of room for improvement? I think that next barrier, there's a few people who have non-standard speech like me. On YouTube you do not find many people speaking like me, so you could not use YouTube data to create model for me. And if you have people who have ALS, now, our speech recognition works for them too but we need record specifically. So relate specifically, get data from people with non-standard speech. They record and they get for them speech recognition. But also this model is local, what you see now is not on network. It is on my device. And actually can understand you also very well. Do you want to try? Let me turn connection. Do you want to speak? And you will see it will start to transcribe you. Please, speak. Okay I'd say Nicky, the honor is yours to have a try? Also, it was now it's time to speak. The the next thing coming more personalized speech recognition for for individuals and training. When I'm looking now and looking at your transcription and and the transcription that comes with Google Meet, I can I can definitely see that your transcription is better. So yeah, it does show a lot of promise for that individual training and, the benefits that you can get. Yeah, I guess yeah, the next thing is how how can people do that without spending a whole year to train like you have and and many hours. Can that be done more efficiently maybe by I'm just, thinking here certain sounds and, targeting that training material to make it, yeah, to improve accuracy? So definitely very interesting times ahead. And and Dimitri, if But so you saw my local speech recognition, not only understand me, it understands other people. This is fantastic and answered your question. I think solution comes when we get enough clusters of similar speeches. This is what we do for people who have ALS. You got a lot of speech from different people with ALS. So new person comes that person does not need to train too much. So this is eventually solution for all accents. Accommodate a lot of classes. Yeah, actually I had two But I will stop transcribing with my local feature, so that it does not distract. I had two follow-up questions. One is, beforehand we thought, well, will the system have troubles with our accents? Like Jessica, Nicky, and I all speak differently, but looks like there have been already quite some clusters of British or New Zealand's speech or had a Dutch accent speech. So my question would be more like what role could clinics play here?'Cause I think that clinics can help in collecting or motivating groups of patients with similar disease or similar symptoms, similar atypical speech and that could help then in collecting data. What do you what do three of you think of this? And and maybe Nicky, you have already working also with different clinics in in Australia? What could could there be an opportunity? Sorry, my Air Pods just went flat there so I might I'm just relying on the captions. That's good. I understand what you just said. Yeah, we, because we release NALscribe and we've been doing some more clinical testing. So we are looking at how it has been performing, In Australian clinics, in U.S. clinics and now coming up in in the Netherlands. We do have different variants of English that the user can select. I'm not quite sure, yeah, how how different they are when you select Australian English versus British English, How similar are these models? But I think was the question more in terms of atypical speech? We haven't actually delved quite into that yet, but it would be interesting. Yes, I-I think it's also looking over it's about finding useful clusters'cause for instance, in the U.K. also we have from city to city, the accent is different and you could debate whether that is U.K. English or the same as also in the in the Netherlands than in some regions. Yeah, the accent is quite it's a di dialect and so then just the label Dutch doesn't capture it all. So how would, what would be then a good strategy to to know when when you have a-a valid cluster or something or an [pause] yeah, I'm wondering who's best to address that question in atypical speech. So Nicky, you didn't yet look into this problem? No, and we have been mainly looking at the use case where it is a normal hearing speaker speaking to a hearing-impaired person in that clinical setting to improve that communication. Yeah, definitely to look at more of that two-way communication is something that would be would be very interesting. But in the yeah, in the development of of NALscribe, we were really looking at the hearing impaired person as as the listener. Okay, yeah, and then that brings me to the follow-up question. Then for broader applications of this technology, and also what you now bring up the barriers of communication between people irregardless of hearing status. So yeah, Nicky, it's actually good example now that you are also now relying on the transcription. So we have now two people listening and two people reading. And and so Nicky, maybe you experience some new barriers or and what do you think could be the the next steps to relieve this and maybe good if I ask you Jessica if you would comment on how to further, yeah, develop this with the opportunities that are, well, already are described before. Okay, so I think the the next advance would be really to take advantage of the technology that we already have. So there are all these situations where particularly at the moment where there are in fact, physical barriers to communication. If if someone with a hearing difficulty goes to their DP clinic and they can't be understood at reception because they're not using a technology that's widely available to everyone I think that's, it's excluding people unnecessarily. So I think there are a lot of situations where this could be beneficial already. And particularly in that situation is quite quite a good one. If you have a business that has a-a tablet and a microphone, then even if you're in a noisy waiting room then they're able to take advantage of that good signal-to-noise ratio and typically get very good captions. So I think that's that's, yeah, using existing technology. And in terms of future technology, I think is very exciting is the emergence of augmented reality systems. So AR glasses. So if you had captions on an AR glass AR glasses, so you could see the captions when you when you look at a particular person or different captions for different people and maybe additional information About sounds that are going on around you. I think that would that would be a is going to be a really wonderful new use of technology. Yeah, I fully agree. I think that could really help in this augmented reality and it also brings me to another question maybe to you, Nicky, of, if people start reading from a tablet, how does it affect, for instance, their ability to read lips and and also include the facial expression? Yes, we've had feedback and recommendations that we can put forward based on the experience of just observing how people are interacting and using that's this technology, In a clinic situation. And definitely, people with hearing loss and even people without hearing loss like to read lips and we've that's just become more apparent when everyone's wearing face masks. And even as a normal hearer I, struggle to understand because it just taken away this facial lip cue, which I didn't really even know I was using. But now that I don't have it, I-I have I-I find I just need to concentrate harder on understanding. So yeah, we do encourage when it's used for the tablet to be placed close to the person's face so that they can don't have to do a full head turn to turn between the tablet and the person speaking. We're encouraging people to pause more after this, while in between, Sentences so that the person who's reading the caption and has time to basically catch up. We found a lot of people saying they like the captions to confirm what they've heard. So a lot of people may be listening and be able to, recognize all the words, but maybe it takes a little bit longer for them to process and understand and really get a good enough understanding to be able to engage fully in the conversation. Yeah, definitely positioning of the screen is really important. Of course, when we all do have augmented reality glasses that may make things a lot easier, but until that becomes more available and then affordable for the average person, certainly there are definitely things you can do with the existing technology to make it easier and hopefully we can pass more of those acceptability barriers. So is it acceptable for use? Is it a usable thing? If we can't get past that, then it doesn't matter how good the technology is people will just, Come up with other strategies. So but then it sounds to me that you had actually a lucky situation that because people were wearing a face mask, they lost lip-reading anyway, and the step to using a tablet then for reading text was smaller. And it could be that as soon as people get familiar with this technology, then in the future, yeah, we'll find maybe hybrid ways of how to use this technology. And in your approach in developing, I read you followed a more holistic approach in in your design. Is there something in particular that help you here in better finding the the needs and how to maybe change your design in this process? Yeah, we followed what's known as a design thinking process. So really starting from the customer and the user point of view, what are their problems and needs? And as you've just, mentioned, the mask really heightened that need. Generally, the people with a disability are your first adopters of such a technology. With masks, it basically gave everyone a disability and when we really make that need so much stronger, yeah, it does reduce that barrier. It makes it more likely that that technology can be more mainstream and picked up by more people. Yeah, we went through a[inaudible] process of understanding the need, coming up with a prototype, and putting something out there. Our first version of NalScribe was really basic. It really had very little features, and as we would talk to our users, we found things like they really wanted the privacy. They wanted that offline mode for a clinical setting. They wanted the screen automatically cleared when they used it in a reception counter use case, where you don't want the next person in line to see what the previous conversation that was had between two people. So all these little things would, yeah, we we took note of, and we could incorporate that into our app. Nice. And and I think, Dimitri, I-I saw you wrote this paper about different use cases and some of those use cases are also covered by by Nicky, but I also saw the example of, for instance, students listening to a professor in a lesson. And when you're briefly interrupted by something else, a message on your phone, and then you want to get into the story again, you can read the transcript, and that way it could be, of course, also come in handy for people without a hearing loss. And, Dimitri, did you for getting to these use cases, what kind of strategy did you follow? Was it based on what you experienced yourself or did you also ask other people with hearing loss? I hope my question was transcribed clearly. We indeed considered various strategies, and sometimes we don't want to listen to transcripts to meeting all time. You indeed want to do something else, maybe you're bored a little bit with meeting, and we got interesting message in your phone. So you can look into phone, then you can miss something that where glasses were very useful. I don't know if you saw we already published paper about using transcription in glasses. I don't know if you saw this. I didn't see it yet, but I see Jessica nodding. Yes. So how was that experience with the glasses? And it shows how nice if you have glasses when you interact, it shows video where I have dinner with a lot of people. So I don't need to put on[inaudible] , mobil phone and to look on transcription. I'm enjoying. I'm looking at everybody. I can eat and follow transcripts. So back to your example. I can do something else, but transcription running in glasses, and I continue to follow transcription of what is being spoken. But we also have in live transcribe, it vibrates if somebody calls me. So if I'm using live transcribe and also integrated with my haptic device, if somebody called me, Dimitri, it start to vibrate. I know people are calling me. So this all users cases scenarios we can see there. You're absolutely right. Glasses change human interaction completely. You can easily follow presentation as well you read captioning in this place and lecture point to something. And you do not know what he pointed at in glass, you see what was pointed in slides and in your transcription. We describe all these user cases. And and, Dimitri, are you using all these technologies now in in daily life? So if you have dinner with friends, you are using already these glasses? Sometimes I'm using these glasses. And what are reasons for you then not to do it? Now now I do not need. You are convenient. Not really. I'm not planning to start to watch my phone mainly... Yeah. Yeah, Okay. While you're talking. We we have your full attention To you. Okay, cool. Let's see, for this this for this round of other potential applications. I think there's, of course, most important the the user the end-user, the person with hearing loss, but also the clinician could be a-a-a user of this technology or play a-a role. What role do you see for clinicians in either promoting this technology better for prospective users or to help in validating the technology or improving the technology? What are your thoughts? And let's say if I don't give a turn how this will work now in the system, so feel free to answer. Take the initiative. I was just gonna talk a bit about how I thought it was particularly useful for medical situations. So we know that people with hearing difficulties are disadvantaged, so they have worse health outcomes and higher rates of rehospitalization, low compliance with medication, and all of these factors are magnified if they if they have said that they have poor communication with their physician. So I think in that situation it's really really valuable for the clinician that they know that the information the important information that they they're trying to get across is actually it has been understood. And also, this possibility of then having a transcript that they can the patient can take home or they can share with a-a carer. And, yeah, hopefully, that should improve the situation. Yeah, improve outcomes a lot. But in terms of actually validating it in the clinic, I think, if we if we have so we did we did tests in the clinics and we did some questionnaires. And I think the the value there is really that it it demonstrates to the clinicians that this is a worthwhile thing to do, that there's actual benefits because of course they, they do have a lot of skills, particularly audiologists at communicating with hearing-impaired people. So they may not feel that they actually need any any help with that. That's what they do every day. And so it gives them some confidence that actually clients do find a benefit and it's it's valuable. And also, that helps them to justify, To the clinic owners, but if there's some expense or some some time that's needed to to get these systems set up that it's worthwhile, that it's providing benefit and increasing satisfaction. Thank you, Jessica. And also, interesting point you raise of clinicians actually being maybe a-a barrier sometimes if they feel that they they don't need a tech'cause they're already taking into accounts in their communication. Nicky, did you find this in in your pilots that clinicians were not open to this technology? I think in our in our pilot Our clinicians were very encouraged to try it out. I think it was maybe more of a barrier from the the client point of view, who thought depending on the degree of hearing loss, if someone doesn't have a severe hearing loss, they would say,"No, I don't need it, I'm okay." But, through our own experience, we thought even just as an assistive thing you're not relying on the captions, but just having it there can be helpful. So we were trying to encourage more and more people to use it, or at least just experience it, the, for a little bit just to see how how they thought of it because you don't really know, It's with any technology, you don't really know what it does until you actually try it. And I think also with a-a good way that clinicians can help to show this technology to especially our older clients who aren't so tech-savvy and haven't experienced, discovered this on their phones for themselves. We had some clients say, "Wow, this is this is amazing." And we're like, there are other apps that have been around, At least a few years or so that that have been doing this, it's not, groundbreaking new novel right now. Yeah, giving that introduction to to more people can the clinicians can help us in that way. And then just one other point I wanted to make is we know that an improved client clinician relationship leads to better hearing outcomes in the clinic. And definitely introducing live captions and being able to make that client feel more valued, more included, we understand your difficulties, and this is what we are doing to take steps to, Improve that for you. I think that can really add to the rapport and the relationship building, Not just in the clinic, but in in personal situations as well. Yeah, that's that's another good advantage. And for clinicians now maybe listening to this podcast, do you recommend oh, there's a kind of minimum hearing loss or a type of persons that you say you should definitely recommend them to use these apps? We certainly recommend it for, more severe hearing losses or complex cases, but it it can help over a very wide range. So I wouldn't say don't offer it to people with more mild hearing losses, cause then it it also works well in appointments where there's a partner, a significant other there. It has a lot of, yeah, wide-ranging benefits. So definitely the people with severe hearing loss were more excited about it and and we could definitely tell that they they gained, Most benefit, but I think there is benefit there for everyone. And and do you think it could also work the other way around that people experience benefits of these apps, that then they are also more open for other assistive technologies? Yes, certainly. And even, as I'm talking here and we are reading our captions, it does encourage me to speak more clearly and more slowly and enunciate better to make sure that I'm, Understood. So I think it does yeah, it's also helping me, even though I'm not relying I actually do have audio through the speakers now. Okay. I'm not relying on just the captions to understand. Yeah, it is helping training my voice better and, as you've mentioned, we all have different accents and, little nuances in the way we speak. Yeah, it it it helps that part of the speaker's communication, not just the listener. Yeah. Yeah, you're correct. It gives feedback both to the listener as the talker. And both can learn from it. Expect it it will also improve my English, for instance, by just seeing when there are regular errors. So that's interesting that this feedback can be used for for training, or I can also imagine that for some people it could help to just focus you have one channel of information instead of many different modes and that it could help with people with attention deficits. Jessica, I see you raise your hand. I was I was just going to make a comment about that. It reminded me there's a campaign in the U.K. at the moment, that parents should switch on subtitles on their televisions because research has shown that it helps children to learn to who are learning to read and improve their reading ability. So I thought that was an interesting Yes, that's Yeah, use of the technology. A nice example of something probably not foreseen when developing the technology. And I guess that if there is a widespread use of both the speech recognition systems, but also, for instance, these earbuds I see many young people wearing it and it reduces the stigma also of using hearing aids, for instance, because any everybody has something in their ears. And if everybody gets used to close caption, for instance, now in the Netherlands would look really funny if a person would get close caption because that's only done for people with dialects. And but that so then it would be maybe good if everybody on television would receive close caption also for having those people who need this or are complaining that they cannot follow interviews. And it brings me maybe to another question that one of the main complaints that my patients tell me is that they want to better understand their grandchildren. And a big problem is that the children are moving around all the time, but also, they have voices that are probably less familiar to the systems'cause there's not so much recordings on YouTube of three-year-old or five four or five-year-old children. So Dimitri do you think there are solutions for this?'Cause I would put it high on the priorities for future developments. I do have patents that I received at IBM for speech recognition that learn to recognize babies while baby are crying. Do they have stomach? Or these [inaudible] or something and suggestion to create this data came expert who spent a lot of time with babies. They can interpret when baby are crying. So they could teach speech recognition system. The parents who have their first baby they could rely on this system. Wow. That's that's wonderful. But I remember reading in news that this was developed. Maybe it was developed. I don't know. I also had similar system to recognize dog barks. If you found the system was developed, it would be nice to see. So if I understand well then the system was warning and it could either be a dog barking or a baby crying or? It could explain why dog is barking.[Inaudible] dog hungry. Or somebody trying to enter the house. Yeah, so for safety and and sound aware spacial awareness a really important feature. Jessica, I see you have another comment or question. So yeah, so that sounds really I would like a to have had a baby interpreter, I can tell you. But I-I saw there was a paper from Google or maybe from DeepMind about speech recognition with from children. They because they have this YouTube kids app, they actually have a big database of children's speech from them trying to interact with this tablet. And they did try applying a-a-a system trained with with children's speech to try and improve the recognition and then got a-a small increase in accuracy. So I thought that was really interesting that they had had this database. So that's a voice control by toddlers that's collected and and this way they can better command their grandparents in the future. That's right. Good idea. But one objection when we are discussing developing such systems this will prevent toddlers to improve their pronunciation. Yeah. Now they're trying to speak better so parents understand them. But if everybody understands them no matter what they mumble, it will be a big problem for their development. Nicky, I see you want to respond? I see. Yes, sorry. Yeah, I-I just had a thought when you said that I-I think an an application of automated speech recognition could be in speech therapy and training children who have speech deficits or trouble pronouncing certain sounds to help them develop their speech more. So yeah, as as an application to get have a, Automated recognition give the child feedback. And I know even in, my own circumstances we have a Google Home and my son will ask it a question in the morning and Google says very politely,''Sorry, I didn't understand that.'' And so he'll repeat and change the way he's speaking to speak more clearly so that Google understands. And I think it's a good way to do it because as a parent when I was trying to do therapy with my child, You can you try not to, A little bit impatient, but, an automated system and computer, Has all the time in the world and can be quite engaging for a child to interact with. So yeah, another application for speech recognition. Yes, and so there's machine learning, but this is more about the the human learning. And then I wonder How do you get the right direction of the learning? If the machines would adapt too fast to the speech of the children, then they never need to develop anymore'cause they're understood. So that machines somehow needs to encourage or motivate the children to improve their speech while in the same way, yeah. When communicating is important with their relatives, the system should allow it. Any thoughts on -on this, Dimitri? How we could both train humans and machines? Yes. These patents that I wrote at IBM addressed this point. It suggested incremental improvement so it's too far from baby to speak normal speech recognition and understand, but if speech recognition see that small changes for babies needed to speak correctly, it pretends that it does not understand. This way baby is improving for small things Yeah. But still they understood for very difficult thing that they can improve right away. Wow. Now that you mentioned this same technology is also I think what we need in fitting cochlear implants,'cause there if you make a too big a step-in change of the patterns that people hear, they have difficulties in adjusting to it and in improving. While when making smaller steps or the the right steps, not too small and not too big, then they're improving better. So yeah that's, of course, completely away from the speech recognition but it's I think the same principle of training or or somehow, right? Providing personalized care or personalized medicine in adapting the treatment to the proper doses. Looking time-wise, I-I don't know how we are we are running out of time, and I think it's really nice that we have touched other topics already and it looks like we can somehow lose loosen up the structure a little a bit that it's more spontaneous than we thought before. But I-I think it's a good point to wrap up and and, yeah, I want to thank you again for this nice conversation. But also, that, yeah, I feel, I learned a lot about this topic, but also on how to use this system. So maybe everybody, if you'd like, could share his or her experience in how you thought this interview and this technology went. So Jessica, what how do you think, or what did you experience? Oh, I don't hear you now. Apart from that slight technical hitch. Yeah, I find the captions really accurate for me. Considering I normally, I-I have to select British English or put on my best Australian accent for it to to understand me. And luckily, I haven't had to do that because I have to live here. That's been really great, and I'm just amazed by the technology with Dimitri's captions, that's such a such a great experience. It's really yeah, I thought it might be not that conducive to having an easy conversation, but it's, yeah, it's been really smooth. So I'm very impressed, Dimitri. Thank you for your kind words. And I really was enjoying talking to you, hearing such fresh point of views. I do agree with you that actually speech system that focuses on some accents like British accent, Indian accent may not work well as general speech recognition. Because for the general recognition, it had many hundred thousands of speech recorded. But maybe you do not have so many hours for accent. And usually YouTube has all kind of accents. We do have in live transcribe also special accents Indian, British, Australian. But I found the general speech recognition works for all accents better. Really nice. Dimitri, and I think it also answers for me already the question about these smaller languages. So it depends on probably on how active a community is on on YouTube for how easy it'll become and the more data, of course, the better,'cause one of the other questions for me was that for instance, I think Bengali is one of the ten major languages, but it's poorly supported digitally. So that could have to do maybe lack of recordings of this language. Exactly. When we started to train for many language, I could see that all other languages, Europe had ten times less recorded videos than English. So for a long time English was the most accurate channel for speech recognition. And are there ways to circumvent this or to further improve it that you'd need less data for these smaller languages? It was in the past. But now we are developing a lot of smart ways that even for languages that we lost all speech we can recreate how people speak. Because some developments for very rare languages to develop speech recognition for them Cool. Wow. For technology is developing fast. And we've started to less and less rely on the amount of data but trying to do it smarter. Nice. And I'm I'm thinking we should really have another session on further developing these ideas. Nicky, how was your experience in this session and your temporarily hearing loss? Yes, it's been a really enjoyable discussion. Yeah, thanks for setting this up, and thanks to Dimitri for, yeah, showcasing your relate app as well. It's been, yeah, amazing to see. The the improvements you can have by training on, Additional training on your own voice. It yeah, makes me more excited about what we can do for the future. But yes, generally it's it's been, yeah, a really good discussion and and bringing up, challenges that we're still facing and sharing feedbacks on what we've done already. Yeah, I would, yeah, I'm sure we could talk for much longer it's been, yeah, very good. Then, thank you all for participating in this interview, and, yeah, I wanted to close with the, yeah, the quote that I had prepared,''Be careful about reading health books. You may die of a misprint'' famous words by Mark Twain. And I was a little bit anxious that we maybe would say, get into misunderstanding due to miss or wrong transcriptions. But I must say that, yeah, that both the technology as you as participants went all better than than expected and I felt that we could relax more and more over this conversation. So thanks again for joining this and also for all, yeah, all the preparation, and hope to see you again soon maybe on a different event or, who knows, on a future project. I guess we have discussed discussed already a lot of potential work that could be done. Thanks, Jan-Willem. That is really great and so nice to meet you, Dimitri. Thank you. Bye. Bye. Bye. Okay,

Computational Audiology Network (CAN)

Automated Speech Recognition (ASR) for the deaf

Automated Speech Recognition (ASR) for the deaf and communication on equal terms regardless of hearing status.

Automated Speech Recognition (ASR) for the deaf and communication on equal terms regardless of hearing status.

Listen to this podcast on