I don’t usually talk about the work I do (in my free time, 😅) but If you wanna nerd out for a bit, here are some of MY OWN THOUGHTS about some PUBLICLY-AVAILABLE RESEARCH that…just happens to have my name on it.
The specific topic area is speech quality measurement. What does that mean? Well, you’re probably familiar with the following statements:
“[voice service provider] sux0rz/always sounds good.”
“[telephone company] never/always has good service.”
“My [smartphone device] drops calls here.”
“This building has better/worse service over here.”
You’ve said these things, I’ve said these things, we all scream for [reliable voice communication “system”].
Speech quality measurement is the practice of quantifying how you perceive the quality of sound that’s produced by a communication device/service when it enters your earhole.
Think about the last phone call or video conference you had. Was there a time where you couldn’t understand the other person? A time when it sounded like the other person was talking into a Pringle’s tube lined with dirty socks? How did you feel about that? How did you, a human, take all the annoyances, missed words, robot noises, muffled sentences, and background noise and convert it into an overall opinion? Turns out there’s an enormous body of research dedicated to measuring and predicting your opinion when those things happen.
The people interested in predicting your opinion can count dropped calls, measure a radio receiver’s signal-to-nose ratio (SNR), and log bandwidth usage. Some measurements have predictable impacts on what you think about [product] or even [location]. Nobody enjoys dropped calls, but I’m guessing you don’t know what SNR your phone’s radio managed for your last 5 phone calls. How low must that SNR be for you to give up on a call? In many cases SNR constrains the available network bandwidth to a certain bit rate. But do you know the bit rate threshold that causes you to hang up and try to get a better connection?
It’s true that people can learn or predict relationships between these physical/software metrics and audio quality, but linking them to human behavior or opinion is Tricky. Because each human is Different. And human behavior is not Deterministic. Remember the two D’s of humans. Er, DnD. Different and non-Deterministic, DanD. Humans are DanD, I always say. At some point, though, humans will vote with their wallet (and/or attention 😢) and stop using a service that sux0rz. If [service provider] wants to stay in business, [service provider] needs to understand if their “system” does indeed sux0rz.
Human-Based Measurement Techniques
In order to figure that out, people can, and do, ask you for your opinion on the quality of [simulated voice call]. They ask, in a strictly controlled laboratory environment, how would you rate the quality of [simulated voice call] on a scale of 1-5? They pose this question to dozens or hundreds of people and they can come up with a good estimate of how most people will react to [simulated voice call].
Strictly controlled environments are expensive and so is getting you to come to them. Besides that, you make phone calls on the streets, in your car, at the clurb, and on the john. Lab tests are informative, but not representative. So some people will ask for your opinion in a “real-world” situation, like right after you call your therapist. Maybe you’ll answer, and maybe you won’t. But if they can gather enough opinions, they can overcome the lack of control and learn something about how well their “system” is performing. This approach is less expensive than a lab test, but it has lots of caveats. Your opinion can be affected by things other than the aural quality of the call, such as your therapist’s unhealthy interest in your naked math test dreams. It is possible to understand and handle these caveats, but doing so adds cost and uncertainty.
Computer-Based Measurement Techniques
Maybe by now you’re asking if we can teach computers to predict your opinion of a call. Well, yes. Yes we can. There are many approaches to accomplish this and here are three:
- Analyzing Metadata: making predictions from call and “system” metadata
- Comparing to Reference: analyzing the audio put into the “system” and comparing it to the audio output from the “system”
- Output Analysis: analyzing solely the output from the “system” that goes into your earhole.
Over the years, people have had success with the first two approaches but each have caveats. The third has been a sort of holy grail—at least until the last few years.
Using call metadata (SNR, bit rate, call length, or other parameters) is powerful but the “system” is complicated—in some cases it’s comprised of two different handsets, two separate radio links, two separate humans, and a network that connects them, all with their own time-varying characteristics. Metadata is not always rich enough to characterize the interactions among all these parts.
Comparing to Reference
Comparing the output audio to the input audio has significant advantages compared to a metadata-only approach. One advantage has more to do with Claude Shannon than you’re comfortable with and another is that this approach fully captures the dynamics of the “system”. But [voice service provider] doesn’t have access to the unfettered input! They are not following you on the streets, riding shotgun in your car, clinging to your face at the clurb, or recording your poops. By the time [voice service provider] receives the “input” it has already traversed almost half of the “system.” This effectively limits use of this approach to the laboratory. Still valuable! Until recently, this has been the most accurate approach. But it’s not deployable.
Output Analysis
This method is essentially how humans form opinions of distorted audio. You have an idea about what it sounds like when Bob from work is standing next to you and droning on about his exploits during college: it sounds good, despite the content. You also know know what it sounds like when he’s calling you from a crowded underground grotto: bad. But how do you tell a computer how human speech should or shouldn’t sound? It’s haaaaaarrrrrrd!
If we could do that though, this approach could theoretically be deployed anywhere in the “system”. The ability to know, for example, that your opinion of the audio is high until it arrives at the cell tower down the road is a great debugging tool. It could be very helpful for [service provider] to pinpoint problem areas in the “system.”
So is it possible to convey that information to a computer? Well, a colleague and I have been working on this for a while, and we came up with one method.
This particular method is…pretty, pretty good. It’s been tested on tons of data: our dataset includes 13 languages, 1,230 talkers, and contains more than 330 hours of speech after augmentation. Our model performs well: predictions correlate with truth data very well (r > 0.91), root-mean-squared errors (RMSE) are around 9% of the target value scale. Our model contains a relatively small number of parameters. Depending on how you train it the model can predict quality or intelligibility. Or some other thing! As long as you have the data. Not gonna lie—it’s one of, if not the best out there right now.
This method doesn’t understand anything about the words present in a given audio signal. Or who is saying it, or what language they’re speaking. The very properties that make this method robust to widely-varied input also just happen to not undermine your privacy.
All that said, I’m proud that the reference implementation is available for free. The reference implementation includes four pre-trained models. Anybody can use it—the rising tide lifts your mom or whatever. The work was published at ICASSP 2020, and at time of publishing, it’s still possible to register and watch the talk.
Anyway, just a little about what I, MYSELF, ONE HUMAN PERSON, REPRESENTING HIS OWN SELF AND DEFINITELY NOT ANY GOVERNMENT ENTITIES IN THIS POST have been thinking IN EXTREMELY VAGUE TERMS about lately.