In a study accepted to the 2020 International Conference on Machine Learning last week, researchers at the Chalmers University of Technology and the RISE Research Institutes of Sweden propose a privacy-preserving technique that learns to obfuscate attributes like gender in speech data. They use a model that’s trained
to filter sensitive information in recordings and then generate new and private information independent of the filtered one, ensuring sensitive information remains hidden without sacrificing realism and utility.
Maintaining privacy without dispensing with like voice assistants altogether is a challenging task, given state-of-the-art AI techniques have been used to infer attributes like intention, gender, emotional state, and identity from timbre, pitch, and speaker style. Recent reporting revealed that accidental voice assistant activations exposed workers to private conversations; the risk is such that law firms including Mischon de Reya have advised staff to mute smart speakers when they talk about client matters at home. Google Assistant, Siri, Cortana, and other major voice recognition platforms allow the deletion of recorded data, but this requires some — and in some cases substantial — effort on users’ parts.
The researcher’s solution employs a generative adversarial network (GAN) called PCMelGAN, a two-part AI model consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples. It maps speech recordings to mel spectrograms, or representations of the spectrum of frequencies of the audio signal as it varies over time, and passes them through a filter that removes sensitive information and a generator that adds synthetic information in its place. PCMelGAN then inverts the mel spectrogram output into audio in the form of a raw waveform.
In experiments, the researchers trained PCMelGAN on 10,000 samples from the open source AudioMNIST data set, which comprises 30,000 audio recordings of the digits one through nine spoken in the English language. They measured privacy by determining whether a classifier could predict with better than 50% accuracy a speaker’s original gender after five runs on the spectrograms and the raw audio.
Here’s a recording of someone saying “four”:
And here’s PCMelGAN’s output:
Here’s someone saying “six”:
And here’s PCMelGAN’s output:
According to the researchers, the results show PCMelGAN makes it empirically difficult for adversaries to, for example, infer the gender of the speaker while retaining qualities including intonation and content. “The proposed method can successfully obfuscate sensitive attributes in speech data and generates realistic speech independent of the sensitive input attribute. Our results for censoring the gender attribute on the AudioMNIST dataset, demonstrate that the method can maintain a high level of utility,” they wrote. As more data is collected in various settings across organizations, companies, and countries, there has been an increase in the demand of user privacy.”