During the Cold War, technological risks were the threat of nuclear, chemical, and biological warfare. Since the emergence of deepfake technology, we can add another existential concern to the list about how our own hubris could inevitably destroy us: today, people are losing control of their likeness and voice.
We've read in previous blogs that video deepfakes mean we can't trust everything we see digitally. Now audio deepfakes mean you can no longer trust your ears. Is that really your child on the phone asking for his or her password? Was that really your CEO who sent you a voice message and told you to transfer millions of dollars to a bank account?
Let’s say you are a bank manager and that you receive a phone call from a company’s director requesting to transfer funds worth $35 million for an acquisition. You duly made the transfer believing everything to be legitimate, only to realize later that the call you received was not the manager, but a synthetic voice created by criminals using artificial intelligence.
This case is not from a Hollywood movie. In fact, it happened in the UAE in early 2020. In this instance, as it turned out, the scammers had used deepfake technology to impersonate the director and swindle the bank manager out of a massive amount of money.
Most of us have already seen a video deepfake, which uses deep learning algorithms to replace one person with someone else's likeness or to create non-existing people. Video deepfakes are nowadays hyper-realistic, and now it's audio's turn. An audio deepfake is a ‘cloned' voice, made using artificial intelligence, that may be indistinguishable from the real person’s voice. Using the cloned voice, one might generate synthetic audio making the target say things he or she never has said.
With the ever-increasing accessibility of deepfake technology, convincing deepfake audio can be produced by using even open-source software. In the recent past, cloning a voice required collecting hours of recorded speech to build a dataset. This dataset is used to function as the needed input to train a new voice model. That is now a thing of the past: a Github project introduces a Real-Time Voice Cloning Toolbox that enables anyone to clone a voice by just using as little as five seconds of sample audio input.
Deepfakes leverage the power of machine learning and artificial intelligence to manipulate or generate audio content to mislead. The quality of deepfake technology is getting better every day and deepfake voices are becoming indistinguishable from real voices for humans. Because of the digitization of all kinds of services, government agencies and financial institutions are using biometrics such as voice for identity verification, funds transfer, and online banking among other things.
As a criminal, it's an easy process to clone someone’s voice to be able to do one of the above-mentioned. The process is as follows:
1. Access voice samples.
2. Use (open-source) algorithms to clone the voice.
3. Choose the best opportunity to exploit.
4. Request a target to act using the fake voice.
The scariest thing is that the ease of access to someone’s voice has grown. Since the COVID-19 pandemic, we have heavily relied on telecommunication media such as Zoom and Microsoft Teams. Let’s not even talk about social media and phone calls since those have been around for a long while. These sources are more than enough to gather a 5-seconds sample and much more input for criminals to exploit. It seems like criminals can be creative enough when it comes to applying deepfakes to do evil.
Today, businesses and governmental institutions are at risk of deepfake attacks. The potential impact of these attacks could be substantial. Therefore, companies and governments should come up with policies, set up internal processes, and create awareness to educate their people, and take steps to put a stop to future offenses before it’s too late to take measures.
DuckDuckGoose is investing in AI to detect, flag, and warn for audio deepfakes. We do this in two different ways. First of all, we check if the audio stream has irregularities. These “artefacts” of the voice cloning processing can be subtle, but a well-trained model is able to pick these up. By training the system to distinguish synthesized audio fragments from authentic audio fragments, the AI is able to tell these apart.
Secondly, we check if the audio track corresponds to the video track. By looking for inconsistencies between the video and audio track, we can spot if the audio is original or not. Want to know more? Click here to contact us!