Deepfake Vishing: AI Voice Cloning Powers Hard-to-Detect Scams
Fraudulent calls employing artificial intelligence to clone familiar voices have become a pervasive threat. Increasingly, victims report receiving calls that sound exactly like a grandchild, CEO, or long-time colleague, often conveying an urgent crisis that demands immediate action—be it wiring money, divulging sensitive login credentials, or navigating to a malicious website. This sophisticated form of voice phishing, or “vishing,” leverages the power of deepfake technology to exploit trust and urgency.
Security researchers and government agencies have issued warnings about this escalating threat for several years. In 2023, the Cybersecurity and Infrastructure Security Agency (CISA) noted an “exponential” increase in deepfake and other synthetic media threats. More recently, Google’s Mandiant security division reported that these attacks are executed with “uncanny precision,” crafting far more realistic and convincing phishing schemes than ever before.
Security firm Group-IB recently detailed the fundamental stages involved in executing these deepfake vishing attacks, highlighting their ease of replication at scale and the significant challenges they pose for detection and defense. The process typically begins with the collection of voice samples from the intended impersonation target. Remarkably, samples as brief as three seconds, sourced from public videos, online meetings, or previous voice calls, can suffice. These samples are then fed into AI-based speech synthesis engines, such as Google’s Tacotron 2, Microsoft’s Vall-E, or commercial services like ElevenLabs and Resemble AI. These powerful engines enable attackers to use a text-to-speech interface, generating user-chosen words in the exact voice tone and with the conversational tics of the person being impersonated. While most of these services prohibit the malicious use of deepfakes, a Consumer Reports investigation in March revealed that their safeguards can often be bypassed with minimal effort.
An optional, yet common, step involves spoofing the phone number of the individual or organization being impersonated—a technique that has been in use for decades to enhance credibility. Attackers then initiate the scam call. In some instances, the cloned voice delivers a pre-scripted message. However, more sophisticated attacks involve real-time generation of the faked speech through voice masking or transformation software. These real-time interactions are significantly more convincing, as they allow the attacker to respond dynamically to any questions or skepticism from the recipient, making the deception remarkably difficult to discern. While real-time impersonation is still somewhat limited in widespread deepfake vishing, Group-IB anticipates it will become far more common in the near future, driven by advancements in processing speed and model efficiency. In either scenario, the attacker uses the fabricated voice to establish a compelling pretense for the recipient to take immediate action, such as a grandchild needing bail money, a CEO demanding an urgent wire transfer for an overdue expense, or an IT professional instructing an employee to reset a password after a supposed data breach. The ultimate goal is to collect cash, stolen credentials, or other assets, and once the requested action is taken, it is often irreversible.
The alarming effectiveness of these attacks was underscored in a simulated red team exercise conducted by Mandiant’s security team, designed to test defenses and train personnel. The red teamers gathered publicly available voice samples of an executive within the targeted organization and then used other publicly accessible information to identify employees who reported to this individual. To make the call even more credible, they leveraged a real-world outage of a VPN service as the urgent pretext. During the simulated attack, the victim, trusting the familiar voice, bypassed security prompts from both Microsoft Edge and Windows Defender SmartScreen, unknowingly downloading and executing a pre-prepared malicious payload onto their workstation. Mandiant concluded that the successful detonation of this payload “showcase[d] the alarming ease with which AI voice spoofing can facilitate the breach of an organization.”
Fortunately, simple precautions can significantly mitigate the risk of falling victim to such scams. One effective strategy is for parties to agree upon a randomly chosen secret word or phrase that the caller must provide before the recipient complies with any request. Another critical step is to end the call and independently call the person back using a known, verified number. Ideally, both precautions should be followed. However, these safeguards rely on the recipient remaining calm and alert, a significant challenge when confronted with a seemingly legitimate and urgent crisis. This becomes even harder when the recipient is tired, stressed, or otherwise not at their best. For these reasons, vishing attacks—whether enhanced by AI or not—are likely to remain a persistent threat.