MIT develops new open-source AI text classifier evaluation tool
As large language models increasingly permeate our daily lives, the imperative to rigorously test and ensure their reliability has never been greater. Whether discerning if a movie review is a glowing endorsement or a scathing critique, categorizing a news story as business or technology, or monitoring an online chatbot to prevent it from dispensing unauthorized financial advice or medical misinformation, these automated evaluations are now predominantly handled by sophisticated algorithms known as text classifiers. The critical question, however, remains: how can we truly ascertain the accuracy of these classifications?
A team at MIT’s Laboratory for Information and Decision Systems (LIDS) has recently unveiled an innovative approach designed not only to measure the efficacy of these classifiers but also to provide a clear pathway for enhancing their precision. The new evaluation and remediation software, developed by principal research scientist Kalyan Veeramachaneni, along with his students Lei Xu and Sarah Alnegheimish and two other collaborators, is being made freely available for download, offering a significant contribution to the broader AI community.
Traditionally, testing classification systems involves creating “synthetic examples”—sentences crafted to resemble those already classified. For instance, researchers might take a sentence previously labeled as a positive review and subtly alter a word or two, aiming to trick the classifier into misinterpreting it as negative, even if the core meaning remains unchanged. Similarly, a sentence deemed misinformation might be subtly tweaked to be misclassified as accurate. These deceptive examples, known as adversarial examples, expose vulnerabilities in the classifiers. While various methods have been attempted to uncover these weaknesses, existing techniques often struggle, missing many critical instances.
The demand for such evaluation tools is growing, particularly as companies increasingly deploy chatbots for diverse purposes, striving to ensure their responses are appropriate and safe. A bank, for example, might use a chatbot for routine customer inquiries, such as checking account balances, but must rigorously ensure it never inadvertently provides financial advice, which could expose the institution to liability. As Veeramachaneni explains, “Before showing the chatbot’s response to the end user, they want to use the text classifier to detect whether it’s giving financial advice or not.” This necessitates robust testing of the classifier itself.
The MIT team’s method leverages the very technology it aims to improve: large language models (LLMs). When an adversarial example is created—a slightly modified sentence that retains the original meaning but fools the classifier—another LLM is employed to confirm that semantic equivalence. If the LLM verifies that two sentences convey the same meaning, yet the classifier assigns them different labels, then, as Veeramachaneni notes, “that is a sentence that is adversarial — it can fool the classifier.” Intriguingly, the researchers discovered that most of these successful adversarial attacks involved just a single-word change, a subtlety often unnoticed by those using LLMs to generate the alternative sentences.
Through extensive analysis of thousands of examples, again utilizing LLMs, the team found that certain specific words wielded disproportionate influence in altering classifications. This crucial insight allows for a much more targeted approach to testing a classifier’s accuracy, focusing on a small subset of words that consistently make the most significant difference. Lei Xu, a recent LIDS graduate whose doctoral thesis contributed significantly to this analysis, “used a lot of interesting estimation techniques to figure out what are the most powerful words that can change the overall classification, that can fool the classifier,” Veeramachaneni elaborated. This approach dramatically streamlines the computational burden of generating adversarial examples.
Building on this, the system further employs LLMs to identify words closely related to these “powerful” terms, creating a comprehensive ranking based on their influence on classification outcomes. Once identified, these adversarial sentences can then be used to retrain the classifier, significantly enhancing its robustness against such errors.
The implications of making classifiers more accurate extend far beyond simple categorization of news articles or movie reviews. Increasingly, these systems are deployed in high-stakes environments where misclassification can have severe consequences. This includes preventing the inadvertent release of sensitive medical, financial, or security information, guiding critical research in fields like biomedicine, or identifying and blocking hate speech and misinformation.
As a direct result of this research, the team has introduced a new metric, dubbed “p,” which quantifies a classifier’s resilience against single-word attacks. Recognizing the critical importance of mitigating such misclassifications, the research team has made their tools openly accessible. The package comprises two key components: SP-Attack, which generates adversarial sentences to test classifiers across various applications, and SP-Defense, designed to improve classifier robustness by using these adversarial examples for model retraining.
In some tests, where competing methods allowed adversarial attacks to achieve a 66 percent success rate, the MIT team’s system nearly halved this, cutting the attack success rate to 33.7 percent. While other applications showed a more modest 2 percent improvement, even such seemingly small gains are immensely significant when considering the billions of interactions these systems handle daily, where even a slight percentage can impact millions of transactions. The team’s findings were published on July 7 in the journal Expert Systems, in a paper authored by Xu, Veeramachaneni, and Alnegheimish of LIDS, alongside Laure Berti-Equille at IRD in Marseille, France, and Alfredo Cuesta-Infante at the Universidad Rey Juan Carlos, in Spain.