Humans better content cops than AI, but 40x costlier

Theregister

When it comes to policing online content for brand safety, a recent study reveals a stark trade-off: human moderators are significantly more accurate than artificial intelligence, but they come at a staggering cost, nearly 40 times that of the most efficient machine learning solutions. This dilemma is particularly acute for marketers striving to prevent their advertisements from appearing alongside problematic material, a practice crucial for protecting a brand’s reputation.

The findings stem from research conducted by experts associated with Zefr, an AI brand protection firm, detailed in their preprint paper, “AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety.” This study, accepted for presentation at the Computer Vision in Advertising and Marketing (CVAM) workshop at the 2025 International Conference on Computer Vision, meticulously analyzed the cost and effectiveness of multimodal large language models (MLLMs) in ensuring brand safety.

Brand safety, as defined by the researchers, is the critical process of preventing inappropriate content from becoming associated with a brand, thereby safeguarding its public image. This differs from consumer-facing content moderation on social media platforms, which often deals with broader policy violations and user-generated content. For advertisers, brand safety means aligning ad placements with specific preferences, avoiding categories ranging from violent or adult-themed material to controversial political discourse. Typically, these efforts combine human oversight with machine learning analysis of imagery, audio, and text. The Zefr study aimed to assess how well cutting-edge MLLMs could perform this complex task and at what financial outlay.

The researchers evaluated six prominent AI models—GPT-4o, GPT-4o-mini, Gemini-1.5-Flash, Gemini-2.0-Flash, Gemini-2.0-Flash-Lite, and Llama-3.2-11B-Vision—pitting their performance against human reviewers. The assessment used a diverse dataset of 1,500 videos, equally divided into categories such as Drugs, Alcohol, and Tobacco; Death, Injury, and Military Conflict; and Kid’s Content. Performance was measured using standard machine learning metrics: precision (the accuracy of positive identifications), recall (the ability to catch all relevant instances), and F1 score (a balanced measure of both).

The results unequivocally demonstrated human superiority. Human moderators achieved an impressive F1 score of 0.98, indicating near-perfect accuracy with minimal false positives or negatives. In contrast, even the best-performing MLLMs, primarily the Gemini models, topped out at an F1 score of 0.91. Interestingly, the study noted that the more compact versions of these AI models did not suffer a significant drop in performance compared to their larger counterparts.

While MLLMs proved effective in automating content moderation, their limitations became apparent, particularly in nuanced or context-heavy situations. The models frequently faltered due to incorrect associations, a lack of contextual understanding, and language barriers. For instance, a video discussing caffeine addiction in Japanese was erroneously flagged as a drug-related violation by all AI models, a misclassification attributed to flawed associations with the term “addiction” and a general struggle with non-English content.

The financial implications of these performance differences are profound. While human moderation delivered superior accuracy, it came at a price of $974 for the evaluated task. In stark contrast, the most cost-efficient AI model, GPT-4o-mini, completed the same task for a mere $25, followed closely by Gemini-1.5-Flash and Gemini-2.0-Flash-Lite at $28 each. Even the more expensive AI models like GPT-4o ($419) and Llama-3.2-11B-Vision ($459) were significantly cheaper than their human counterparts.

The study’s authors concluded that while compact MLLMs offer a considerably more affordable alternative without a substantial drop in accuracy, human reviewers maintain a clear edge, especially when dealing with complex or subtle classifications. Jon Morra, Zefr’s Chief AI Officer, summarized the findings, stating that while multimodal large language models can handle brand safety video moderation across various media types with surprising accuracy and lower costs, they still fall short in nuanced cases. He emphasized that a hybrid approach, combining human expertise with AI efficiency, represents the most effective and economical path forward for content moderation in the evolving brand safety landscape.