AI Model Personalities Impact Code Quality & Security
Generative AI models designed to assist with coding exhibit distinct “personalities” that profoundly influence the quality, security, and maintainability of the code they produce. While these models share fundamental strengths and weaknesses, the nuances of their individual coding styles lead to varied outcomes, including the generation of problematic “code smells”—design patterns indicative of deeper structural issues.
Understanding these unique characteristics is crucial for software developers leveraging large language models (LLMs) for assistance, according to Sonar, a code quality firm. Tariq Shaukat, CEO of Sonar, emphasized the need to look beyond raw performance to grasp the full spectrum of a model’s capabilities. He stated that recognizing each model’s strengths and propensity for errors is essential for safe and secure deployment.
Sonar recently put five prominent LLMs to the test: Anthropic’s Claude Sonnet 4 and 3.7; OpenAI’s GPT-4o; Meta’s Llama 3.2 90B; and the open-source OpenCoder-8B. The evaluation involved 4,442 Java programming tasks, encompassing a range of challenges designed to assess coding proficiency. The findings, published in a report titled “The coding personalities of leading LLMs,” largely align with broader industry sentiment: generative AI models are valuable tools when used in conjunction with rigorous human oversight and review.
The tested LLMs demonstrated varying levels of competence across standard benchmarks. For instance, on the HumanEval benchmark, Claude Sonnet 4 achieved an impressive 95.57 percent pass rate, while Llama 3.2 90B lagged at 61.64 percent. Claude’s strong performance suggests a high capability for generating valid, executable code. Furthermore, models like Claude 3.7 Sonnet (72.46 percent correct solutions) and GPT-4o (69.67 percent) showed technical prowess in tasks requiring the application of algorithms and data structures, and they proved capable of transferring concepts across different programming languages.
Despite these strengths, the report highlighted a critical shared flaw: a pervasive lack of security awareness. All evaluated LLMs produced a disturbingly high percentage of vulnerabilities, frequently with the highest possible severity ratings. On a scale that includes “Blocker,” “Critical,” “Major,” and “Minor,” the vulnerabilities generated were predominantly “Blocker” level, meaning they could cause an application to crash. For Llama 3.2 90B, over 70 percent of its vulnerabilities were rated “Blocker,” a figure that stood at 62.5 percent for GPT-4o, and nearly 60 percent for Claude Sonnet 4.
The most common flaws included path traversal and injection vulnerabilities—accounting for 34.04 percent of issues in Claude Sonnet 4’s generated code—followed by hard-coded credentials, cryptography misconfigurations, and XML external entity injection. This struggle with injection flaws stems from the models’ inability to effectively trace the flow of untrusted data to sensitive parts of the code, a complex non-local analysis beyond their typical processing scope. Additionally, they often generate hard-coded secrets because such flaws are present in their training data. Lacking a comprehensive understanding of software engineering norms, these models also frequently neglected to close file streams, leading to resource leaks, and exhibited a bias towards messy, complex, and hard-to-maintain code, characterized by numerous “code smells.”
Sonar’s analysis revealed distinct “personalities” for each model:
Claude 4 Sonnet earned the moniker “the senior architect” due to its exceptional skill, passing 77.04 percent of benchmark tests. Its output is typically verbose and highly complex, mirroring an experienced engineer’s tendency to implement sophisticated safeguards, error handling, and advanced features.
GPT-4o was dubbed “the efficient generalist,” described as a reliable, middle-of-the-road developer. While it generally avoids the most severe bugs, it is prone to making control-flow mistakes.
OpenCoder-8B emerged as “the rapid prototyper.” It produces the most concise code, generating the fewest lines, but also exhibits the highest issue density, with 32.45 issues per thousand lines of code.
Llama 3.2 90B was branded “the unfulfilled promise,” marked by a mediocre benchmark pass rate of 61.47 percent and an alarmingly high 70.73 percent of “Blocker” severity vulnerabilities.
Finally, Claude 3.7 Sonnet was named “the balanced predecessor.” It boasts a capable benchmark pass rate of 72.46 percent and high comment density (16.4 percent). Despite its strengths, this model also produces a significant proportion of “Blocker” severity vulnerabilities. Interestingly, while Claude 4 Sonnet is a newer model and performs better on general benchmarks, the security vulnerabilities it creates are almost twice as likely to be “Blocker” severity compared to its predecessor.
Given these inherent challenges, Sonar’s report concludes that robust governance and thorough code analysis are not merely advisable but imperative for verifying AI-generated code.