Refining FGVC: Lessons from Building a Real-time Car Classifier

Towardsdatascience

For the past year, researchers at Multitel have delved deep into the complexities of fine-grained visual classification (FGVC). Their primary objective: to engineer a robust car classifier capable of identifying specific car models and years, not just broad makes, and crucially, to operate in real-time on resource-constrained edge devices alongside other AI models. This ambitious undertaking required blending academic rigor with the practical demands of real-world deployment.

The challenge of FGVC is multifaceted. Unlike general image classification, which might distinguish between a car and a cat, FGVC demands discerning subtle visual differences between highly similar objects—for instance, differentiating between various BMW models or even specific production years. This task is inherently difficult due to several factors. Firstly, there’s often minimal inter-class variation, meaning the visual cues separating categories can be incredibly subtle. Concurrently, large intra-class variation exists, as instances within the same category can appear vastly different due to changes in lighting, perspective, or background clutter, easily overwhelming those subtle distinctions. Furthermore, real-world datasets frequently exhibit long-tailed distributions, where a few common categories have abundant examples, while many rare categories are represented by only a handful of images, making it difficult for models to learn across all classes equally well.

In addressing this problem, the Multitel team initially reviewed the extensive academic literature on FGVC. Years of research have yielded a plethora of increasingly complex architectures and pipelines. Early approaches often involved multi-stage models, where one subnetwork would localize discriminative parts of an object before a second classified it. Other methods explored custom loss functions, high-order feature interactions, or hierarchical label dependencies. While many of the most recent state-of-the-art solutions, particularly those based on transformer architectures, achieved impressive benchmark accuracies—some even exceeding 97% on datasets like Stanford Cars—they often lacked discussion about inference time or deployment constraints. For Multitel’s real-time edge device application, such models were deemed impractical.

Instead of pursuing the most complex or specialized solutions, Multitel adopted a counter-intuitive strategy: could a known, efficient general-purpose model, if trained optimally, achieve performance comparable to heavier, more specialized architectures? This line of inquiry was inspired by research suggesting that many new AI architectures are unfairly compared against older baselines trained with outdated procedures. The premise was that a well-established model like ResNet-50, when given the benefit of modern training advancements, could “strike back” with surprisingly strong results, even on challenging FGVC benchmarks.

With this philosophy, the team set out to construct a powerful, reusable training procedure that could deliver high performance on FGVC tasks without relying on architecture-specific modifications. The core idea was to start with an efficient backbone like ResNet-50 and focus entirely on refining the training pipeline, ensuring the recipe could be broadly applied to other architectures with minimal adjustments. They meticulously collected and compounded best practices from several influential papers, including those on “Bag of Tricks for Image Classification,” “Compounding Performance Improvements,” and Wightman’s “ResNet Strikes Back” work.

To validate their evolving training pipeline, the researchers utilized the Stanford Cars dataset, a widely accepted FGVC benchmark featuring 196 car categories and over 16,000 images, all cropped to bounding boxes to simulate a downstream classification scenario. Their initial baseline, using a ResNet-50 model pre-trained on ImageNet and trained for 600 epochs with Nesterov Accelerated Gradient optimization, a learning rate of 0.01, and a batch size of 32, achieved an accuracy of 88.22%.

The team then systematically introduced improvements. Implementing large batch training (batch size 128, learning rate 0.1) combined with a linear learning rate warmup strategy immediately boosted accuracy to 89.21%. A significant leap occurred with the adoption of TrivialAugment, a remarkably simple yet effective parameter-free data augmentation technique that randomly samples and applies augmentations. This alone propelled accuracy to 92.66%. Further refinements included switching to a cosine learning rate decay, which nudged the accuracy to 93.22%, and the introduction of label smoothing. This technique, which softens ground truth labels to reduce model overconfidence, not only improved regularization but also allowed for a higher initial learning rate (0.4), culminating in a robust 94.5% accuracy. Additional regularization came from Random Erasing, which randomly obscures parts of images, boosting accuracy to 94.93%. Finally, Exponential Moving Average (EMA) was incorporated. While EMA consistently improved stability and generalization in isolated tests, its integration into the full, already optimized pipeline showed no further incremental gain. However, due to its overall benefits and low overhead, it was retained in the final recipe for its general applicability.

The team also explored other common optimization techniques that ultimately did not yield improvements for this specific task. Weight decay consistently regressed performance, while advanced augmentation methods like Cutmix and Mixup also proved detrimental. Although AutoAugment delivered strong results, TrivialAugment was preferred for its superior performance and parameter-free nature, simplifying the tuning process. Among various optimizers and learning rate schedulers tested, Nesterov Accelerated Gradient and Cosine Annealing consistently delivered the best results.

In conclusion, by systematically applying and compounding modern training best practices to a standard ResNet-50 architecture, Multitel achieved strong performance on the Stanford Cars dataset, pushing accuracy to nearly 95%. This demonstrates that careful tuning of established techniques can significantly enhance a general-purpose model’s capabilities in fine-grained classification. However, it is crucial to acknowledge the limitations of such benchmarks. The Stanford Cars dataset is nearly class-balanced, features high-quality, mostly frontal images, and lacks significant occlusion or real-world noise. It does not fully address challenges like long-tailed distributions, domain shift, or the recognition of unseen classes, which are pervasive in practical applications. While this research provides a robust baseline and proof of concept, building a truly production-ready system capable of handling the inherent complexities of real-world data remains a continuous endeavor.