Synthetic VQA Data Generation Using BLIP and PaliGemma Models

Pyimagesearch

In the realm of artificial intelligence, particularly for tasks like Visual Question Answering (VQA), the demand for high-quality, large-scale datasets often clashes with the prohibitive costs and time associated with manual annotation. Synthetic data generation, leveraging advanced Vision-Language Models (VLMs), presents a compelling solution. This first installment of a two-part series details a foundational step in building such a dataset using a “VLM-as-Judge” methodology. Here, we demonstrate the initial generation of raw VQA annotations by two prominent open-source VLMs: Salesforce’s BLIP and Google’s PaliGemma.

Our process began by acquiring a substantial collection of images to serve as the foundation for our synthetic dataset. We extracted 21,435 images from the validation split of a smaller subset of the comprehensive VQAv2 dataset. These images provided the visual context for the subsequent question-answering tasks.

With the images prepared, the next phase involved running inference using Salesforce’s Bootstrapping Language-Image Pre-training (BLIP) model. A set of four generic questions was posed for each image: “What is happening in this image?”, “How many people are present in the image?”, “What objects do you see?”, and “What is the main subject of the image?”. The BLIP model, configured as a visual-question-answering pipeline and optimized for GPU execution when available, processed each image-question pair, generating a single, top-ranked answer. The responses for all 21,435 images were systematically collected and saved into a JSON file, a process that, despite leveraging an A100 GPU, required approximately 2.5 hours to complete.

To ensure accessibility and ease of use for the broader research community, these raw BLIP-generated annotations were then converted into the standardized Hugging Face Dataset format. This involved transforming the nested JSON structure into a flat list of examples, each comprising an image, its corresponding question, and the model-generated answer. Crucially, the dataset schema was explicitly defined to correctly load image data rather than just file paths, alongside string values for questions and answers. The resulting dataset was subsequently pushed to the Hugging Face Hub, making it publicly available for further research and development.

Following the BLIP annotations, a second, independent set of responses was generated using Google’s PaliGemma model, specifically the paligemma2-3b-mix-224 variant. The same 21,435 images were processed with the identical set of four questions, though adapted to PaliGemma’s preferred prompt format: “Question: \nAnswer:”. This conditional generation model received the image and prompt, then produced an answer, which was subsequently cleaned to remove any redundant prompt text. This extensive inference run, also performed on an A100 GPU, took a more considerable 4 hours, yielding a separate JSON file containing PaliGemma’s complete set of synthetic VQA annotations.

Mirroring the process for the BLIP outputs, the PaliGemma annotations were likewise transformed into the Hugging Face Dataset format. This involved loading the JSON data, restructuring it into individual examples, and applying a cleaning step to ensure the answers were free of extraneous formatting or repeated prompt elements. With the schema correctly defined to handle images and text fields, this second synthetic dataset was also uploaded to the Hugging Face Hub, providing a complementary set of VQA annotations derived from a different state-of-the-art VLM.

This initial phase successfully established two distinct synthetic Visual Question Answering datasets, each populated with model-generated answers for over 21,000 images, derived from Salesforce BLIP and Google PaliGemma respectively. These datasets represent a significant step towards scalable VQA research, mitigating the need for costly manual annotation. The stage is now set for the second part of this series, where a third VLM will assume the role of a “judge,” evaluating and curating these two sets of annotations to produce a final, high-quality synthetic VQA dataset through automated comparison and selection.