Build a Video Summarizer with Qwen2.5-Omni 3B & Gradio
A new application demonstrates the capabilities of Qwen2.5-Omni 3B, an advanced end-to-end multimodal AI model, by creating a simple yet insightful video summarizer. Developed using Hugging Face for model integration and Gradio for the user interface, this project highlights how powerful AI models can be deployed on consumer-grade hardware for practical applications.
Qwen2.5-Omni is distinguished by its ability to process diverse inputs, including text, images, videos, and audio, and generate both text and natural speech outputs. Leveraging the 3-billion-parameter version of this model, the video summarizer is designed to take a user-uploaded video, process it in segments, and generate a comprehensive summary.
Technical Approach and Implementation
The core of the summarizer’s functionality lies in its efficient handling of Qwen2.5-Omni 3B. To enable the model to run on systems with limited VRAM, such as a 10GB RTX 3080 GPU, several optimizations are employed. These include 4-bit quantization, which reduces the memory footprint of the model’s weights, and the integration of Flash Attention 2, a technique that accelerates attention mechanisms and conserves GPU memory.
Given that processing entire videos at once can be highly GPU-intensive, the application adopts a video chunking strategy. Input videos are broken down into smaller, manageable segments using OpenCV. Each temporary video chunk is then fed to the Qwen model. The summarization process unfolds in two main stages:
Chunk Analysis: The model analyzes individual video chunks, guided by a specific system prompt (
SYSTEM_PROMPT_ANALYTICS
), to generate a textual description for each segment. These individual analyses are accumulated.Final Summary Generation: Once all chunks are processed, the accumulated analyses are concatenated. This combined text forms a new input for Qwen, this time using a
SYSTEM_PROMPT_SUMMARY
to guide the model in generating an in-depth, overall summary of the entire video. For a smoother user experience, the final summary is streamed token by token to the user interface.
The user interface, built with Gradio, provides a straightforward experience. Users can upload a video and specify a chunk duration. The UI offers real-time feedback, displaying the progress of chunk processing and the accumulating log of individual segment analyses. Error handling and temporary file cleanup are robustly implemented to ensure stability and efficient resource management.
Experimental Results and Observations
The video summarizer was tested with various video types, revealing both the model’s strengths and current limitations.
-
Traffic Intersection Video (Short): When tested with a short video depicting a traffic intersection, divided into four 5-second chunks, the model generated a final summary that was notably accurate. This demonstrates its capability to effectively summarize concise, clear visual information.
-
Indoor Retail Scene Video (Long): A more challenging test involved a 30-minute indoor retail scene. Initially, the model performed well, generating correct summaries for the first few chunks. However, it soon began to hallucinate, mistakenly identifying scenes as being from the “Minecraft video game.” While some subsequent chunks were correctly described, the prevalence of these errors led to a final summary that was partly inaccurate. This highlights a challenge with longer inputs, where the model’s contextual understanding can degrade or lead to confabulations.
-
Out of Memory (OOM) Considerations: A critical observation during experiments was the potential for Out of Memory (OOM) errors, particularly when generating the final summary for very long videos (e.g., exceeding 2 minutes, resulting in 100-170 chunks). The sheer volume of accumulated chunk summaries fed into the final summary generator can exceed GPU memory limits, even with chunking.
-
Snowy Forest Video (Simple): Surprisingly, a seemingly simple video of two people walking in a snowy forest yielded mostly incorrect results. The model hallucinated, describing “corrupted pixels” and only briefly mentioning the snowy forest. The exact cause of this misinterpretation is unclear but suggests that model performance can vary unpredictably even with straightforward inputs. The developer noted that running the model in full precision (FP16/BF16) might yield different results, though this was not tested.
Future Enhancements
The current video summarizer serves as a foundational step. Future improvements could transform it into a more comprehensive open-source video analytics platform, akin to commercial solutions like Azure Vision Studio. Potential enhancements include:
Advanced Search: Allowing users to find specific scenarios or incidents within a video using natural language queries.
Timestamp Integration: Adding timestamps to pinpoint where specific events or incidents occur in the video.
Speech Capabilities: Utilizing Qwen2.5-Omni’s full multimodal spectrum to incorporate speech synthesis for generated summaries.
Audio Track Analysis: Integrating analysis of video audio tracks to create richer, more in-depth summaries.
Addressing Model Misunderstandings: Further research into why the model occasionally misinterprets frames or hallucinates is crucial for improving accuracy.
While Gradio provides a rapid prototyping environment, a more advanced video analytics platform would likely necessitate a full-fledged, custom user interface to accommodate its expanded features and complexity.
In conclusion, this project successfully demonstrates building a video summarizer using Qwen2.5-Omni 3B, showcasing its potential for practical applications. The experiments provided valuable insights into the model’s performance, highlighting its strengths in summarizing clear, concise content while also identifying challenges related to hallucination, handling very long videos, and occasional unpredictable misinterpretations. These observations pave the way for future research and development in multimodal video understanding.