V-RAG: Video-based Retrieval Augmented Generation for Lifelike Conversational AI
I designed a multimodal video-based Retrieval-Augmented Generation (RAG) pipeline to address a key challenge in traditional systems: the loss of context and visual detail during repetitive captioning stages. By integrating tools like VideoDB, LlamaIndex, Whisper, and VideoLLaVa into a microservices architecture, we created a system capable of delivering context-rich, visually-aware responses with incredible efficiency. One of our biggest achievements was achieving semantic retrievals in just 3 seconds for 20-second video clips using open-sourced LLMs.
When designing the pipeline, I wanted to solve a problem on conventional multimodal RAG systems, which often lose vital visual details. To address that, my pipeline preserves and enhances context by enabling two-way, reversible video-to-text and text-to-video transformations. The workflow processes input videos to generate and query metadata, retrieves and resamples keyframes, and feeds this information into a language model to produce enriched responses. By simulating human-like memory encoding, storing visuals, sounds, and contextual details enables the pipeline to deliver more nuanced and adaptive interactions.
This approach is especially impactful in healthcare, where lossless keyframe resampling ensures no critical visual detail is overlooked, making it invaluable for diagnostics and training. Reuniting with my team from the Cantcer project, we focused exclusively on open-source tools, showcasing how far open-source AI has come. Ultimately, this pipeline represents a step forward in creating lifelike, context-aware AI systems for healthcare, conversational assistants, and beyond.