Creating an In-Video Search System
Processing Video Data: From Extraction to Embedding
In-video search is ability to search for a specific content within a video. This can include searching for particular words spoken, objects shown or description of a scene.
With the current advancement in transformers the process of in-video search have become more accurate and fairly simple. Although most of the transformers doesn’t have a joint embedding space for multiple modalities but there are few models like Meta’s ImageBind that a joint embedding space between text, image, audio, depth, thermal and IMU, or OpenAI’s CLiP model have joint embedding space between text and image. We can use these models to create a relatively quick and accurate in-video service.
What data does a video consist of ?
A typical video would have the following data with them -:
Frames -: A video is basically a series of the frames. Some of these frames are complete frames called i-frames and others will be partial frames, also known as p-frames and b-frames, which contain only the changes from the previous frame.
Audio -: A video also might have a audio channels. These channels can contain different types of audio tracks, such as dialogue, music, and sound effects.
Subtitles/Transcript -: Many videos include subtitle or closed caption tracks or the transcript could be derived from the audio. This data is important for us to query a dialogue in the video.
We will process these data independently using different techniques.
Processing Frames
Typically frames in a video is of three types -:
i-frame -: They are complete, self-contained frames that do not rely on other frames for decoding.
p-frame -: These frames rely on previous frames for decoding. They are compressed by only storing the differences between the current frame and the previous frame.
b-frame -: These frames rely on both previous and future frames for decoding. They are compressed by storing the differences between the current frame and the previous frame, as well as the differences between the current frame and the next frame.
For our use-case we will rely on i-frame since they serve as the reference point for other type of frames. We will create vector representation of i-frames and store it in our vector DB with metadata such as timestamp of the frame using the following steps -:
Extract p-frames -: We can use ffmpeg to extract i-frames of a video.
Discard similar p-frames -: Often, two t-frames can be very similar to each other. Indexing these similar frames in our vector DB would waste resources. We will discard similar frames by calculating the Structural Similarity Index (SSIM) between two frames. SSIM measures the similarity between two images using luminance, contrast, and structure. For more information, use this link.
Generate Embeddings: Once we have discarded the similar i-frames, we will pass the remaining images to the ImageBind model to generate the associated embeddings and store them in our vector DB.
Processing Audio
Processing audio is optional and often does not significantly enhance results, as audio typically doesn't provide unique information beyond what is already available in video frames or transcripts. However, if your use case benefits from audio data, this method can be utilized.
To process audio, the first step is to divide it into smaller clips. Instead of clipping the audio at constant intervals, which might not yield the best results, we will smartly segment the audio so that each chunk retains complete contextual information.
Before that we will need to get familiar with two terms -:
- Spectral Contrast -: The audio is divided into different sections based on pitch. Imagine dividing the music into high-pitch, middle-pitch, and low-pitch parts. In each section, we look for the loudest parts (peaks) and the quietest parts (valleys). We measure how big the difference is between the loudest and quietest parts in each section. This difference is the "contrast.” We will then calculate the mean of spectral contrast of different sections.
RMS Energy -: RMS (Root Mean Square) Energy is a measure of the power or loudness of the audio signal. This gives us an idea of how much energy the audio signal has, which correlates to its perceived loudness. It is calculated using the following formulae -:
$$\text{RMS} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} x_i^2}$$
We will process each frame (a continuous segment of the audio signal), calculating and normalising its spectral contrast and RMS energy. By comparing these features with the previous frame, we will determine whether it belongs to the same clip or a new one. A new clip will be created if the following condition is met.
$$(SC_2 - SC_1) +(RMS_2 - RMS_1) > Threshold$$
Once we have the clips we will create the embeds associated with it using ImageBind and store it in out vector database along with the metadata like timestamp of the clip.
Processing Text Data
Video data may include a subtitle file, or a transcript can be generated from the video. Working with text data is typically simpler. To preserve context and minimize information loss, we'll segment the text data with some overlap. Then, we'll generate the embeddings using ImageBind and store them in VectorDB, marking each with its corresponding timestamp.
Querying For Video Clip
The process of querying a video clip begins with a prompt provided by the user. Upon receiving the query, we follow these steps:
We generate the embed for the prompt and query our vector database to find the closest match. This could be a text, frame, or audio embed.
If the embed is text-based, we calculate the cosine similarity of the next and previous text embeddings. We continue to iterate in both directions until the cosine similarity falls below a certain threshold. This threshold can be calculated dynamically (I created an npm package for this purpose. You can check it out here). The start time of the first chunk becomes the start time of the video clip, while the end time of the last chunk becomes the end time.
If the embed is audio-based, the process is more straightforward. We simply use the start time of the audio clip as the start of the video, and the end time of the audio clip as the end of the video.
If the embed is of a frame, then the timestamp of the previous frame in vectorDB is the start time of the video clip, and the timestamp of the next frame is the end time of the video clip.
Once we have the start and end times for the video clips, we will generate the clip from the original video and provide the output.
In conclusion, with the help of advanced transformer models like ImageBind, creating a reliable and efficient in-video search service is more attainable than ever. By processing and embedding different data modalities from videos - frames, audio and text - we can create a comprehensive search system that caters to a wide array of user queries. Remember, the process highlighted in this article is a guide - the thresholds, processing techniques, and models can be tweaked and optimized as per your specific use case. Happy coding!