Online video platforms contain a massive amount of content, with video lengths ranging from a few minutes to over an hour. However, studies show most viewers prefer videos to be concise, with ideal lengths under 20 minutes and a sweet spot of 3-6 minutes. Despite this preference for brevity, users often struggle to efficiently extract key information from lengthy videos. The long form content challenges people's short attention spans - research indicates our average attention span when consuming digital media has shrunk to just 47 seconds today. Viewers frequently rewind and re-watch segments to try to pick out important details, which increases cognitive load. There is a clear need for solutions that can distill key content from lengthy videos into short, information-dense summaries tailored to users' limited time and attention. Viewers need to grasp key details from videos without spending excessive time watching full videos. Enabling easy access to interact and extract crucial details on demand would greatly enhance the knowledge consumption process when dealing with expansive video content. Tackling these challenges will allow online video platforms to better align with users' preferences for concise and efficient learning.
The proposed ‘IntelliWatch’ application integrates video and audio processing capabilities, natural language processing techniques, and an interactive user interface. The application pre-processes input videos by extracting key frames and generating captions for them using a video-frame captioning module inspired by the BLIP model [6]. These frame-level captions are stored as "Context-1". In parallel, the application samples and transcribes the video's audio into text transcripts using OpenAI's Whisper model [7]. The audio transcripts are stored as "Context-2", linked to timestamp information. When a user selects a video, the application retrieves the corresponding contextual information from the database for that video. This contextual data is then fed into a large language model (LLM) to facilitate interactive summarization and question answering. The user can have a natural conversation with the LLM through the application's interface to obtain video insights. Through the application's interactive user interface, users can engage with the fine-tuned large language model to ask questions related to the video content. To reduce cognitive load for users, the interface provides options to review previous conversations with the system. The core AI techniques powering the video summarization and question answering capabilities are video-frame captioning, audio transcription, and key-frame extraction. By leveraging state-of-the-art large language models, the system can effectively analyze long temporal context from the video and generate abstractive summaries and meaningful insights. This allows for an interactive conversation with the video content rather than passive watching. The proposed IntelliWatch application enables generating concise summaries by extracting key frames, captioning them, and transcribing audio to build a rich contextual database for each video. This context is processed by the large language model to produce succinct summarizations. The conversational interface allows users to efficiently interact with the video content by asking questions and accessing insights from the video context, without needing to watch the full video. Project Live Demo Project Demo on YouTube
- Make sure you have Python > 3.9
- Install the dependecies using
pip install -r requirements.txt
- Once the dependecies are insstalled, run
streamlit run .\complete_IntelliWatch_demo_V10.py
- The full end-end application is present at IntelliWatch 2.0, the above,
streamlit run .\complete_IntelliWatch_demo_V10.py
code is just a reference for other users to build further extensions.
- visit IntelliWatch 2.0
- sign up if you are a new user
- once signed up, please login into the application, if wrong password is entered please rfresh the page
- once logged into the application, a video repository page greets you, please select a video which you like
- now in the next page, you can a see a brief summary of the video, chat window, and 3 keywords with corresposing prompts.
- these keyowrds and prmpts act as a guidance for you to ask questions, you can ask other questions as well
- in the chat window, ask any question, our applcation responds with an answer, and along with the asnwer you will be provided with a list of timestamps related to the asnwer, click on them to automatically watch the specific parts of the video.
-
streamlit
: This is the main framework used for building the web app. It provides functions to create interactive widgets and to manage the app's state. -
langchain
library components:FAISS
: Utilized for efficient similarity search and retrieval of vectors.ChatOpenAI
: Interface for the OpenAI chat model, likely used for generating responses.OpenAIEmbeddings
: To convert text into embeddings using OpenAI's model.RecursiveCharacterTextSplitter
: To split long texts into smaller chunks suitable for processing.RetrievalQA
,ConversationalRetrievalChain
: Components for building a retrieval-based QA system where the chatbot retrieves information to answer questions.
-
os
: To interact with the operating system, for example, setting an environment variable for the OpenAI API key. -
csv
: To read and write CSV files, which in this context are used to store chat histories and video/audio metadata. -
pandas
: For handling data in a tabular form, here used to load video data. -
streamlit_chat
: To add chat functionality to the Streamlit app.
The code includes several functions:
load_audio_video_context()
: This function reads CSV files containing video frame captions and audio transcriptions, then concatenates them into a single text, preparing it as context for the chatbot.load_chathistory()
: Reads a CSV file containing a chat history and returns it as a list.populate_chat_history()
: Appends chat history from a CSV file into the Streamlit session state.load_data()
: Loads video data from a CSV file into a pandas DataFrame.display_main_page()
: Renders the main page of the app, which includes video selection.display_video_page()
: Shows selected video details and provides an interface for chat interaction.
The @st.cache_data
decorator likely caches the loaded data to improve performance, and st.set_page_config(layout="wide")
sets the layout of the Streamlit page to wide mode. The environment variable for the OPENAI_API_KEY
is set, which is necessary for authentication with OpenAI's API.
The app's state management is done through Streamlit's session state (st.session_state
), holding the generated responses, past inputs, and messages. This state persists across reruns of the app, allowing for a continuous chat experience.
The app provides an interface for users to interact with a video by asking questions, which are then answered by the chatbot using the loaded context. It is designed to handle video playback, display video metadata, and manage the interaction between the user and the system.
Pseudo Code of the main functionality:
INITIALIZE empty list chat_history
SET temp_str to empty string
IMPORT streamlit as st
IMPORT necessary modules from langchain (FAISS, ChatOpenAI, OpenAIEmbeddings, RecursiveCharacterTextSplitter, RetrievalQA, ConversationalRetrievalChain)
IMPORT os, csv, pandas as pd
IMPORT message from streamlit_chat
DEFINE function load_audio_video_context(input_video_csv, input_audio_csv)
INITIALIZE detected_text with instructions for the bot
READ video CSV file and APPEND frame-wise captions to detected_text
READ audio CSV file and APPEND speech-to-text transcriptions to detected_text
RETURN detected_text
DEFINE function load_chathistory(chathistory_csv)
READ chathistory_csv and RETURN chat history list
DEFINE function populate_chat_history(inp_csv_file)
READ inp_csv_file and APPEND chat history to session state
SET Streamlit page configuration
SET environment variable for OpenAI API key
CREATE text_splitter instance with specified chunk size and overlap
INITIALIZE Streamlit session state variables for 'generated', 'past', and 'messages' if not present
DEFINE function load_data()
READ CSV and RETURN as pandas DataFrame
DEFINE function display_main_page()
DISPLAY title and subheader in Streamlit
ALLOW selection of video category
FILTER videos based on selected category
DISPLAY video thumbnails and names, and handle video selection
DEFINE function display_video_page()
HANDLE navigation back to main page
LOAD and DISPLAY selected video details
SET up chat functionality with input and response containers
IF user submits input and page is not main:
LOAD chat history from CSV
GENERATE response using conversational retrieval chain
APPEND messages to session state
SHOW video timestamps in the asnwer
WRITE new chat history to CSV
IF selected page in session state is 'main'
CALL display_main_page()
ELSE
CALL display_video_page()