diff --git a/MultimodalQnA/README.md b/MultimodalQnA/README.md
index 95626aa78..08de5686a 100644
--- a/MultimodalQnA/README.md
+++ b/MultimodalQnA/README.md
@@ -2,7 +2,7 @@
Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
-`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
+`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
The MultimodalQnA architecture shows below:
@@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component
By default, the embedding and LVM models are set to a default value as listed below:
-| Service | Model |
-| -------------------- | ------------------------------------------- |
-| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi |
-| LVM | llava-hf/llava-v1.6-vicuna-13b-hf |
+| Service | HW | Model |
+| -------------------- | ----- | ----------------------------------------- |
+| embedding-multimodal | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
+| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
+| embedding-multimodal | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
+| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |
You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.
diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
index 9b3a3edaa..d0a1c7d27 100644
--- a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
+++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
@@ -84,16 +84,18 @@ export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
+export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
```
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -274,54 +276,76 @@ curl http://${host_ip}:9399/v1/lvm \
6. dataprep-multimodal-redis
-Download a sample video
+Download a sample video, image, and audio file and create a caption
```bash
export video_fn="WeAreGoingOnBullrun.mp4"
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
+
+export image_fn="apple.png"
+wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
+
+export caption_fn="apple.txt"
+echo "This is an apple." > ${caption_fn}
+
+export audio_fn="AudioSample.wav"
+wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
```
-Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
+Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
- -X POST -F "files=@./${video_fn}"
+ -X POST \
+ -F "files=@./${video_fn}" \
+ -F "files=@./${audio_fn}"
```
-Also, test dataprep microservice with generating caption using lvm microservice
+Also, test dataprep microservice with generating an image caption using lvm microservice
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
- -X POST -F "files=@./${video_fn}"
+ -X POST -F "files=@./${image_fn}"
+```
+
+Now, test the microservice with posting a custom caption along with an image
+
+```bash
+curl --silent --write-out "HTTPSTATUS:%{http_code}" \
+ ${DATAPREP_INGEST_SERVICE_ENDPOINT} \
+ -H 'Content-Type: multipart/form-data' \
+ -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
```
-Also, you are able to get the list of all videos that you uploaded:
+Also, you are able to get the list of all files that you uploaded:
```bash
curl -X POST \
-H "Content-Type: application/json" \
- ${DATAPREP_GET_VIDEO_ENDPOINT}
+ ${DATAPREP_GET_FILE_ENDPOINT}
```
-Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
+Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
```bash
[
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
- "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
+ "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
+ "apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
+ "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
]
```
-To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
+To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
```bash
curl -X POST \
-H "Content-Type: application/json" \
- ${DATAPREP_DELETE_VIDEO_ENDPOINT}
+ ${DATAPREP_DELETE_FILE_ENDPOINT}
```
7. MegaService
diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml b/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml
index d9bf3bce9..eece99da8 100644
--- a/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml
+++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml
@@ -36,6 +36,7 @@ services:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PORT: ${EMBEDDER_PORT}
+ entrypoint: ["python", "bridgetower_server.py", "--device", "cpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
restart: unless-stopped
embedding-multimodal:
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -76,6 +77,7 @@ services:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
+ entrypoint: ["python", "llava_server.py", "--device", "cpu", "--model_name_or_path", $LVM_MODEL_ID]
restart: unless-stopped
lvm-llava-svc:
image: ${REGISTRY:-opea}/lvm-llava-svc:${TAG:-latest}
@@ -125,6 +127,7 @@ services:
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
+ - DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
ipc: host
diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh b/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
index ca5e650ff..d8824fb0b 100755
--- a/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
+++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
@@ -15,13 +15,15 @@ export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
+export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md b/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
index 6517b100c..6d6ca88ff 100644
--- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
+++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
@@ -40,10 +40,11 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
```
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -224,56 +225,76 @@ curl http://${host_ip}:9399/v1/lvm \
6. Multimodal Dataprep Microservice
-Download a sample video
+Download a sample video, image, and audio file and create a caption
```bash
export video_fn="WeAreGoingOnBullrun.mp4"
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
-```
-Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
+export image_fn="apple.png"
+wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
+
+export caption_fn="apple.txt"
+echo "This is an apple." > ${caption_fn}
+
+export audio_fn="AudioSample.wav"
+wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
+```
-Test dataprep microservice with generating transcript using whisper model
+Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
- -X POST -F "files=@./${video_fn}"
+ -X POST \
+ -F "files=@./${video_fn}" \
+ -F "files=@./${audio_fn}"
```
-Also, test dataprep microservice with generating caption using lvm-tgi
+Also, test dataprep microservice with generating an image caption using lvm-tgi
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
- -X POST -F "files=@./${video_fn}"
+ -X POST -F "files=@./${image_fn}"
+```
+
+Now, test the microservice with posting a custom caption along with an image
+
+```bash
+curl --silent --write-out "HTTPSTATUS:%{http_code}" \
+ ${DATAPREP_INGEST_SERVICE_ENDPOINT} \
+ -H 'Content-Type: multipart/form-data' \
+ -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
```
-Also, you are able to get the list of all videos that you uploaded:
+Also, you are able to get the list of all files that you uploaded:
```bash
curl -X POST \
-H "Content-Type: application/json" \
- ${DATAPREP_GET_VIDEO_ENDPOINT}
+ ${DATAPREP_GET_FILE_ENDPOINT}
```
-Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
+Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
```bash
[
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
- "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
+ "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
+ "apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
+ "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
]
```
-To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
+To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
```bash
curl -X POST \
-H "Content-Type: application/json" \
- ${DATAPREP_DELETE_VIDEO_ENDPOINT}
+ ${DATAPREP_DELETE_FILE_ENDPOINT}
```
7. MegaService
diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml b/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml
index d7ac74084..e66aea1f0 100644
--- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml
+++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml
@@ -36,6 +36,7 @@ services:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PORT: ${EMBEDDER_PORT}
+ entrypoint: ["python", "bridgetower_server.py", "--device", "hpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
restart: unless-stopped
embedding-multimodal:
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -139,6 +140,7 @@ services:
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
+ - DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
ipc: host
diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh b/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
index 211a1a696..b5be052e1 100755
--- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
+++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
@@ -22,7 +22,8 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
diff --git a/MultimodalQnA/tests/test_compose_on_gaudi.sh b/MultimodalQnA/tests/test_compose_on_gaudi.sh
index dd7af39fb..6af5e9e62 100644
--- a/MultimodalQnA/tests/test_compose_on_gaudi.sh
+++ b/MultimodalQnA/tests/test_compose_on_gaudi.sh
@@ -14,12 +14,15 @@ WORKPATH=$(dirname "$PWD")
LOG_PATH="$WORKPATH/tests"
ip_address=$(hostname -I | awk '{print $1}')
+export image_fn="apple.png"
export video_fn="WeAreGoingOnBullrun.mp4"
+export caption_fn="apple.txt"
function build_docker_images() {
cd $WORKPATH/docker_image_build
- git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
-
+ # TODO: Revert to the clone of opea-project "main" after the merge of https://github.com/opea-project/GenAIComps/pull/852
+ # git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
+ git clone --single-branch --branch="melanie/mm-rag-enhanced" https://github.com/mhbuehler/GenAIComps.git
echo "Build all the images with --no-cache, check docker_image_build.log for details..."
service_list="multimodalqna multimodalqna-ui embedding-multimodal-bridgetower embedding-multimodal retriever-multimodal-redis lvm-tgi dataprep-multimodal-redis"
docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log
@@ -47,10 +50,11 @@ function setup_env() {
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+ export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
- export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
- export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+ export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+ export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
}
function start_services() {
@@ -63,12 +67,15 @@ function start_services() {
function prepare_data() {
cd $LOG_PATH
- echo "Downloading video"
+ echo "Downloading image and video"
+ wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
+ echo "Writing caption file"
+ echo "This is an apple." > ${caption_fn}
sleep 30s
-
}
+
function validate_service() {
local URL="$1"
local EXPECTED_RESULT="$2"
@@ -76,9 +83,15 @@ function validate_service() {
local DOCKER_NAME="$4"
local INPUT_DATA="$5"
- if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
+ if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+ elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
+ cd $LOG_PATH
+ HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+ elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
+ cd $LOG_PATH
+ HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
@@ -147,27 +160,34 @@ function validate_microservices() {
sleep 1m # retrieval can't curl as expected, try to wait for more time
# test data prep
- echo "Data Prep with Generating Transcript"
+ echo "Data Prep with Generating Transcript for Video"
validate_service \
"${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
- "dataprep-multimodal-redis" \
+ "dataprep-multimodal-redis-transcript" \
"dataprep-multimodal-redis"
- echo "Data Prep with Generating Transcript"
+ echo "Data Prep with Image & Caption Ingestion"
validate_service \
- "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
+ "${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
- "dataprep-multimodal-redis" \
+ "dataprep-multimodal-redis-ingest" \
"dataprep-multimodal-redis"
- echo "Validating get file"
+ echo "Validating get file returns mp4"
validate_service \
- "${DATAPREP_GET_VIDEO_ENDPOINT}" \
+ "${DATAPREP_GET_FILE_ENDPOINT}" \
'.mp4' \
"dataprep_get" \
"dataprep-multimodal-redis"
+ echo "Validating get file returns png"
+ validate_service \
+ "${DATAPREP_GET_FILE_ENDPOINT}" \
+ '.png' \
+ "dataprep_get" \
+ "dataprep-multimodal-redis"
+
sleep 1m
# multimodal retrieval microservice
@@ -200,6 +220,14 @@ function validate_microservices() {
"lvm-tgi" \
'{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
+ # data prep requiring lvm
+ echo "Data Prep with Generating Caption for Image"
+ validate_service \
+ "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
+ "Data preparation succeeded" \
+ "dataprep-multimodal-redis-caption" \
+ "dataprep-multimodal-redis"
+
sleep 1m
}
@@ -224,14 +252,22 @@ function validate_megaservice() {
}
function validate_delete {
- echo "Validate data prep delete videos"
+ echo "Validate data prep delete files"
validate_service \
- "${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
+ "${DATAPREP_DELETE_FILE_ENDPOINT}" \
'{"status":true}' \
"dataprep_del" \
"dataprep-multimodal-redis"
}
+function delete_data() {
+ cd $LOG_PATH
+ echo "Deleting image, video, and caption"
+ rm -rf ${image_fn}
+ rm -rf ${video_fn}
+ rm -rf ${caption_fn}
+}
+
function stop_docker() {
cd $WORKPATH/docker_compose/intel/hpu/gaudi
docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
@@ -256,6 +292,7 @@ function main() {
validate_delete
echo "==== delete validated ===="
+ delete_data
stop_docker
echo y | docker system prune
diff --git a/MultimodalQnA/tests/test_compose_on_xeon.sh b/MultimodalQnA/tests/test_compose_on_xeon.sh
index 46042c600..4b49501ca 100644
--- a/MultimodalQnA/tests/test_compose_on_xeon.sh
+++ b/MultimodalQnA/tests/test_compose_on_xeon.sh
@@ -14,11 +14,15 @@ WORKPATH=$(dirname "$PWD")
LOG_PATH="$WORKPATH/tests"
ip_address=$(hostname -I | awk '{print $1}')
+export image_fn="apple.png"
export video_fn="WeAreGoingOnBullrun.mp4"
+export caption_fn="apple.txt"
function build_docker_images() {
cd $WORKPATH/docker_image_build
- git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
+ # TODO: Revert to the clone of opea-project "main" after the merge of https://github.com/opea-project/GenAIComps/pull/852
+ # git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
+ git clone --single-branch --branch="melanie/mm-rag-enhanced" https://github.com/mhbuehler/GenAIComps.git
echo "Build all the images with --no-cache, check docker_image_build.log for details..."
service_list="multimodalqna multimodalqna-ui embedding-multimodal-bridgetower embedding-multimodal retriever-multimodal-redis lvm-llava lvm-llava-svc dataprep-multimodal-redis"
@@ -37,6 +41,7 @@ function setup_env() {
export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
+ export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
@@ -44,10 +49,11 @@ function setup_env() {
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+ export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
- export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
- export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+ export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+ export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
}
function start_services() {
@@ -61,12 +67,14 @@ function start_services() {
function prepare_data() {
cd $LOG_PATH
- echo "Downloading video"
+ echo "Downloading image and video"
+ wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
-
+ echo "Writing caption file"
+ echo "This is an apple." > ${caption_fn}
sleep 1m
-
}
+
function validate_service() {
local URL="$1"
local EXPECTED_RESULT="$2"
@@ -74,9 +82,15 @@ function validate_service() {
local DOCKER_NAME="$4"
local INPUT_DATA="$5"
- if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
+ if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+ elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
+ cd $LOG_PATH
+ HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+ elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
+ cd $LOG_PATH
+ HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
@@ -145,27 +159,34 @@ function validate_microservices() {
sleep 1m # retrieval can't curl as expected, try to wait for more time
# test data prep
- echo "Data Prep with Generating Transcript"
+ echo "Data Prep with Generating Transcript for Video"
validate_service \
"${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
- "dataprep-multimodal-redis" \
+ "dataprep-multimodal-redis-transcript" \
"dataprep-multimodal-redis"
- # echo "Data Prep with Generating Caption"
- # validate_service \
- # "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
- # "Data preparation succeeded" \
- # "dataprep-multimodal-redis" \
- # "dataprep-multimodal-redis"
+ echo "Data Prep with Image & Caption Ingestion"
+ validate_service \
+ "${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
+ "Data preparation succeeded" \
+ "dataprep-multimodal-redis-ingest" \
+ "dataprep-multimodal-redis"
- echo "Validating get file"
+ echo "Validating get file returns mp4"
validate_service \
- "${DATAPREP_GET_VIDEO_ENDPOINT}" \
+ "${DATAPREP_GET_FILE_ENDPOINT}" \
'.mp4' \
"dataprep_get" \
"dataprep-multimodal-redis"
+ echo "Validating get file returns png"
+ validate_service \
+ "${DATAPREP_GET_FILE_ENDPOINT}" \
+ '.png' \
+ "dataprep_get" \
+ "dataprep-multimodal-redis"
+
sleep 1m
# multimodal retrieval microservice
@@ -178,7 +199,24 @@ function validate_microservices() {
"retriever-multimodal-redis" \
"{\"text\":\"test\",\"embedding\":${your_embedding}}"
- sleep 10s
+ echo "Wait for lvm-llava service to be ready"
+ max_retries=10
+ for i in $(seq $max_retries)
+ do
+ lvm_logs=$(docker logs lvm-llava 2>&1 | grep "Uvicorn running on http://0.0.0.0")
+ if [[ "$lvm_logs" != *"Uvicorn running on http://0.0.0.0"* ]]; then
+ echo "The lvm-llava service is not ready yet, sleeping 30s..."
+ sleep 30s
+ else
+ echo "lvm-llava service is ready"
+ break
+ fi
+ done
+
+ if [[ $i -ge 10 ]]; then
+ echo "WARNING: Max retries reached when waiting for the lvm-llava service to be ready"
+ docker logs lvm-llava >> ${LOG_PATH}/lvm_llava_file.log
+ fi
# llava server
echo "Evaluating lvm-llava"
@@ -198,6 +236,14 @@ function validate_microservices() {
"lvm-llava-svc" \
'{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
+ # data prep requiring lvm
+ echo "Data Prep with Generating Caption for Image"
+ validate_service \
+ "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
+ "Data preparation succeeded" \
+ "dataprep-multimodal-redis-caption" \
+ "dataprep-multimodal-redis"
+
sleep 3m
}
@@ -222,14 +268,22 @@ function validate_megaservice() {
}
function validate_delete {
- echo "Validate data prep delete videos"
+ echo "Validate data prep delete files"
validate_service \
- "${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
+ "${DATAPREP_DELETE_FILE_ENDPOINT}" \
'{"status":true}' \
"dataprep_del" \
"dataprep-multimodal-redis"
}
+function delete_data() {
+ cd $LOG_PATH
+ echo "Deleting image, video, and caption"
+ rm -rf ${image_fn}
+ rm -rf ${video_fn}
+ rm -rf ${caption_fn}
+}
+
function stop_docker() {
cd $WORKPATH/docker_compose/intel/cpu/xeon
docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
@@ -254,6 +308,7 @@ function main() {
validate_delete
echo "==== delete validated ===="
+ delete_data
stop_docker
echo y | docker system prune
diff --git a/MultimodalQnA/ui/gradio/conversation.py b/MultimodalQnA/ui/gradio/conversation.py
index 9f1a2827b..3057e9879 100644
--- a/MultimodalQnA/ui/gradio/conversation.py
+++ b/MultimodalQnA/ui/gradio/conversation.py
@@ -30,6 +30,7 @@ class Conversation:
base64_frame: str = None
skip_next: bool = False
split_video: str = None
+ image: str = None
def _template_caption(self):
out = ""
@@ -59,6 +60,8 @@ def get_prompt(self):
else:
base64_frame = get_b64_frame_from_timestamp(self.video_file, self.time_of_frame_ms)
self.base64_frame = base64_frame
+ if base64_frame is None:
+ base64_frame = ""
content.append({"type": "image_url", "image_url": {"url": base64_frame}})
else:
content = message
@@ -137,6 +140,7 @@ def dict(self):
"caption": self.caption,
"base64_frame": self.base64_frame,
"split_video": self.split_video,
+ "image": self.image,
}
@@ -152,4 +156,5 @@ def dict(self):
time_of_frame_ms=None,
base64_frame=None,
split_video=None,
+ image=None,
)
diff --git a/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py b/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py
index 3eba01a71..ec6a033ca 100644
--- a/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py
+++ b/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py
@@ -13,7 +13,7 @@
from conversation import multimodalqna_conv
from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles
-from utils import build_logger, moderation_msg, server_error_msg, split_video
+from utils import build_logger, make_temp_image, moderation_msg, server_error_msg, split_video
logger = build_logger("gradio_web_server", "gradio_web_server.log")
@@ -47,22 +47,24 @@ def clear_history(state, request: gr.Request):
logger.info(f"clear_history. ip: {request.client.host}")
if state.split_video and os.path.exists(state.split_video):
os.remove(state.split_video)
+ if state.image and os.path.exists(state.image):
+ os.remove(state.image)
state = multimodalqna_conv.copy()
- return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 1
+ return (state, state.to_gradio_chatbot(), None, None, None) + (disable_btn,) * 1
def add_text(state, text, request: gr.Request):
logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}")
if len(text) <= 0:
state.skip_next = True
- return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 1
+ return (state, state.to_gradio_chatbot(), None) + (no_change_btn,) * 1
text = text[:2000] # Hard cut-off
state.append_message(state.roles[0], text)
state.append_message(state.roles[1], None)
state.skip_next = False
- return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 1
+ return (state, state.to_gradio_chatbot(), None) + (disable_btn,) * 1
def http_bot(state, request: gr.Request):
@@ -73,7 +75,7 @@ def http_bot(state, request: gr.Request):
if state.skip_next:
# This generate call is skipped due to invalid inputs
path_to_sub_videos = state.get_path_to_subvideos()
- yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (no_change_btn,) * 1
+ yield (state, state.to_gradio_chatbot(), path_to_sub_videos, None) + (no_change_btn,) * 1
return
if len(state.messages) == state.offset + 2:
@@ -97,7 +99,7 @@ def http_bot(state, request: gr.Request):
logger.info(f"==== url request ====\n{gateway_addr}")
state.messages[-1][-1] = "▌"
- yield (state, state.to_gradio_chatbot(), state.split_video) + (disable_btn,) * 1
+ yield (state, state.to_gradio_chatbot(), state.split_video, state.image) + (disable_btn,) * 1
try:
response = requests.post(
@@ -108,6 +110,7 @@ def http_bot(state, request: gr.Request):
)
print(response.status_code)
print(response.json())
+
if response.status_code == 200:
response = response.json()
choice = response["choices"][-1]
@@ -123,44 +126,61 @@ def http_bot(state, request: gr.Request):
video_file = metadata["source_video"]
state.video_file = os.path.join(static_dir, metadata["source_video"])
state.time_of_frame_ms = metadata["time_of_frame_ms"]
- try:
- splited_video_path = split_video(
- state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
- )
- except:
- print(f"video {state.video_file} does not exist in UI host!")
- splited_video_path = None
- state.split_video = splited_video_path
+ file_ext = os.path.splitext(state.video_file)[-1]
+ if file_ext == ".mp4":
+ try:
+ splited_video_path = split_video(
+ state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
+ )
+ except:
+ print(f"video {state.video_file} does not exist in UI host!")
+ splited_video_path = None
+ state.split_video = splited_video_path
+ elif file_ext in [".jpg", ".jpeg", ".png", ".gif"]:
+ try:
+ output_image_path = make_temp_image(state.video_file, file_ext)
+ except:
+ print(f"image {state.video_file} does not exist in UI host!")
+ output_image_path = None
+ state.image = output_image_path
+
else:
raise requests.exceptions.RequestException
except requests.exceptions.RequestException as e:
state.messages[-1][-1] = server_error_msg
- yield (state, state.to_gradio_chatbot(), None) + (enable_btn,)
+ yield (state, state.to_gradio_chatbot(), None, None) + (enable_btn,)
return
state.messages[-1][-1] = message
- yield (state, state.to_gradio_chatbot(), state.split_video) + (enable_btn,) * 1
+ yield (
+ state,
+ state.to_gradio_chatbot(),
+ gr.Video(state.split_video, visible=state.split_video is not None),
+ gr.Image(state.image, visible=state.image is not None),
+ ) + (enable_btn,) * 1
logger.info(f"{state.messages[-1][-1]}")
return
-def ingest_video_gen_transcript(filepath, request: gr.Request):
- yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
+def ingest_gen_transcript(filepath, filetype, request: gr.Request):
+ yield (
+ gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
+ )
verified_filepath = os.path.normpath(filepath)
if not verified_filepath.startswith(tmp_upload_folder):
- print("Found malicious video file name!")
+ print(f"Found malicious {filetype} file name!")
yield (
gr.Textbox(
visible=True,
- value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
+ value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
)
)
return
basename = os.path.basename(verified_filepath)
dest = os.path.join(static_dir, basename)
shutil.copy(verified_filepath, dest)
- print("Done copy uploaded file to static folder!")
+ print("Done copying uploaded file to static folder.")
headers = {
# 'Content-Type': 'multipart/form-data'
}
@@ -172,17 +192,17 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
if response.status_code == 200:
response = response.json()
print(response)
- yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
+ yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
time.sleep(2)
fn_no_ext = Path(dest).stem
- if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
- new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
- print(response["video_id_maps"][fn_no_ext])
+ if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
+ new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
+ print(response["file_id_maps"][fn_no_ext])
os.rename(dest, new_dst)
yield (
gr.Textbox(
visible=True,
- value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
+ value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
)
)
return
@@ -190,51 +210,53 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
yield (
gr.Textbox(
visible=True,
- value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
+ value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
)
)
time.sleep(2)
return
-def ingest_video_gen_caption(filepath, request: gr.Request):
- yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
+def ingest_gen_caption(filepath, filetype, request: gr.Request):
+ yield (
+ gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
+ )
verified_filepath = os.path.normpath(filepath)
if not verified_filepath.startswith(tmp_upload_folder):
- print("Found malicious video file name!")
+ print(f"Found malicious {filetype} file name!")
yield (
gr.Textbox(
visible=True,
- value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
+ value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
)
)
return
basename = os.path.basename(verified_filepath)
dest = os.path.join(static_dir, basename)
shutil.copy(verified_filepath, dest)
- print("Done copy uploaded file to static folder!")
+ print("Done copying uploaded file to static folder.")
headers = {
# 'Content-Type': 'multipart/form-data'
}
files = {
"files": open(dest, "rb"),
}
- response = requests.post(dataprep_gen_captiono_addr, headers=headers, files=files)
+ response = requests.post(dataprep_gen_caption_addr, headers=headers, files=files)
print(response.status_code)
if response.status_code == 200:
response = response.json()
print(response)
- yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
+ yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
time.sleep(2)
fn_no_ext = Path(dest).stem
- if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
- new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
- print(response["video_id_maps"][fn_no_ext])
+ if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
+ new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
+ print(response["file_id_maps"][fn_no_ext])
os.rename(dest, new_dst)
yield (
gr.Textbox(
visible=True,
- value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
+ value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
)
)
return
@@ -242,48 +264,181 @@ def ingest_video_gen_caption(filepath, request: gr.Request):
yield (
gr.Textbox(
visible=True,
- value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
+ value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
)
)
time.sleep(2)
return
-def clear_uploaded_video(request: gr.Request):
+def ingest_with_text(filepath, text, request: gr.Request):
+ yield (gr.Textbox(visible=True, value="Please wait for your uploaded image to be ingested into the database..."))
+ verified_filepath = os.path.normpath(filepath)
+ if not verified_filepath.startswith(tmp_upload_folder):
+ print("Found malicious image file name!")
+ yield (
+ gr.Textbox(
+ visible=True,
+ value="Your uploaded image's file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
+ )
+ )
+ return
+ basename = os.path.basename(verified_filepath)
+ dest = os.path.join(static_dir, basename)
+ shutil.copy(verified_filepath, dest)
+ text_basename = "{}.txt".format(os.path.splitext(basename)[0])
+ text_dest = os.path.join(static_dir, text_basename)
+ with open(text_dest, "w") as file:
+ file.write(text)
+ print("Done copying uploaded files to static folder!")
+ headers = {
+ # 'Content-Type': 'multipart/form-data'
+ }
+ files = [("files", (basename, open(dest, "rb"))), ("files", (text_basename, open(text_dest, "rb")))]
+ try:
+ response = requests.post(dataprep_ingest_addr, headers=headers, files=files)
+ finally:
+ os.remove(text_dest)
+ print(response.status_code)
+ if response.status_code == 200:
+ response = response.json()
+ print(response)
+ yield (gr.Textbox(visible=True, value="Image ingestion is done. Saving your uploaded image..."))
+ time.sleep(2)
+ fn_no_ext = Path(dest).stem
+ if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
+ new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
+ print(response["file_id_maps"][fn_no_ext])
+ os.rename(dest, new_dst)
+ yield (
+ gr.Textbox(
+ visible=True,
+ value="Congratulation! Your upload is done!\nClick the X button on the top right of the image upload box to upload another image.",
+ )
+ )
+ return
+ else:
+ yield (
+ gr.Textbox(
+ visible=True,
+ value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the image upload box to reupload your image!",
+ )
+ )
+ time.sleep(2)
+ return
+
+
+def hide_text(request: gr.Request):
return gr.Textbox(visible=False)
-with gr.Blocks() as upload_gen_trans:
- gr.Markdown("# Ingest Your Own Video - Utilizing Generated Transcripts")
- gr.Markdown(
- "Please use this interface to ingest your own video if the video has meaningful audio (e.g., announcements, discussions, etc...)"
- )
+def clear_text(request: gr.Request):
+ return None
+
+
+with gr.Blocks() as upload_video:
+ gr.Markdown("# Ingest Your Own Video Using Generated Transcripts or Captions")
+ gr.Markdown("Use this interface to ingest your own video and generate transcripts or captions for it")
+
+ def select_upload_type(choice, request: gr.Request):
+ if choice == "transcript":
+ return gr.Video(sources="upload", visible=True), gr.Video(sources="upload", visible=False)
+ else:
+ return gr.Video(sources="upload", visible=False), gr.Video(sources="upload", visible=True)
+
with gr.Row():
with gr.Column(scale=6):
- video_upload = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload")
+ video_upload_trans = gr.Video(sources="upload", elem_id="video_upload_trans", visible=True)
+ video_upload_cap = gr.Video(sources="upload", elem_id="video_upload_cap", visible=False)
with gr.Column(scale=3):
+ text_options_radio = gr.Radio(
+ [
+ ("Generate transcript (video contains voice)", "transcript"),
+ ("Generate captions (video does not contain voice)", "caption"),
+ ],
+ label="Text Options",
+ info="How should text be ingested?",
+ value="transcript",
+ )
text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
- video_upload.upload(ingest_video_gen_transcript, [video_upload], [text_upload_result])
- video_upload.clear(clear_uploaded_video, [], [text_upload_result])
+ video_upload_trans.upload(
+ ingest_gen_transcript, [video_upload_trans, gr.Textbox(value="video", visible=False)], [text_upload_result]
+ )
+ video_upload_trans.clear(hide_text, [], [text_upload_result])
+ video_upload_cap.upload(
+ ingest_gen_caption, [video_upload_cap, gr.Textbox(value="video", visible=False)], [text_upload_result]
+ )
+ video_upload_cap.clear(hide_text, [], [text_upload_result])
+ text_options_radio.change(select_upload_type, [text_options_radio], [video_upload_trans, video_upload_cap])
-with gr.Blocks() as upload_gen_captions:
- gr.Markdown("# Ingest Your Own Video - Utilizing Generated Captions")
- gr.Markdown(
- "Please use this interface to ingest your own video if the video has meaningless audio (e.g., background musics, etc...)"
- )
+with gr.Blocks() as upload_image:
+ gr.Markdown("# Ingest Your Own Image Using Generated or Custom Captions/Labels")
+ gr.Markdown("Use this interface to ingest your own image and generate a caption for it")
+
+ def select_upload_type(choice, request: gr.Request):
+ if choice == "gen_caption":
+ return gr.Image(sources="upload", visible=True), gr.Image(sources="upload", visible=False)
+ else:
+ return gr.Image(sources="upload", visible=False), gr.Image(sources="upload", visible=True)
+
+ with gr.Row():
+ with gr.Column(scale=6):
+ image_upload_cap = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=True)
+ image_upload_text = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=False)
+ with gr.Column(scale=3):
+ text_options_radio = gr.Radio(
+ [("Generate caption", "gen_caption"), ("Custom caption or label", "custom_caption")],
+ label="Text Options",
+ info="How should text be ingested?",
+ value="gen_caption",
+ )
+ custom_caption = gr.Textbox(visible=True, interactive=True, label="Custom Caption or Label")
+ text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
+ image_upload_cap.upload(
+ ingest_gen_caption, [image_upload_cap, gr.Textbox(value="image", visible=False)], [text_upload_result]
+ )
+ image_upload_cap.clear(hide_text, [], [text_upload_result])
+ image_upload_text.upload(ingest_with_text, [image_upload_text, custom_caption], [text_upload_result]).then(
+ clear_text, [], [custom_caption]
+ )
+ image_upload_text.clear(hide_text, [], [text_upload_result])
+ text_options_radio.change(select_upload_type, [text_options_radio], [image_upload_cap, image_upload_text])
+
+with gr.Blocks() as upload_audio:
+ gr.Markdown("# Ingest Your Own Audio Using Generated Transcripts")
+ gr.Markdown("Use this interface to ingest your own audio file and generate a transcript for it")
with gr.Row():
with gr.Column(scale=6):
- video_upload_cap = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload_cap")
+ audio_upload = gr.Audio(type="filepath")
+ with gr.Column(scale=3):
+ text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
+ audio_upload.upload(
+ ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
+ )
+ audio_upload.stop_recording(
+ ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
+ )
+ audio_upload.clear(hide_text, [], [text_upload_result])
+
+with gr.Blocks() as upload_pdf:
+ gr.Markdown("# Ingest Your Own PDF")
+ gr.Markdown("Use this interface to ingest your own PDF file with text, tables, images, and graphs")
+ with gr.Row():
+ with gr.Column(scale=6):
+ image_upload_cap = gr.File()
with gr.Column(scale=3):
text_upload_result_cap = gr.Textbox(visible=False, interactive=False, label="Upload Status")
- video_upload_cap.upload(ingest_video_gen_transcript, [video_upload_cap], [text_upload_result_cap])
- video_upload_cap.clear(clear_uploaded_video, [], [text_upload_result_cap])
+ image_upload_cap.upload(
+ ingest_gen_caption, [image_upload_cap, gr.Textbox(value="PDF", visible=False)], [text_upload_result_cap]
+ )
+ image_upload_cap.clear(hide_text, [], [text_upload_result_cap])
with gr.Blocks() as qna:
state = gr.State(multimodalqna_conv.copy())
with gr.Row():
with gr.Column(scale=4):
- video = gr.Video(height=512, width=512, elem_id="video")
+ video = gr.Video(height=512, width=512, elem_id="video", visible=True, label="Media")
+ image = gr.Image(height=512, width=512, elem_id="image", visible=False, label="Media")
with gr.Column(scale=7):
chatbot = gr.Chatbot(elem_id="chatbot", label="MultimodalQnA Chatbot", height=390)
with gr.Row():
@@ -293,7 +448,8 @@ def clear_uploaded_video(request: gr.Request):
# show_label=False,
# container=False,
label="Query",
- info="Enter your query here!",
+ info="Enter a text query below",
+ # submit_btn=False,
)
with gr.Column(scale=1, min_width=100):
with gr.Row():
@@ -306,7 +462,7 @@ def clear_uploaded_video(request: gr.Request):
[
state,
],
- [state, chatbot, textbox, video, clear_btn],
+ [state, chatbot, textbox, video, image, clear_btn],
)
submit_btn.click(
@@ -318,17 +474,19 @@ def clear_uploaded_video(request: gr.Request):
[
state,
],
- [state, chatbot, video, clear_btn],
+ [state, chatbot, video, image, clear_btn],
)
with gr.Blocks(css=css) as demo:
gr.Markdown("# MultimodalQnA")
with gr.Tabs():
- with gr.TabItem("MultimodalQnA With Your Videos"):
+ with gr.TabItem("MultimodalQnA"):
qna.render()
- with gr.TabItem("Upload Your Own Videos"):
- upload_gen_trans.render()
- with gr.TabItem("Upload Your Own Videos"):
- upload_gen_captions.render()
+ with gr.TabItem("Upload Video"):
+ upload_video.render()
+ with gr.TabItem("Upload Image"):
+ upload_image.render()
+ with gr.TabItem("Upload Audio"):
+ upload_audio.render()
demo.queue()
app = gr.mount_gradio_app(app, demo, path="/")
@@ -343,6 +501,9 @@ def clear_uploaded_video(request: gr.Request):
parser.add_argument("--share", action="store_true")
backend_service_endpoint = os.getenv("BACKEND_SERVICE_ENDPOINT", "http://localhost:8888/v1/multimodalqna")
+ dataprep_ingest_endpoint = os.getenv(
+ "DATAPREP_INGEST_SERVICE_ENDPOINT", "http://localhost:6007/v1/ingest_with_text"
+ )
dataprep_gen_transcript_endpoint = os.getenv(
"DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT", "http://localhost:6007/v1/generate_transcripts"
)
@@ -353,9 +514,11 @@ def clear_uploaded_video(request: gr.Request):
logger.info(f"args: {args}")
global gateway_addr
gateway_addr = backend_service_endpoint
+ global dataprep_ingest_addr
+ dataprep_ingest_addr = dataprep_ingest_endpoint
global dataprep_gen_transcript_addr
dataprep_gen_transcript_addr = dataprep_gen_transcript_endpoint
- global dataprep_gen_captiono_addr
- dataprep_gen_captiono_addr = dataprep_gen_caption_endpoint
+ global dataprep_gen_caption_addr
+ dataprep_gen_caption_addr = dataprep_gen_caption_endpoint
uvicorn.run(app, host=args.host, port=args.port)
diff --git a/MultimodalQnA/ui/gradio/utils.py b/MultimodalQnA/ui/gradio/utils.py
index f6e1027eb..7a730a7ed 100644
--- a/MultimodalQnA/ui/gradio/utils.py
+++ b/MultimodalQnA/ui/gradio/utils.py
@@ -5,6 +5,7 @@
import logging
import logging.handlers
import os
+import shutil
import sys
from pathlib import Path
@@ -118,6 +119,18 @@ def maintain_aspect_ratio_resize(image, width=None, height=None, inter=cv2.INTER
return cv2.resize(image, dim, interpolation=inter)
+def make_temp_image(
+ image_name,
+ file_ext,
+ output_image_path: str = "./public/images",
+ output_image_name: str = "image_tmp",
+):
+ Path(output_image_path).mkdir(parents=True, exist_ok=True)
+ output_image = os.path.join(output_image_path, "{}.{}".format(output_image_name, file_ext))
+ shutil.copy(image_name, output_image)
+ return output_image
+
+
# function to split video at a timestamp
def split_video(
video_path,
diff --git a/README.md b/README.md
index 87581d3dd..a34166387 100644
--- a/README.md
+++ b/README.md
@@ -37,18 +37,19 @@ Deployment are based on released docker images by default, check [docker image l
#### Deploy Examples
-| Use Case | Docker Compose
Deployment on Xeon | Docker Compose
Deployment on Gaudi | Kubernetes with Manifests | Kubernetes with Helm Charts | Kubernetes with GMC |
-| ----------------- | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ |
-| ChatQnA | [Xeon Instructions](ChatQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](ChatQnA/docker_compose/intel/hpu/gaudi/README.md) | [ChatQnA with Manifests](ChatQnA/kubernetes/intel/README.md) | [ChatQnA with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) | [ChatQnA with GMC](ChatQnA/kubernetes/intel/README_gmc.md) |
-| CodeGen | [Xeon Instructions](CodeGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeGen/docker_compose/intel/hpu/gaudi/README.md) | [CodeGen with Manifests](CodeGen/kubernetes/intel/README.md) | [CodeGen with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codegen/README.md) | [CodeGen with GMC](CodeGen/kubernetes/intel/README_gmc.md) |
-| CodeTrans | [Xeon Instructions](CodeTrans/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeTrans/docker_compose/intel/hpu/gaudi/README.md) | [CodeTrans with Manifests](CodeTrans/kubernetes/intel/README.md) | [CodeTrans with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codetrans/README.md) | [CodeTrans with GMC](CodeTrans/kubernetes/intel/README_gmc.md) |
-| DocSum | [Xeon Instructions](DocSum/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](DocSum/docker_compose/intel/hpu/gaudi/README.md) | [DocSum with Manifests](DocSum/kubernetes/intel/README.md) | [DocSum with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/docsum/README.md) | [DocSum with GMC](DocSum/kubernetes/intel/README_gmc.md) |
-| SearchQnA | [Xeon Instructions](SearchQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](SearchQnA/docker_compose/intel/hpu/gaudi/README.md) | Not Supported | Not Supported | [SearchQnA with GMC](SearchQnA/kubernetes/intel/README_gmc.md) |
-| FaqGen | [Xeon Instructions](FaqGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](FaqGen/docker_compose/intel/hpu/gaudi/README.md) | [FaqGen with Manifests](FaqGen/kubernetes/intel/README.md) | Not Supported | [FaqGen with GMC](FaqGen/kubernetes/intel/README_gmc.md) |
-| Translation | [Xeon Instructions](Translation/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](Translation/docker_compose/intel/hpu/gaudi/README.md) | [Translation with Manifests](Translation/kubernetes/intel/README.md) | Not Supported | [Translation with GMC](Translation/kubernetes/intel/README_gmc.md) |
-| AudioQnA | [Xeon Instructions](AudioQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](AudioQnA/docker_compose/intel/hpu/gaudi/README.md) | [AudioQnA with Manifests](AudioQnA/kubernetes/intel/README.md) | Not Supported | [AudioQnA with GMC](AudioQnA/kubernetes/intel/README_gmc.md) |
-| VisualQnA | [Xeon Instructions](VisualQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](VisualQnA/docker_compose/intel/hpu/gaudi/README.md) | [VisualQnA with Manifests](VisualQnA/kubernetes/intel/README.md) | Not Supported | [VisualQnA with GMC](VisualQnA/kubernetes/intel/README_gmc.md) |
-| ProductivitySuite | [Xeon Instructions](ProductivitySuite/docker_compose/intel/cpu/xeon/README.md) | Not Supported | [ProductivitySuite with Manifests](ProductivitySuite/kubernetes/intel/README.md) | Not Supported | Not Supported |
+| Use Case | Docker Compose
Deployment on Xeon | Docker Compose
Deployment on Gaudi | Kubernetes with Manifests | Kubernetes with Helm Charts | Kubernetes with GMC |
+| ----------------- | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ |
+| ChatQnA | [Xeon Instructions](ChatQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](ChatQnA/docker_compose/intel/hpu/gaudi/README.md) | [ChatQnA with Manifests](ChatQnA/kubernetes/intel/README.md) | [ChatQnA with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) | [ChatQnA with GMC](ChatQnA/kubernetes/intel/README_gmc.md) |
+| CodeGen | [Xeon Instructions](CodeGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeGen/docker_compose/intel/hpu/gaudi/README.md) | [CodeGen with Manifests](CodeGen/kubernetes/intel/README.md) | [CodeGen with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codegen/README.md) | [CodeGen with GMC](CodeGen/kubernetes/intel/README_gmc.md) |
+| CodeTrans | [Xeon Instructions](CodeTrans/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeTrans/docker_compose/intel/hpu/gaudi/README.md) | [CodeTrans with Manifests](CodeTrans/kubernetes/intel/README.md) | [CodeTrans with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codetrans/README.md) | [CodeTrans with GMC](CodeTrans/kubernetes/intel/README_gmc.md) |
+| DocSum | [Xeon Instructions](DocSum/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](DocSum/docker_compose/intel/hpu/gaudi/README.md) | [DocSum with Manifests](DocSum/kubernetes/intel/README.md) | [DocSum with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/docsum/README.md) | [DocSum with GMC](DocSum/kubernetes/intel/README_gmc.md) |
+| SearchQnA | [Xeon Instructions](SearchQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](SearchQnA/docker_compose/intel/hpu/gaudi/README.md) | Not Supported | Not Supported | [SearchQnA with GMC](SearchQnA/kubernetes/intel/README_gmc.md) |
+| FaqGen | [Xeon Instructions](FaqGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](FaqGen/docker_compose/intel/hpu/gaudi/README.md) | [FaqGen with Manifests](FaqGen/kubernetes/intel/README.md) | Not Supported | [FaqGen with GMC](FaqGen/kubernetes/intel/README_gmc.md) |
+| Translation | [Xeon Instructions](Translation/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](Translation/docker_compose/intel/hpu/gaudi/README.md) | [Translation with Manifests](Translation/kubernetes/intel/README.md) | Not Supported | [Translation with GMC](Translation/kubernetes/intel/README_gmc.md) |
+| AudioQnA | [Xeon Instructions](AudioQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](AudioQnA/docker_compose/intel/hpu/gaudi/README.md) | [AudioQnA with Manifests](AudioQnA/kubernetes/intel/README.md) | Not Supported | [AudioQnA with GMC](AudioQnA/kubernetes/intel/README_gmc.md) |
+| VisualQnA | [Xeon Instructions](VisualQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](VisualQnA/docker_compose/intel/hpu/gaudi/README.md) | [VisualQnA with Manifests](VisualQnA/kubernetes/intel/README.md) | Not Supported | [VisualQnA with GMC](VisualQnA/kubernetes/intel/README_gmc.md) |
+| MultimodalQnA | [Xeon Instructions](MultimodalQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md) | Not supported | Not supported | Not supported |
+| ProductivitySuite | [Xeon Instructions](ProductivitySuite/docker_compose/intel/cpu/xeon/README.md) | Not Supported | [ProductivitySuite with Manifests](ProductivitySuite/kubernetes/intel/README.md) | Not Supported | Not Supported |
## Supported Examples
diff --git a/docker_images_list.md b/docker_images_list.md
index d90ae08dc..056ee2fce 100644
--- a/docker_images_list.md
+++ b/docker_images_list.md
@@ -26,8 +26,8 @@ Take ChatQnA for example. ChatQnA is a chatbot application service based on the
| [opea/faqgen](https://hub.docker.com/r/opea/faqgen) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/Dockerfile) | The docker image served as a faqgen gateway and automatically generating comprehensive, natural sounding Frequently Asked Questions (FAQs) from documents, legal texts, customer inquiries and other sources. |
| [opea/faqgen-ui](https://hub.docker.com/r/opea/faqgen-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/ui/docker/Dockerfile) | The docker image serves as the docsum UI entry point for easy interaction with users, generating FAQs by pasting in question text. |
| [opea/faqgen-react-ui](https://hub.docker.com/r/opea/faqgen-react-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/ui/docker/Dockerfile.react) | The purpose of the docker image is to provide a user interface for Generate FAQs using React. It allows generating FAQs by uploading files or pasting text. |
-| [opea/multimodalqna](https://hub.docker.com/r/opea/multimodalqna) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/Dockerfile) | The docker image served as a multimodalqna gateway and dynamically fetches the most relevant multimodal information (frames, transcripts, and/or subtitles) from the user's video collection to solve the problem. |
-| [opea/multimodalqna-ui](https://hub.docker.com/r/opea/multimodalqna-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/ui/docker/Dockerfile) | The docker image serves as the docsum UI entry point for easy interaction with users. Answers to questions are generated from videos uploaded by users.. |
+| [opea/multimodalqna](https://hub.docker.com/r/opea/multimodalqna) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/Dockerfile) | The docker image served as a multimodalqna gateway and dynamically fetches the most relevant multimodal information (frames, transcripts, and/or subtitles) from the user's video, image, or audio collection to solve the problem. |
+| [opea/multimodalqna-ui](https://hub.docker.com/r/opea/multimodalqna-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/ui/docker/Dockerfile) | The docker image serves as the multimodalqna UI entry point for easy interaction with users. Answers to questions are generated from uploaded by users. |
| [opea/productivity-suite-react-ui-server](https://hub.docker.com/r/opea/productivity-suite-react-ui-server) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/ProductivitySuite/ui/docker/Dockerfile.react) | The purpose of the docker image is to provide a user interface for Productivity Suite Application using React. It allows interaction by uploading documents and inputs. |
| [opea/searchqna](https://hub.docker.com/r/opea/searchqna/tags) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/SearchQnA/Dockerfile) | The docker image served as the searchqna gateway to provide service of retrieving accurate and relevant answers to user queries from a knowledge base or dataset |
| [opea/searchqna-ui](https://hub.docker.com/r/opea/searchqna-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/SearchQnA/ui/docker/Dockerfile) | The docker image acted as the searchqna UI entry for facilitating interaction with users for question answering |
diff --git a/supported_examples.md b/supported_examples.md
index 33b02f71d..0754be3ee 100644
--- a/supported_examples.md
+++ b/supported_examples.md
@@ -186,7 +186,15 @@ FAQ Generation Application leverages the power of large language models (LLMs) t
### MultimodalQnA
-[MultimodalQnA](./MultimodalQnA/README.md) addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos.
+[MultimodalQnA](./MultimodalQnA/README.md) addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, or audio files. MultimodalQnA utilizes BridgeTower model, a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
+
+| Service | Model | HW | Description |
+| --------- | ----------------------------------------------------------------------------------------------------------------- | ---------- | ----------------------------- |
+| Embedding | [BridgeTower/bridgetower-large-itm-mlm-itc](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-itc) | Xeon/Gaudi | Multimodal embeddings service |
+| Embedding | [BridgeTower/bridgetower-large-itm-mlm-gaudi](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi) | Gaudi | Multimodal embeddings service |
+| LVM | [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | Xeon | LVM service |
+| LVM | [llava-hf/llava-1.5-13b-hf](https://huggingface.co/llava-hf/llava-1.5-13b-hf) | Xeon | LVM service |
+| LVM | [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) | Gaudi | LVM service |
### ProductivitySuite