diff --git a/MultimodalQnA/README.md b/MultimodalQnA/README.md
index 95626aa78..08de5686a 100644
--- a/MultimodalQnA/README.md
+++ b/MultimodalQnA/README.md
@@ -2,7 +2,7 @@
 
 Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
 
-`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
+`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
 
 The MultimodalQnA architecture shows below:
 
@@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component
 
 By default, the embedding and LVM models are set to a default value as listed below:
 
-| Service              | Model                                       |
-| -------------------- | ------------------------------------------- |
-| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi |
-| LVM                  | llava-hf/llava-v1.6-vicuna-13b-hf           |
+| Service              | HW    | Model                                     |
+| -------------------- | ----- | ----------------------------------------- |
+| embedding-multimodal | Xeon  | BridgeTower/bridgetower-large-itm-mlm-itc |
+| LVM                  | Xeon  | llava-hf/llava-1.5-7b-hf                  |
+| embedding-multimodal | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
+| LVM                  | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf         |
 
 You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.
 
diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
index 9b3a3edaa..d0a1c7d27 100644
--- a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
+++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
@@ -84,16 +84,18 @@ export INDEX_NAME="mm-rag-redis"
 export LLAVA_SERVER_PORT=8399
 export LVM_ENDPOINT="http://${host_ip}:8399"
 export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
+export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
 export WHISPER_MODEL="base"
 export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
 export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}
 export MEGA_SERVICE_HOST_IP=${host_ip}
 export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
 export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
 export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
 ```
 
 Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -274,54 +276,76 @@ curl http://${host_ip}:9399/v1/lvm \
 
 6. dataprep-multimodal-redis
 
-Download a sample video
+Download a sample video, image, and audio file and create a caption
 
 ```bash
 export video_fn="WeAreGoingOnBullrun.mp4"
 wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
+
+export image_fn="apple.png"
+wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
+
+export caption_fn="apple.txt"
+echo "This is an apple."  > ${caption_fn}
+
+export audio_fn="AudioSample.wav"
+wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
 ```
 
-Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
+Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
 
 ```bash
 curl --silent --write-out "HTTPSTATUS:%{http_code}" \
     ${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
     -H 'Content-Type: multipart/form-data' \
-    -X POST -F "files=@./${video_fn}"
+    -X POST \
+    -F "files=@./${video_fn}" \
+    -F "files=@./${audio_fn}"
 ```
 
-Also, test dataprep microservice with generating caption using lvm microservice
+Also, test dataprep microservice with generating an image caption using lvm microservice
 
 ```bash
 curl --silent --write-out "HTTPSTATUS:%{http_code}" \
     ${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
     -H 'Content-Type: multipart/form-data' \
-    -X POST -F "files=@./${video_fn}"
+    -X POST -F "files=@./${image_fn}"
+```
+
+Now, test the microservice with posting a custom caption along with an image
+
+```bash
+curl --silent --write-out "HTTPSTATUS:%{http_code}" \
+    ${DATAPREP_INGEST_SERVICE_ENDPOINT} \
+    -H 'Content-Type: multipart/form-data' \
+    -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
 ```
 
-Also, you are able to get the list of all videos that you uploaded:
+Also, you are able to get the list of all files that you uploaded:
 
 ```bash
 curl -X POST \
     -H "Content-Type: application/json" \
-    ${DATAPREP_GET_VIDEO_ENDPOINT}
+    ${DATAPREP_GET_FILE_ENDPOINT}
 ```
 
-Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
+Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
 
 ```bash
 [
     "WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
-    "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
+    "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
+    "apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
+    "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
 ]
 ```
 
-To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
+To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
 
 ```bash
 curl -X POST \
     -H "Content-Type: application/json" \
-    ${DATAPREP_DELETE_VIDEO_ENDPOINT}
+    ${DATAPREP_DELETE_FILE_ENDPOINT}
 ```
 
 7. MegaService
diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml b/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml
index d9bf3bce9..eece99da8 100644
--- a/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml
+++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml
@@ -36,6 +36,7 @@ services:
       http_proxy: ${http_proxy}
       https_proxy: ${https_proxy}
       PORT: ${EMBEDDER_PORT}
+    entrypoint: ["python", "bridgetower_server.py", "--device", "cpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
     restart: unless-stopped
   embedding-multimodal:
     image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -76,6 +77,7 @@ services:
       no_proxy: ${no_proxy}
       http_proxy: ${http_proxy}
       https_proxy: ${https_proxy}
+    entrypoint: ["python", "llava_server.py", "--device", "cpu", "--model_name_or_path", $LVM_MODEL_ID]
     restart: unless-stopped
   lvm-llava-svc:
     image: ${REGISTRY:-opea}/lvm-llava-svc:${TAG:-latest}
@@ -125,6 +127,7 @@ services:
       - https_proxy=${https_proxy}
       - http_proxy=${http_proxy}
       - BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
+      - DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
       - DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
       - DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
     ipc: host
diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh b/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
index ca5e650ff..d8824fb0b 100755
--- a/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
+++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
@@ -15,13 +15,15 @@ export INDEX_NAME="mm-rag-redis"
 export LLAVA_SERVER_PORT=8399
 export LVM_ENDPOINT="http://${host_ip}:8399"
 export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
+export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
 export WHISPER_MODEL="base"
 export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
 export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}
 export MEGA_SERVICE_HOST_IP=${host_ip}
 export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
 export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
 export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md b/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
index 6517b100c..6d6ca88ff 100644
--- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
+++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
@@ -40,10 +40,11 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}
 export MEGA_SERVICE_HOST_IP=${host_ip}
 export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
 export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
 export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
 ```
 
 Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -224,56 +225,76 @@ curl http://${host_ip}:9399/v1/lvm \
 
 6. Multimodal Dataprep Microservice
 
-Download a sample video
+Download a sample video, image, and audio file and create a caption
 
 ```bash
 export video_fn="WeAreGoingOnBullrun.mp4"
 wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
-```
 
-Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
+export image_fn="apple.png"
+wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
+
+export caption_fn="apple.txt"
+echo "This is an apple."  > ${caption_fn}
+
+export audio_fn="AudioSample.wav"
+wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
+```
 
-Test dataprep microservice with generating transcript using whisper model
+Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
 
 ```bash
 curl --silent --write-out "HTTPSTATUS:%{http_code}" \
     ${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
     -H 'Content-Type: multipart/form-data' \
-    -X POST -F "files=@./${video_fn}"
+    -X POST \
+    -F "files=@./${video_fn}" \
+    -F "files=@./${audio_fn}"
 ```
 
-Also, test dataprep microservice with generating caption using lvm-tgi
+Also, test dataprep microservice with generating an image caption using lvm-tgi
 
 ```bash
 curl --silent --write-out "HTTPSTATUS:%{http_code}" \
     ${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
     -H 'Content-Type: multipart/form-data' \
-    -X POST -F "files=@./${video_fn}"
+    -X POST -F "files=@./${image_fn}"
+```
+
+Now, test the microservice with posting a custom caption along with an image
+
+```bash
+curl --silent --write-out "HTTPSTATUS:%{http_code}" \
+    ${DATAPREP_INGEST_SERVICE_ENDPOINT} \
+    -H 'Content-Type: multipart/form-data' \
+    -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
 ```
 
-Also, you are able to get the list of all videos that you uploaded:
+Also, you are able to get the list of all files that you uploaded:
 
 ```bash
 curl -X POST \
     -H "Content-Type: application/json" \
-    ${DATAPREP_GET_VIDEO_ENDPOINT}
+    ${DATAPREP_GET_FILE_ENDPOINT}
 ```
 
-Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
+Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
 
 ```bash
 [
     "WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
-    "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
+    "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
+    "apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
+    "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
 ]
 ```
 
-To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
+To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
 
 ```bash
 curl -X POST \
     -H "Content-Type: application/json" \
-    ${DATAPREP_DELETE_VIDEO_ENDPOINT}
+    ${DATAPREP_DELETE_FILE_ENDPOINT}
 ```
 
 7. MegaService
diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml b/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml
index d7ac74084..e66aea1f0 100644
--- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml
+++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml
@@ -36,6 +36,7 @@ services:
       http_proxy: ${http_proxy}
       https_proxy: ${https_proxy}
       PORT: ${EMBEDDER_PORT}
+    entrypoint: ["python", "bridgetower_server.py", "--device", "hpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
     restart: unless-stopped
   embedding-multimodal:
     image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -139,6 +140,7 @@ services:
       - https_proxy=${https_proxy}
       - http_proxy=${http_proxy}
       - BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
+      - DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
       - DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
       - DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
     ipc: host
diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh b/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
index 211a1a696..b5be052e1 100755
--- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
+++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
@@ -22,7 +22,8 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}
 export MEGA_SERVICE_HOST_IP=${host_ip}
 export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
 export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
 export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
diff --git a/MultimodalQnA/tests/test_compose_on_gaudi.sh b/MultimodalQnA/tests/test_compose_on_gaudi.sh
index dd7af39fb..6af5e9e62 100644
--- a/MultimodalQnA/tests/test_compose_on_gaudi.sh
+++ b/MultimodalQnA/tests/test_compose_on_gaudi.sh
@@ -14,12 +14,15 @@ WORKPATH=$(dirname "$PWD")
 LOG_PATH="$WORKPATH/tests"
 ip_address=$(hostname -I | awk '{print $1}')
 
+export image_fn="apple.png"
 export video_fn="WeAreGoingOnBullrun.mp4"
+export caption_fn="apple.txt"
 
 function build_docker_images() {
     cd $WORKPATH/docker_image_build
-    git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
-
+    # TODO: Revert to the clone of opea-project "main" after the merge of https://github.com/opea-project/GenAIComps/pull/852
+    # git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
+    git clone --single-branch --branch="melanie/mm-rag-enhanced" https://github.com/mhbuehler/GenAIComps.git
     echo "Build all the images with --no-cache, check docker_image_build.log for details..."
     service_list="multimodalqna multimodalqna-ui embedding-multimodal-bridgetower embedding-multimodal retriever-multimodal-redis lvm-tgi dataprep-multimodal-redis"
     docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log
@@ -47,10 +50,11 @@ function setup_env() {
     export LVM_SERVICE_HOST_IP=${host_ip}
     export MEGA_SERVICE_HOST_IP=${host_ip}
     export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+    export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
     export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
     export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-    export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-    export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+    export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+    export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
 }
 
 function start_services() {
@@ -63,12 +67,15 @@ function start_services() {
 
 function prepare_data() {
     cd $LOG_PATH
-    echo "Downloading video"
+    echo "Downloading image and video"
+    wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
     wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
+    echo "Writing caption file"
+    echo "This is an apple."  > ${caption_fn}
 
     sleep 30s
-
 }
+
 function validate_service() {
     local URL="$1"
     local EXPECTED_RESULT="$2"
@@ -76,9 +83,15 @@ function validate_service() {
     local DOCKER_NAME="$4"
     local INPUT_DATA="$5"
 
-    if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
+    if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
         cd $LOG_PATH
         HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+    elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
+         cd $LOG_PATH
+         HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+    elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
+        cd $LOG_PATH
+        HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
     elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
         HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
     elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
@@ -147,27 +160,34 @@ function validate_microservices() {
     sleep 1m # retrieval can't curl as expected, try to wait for more time
 
     # test data prep
-    echo "Data Prep with Generating Transcript"
+    echo "Data Prep with Generating Transcript for Video"
     validate_service \
         "${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
         "Data preparation succeeded" \
-        "dataprep-multimodal-redis" \
+        "dataprep-multimodal-redis-transcript" \
         "dataprep-multimodal-redis"
 
-    echo "Data Prep with Generating Transcript"
+    echo "Data Prep with Image & Caption Ingestion"
     validate_service \
-        "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
+        "${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
         "Data preparation succeeded" \
-        "dataprep-multimodal-redis" \
+        "dataprep-multimodal-redis-ingest" \
         "dataprep-multimodal-redis"
 
-    echo "Validating get file"
+    echo "Validating get file returns mp4"
     validate_service \
-        "${DATAPREP_GET_VIDEO_ENDPOINT}" \
+        "${DATAPREP_GET_FILE_ENDPOINT}" \
         '.mp4' \
         "dataprep_get" \
         "dataprep-multimodal-redis"
 
+    echo "Validating get file returns png"
+    validate_service \
+        "${DATAPREP_GET_FILE_ENDPOINT}" \
+        '.png' \
+        "dataprep_get" \
+        "dataprep-multimodal-redis"
+
     sleep 1m
 
     # multimodal retrieval microservice
@@ -200,6 +220,14 @@ function validate_microservices() {
         "lvm-tgi" \
         '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
 
+    # data prep requiring lvm
+    echo "Data Prep with Generating Caption for Image"
+    validate_service \
+        "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
+        "Data preparation succeeded" \
+        "dataprep-multimodal-redis-caption" \
+        "dataprep-multimodal-redis"
+
     sleep 1m
 }
 
@@ -224,14 +252,22 @@ function validate_megaservice() {
 }
 
 function validate_delete {
-    echo "Validate data prep delete videos"
+    echo "Validate data prep delete files"
     validate_service \
-        "${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
+        "${DATAPREP_DELETE_FILE_ENDPOINT}" \
         '{"status":true}' \
         "dataprep_del" \
         "dataprep-multimodal-redis"
 }
 
+function delete_data() {
+    cd $LOG_PATH
+    echo "Deleting image, video, and caption"
+    rm -rf ${image_fn}
+    rm -rf ${video_fn}
+    rm -rf ${caption_fn}
+}
+
 function stop_docker() {
     cd $WORKPATH/docker_compose/intel/hpu/gaudi
     docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
@@ -256,6 +292,7 @@ function main() {
     validate_delete
     echo "==== delete validated ===="
 
+    delete_data
     stop_docker
     echo y | docker system prune
 
diff --git a/MultimodalQnA/tests/test_compose_on_xeon.sh b/MultimodalQnA/tests/test_compose_on_xeon.sh
index 46042c600..4b49501ca 100644
--- a/MultimodalQnA/tests/test_compose_on_xeon.sh
+++ b/MultimodalQnA/tests/test_compose_on_xeon.sh
@@ -14,11 +14,15 @@ WORKPATH=$(dirname "$PWD")
 LOG_PATH="$WORKPATH/tests"
 ip_address=$(hostname -I | awk '{print $1}')
 
+export image_fn="apple.png"
 export video_fn="WeAreGoingOnBullrun.mp4"
+export caption_fn="apple.txt"
 
 function build_docker_images() {
     cd $WORKPATH/docker_image_build
-    git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
+    # TODO: Revert to the clone of opea-project "main" after the merge of https://github.com/opea-project/GenAIComps/pull/852
+    # git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
+    git clone --single-branch --branch="melanie/mm-rag-enhanced" https://github.com/mhbuehler/GenAIComps.git
 
     echo "Build all the images with --no-cache, check docker_image_build.log for details..."
     service_list="multimodalqna multimodalqna-ui embedding-multimodal-bridgetower embedding-multimodal retriever-multimodal-redis lvm-llava lvm-llava-svc dataprep-multimodal-redis"
@@ -37,6 +41,7 @@ function setup_env() {
     export INDEX_NAME="mm-rag-redis"
     export LLAVA_SERVER_PORT=8399
     export LVM_ENDPOINT="http://${host_ip}:8399"
+    export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
     export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
     export WHISPER_MODEL="base"
     export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
@@ -44,10 +49,11 @@ function setup_env() {
     export LVM_SERVICE_HOST_IP=${host_ip}
     export MEGA_SERVICE_HOST_IP=${host_ip}
     export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
+    export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
     export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
     export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-    export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
-    export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
+    export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
+    export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
 }
 
 function start_services() {
@@ -61,12 +67,14 @@ function start_services() {
 
 function prepare_data() {
     cd $LOG_PATH
-    echo "Downloading video"
+    echo "Downloading image and video"
+    wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
     wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
-
+    echo "Writing caption file"
+    echo "This is an apple."  > ${caption_fn}
     sleep 1m
-
 }
+
 function validate_service() {
     local URL="$1"
     local EXPECTED_RESULT="$2"
@@ -74,9 +82,15 @@ function validate_service() {
     local DOCKER_NAME="$4"
     local INPUT_DATA="$5"
 
-    if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
+    if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
         cd $LOG_PATH
         HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+    elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
+        cd $LOG_PATH
+        HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
+    elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
+        cd $LOG_PATH
+        HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
     elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
         HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
     elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
@@ -145,27 +159,34 @@ function validate_microservices() {
     sleep 1m # retrieval can't curl as expected, try to wait for more time
 
     # test data prep
-    echo "Data Prep with Generating Transcript"
+    echo "Data Prep with Generating Transcript for Video"
     validate_service \
         "${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
         "Data preparation succeeded" \
-        "dataprep-multimodal-redis" \
+        "dataprep-multimodal-redis-transcript" \
         "dataprep-multimodal-redis"
 
-    # echo "Data Prep with Generating Caption"
-    # validate_service \
-    #     "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
-    #     "Data preparation succeeded" \
-    #     "dataprep-multimodal-redis" \
-    #     "dataprep-multimodal-redis"
+    echo "Data Prep with Image & Caption Ingestion"
+    validate_service \
+        "${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
+        "Data preparation succeeded" \
+        "dataprep-multimodal-redis-ingest" \
+        "dataprep-multimodal-redis"
 
-    echo "Validating get file"
+    echo "Validating get file returns mp4"
     validate_service \
-        "${DATAPREP_GET_VIDEO_ENDPOINT}" \
+        "${DATAPREP_GET_FILE_ENDPOINT}" \
         '.mp4' \
         "dataprep_get" \
         "dataprep-multimodal-redis"
 
+    echo "Validating get file returns png"
+    validate_service \
+        "${DATAPREP_GET_FILE_ENDPOINT}" \
+        '.png' \
+        "dataprep_get" \
+        "dataprep-multimodal-redis"
+
     sleep 1m
 
     # multimodal retrieval microservice
@@ -178,7 +199,24 @@ function validate_microservices() {
         "retriever-multimodal-redis" \
         "{\"text\":\"test\",\"embedding\":${your_embedding}}"
 
-    sleep 10s
+    echo "Wait for lvm-llava service to be ready"
+    max_retries=10
+    for i in $(seq $max_retries)
+    do
+        lvm_logs=$(docker logs lvm-llava 2>&1 | grep "Uvicorn running on http://0.0.0.0")
+	if [[ "$lvm_logs" != *"Uvicorn running on http://0.0.0.0"* ]]; then
+            echo "The lvm-llava service is not ready yet, sleeping 30s..."
+            sleep 30s
+	else
+	    echo "lvm-llava service is ready"
+	    break
+	fi
+    done
+
+    if [[ $i -ge 10 ]]; then
+        echo "WARNING: Max retries reached when waiting for the lvm-llava service to be ready"
+        docker logs lvm-llava >> ${LOG_PATH}/lvm_llava_file.log
+    fi
 
     # llava server
     echo "Evaluating lvm-llava"
@@ -198,6 +236,14 @@ function validate_microservices() {
         "lvm-llava-svc" \
         '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
 
+    # data prep requiring lvm
+    echo "Data Prep with Generating Caption for Image"
+    validate_service \
+        "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
+        "Data preparation succeeded" \
+        "dataprep-multimodal-redis-caption" \
+        "dataprep-multimodal-redis"
+
     sleep 3m
 }
 
@@ -222,14 +268,22 @@ function validate_megaservice() {
 }
 
 function validate_delete {
-    echo "Validate data prep delete videos"
+    echo "Validate data prep delete files"
     validate_service \
-        "${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
+        "${DATAPREP_DELETE_FILE_ENDPOINT}" \
         '{"status":true}' \
         "dataprep_del" \
         "dataprep-multimodal-redis"
 }
 
+function delete_data() {
+    cd $LOG_PATH
+    echo "Deleting image, video, and caption"
+    rm -rf ${image_fn}
+    rm -rf ${video_fn}
+    rm -rf ${caption_fn}
+}
+
 function stop_docker() {
     cd $WORKPATH/docker_compose/intel/cpu/xeon
     docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
@@ -254,6 +308,7 @@ function main() {
     validate_delete
     echo "==== delete validated ===="
 
+    delete_data
     stop_docker
     echo y | docker system prune
 
diff --git a/MultimodalQnA/ui/gradio/conversation.py b/MultimodalQnA/ui/gradio/conversation.py
index 9f1a2827b..3057e9879 100644
--- a/MultimodalQnA/ui/gradio/conversation.py
+++ b/MultimodalQnA/ui/gradio/conversation.py
@@ -30,6 +30,7 @@ class Conversation:
     base64_frame: str = None
     skip_next: bool = False
     split_video: str = None
+    image: str = None
 
     def _template_caption(self):
         out = ""
@@ -59,6 +60,8 @@ def get_prompt(self):
                                 else:
                                     base64_frame = get_b64_frame_from_timestamp(self.video_file, self.time_of_frame_ms)
                                     self.base64_frame = base64_frame
+                                if base64_frame is None:
+                                    base64_frame = ""
                                 content.append({"type": "image_url", "image_url": {"url": base64_frame}})
                             else:
                                 content = message
@@ -137,6 +140,7 @@ def dict(self):
             "caption": self.caption,
             "base64_frame": self.base64_frame,
             "split_video": self.split_video,
+            "image": self.image,
         }
 
 
@@ -152,4 +156,5 @@ def dict(self):
     time_of_frame_ms=None,
     base64_frame=None,
     split_video=None,
+    image=None,
 )
diff --git a/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py b/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py
index 3eba01a71..ec6a033ca 100644
--- a/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py
+++ b/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py
@@ -13,7 +13,7 @@
 from conversation import multimodalqna_conv
 from fastapi import FastAPI
 from fastapi.staticfiles import StaticFiles
-from utils import build_logger, moderation_msg, server_error_msg, split_video
+from utils import build_logger, make_temp_image, moderation_msg, server_error_msg, split_video
 
 logger = build_logger("gradio_web_server", "gradio_web_server.log")
 
@@ -47,22 +47,24 @@ def clear_history(state, request: gr.Request):
     logger.info(f"clear_history. ip: {request.client.host}")
     if state.split_video and os.path.exists(state.split_video):
         os.remove(state.split_video)
+    if state.image and os.path.exists(state.image):
+        os.remove(state.image)
     state = multimodalqna_conv.copy()
-    return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 1
+    return (state, state.to_gradio_chatbot(), None, None, None) + (disable_btn,) * 1
 
 
 def add_text(state, text, request: gr.Request):
     logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}")
     if len(text) <= 0:
         state.skip_next = True
-        return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 1
+        return (state, state.to_gradio_chatbot(), None) + (no_change_btn,) * 1
 
     text = text[:2000]  # Hard cut-off
 
     state.append_message(state.roles[0], text)
     state.append_message(state.roles[1], None)
     state.skip_next = False
-    return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 1
+    return (state, state.to_gradio_chatbot(), None) + (disable_btn,) * 1
 
 
 def http_bot(state, request: gr.Request):
@@ -73,7 +75,7 @@ def http_bot(state, request: gr.Request):
     if state.skip_next:
         # This generate call is skipped due to invalid inputs
         path_to_sub_videos = state.get_path_to_subvideos()
-        yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (no_change_btn,) * 1
+        yield (state, state.to_gradio_chatbot(), path_to_sub_videos, None) + (no_change_btn,) * 1
         return
 
     if len(state.messages) == state.offset + 2:
@@ -97,7 +99,7 @@ def http_bot(state, request: gr.Request):
     logger.info(f"==== url request ====\n{gateway_addr}")
 
     state.messages[-1][-1] = "▌"
-    yield (state, state.to_gradio_chatbot(), state.split_video) + (disable_btn,) * 1
+    yield (state, state.to_gradio_chatbot(), state.split_video, state.image) + (disable_btn,) * 1
 
     try:
         response = requests.post(
@@ -108,6 +110,7 @@ def http_bot(state, request: gr.Request):
         )
         print(response.status_code)
         print(response.json())
+
         if response.status_code == 200:
             response = response.json()
             choice = response["choices"][-1]
@@ -123,44 +126,61 @@ def http_bot(state, request: gr.Request):
                 video_file = metadata["source_video"]
                 state.video_file = os.path.join(static_dir, metadata["source_video"])
                 state.time_of_frame_ms = metadata["time_of_frame_ms"]
-                try:
-                    splited_video_path = split_video(
-                        state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
-                    )
-                except:
-                    print(f"video {state.video_file} does not exist in UI host!")
-                    splited_video_path = None
-                state.split_video = splited_video_path
+                file_ext = os.path.splitext(state.video_file)[-1]
+                if file_ext == ".mp4":
+                    try:
+                        splited_video_path = split_video(
+                            state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
+                        )
+                    except:
+                        print(f"video {state.video_file} does not exist in UI host!")
+                        splited_video_path = None
+                    state.split_video = splited_video_path
+                elif file_ext in [".jpg", ".jpeg", ".png", ".gif"]:
+                    try:
+                        output_image_path = make_temp_image(state.video_file, file_ext)
+                    except:
+                        print(f"image {state.video_file} does not exist in UI host!")
+                        output_image_path = None
+                    state.image = output_image_path
+
         else:
             raise requests.exceptions.RequestException
     except requests.exceptions.RequestException as e:
         state.messages[-1][-1] = server_error_msg
-        yield (state, state.to_gradio_chatbot(), None) + (enable_btn,)
+        yield (state, state.to_gradio_chatbot(), None, None) + (enable_btn,)
         return
 
     state.messages[-1][-1] = message
-    yield (state, state.to_gradio_chatbot(), state.split_video) + (enable_btn,) * 1
+    yield (
+        state,
+        state.to_gradio_chatbot(),
+        gr.Video(state.split_video, visible=state.split_video is not None),
+        gr.Image(state.image, visible=state.image is not None),
+    ) + (enable_btn,) * 1
 
     logger.info(f"{state.messages[-1][-1]}")
     return
 
 
-def ingest_video_gen_transcript(filepath, request: gr.Request):
-    yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
+def ingest_gen_transcript(filepath, filetype, request: gr.Request):
+    yield (
+        gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
+    )
     verified_filepath = os.path.normpath(filepath)
     if not verified_filepath.startswith(tmp_upload_folder):
-        print("Found malicious video file name!")
+        print(f"Found malicious {filetype} file name!")
         yield (
             gr.Textbox(
                 visible=True,
-                value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
+                value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
             )
         )
         return
     basename = os.path.basename(verified_filepath)
     dest = os.path.join(static_dir, basename)
     shutil.copy(verified_filepath, dest)
-    print("Done copy uploaded file to static folder!")
+    print("Done copying uploaded file to static folder.")
     headers = {
         # 'Content-Type': 'multipart/form-data'
     }
@@ -172,17 +192,17 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
     if response.status_code == 200:
         response = response.json()
         print(response)
-        yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
+        yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
         time.sleep(2)
         fn_no_ext = Path(dest).stem
-        if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
-            new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
-            print(response["video_id_maps"][fn_no_ext])
+        if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
+            new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
+            print(response["file_id_maps"][fn_no_ext])
             os.rename(dest, new_dst)
             yield (
                 gr.Textbox(
                     visible=True,
-                    value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
+                    value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
                 )
             )
             return
@@ -190,51 +210,53 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
         yield (
             gr.Textbox(
                 visible=True,
-                value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
+                value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
             )
         )
         time.sleep(2)
     return
 
 
-def ingest_video_gen_caption(filepath, request: gr.Request):
-    yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
+def ingest_gen_caption(filepath, filetype, request: gr.Request):
+    yield (
+        gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
+    )
     verified_filepath = os.path.normpath(filepath)
     if not verified_filepath.startswith(tmp_upload_folder):
-        print("Found malicious video file name!")
+        print(f"Found malicious {filetype} file name!")
         yield (
             gr.Textbox(
                 visible=True,
-                value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
+                value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
             )
         )
         return
     basename = os.path.basename(verified_filepath)
     dest = os.path.join(static_dir, basename)
     shutil.copy(verified_filepath, dest)
-    print("Done copy uploaded file to static folder!")
+    print("Done copying uploaded file to static folder.")
     headers = {
         # 'Content-Type': 'multipart/form-data'
     }
     files = {
         "files": open(dest, "rb"),
     }
-    response = requests.post(dataprep_gen_captiono_addr, headers=headers, files=files)
+    response = requests.post(dataprep_gen_caption_addr, headers=headers, files=files)
     print(response.status_code)
     if response.status_code == 200:
         response = response.json()
         print(response)
-        yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
+        yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
         time.sleep(2)
         fn_no_ext = Path(dest).stem
-        if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
-            new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
-            print(response["video_id_maps"][fn_no_ext])
+        if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
+            new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
+            print(response["file_id_maps"][fn_no_ext])
             os.rename(dest, new_dst)
             yield (
                 gr.Textbox(
                     visible=True,
-                    value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
+                    value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
                 )
             )
             return
@@ -242,48 +264,181 @@ def ingest_video_gen_caption(filepath, request: gr.Request):
         yield (
             gr.Textbox(
                 visible=True,
-                value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
+                value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
             )
         )
         time.sleep(2)
     return
 
 
-def clear_uploaded_video(request: gr.Request):
+def ingest_with_text(filepath, text, request: gr.Request):
+    yield (gr.Textbox(visible=True, value="Please wait for your uploaded image to be ingested into the database..."))
+    verified_filepath = os.path.normpath(filepath)
+    if not verified_filepath.startswith(tmp_upload_folder):
+        print("Found malicious image file name!")
+        yield (
+            gr.Textbox(
+                visible=True,
+                value="Your uploaded image's file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
+            )
+        )
+        return
+    basename = os.path.basename(verified_filepath)
+    dest = os.path.join(static_dir, basename)
+    shutil.copy(verified_filepath, dest)
+    text_basename = "{}.txt".format(os.path.splitext(basename)[0])
+    text_dest = os.path.join(static_dir, text_basename)
+    with open(text_dest, "w") as file:
+        file.write(text)
+    print("Done copying uploaded files to static folder!")
+    headers = {
+        # 'Content-Type': 'multipart/form-data'
+    }
+    files = [("files", (basename, open(dest, "rb"))), ("files", (text_basename, open(text_dest, "rb")))]
+    try:
+        response = requests.post(dataprep_ingest_addr, headers=headers, files=files)
+    finally:
+        os.remove(text_dest)
+    print(response.status_code)
+    if response.status_code == 200:
+        response = response.json()
+        print(response)
+        yield (gr.Textbox(visible=True, value="Image ingestion is done. Saving your uploaded image..."))
+        time.sleep(2)
+        fn_no_ext = Path(dest).stem
+        if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
+            new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
+            print(response["file_id_maps"][fn_no_ext])
+            os.rename(dest, new_dst)
+            yield (
+                gr.Textbox(
+                    visible=True,
+                    value="Congratulation! Your upload is done!\nClick the X button on the top right of the image upload box to upload another image.",
+                )
+            )
+            return
+    else:
+        yield (
+            gr.Textbox(
+                visible=True,
+                value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the image upload box to reupload your image!",
+            )
+        )
+        time.sleep(2)
+    return
+
+
+def hide_text(request: gr.Request):
     return gr.Textbox(visible=False)
 
 
-with gr.Blocks() as upload_gen_trans:
-    gr.Markdown("# Ingest Your Own Video - Utilizing Generated Transcripts")
-    gr.Markdown(
-        "Please use this interface to ingest your own video if the video has meaningful audio (e.g., announcements, discussions, etc...)"
-    )
+def clear_text(request: gr.Request):
+    return None
+
+
+with gr.Blocks() as upload_video:
+    gr.Markdown("# Ingest Your Own Video Using Generated Transcripts or Captions")
+    gr.Markdown("Use this interface to ingest your own video and generate transcripts or captions for it")
+
+    def select_upload_type(choice, request: gr.Request):
+        if choice == "transcript":
+            return gr.Video(sources="upload", visible=True), gr.Video(sources="upload", visible=False)
+        else:
+            return gr.Video(sources="upload", visible=False), gr.Video(sources="upload", visible=True)
+
     with gr.Row():
         with gr.Column(scale=6):
-            video_upload = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload")
+            video_upload_trans = gr.Video(sources="upload", elem_id="video_upload_trans", visible=True)
+            video_upload_cap = gr.Video(sources="upload", elem_id="video_upload_cap", visible=False)
         with gr.Column(scale=3):
+            text_options_radio = gr.Radio(
+                [
+                    ("Generate transcript (video contains voice)", "transcript"),
+                    ("Generate captions (video does not contain voice)", "caption"),
+                ],
+                label="Text Options",
+                info="How should text be ingested?",
+                value="transcript",
+            )
             text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
-        video_upload.upload(ingest_video_gen_transcript, [video_upload], [text_upload_result])
-        video_upload.clear(clear_uploaded_video, [], [text_upload_result])
+        video_upload_trans.upload(
+            ingest_gen_transcript, [video_upload_trans, gr.Textbox(value="video", visible=False)], [text_upload_result]
+        )
+        video_upload_trans.clear(hide_text, [], [text_upload_result])
+        video_upload_cap.upload(
+            ingest_gen_caption, [video_upload_cap, gr.Textbox(value="video", visible=False)], [text_upload_result]
+        )
+        video_upload_cap.clear(hide_text, [], [text_upload_result])
+        text_options_radio.change(select_upload_type, [text_options_radio], [video_upload_trans, video_upload_cap])
 
-with gr.Blocks() as upload_gen_captions:
-    gr.Markdown("# Ingest Your Own Video - Utilizing Generated Captions")
-    gr.Markdown(
-        "Please use this interface to ingest your own video if the video has meaningless audio (e.g., background musics, etc...)"
-    )
+with gr.Blocks() as upload_image:
+    gr.Markdown("# Ingest Your Own Image Using Generated or Custom Captions/Labels")
+    gr.Markdown("Use this interface to ingest your own image and generate a caption for it")
+
+    def select_upload_type(choice, request: gr.Request):
+        if choice == "gen_caption":
+            return gr.Image(sources="upload", visible=True), gr.Image(sources="upload", visible=False)
+        else:
+            return gr.Image(sources="upload", visible=False), gr.Image(sources="upload", visible=True)
+
+    with gr.Row():
+        with gr.Column(scale=6):
+            image_upload_cap = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=True)
+            image_upload_text = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=False)
+        with gr.Column(scale=3):
+            text_options_radio = gr.Radio(
+                [("Generate caption", "gen_caption"), ("Custom caption or label", "custom_caption")],
+                label="Text Options",
+                info="How should text be ingested?",
+                value="gen_caption",
+            )
+            custom_caption = gr.Textbox(visible=True, interactive=True, label="Custom Caption or Label")
+            text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
+        image_upload_cap.upload(
+            ingest_gen_caption, [image_upload_cap, gr.Textbox(value="image", visible=False)], [text_upload_result]
+        )
+        image_upload_cap.clear(hide_text, [], [text_upload_result])
+        image_upload_text.upload(ingest_with_text, [image_upload_text, custom_caption], [text_upload_result]).then(
+            clear_text, [], [custom_caption]
+        )
+        image_upload_text.clear(hide_text, [], [text_upload_result])
+        text_options_radio.change(select_upload_type, [text_options_radio], [image_upload_cap, image_upload_text])
+
+with gr.Blocks() as upload_audio:
+    gr.Markdown("# Ingest Your Own Audio Using Generated Transcripts")
+    gr.Markdown("Use this interface to ingest your own audio file and generate a transcript for it")
     with gr.Row():
         with gr.Column(scale=6):
-            video_upload_cap = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload_cap")
+            audio_upload = gr.Audio(type="filepath")
+        with gr.Column(scale=3):
+            text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
+        audio_upload.upload(
+            ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
+        )
+        audio_upload.stop_recording(
+            ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
+        )
+        audio_upload.clear(hide_text, [], [text_upload_result])
+
+with gr.Blocks() as upload_pdf:
+    gr.Markdown("# Ingest Your Own PDF")
+    gr.Markdown("Use this interface to ingest your own PDF file with text, tables, images, and graphs")
+    with gr.Row():
+        with gr.Column(scale=6):
+            image_upload_cap = gr.File()
         with gr.Column(scale=3):
             text_upload_result_cap = gr.Textbox(visible=False, interactive=False, label="Upload Status")
-        video_upload_cap.upload(ingest_video_gen_transcript, [video_upload_cap], [text_upload_result_cap])
-        video_upload_cap.clear(clear_uploaded_video, [], [text_upload_result_cap])
+        image_upload_cap.upload(
+            ingest_gen_caption, [image_upload_cap, gr.Textbox(value="PDF", visible=False)], [text_upload_result_cap]
+        )
+        image_upload_cap.clear(hide_text, [], [text_upload_result_cap])
 
 with gr.Blocks() as qna:
     state = gr.State(multimodalqna_conv.copy())
     with gr.Row():
         with gr.Column(scale=4):
-            video = gr.Video(height=512, width=512, elem_id="video")
+            video = gr.Video(height=512, width=512, elem_id="video", visible=True, label="Media")
+            image = gr.Image(height=512, width=512, elem_id="image", visible=False, label="Media")
         with gr.Column(scale=7):
             chatbot = gr.Chatbot(elem_id="chatbot", label="MultimodalQnA Chatbot", height=390)
             with gr.Row():
@@ -293,7 +448,8 @@ def clear_uploaded_video(request: gr.Request):
                         # show_label=False,
                         # container=False,
                         label="Query",
-                        info="Enter your query here!",
+                        info="Enter a text query below",
+                        # submit_btn=False,
                     )
                 with gr.Column(scale=1, min_width=100):
                     with gr.Row():
@@ -306,7 +462,7 @@ def clear_uploaded_video(request: gr.Request):
         [
             state,
         ],
-        [state, chatbot, textbox, video, clear_btn],
+        [state, chatbot, textbox, video, image, clear_btn],
     )
 
     submit_btn.click(
@@ -318,17 +474,19 @@ def clear_uploaded_video(request: gr.Request):
         [
             state,
         ],
-        [state, chatbot, video, clear_btn],
+        [state, chatbot, video, image, clear_btn],
     )
 with gr.Blocks(css=css) as demo:
     gr.Markdown("# MultimodalQnA")
     with gr.Tabs():
-        with gr.TabItem("MultimodalQnA With Your Videos"):
+        with gr.TabItem("MultimodalQnA"):
             qna.render()
-        with gr.TabItem("Upload Your Own Videos"):
-            upload_gen_trans.render()
-        with gr.TabItem("Upload Your Own Videos"):
-            upload_gen_captions.render()
+        with gr.TabItem("Upload Video"):
+            upload_video.render()
+        with gr.TabItem("Upload Image"):
+            upload_image.render()
+        with gr.TabItem("Upload Audio"):
+            upload_audio.render()
 
 demo.queue()
 app = gr.mount_gradio_app(app, demo, path="/")
@@ -343,6 +501,9 @@ def clear_uploaded_video(request: gr.Request):
     parser.add_argument("--share", action="store_true")
 
     backend_service_endpoint = os.getenv("BACKEND_SERVICE_ENDPOINT", "http://localhost:8888/v1/multimodalqna")
+    dataprep_ingest_endpoint = os.getenv(
+        "DATAPREP_INGEST_SERVICE_ENDPOINT", "http://localhost:6007/v1/ingest_with_text"
+    )
     dataprep_gen_transcript_endpoint = os.getenv(
         "DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT", "http://localhost:6007/v1/generate_transcripts"
     )
@@ -353,9 +514,11 @@ def clear_uploaded_video(request: gr.Request):
     logger.info(f"args: {args}")
     global gateway_addr
     gateway_addr = backend_service_endpoint
+    global dataprep_ingest_addr
+    dataprep_ingest_addr = dataprep_ingest_endpoint
     global dataprep_gen_transcript_addr
     dataprep_gen_transcript_addr = dataprep_gen_transcript_endpoint
-    global dataprep_gen_captiono_addr
-    dataprep_gen_captiono_addr = dataprep_gen_caption_endpoint
+    global dataprep_gen_caption_addr
+    dataprep_gen_caption_addr = dataprep_gen_caption_endpoint
 
     uvicorn.run(app, host=args.host, port=args.port)
diff --git a/MultimodalQnA/ui/gradio/utils.py b/MultimodalQnA/ui/gradio/utils.py
index f6e1027eb..7a730a7ed 100644
--- a/MultimodalQnA/ui/gradio/utils.py
+++ b/MultimodalQnA/ui/gradio/utils.py
@@ -5,6 +5,7 @@
 import logging
 import logging.handlers
 import os
+import shutil
 import sys
 from pathlib import Path
 
@@ -118,6 +119,18 @@ def maintain_aspect_ratio_resize(image, width=None, height=None, inter=cv2.INTER
     return cv2.resize(image, dim, interpolation=inter)
 
 
+def make_temp_image(
+    image_name,
+    file_ext,
+    output_image_path: str = "./public/images",
+    output_image_name: str = "image_tmp",
+):
+    Path(output_image_path).mkdir(parents=True, exist_ok=True)
+    output_image = os.path.join(output_image_path, "{}.{}".format(output_image_name, file_ext))
+    shutil.copy(image_name, output_image)
+    return output_image
+
+
 # function to split video at a timestamp
 def split_video(
     video_path,
diff --git a/README.md b/README.md
index 87581d3dd..a34166387 100644
--- a/README.md
+++ b/README.md
@@ -37,18 +37,19 @@ Deployment are based on released docker images by default, check [docker image l
 
 #### Deploy Examples
 
-| Use Case          | Docker Compose<br/>Deployment on Xeon                                          | Docker Compose<br/>Deployment on Gaudi                                     | Kubernetes with Manifests                                                        | Kubernetes with Helm Charts                                                                                        | Kubernetes with GMC                                                |
-| ----------------- | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ |
-| ChatQnA           | [Xeon Instructions](ChatQnA/docker_compose/intel/cpu/xeon/README.md)           | [Gaudi Instructions](ChatQnA/docker_compose/intel/hpu/gaudi/README.md)     | [ChatQnA with Manifests](ChatQnA/kubernetes/intel/README.md)                     | [ChatQnA with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md)     | [ChatQnA with GMC](ChatQnA/kubernetes/intel/README_gmc.md)         |
-| CodeGen           | [Xeon Instructions](CodeGen/docker_compose/intel/cpu/xeon/README.md)           | [Gaudi Instructions](CodeGen/docker_compose/intel/hpu/gaudi/README.md)     | [CodeGen with Manifests](CodeGen/kubernetes/intel/README.md)                     | [CodeGen with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codegen/README.md)     | [CodeGen with GMC](CodeGen/kubernetes/intel/README_gmc.md)         |
-| CodeTrans         | [Xeon Instructions](CodeTrans/docker_compose/intel/cpu/xeon/README.md)         | [Gaudi Instructions](CodeTrans/docker_compose/intel/hpu/gaudi/README.md)   | [CodeTrans with Manifests](CodeTrans/kubernetes/intel/README.md)                 | [CodeTrans with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codetrans/README.md) | [CodeTrans with GMC](CodeTrans/kubernetes/intel/README_gmc.md)     |
-| DocSum            | [Xeon Instructions](DocSum/docker_compose/intel/cpu/xeon/README.md)            | [Gaudi Instructions](DocSum/docker_compose/intel/hpu/gaudi/README.md)      | [DocSum with Manifests](DocSum/kubernetes/intel/README.md)                       | [DocSum with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/docsum/README.md)       | [DocSum with GMC](DocSum/kubernetes/intel/README_gmc.md)           |
-| SearchQnA         | [Xeon Instructions](SearchQnA/docker_compose/intel/cpu/xeon/README.md)         | [Gaudi Instructions](SearchQnA/docker_compose/intel/hpu/gaudi/README.md)   | Not Supported                                                                    | Not Supported                                                                                                      | [SearchQnA with GMC](SearchQnA/kubernetes/intel/README_gmc.md)     |
-| FaqGen            | [Xeon Instructions](FaqGen/docker_compose/intel/cpu/xeon/README.md)            | [Gaudi Instructions](FaqGen/docker_compose/intel/hpu/gaudi/README.md)      | [FaqGen with Manifests](FaqGen/kubernetes/intel/README.md)                       | Not Supported                                                                                                      | [FaqGen with GMC](FaqGen/kubernetes/intel/README_gmc.md)           |
-| Translation       | [Xeon Instructions](Translation/docker_compose/intel/cpu/xeon/README.md)       | [Gaudi Instructions](Translation/docker_compose/intel/hpu/gaudi/README.md) | [Translation with Manifests](Translation/kubernetes/intel/README.md)             | Not Supported                                                                                                      | [Translation with GMC](Translation/kubernetes/intel/README_gmc.md) |
-| AudioQnA          | [Xeon Instructions](AudioQnA/docker_compose/intel/cpu/xeon/README.md)          | [Gaudi Instructions](AudioQnA/docker_compose/intel/hpu/gaudi/README.md)    | [AudioQnA with Manifests](AudioQnA/kubernetes/intel/README.md)                   | Not Supported                                                                                                      | [AudioQnA with GMC](AudioQnA/kubernetes/intel/README_gmc.md)       |
-| VisualQnA         | [Xeon Instructions](VisualQnA/docker_compose/intel/cpu/xeon/README.md)         | [Gaudi Instructions](VisualQnA/docker_compose/intel/hpu/gaudi/README.md)   | [VisualQnA with Manifests](VisualQnA/kubernetes/intel/README.md)                 | Not Supported                                                                                                      | [VisualQnA with GMC](VisualQnA/kubernetes/intel/README_gmc.md)     |
-| ProductivitySuite | [Xeon Instructions](ProductivitySuite/docker_compose/intel/cpu/xeon/README.md) | Not Supported                                                              | [ProductivitySuite with Manifests](ProductivitySuite/kubernetes/intel/README.md) | Not Supported                                                                                                      | Not Supported                                                      |
+| Use Case          | Docker Compose<br/>Deployment on Xeon                                          | Docker Compose<br/>Deployment on Gaudi                                       | Kubernetes with Manifests                                                        | Kubernetes with Helm Charts                                                                                        | Kubernetes with GMC                                                |
+| ----------------- | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ |
+| ChatQnA           | [Xeon Instructions](ChatQnA/docker_compose/intel/cpu/xeon/README.md)           | [Gaudi Instructions](ChatQnA/docker_compose/intel/hpu/gaudi/README.md)       | [ChatQnA with Manifests](ChatQnA/kubernetes/intel/README.md)                     | [ChatQnA with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md)     | [ChatQnA with GMC](ChatQnA/kubernetes/intel/README_gmc.md)         |
+| CodeGen           | [Xeon Instructions](CodeGen/docker_compose/intel/cpu/xeon/README.md)           | [Gaudi Instructions](CodeGen/docker_compose/intel/hpu/gaudi/README.md)       | [CodeGen with Manifests](CodeGen/kubernetes/intel/README.md)                     | [CodeGen with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codegen/README.md)     | [CodeGen with GMC](CodeGen/kubernetes/intel/README_gmc.md)         |
+| CodeTrans         | [Xeon Instructions](CodeTrans/docker_compose/intel/cpu/xeon/README.md)         | [Gaudi Instructions](CodeTrans/docker_compose/intel/hpu/gaudi/README.md)     | [CodeTrans with Manifests](CodeTrans/kubernetes/intel/README.md)                 | [CodeTrans with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codetrans/README.md) | [CodeTrans with GMC](CodeTrans/kubernetes/intel/README_gmc.md)     |
+| DocSum            | [Xeon Instructions](DocSum/docker_compose/intel/cpu/xeon/README.md)            | [Gaudi Instructions](DocSum/docker_compose/intel/hpu/gaudi/README.md)        | [DocSum with Manifests](DocSum/kubernetes/intel/README.md)                       | [DocSum with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/docsum/README.md)       | [DocSum with GMC](DocSum/kubernetes/intel/README_gmc.md)           |
+| SearchQnA         | [Xeon Instructions](SearchQnA/docker_compose/intel/cpu/xeon/README.md)         | [Gaudi Instructions](SearchQnA/docker_compose/intel/hpu/gaudi/README.md)     | Not Supported                                                                    | Not Supported                                                                                                      | [SearchQnA with GMC](SearchQnA/kubernetes/intel/README_gmc.md)     |
+| FaqGen            | [Xeon Instructions](FaqGen/docker_compose/intel/cpu/xeon/README.md)            | [Gaudi Instructions](FaqGen/docker_compose/intel/hpu/gaudi/README.md)        | [FaqGen with Manifests](FaqGen/kubernetes/intel/README.md)                       | Not Supported                                                                                                      | [FaqGen with GMC](FaqGen/kubernetes/intel/README_gmc.md)           |
+| Translation       | [Xeon Instructions](Translation/docker_compose/intel/cpu/xeon/README.md)       | [Gaudi Instructions](Translation/docker_compose/intel/hpu/gaudi/README.md)   | [Translation with Manifests](Translation/kubernetes/intel/README.md)             | Not Supported                                                                                                      | [Translation with GMC](Translation/kubernetes/intel/README_gmc.md) |
+| AudioQnA          | [Xeon Instructions](AudioQnA/docker_compose/intel/cpu/xeon/README.md)          | [Gaudi Instructions](AudioQnA/docker_compose/intel/hpu/gaudi/README.md)      | [AudioQnA with Manifests](AudioQnA/kubernetes/intel/README.md)                   | Not Supported                                                                                                      | [AudioQnA with GMC](AudioQnA/kubernetes/intel/README_gmc.md)       |
+| VisualQnA         | [Xeon Instructions](VisualQnA/docker_compose/intel/cpu/xeon/README.md)         | [Gaudi Instructions](VisualQnA/docker_compose/intel/hpu/gaudi/README.md)     | [VisualQnA with Manifests](VisualQnA/kubernetes/intel/README.md)                 | Not Supported                                                                                                      | [VisualQnA with GMC](VisualQnA/kubernetes/intel/README_gmc.md)     |
+| MultimodalQnA     | [Xeon Instructions](MultimodalQnA/docker_compose/intel/cpu/xeon/README.md)     | [Gaudi Instructions](MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md) | Not supported                                                                    | Not supported                                                                                                      | Not supported                                                      |
+| ProductivitySuite | [Xeon Instructions](ProductivitySuite/docker_compose/intel/cpu/xeon/README.md) | Not Supported                                                                | [ProductivitySuite with Manifests](ProductivitySuite/kubernetes/intel/README.md) | Not Supported                                                                                                      | Not Supported                                                      |
 
 ## Supported Examples
 
diff --git a/docker_images_list.md b/docker_images_list.md
index d90ae08dc..056ee2fce 100644
--- a/docker_images_list.md
+++ b/docker_images_list.md
@@ -26,8 +26,8 @@ Take ChatQnA for example. ChatQnA is a chatbot application service based on the
 | [opea/faqgen](https://hub.docker.com/r/opea/faqgen)                                                         | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/Dockerfile)                            | The docker image served as a faqgen gateway and automatically generating comprehensive, natural sounding Frequently Asked Questions (FAQs) from documents, legal texts, customer inquiries and other sources.                                                                          |
 | [opea/faqgen-ui](https://hub.docker.com/r/opea/faqgen-ui)                                                   | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/ui/docker/Dockerfile)                  | The docker image serves as the docsum UI entry point for easy interaction with users, generating FAQs by pasting in question text.                                                                                                                                                     |
 | [opea/faqgen-react-ui](https://hub.docker.com/r/opea/faqgen-react-ui)                                       | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/ui/docker/Dockerfile.react)            | The purpose of the docker image is to provide a user interface for Generate FAQs using React. It allows generating FAQs by uploading files or pasting text.                                                                                                                            |
-| [opea/multimodalqna](https://hub.docker.com/r/opea/multimodalqna)                                           | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/Dockerfile)                     | The docker image served as a multimodalqna gateway and dynamically fetches the most relevant multimodal information (frames, transcripts, and/or subtitles) from the user's video collection to solve the problem.                                                                     |
-| [opea/multimodalqna-ui](https://hub.docker.com/r/opea/multimodalqna-ui)                                     | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/ui/docker/Dockerfile)           | The docker image serves as the docsum UI entry point for easy interaction with users. Answers to questions are generated from videos uploaded by users..                                                                                                                               |
+| [opea/multimodalqna](https://hub.docker.com/r/opea/multimodalqna)                                           | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/Dockerfile)                     | The docker image served as a multimodalqna gateway and dynamically fetches the most relevant multimodal information (frames, transcripts, and/or subtitles) from the user's video, image, or audio collection to solve the problem.                                                    |
+| [opea/multimodalqna-ui](https://hub.docker.com/r/opea/multimodalqna-ui)                                     | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/ui/docker/Dockerfile)           | The docker image serves as the multimodalqna UI entry point for easy interaction with users. Answers to questions are generated from uploaded by users.                                                                                                                                |
 | [opea/productivity-suite-react-ui-server](https://hub.docker.com/r/opea/productivity-suite-react-ui-server) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/ProductivitySuite/ui/docker/Dockerfile.react) | The purpose of the docker image is to provide a user interface for Productivity Suite Application using React. It allows interaction by uploading documents and inputs.                                                                                                                |
 | [opea/searchqna](https://hub.docker.com/r/opea/searchqna/tags)                                              | [Link](https://github.com/opea-project/GenAIExamples/blob/main/SearchQnA/Dockerfile)                         | The docker image served as the searchqna gateway to provide service of retrieving accurate and relevant answers to user queries from a knowledge base or dataset                                                                                                                       |
 | [opea/searchqna-ui](https://hub.docker.com/r/opea/searchqna-ui)                                             | [Link](https://github.com/opea-project/GenAIExamples/blob/main/SearchQnA/ui/docker/Dockerfile)               | The docker image acted as the searchqna UI entry for facilitating interaction with users for question answering                                                                                                                                                                        |
diff --git a/supported_examples.md b/supported_examples.md
index 33b02f71d..0754be3ee 100644
--- a/supported_examples.md
+++ b/supported_examples.md
@@ -186,7 +186,15 @@ FAQ Generation Application leverages the power of large language models (LLMs) t
 
 ### MultimodalQnA
 
-[MultimodalQnA](./MultimodalQnA/README.md) addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos.
+[MultimodalQnA](./MultimodalQnA/README.md) addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, or audio files. MultimodalQnA utilizes BridgeTower model, a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
+
+| Service   | Model                                                                                                             | HW         | Description                   |
+| --------- | ----------------------------------------------------------------------------------------------------------------- | ---------- | ----------------------------- |
+| Embedding | [BridgeTower/bridgetower-large-itm-mlm-itc](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-itc)     | Xeon/Gaudi | Multimodal embeddings service |
+| Embedding | [BridgeTower/bridgetower-large-itm-mlm-gaudi](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi) | Gaudi      | Multimodal embeddings service |
+| LVM       | [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)                                       | Xeon       | LVM service                   |
+| LVM       | [llava-hf/llava-1.5-13b-hf](https://huggingface.co/llava-hf/llava-1.5-13b-hf)                                     | Xeon       | LVM service                   |
+| LVM       | [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf)                     | Gaudi      | LVM service                   |
 
 ### ProductivitySuite