[Feature] Multimodal agents demo #320

chenllliang · 2023-10-19T15:49:35Z

add MultiModalPrompt class and an example

Description

This PR introduces a new class, MultiModalPrompt, aimed at facilitating the transfer of information between multimodal agents. The class encapsulates both text prompts and additional multimodal data, thereby allowing seamless integration and interchangeability.

The updated src file is camel/prompts/multimodal.py and camel/prompts/__init__.py.

An example is added to examples/multimodal/formating_example.py

Key Features:

Multimodal Data Handling: The class can handle both text prompts (from the TextPrompt class) and multimodal information.
Flexible Modality Support: The class comes with a predefined list of modalities (MODALITIES), and it can validate the provided modalities against this list.
Dynamic Formatting: The format method allows the formatting of both text prompts and multimodal information in tandem. It can also distinguish between keyword arguments meant for the text prompt and those intended for multimodal information.
Customizable Model Input Conversion: With the to_model_format method, the prompt can be converted into a model-understandable format. By default, it uses the default_to_model_format method, but custom methods can also be provided.

Code Changes:

Added MultiModalPrompt class with methods for initializing, formatting, and converting to a model-understandable format.
Included a helper function, default_to_model_format, which serves as the default method to format multimodal prompts for models.

Example Description for Pull Request

MultiModalPrompt Example Demonstrations

In the attached example examples/multimodal/formating_example.py, it demonstrates the capabilities and practical use-cases of the newly added MultiModalPrompt class for various multimodal scenarios.

Single Image VQA (Visual Question-Answering) Prompt:
- We start by initializing a prompt for a basic VQA scenario that involves a single image.
- This demonstration uses the default model input format.
- Two different questions are formatted with two separate images: "camel.jpg" and "llama.jpg".
- Model-understandable formats for both prompts are printed out.
Multi-Image Question with Custom Model Input Format:
- A more complex scenario is demonstrated where we have multiple images corresponding to a single question.
- A custom input format, multi_image_input_format, is implemented which labels images in the prompt with numbers. This indexing format is inspired by MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning.
  - Special tokens <Image{i}> are introduced in the textual prompt to indicate image positions.
  - [Image{i}] acts as the visual placeholder for the i-th image in the prompt.
- The text prompt is dynamically updated based on the number of images provided in the multimodal information.
- Finally, the complete formatted prompt and the associated images are printed out.

This example serves as a practical guide to:

Show how the MultiModalPrompt can be seamlessly integrated with existing prompts.
Handle various complexities like multiple images.
Customize the input format for different multimodal models.

The described example not only showcases the ease of use and flexibility of the MultiModalPrompt class but also demonstrates its applicability across various real-world scenarios, emphasizing its potential utility for developers and researchers in the multimodal domain.

Future Work

Planning to extend the functionality with MultiModalPromptDict.
If the Prompt Class is setteled, I hope to add some demos with multiple LLM and VLLM agents.

Please review the changes and provide feedback.

Motivation and Context

Why is this change required? What problem does it solve?
close #317 Feature Request

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of example)

Implemented Tasks

Subtask 1
Subtask 2
Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide. (required)
My change requires a change to the documentation.
I have updated the tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

add MultiModalPrompt class and an example

ZIYU-DEEP · 2023-10-23T05:19:26Z

The high-level class design looks good to me! Left one minor comment in the code.

chenllliang · 2023-10-23T14:28:54Z

The high-level class design looks good to me! Left one minor comment in the code.

Hi, thanks for your review, yet I can't see your comment in the code. Could you post it again?
@ZIYU-DEEP

chenllliang · 2023-10-24T12:33:26Z

TODO:

Write document for multimodal function @chenllliang
Give an example using the multimodal prompt class @ZIYU-DEEP

ZIYU-DEEP · 2023-10-25T09:32:55Z

TODO:

* Write document for multimodal function @chenllliang

* Give an example using the multimodal prompt class @ZIYU-DEEP

Just to leave a note here – i was planning to combine this with the huggingface agent, yet i encountered the SSLError, w/ or w/o firewall restrictions (cf., #268, #17611); you may check if you have the same issue @chenllliang. i plan alternatively make a minimal example with the interpreter.

chenllliang · 2023-10-31T15:37:17Z

I have updated the documentation of multimdoal prompt class. (It could be merged I think)

chenllliang · 2023-11-14T16:07:27Z

currently developming multimodal role-playing demo

chenllliang · 2024-02-04T16:08:21Z

I design a pipeline for a possible application of multimodal agents' collaboartion. It's called "Scientific Graph Painter", which is used to generate python code to draw a figure from in scientific papers.

It has 3 roles and possible models:

Draft : GPT4V , generate the draft code for drawing the graph
Critic: GPT4V , compare the draft and target graph, give revision suggestions
Polisher: GPT4 , following the suggestions, revise the code for drawing the graph

The pipeline graph is listed below, sry I am too occupied with other stuffs currently in Feb. 2024. Anyone feels interested in the topic can implement or discuss.

I think something need to be done first: add image information in agents' message. I change the PR from "add multimodal prompt class" to "Multimodal agents demo".

chenllliang added 7 commits October 19, 2023 23:31

[Feature] Add MultiModalPrompt class

8136684

add MultiModalPrompt class and an example

Update multimodal.py

07410b8

Merge branch 'master' into master

f80de30

formating with isort

6432d88

Merge branch 'master' of https://github.com/chenllliang/camel

526148a

pep8 update

49be121

update pep8

bee5730

chenllliang requested a review from lightaime October 21, 2023 07:25

ZIYU-DEEP assigned ZIYU-DEEP and unassigned ZIYU-DEEP Oct 21, 2023

ZIYU-DEEP self-requested a review October 21, 2023 12:30

chenllliang self-assigned this Oct 21, 2023

ZIYU-DEEP self-assigned this Oct 24, 2023

ZIYU-DEEP added multi-modality New Feature labels Oct 24, 2023

add document for multimodalprompt class

71e2e62

chenllliang changed the title ~~[Feature] Add MultiModalPrompt class~~ [Feature] Multimodal agents demo Feb 4, 2024

zechengz mentioned this pull request Mar 11, 2024

[Roadmap] Multimodal Agent Roadmap #454

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Multimodal agents demo #320

[Feature] Multimodal agents demo #320

chenllliang commented Oct 19, 2023 •

edited

Loading

ZIYU-DEEP commented Oct 23, 2023

chenllliang commented Oct 23, 2023 •

edited

Loading

chenllliang commented Oct 24, 2023

ZIYU-DEEP commented Oct 25, 2023

chenllliang commented Oct 31, 2023 •

edited

Loading

chenllliang commented Nov 14, 2023

chenllliang commented Feb 4, 2024

[Feature] Multimodal agents demo #320

Are you sure you want to change the base?

[Feature] Multimodal agents demo #320

Conversation

chenllliang commented Oct 19, 2023 • edited Loading

Description

Example Description for Pull Request

Future Work

Motivation and Context

Types of changes

Implemented Tasks

Checklist

ZIYU-DEEP commented Oct 23, 2023

chenllliang commented Oct 23, 2023 • edited Loading

chenllliang commented Oct 24, 2023

ZIYU-DEEP commented Oct 25, 2023

chenllliang commented Oct 31, 2023 • edited Loading

chenllliang commented Nov 14, 2023

chenllliang commented Feb 4, 2024

chenllliang commented Oct 19, 2023 •

edited

Loading

chenllliang commented Oct 23, 2023 •

edited

Loading

chenllliang commented Oct 31, 2023 •

edited

Loading