Project for Natural Language Processing course:
Text-driven Shape generation with Denoising Diffusion Models
Python 3.8 is required.
Create and activate a virtual environment
python -m venv env
source env/bin/activate
Install all the dependencies
pip install -r requirements.txt
The architecture of this project is inspired by PVD, a method for unconditional point cloud generation. The following figure, taken from this method summarizes the diffusion process, which allows to generate a point cloud from random noise.
The architecture of the trainable network which learns to estimate the noise applied to an input source, is shown in the figure below. The model receives as input the noisy point cloud, its corresponding time step t and the text describing the point cloud. The text prompt is processed by the Encoder Model of the large language model T5, which computes a text embedding. This text embedding is provided as input to the PVConv layers to predict the noise, conditioned on text.
In order to generate shapes directly from text, two conditional schemes have been implemented and evaluated:
- Concatenation of text features with point cloud features
- Cross-attention between text and point cloud features The text-conditioning methods are implemented inside the PVConv layers of PVD, as shown in the figures below.
The table below summarizes the architecture of PVConv (3 layers) in the concatenation text-conditioning scheme.
PVConv x 3 |
---|
Input: (x, t, text_embed) |
Concat(X, t, text_embed) |
3x3x3 Conv, GroupNorm, Swish |
Dropout |
3x3x3 Conv, GroupNorm, Swish |
SelfAttention |
The table below summarizes the architecture of PVConv (3 layers) in the cross-attention text-conditioning scheme.
PVConv x 3 |
---|
Input: (x, t, text_embed) |
Concat(X, t) |
3x3x3 Conv, GroupNorm, Swish |
Dropout |
3x3x3 Conv, GroupNorm, Swish |
SelfAttention |
CrossAttention(text_embed) |
FeedForward |
This model has been trained on Text2Shape, the only existing dataset with paired 3D shapes and textual descriptions. Such dataset is limited to the chair and table categories of ShapeNet. Text2Shape provides a total of 75k shape-text pairs, referred to 15032 distint 3D shapes.
For the concatenation scheme:
python train.py --half_resolution --use_concat
For the cross-attention scheme:
python train.py --half_resolution
When testing the trained models, we compute the metrics reported in PVD:
- MMD-CD: Minimum Matching Distance (using Chamfer Distance)
- COV-CD: Coverage (using Chamfer Distance)
- JSD: Jensen-Shannon Divergence
For the concatenation scheme:
python test.py --half_resolution --use_concat --model path/to/your/model.pth --eval_dir path/to/output/directory
For the cross-attention scheme:
python test.py --half_resolution --model path/to/your/model.pth --eval_dir path/to/output/directory
method | MMD-CD ↓ | COV-CD ↑ | JSD ↓ |
---|---|---|---|
concatenation | 0.0941 | 2.50 % | 0.965 |
cross-attention | 0.00154 | 50.91 % | 0.0283 |
The table below shows some qualitative results of the text-driven generation process.
text | cross-attention | concatenation |
---|---|---|
"a sofa chair" | ||
"a round table" | ||
"a long rectangular table" |
The quantitative and qualitative results reported show that the text-conditioning scheme based on cross-attention is much more effective than concatenation. Indeed, cross-attention allows the generation of 3D shapes which are coherent with the input text, since they contain the main geometric and appearance details specified in the text. On the other hand, the concatenation scheme is not able to inject into the 3D generation process the features described in the text, resulting in 3D shapes which are not various and related to the input textual description.