Code for the main conference of ACL 2021 long paper UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other.
They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs).
In this work, we propose a UNIfied-MOdal pre-training architecture, namely UNIMO
, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks.
Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs augmented with related images and texts.
With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space.
The experimental results show that UNIMO
greatly improves the performance of several single-modal and multi-modal downstream tasks.
Results on multi-modal understanding and generation tasks:
Results on single-modal understanding and generation tasks:
- [] Add VQA tasks
python 3.7.4
paddlepaddle-gpu==1.8.4.post107
pyrouge==0.1.3
regex==2020.7.14
UNIMO
adopts large-scale text corpus, image collections and image-text aligned datasets as the pre-training data.
We provide UNIMO
pre-trained models below:
UNIMO base (lowercased | 12 layers)
UNIMO-mnli base (lowercased | 12 layers)
UNIMO large (lowercased | 24 layers)
UNIMO-mnli large (lowercased | 24 layers)
MODEL_SIZE=base # base | mnli_base | large | mnli_large
cd /path/to/model_files
wget --no-check-certificate -q https://unimo.bj.bcebos.com/model/unimo_${MODEL_SIZE}_en.tar.gz
tar -zxf unimo_${MODEL_SIZE}_en.tar.gz
Our fine-tuning experiments are carried on V100 GPU. The following are the startup methods and basic settings of all downstream tasks:
Task Type | Datatset | Pre-trained Models | Start Command | V100 GPU Cards | Running Time |
Text Understanding | SST-2 | UNIMO base | sh ./script/classification/SST-2/run.sh | 8 | 9h |
UNIMO large | sh ./script/classification/SST-2_large/run.sh | 8 | 14h | ||
CoLA | UNIMO base | sh ./script/classification/CoLA/run.sh | 4 | 2h | |
UNIMO large | sh ./script/classification/CoLA_large/run.sh | 4 | 4h | ||
MNLI-AX | UNIMO base | sh ./script/classification/MNLI-AX/run.sh | 8 | 1d20h | |
UNIMO large | sh ./script/classification/MNLI-AX_large/run.sh | 8 | 2d13h | ||
STS-B | UNIMO-mnli base | sh ./script/regression/STS-B/run.sh | 8 | 2h | |
UNIMO-mnli large | sh ./script/regression/STS-B_large/run.sh | 8 | 4h | ||
Text Generation | CNN/DailyMail | UNIMO base | sh ./script/seq2seq/cnndm/run.sh | 4 | 1d8h |
UNIMO large | sh ./script/seq2seq/cnndm_large/run.sh | 4 | 3d18h | ||
Gigaword | UNIMO base | sh ./script/seq2seq/gigaword/run.sh | 4 | 1d3h | |
UNIMO large | sh ./script/seq2seq/gigaword_large/run.sh | 4 | 2d3h | ||
CoQA | UNIMO base | sh ./script/seq2seq/coqa/run.sh | 4 | 7h | |
UNIMO large | sh ./script/seq2seq/coqa_large/run.sh | 4 | 22h | ||
Squad_QG | UNIMO base | sh ./script/seq2seq/squad_qg/run.sh | 4 | 4h | |
UNIMO large | sh ./script/seq2seq/squad_qg_large/run.sh | 4 | 8h | ||
Multi-Modal Understanding | Flickr30k | UNIMO base | sh ./script/retrieval/Flickr30k/run.sh | 16 | 3d |
UNIMO large | sh ./script/retrieval/Flickr30k_large/run.sh | 16 | 3d | ||
SNLI-VE | UNIMO base | sh ./script/visual_entailment/SNLI-VE/run.sh | 16 | 16h | |
UNIMO large | sh ./script/visual_entailment/SNLI-VE_large/run.sh | 16 | 2d | ||
VQA | UNIMO base | - | - | - | |
UNIMO large | - | - | - | ||
Multi-Modal Generation | COCO Caption | UNIMO base | sh ./script/img2txt/coco/run.sh | 16 | 3d |
UNIMO large | sh ./script/img2txt/coco_large/run.sh | 16 | 4d |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SST-2.tar.gz
tar -zxf SST.tar.gz
For base model:
bash ./script/classification/SST-2/run.sh
For large model:
bash ./script/classification/SST-2_large/run.sh
Model | Acc |
UNIMO-base | 95.1 |
UNIMO-large | 96.8 |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/MNLI-AX.tar.gz
tar -zxf MNLI-AX.tar.gz
For base model:
bash ./script/classification/MNLI-AX/run.sh
For large model:
bash ./script/classification/MNLI-AX_large/run.sh
Model | Acc-(m/mm) |
UNIMO-base | 86.8/86.7 |
UNIMO-large | 89.8/89.5 |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/STS-B.tar.gz
tar -zxf STS-B.tar.gz
For base model:
bash ./script/regression/STS-B/run.sh
For large model:
bash ./script/regression/STS-B_large/run.sh
Model | Pearson correlation |
UNIMO-base | 91.0 |
UNIMO-large | 92.6 |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/CoLA.tar.gz
tar -zxf CoLA.tar.gz
For base model:
bash ./script/classification/CoLA/run.sh
For large model:
bash ./script/classification/CoLA_large/run.sh
Model | Matthews correlation |
UNIMO-base | 65.4 |
UNIMO-large | 68.5 |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/cnndm.tar.gz
tar -zxf cnndm.tar.gz
cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/cnndm.tar.gz
tar -zxf cnndm.tar.gz
For base model:
bash ./script/seq2seq/cnndm/run.sh
For large model:
bash ./script/seq2seq/cnndm_large/run.sh
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
UNIMO-base | 42.42 | 20.12 | 39.61 |
UNIMO-large | 43.51 | 20.65 | 40.63 |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/gigaword.tar.gz
tar -zxf gigaword.tar.gz
cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/gigaword.tar.gz
tar -zxf gigaword.tar.gz
For base model:
bash ./script/seq2seq/gigaword/run.sh
For large model:
bash ./script/seq2seq/gigaword_large/run.sh
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
UNIMO-base | 38.80 | 19.99 | 36.27 |
UNIMO-large | 39.71 | 20.37 | 36.88 |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/squad_qg.tar.gz
tar -zxf squad_qg.tar.gz
cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/squad_qg.tar.gz
tar -zxf squad_qg.tar.gz
For base model:
bash ./script/seq2seq/squad_qg/run.sh
For large model:
bash ./script/seq2seq/squad_qg_large/run.sh
Model | BLUE4 | METEOR | ROUGE-L |
UNIMO-base | 22.78 | 25.24 | 51.34 |
UNIMO-large | 24.59 | 26.39 | 52.47 |
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coqa.tar.gz
tar -zxf coqa.tar.gz
cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coqa.tar.gz
tar -zxf coqa.tar.gz
For base model:
bash ./script/seq2seq/coqa/run.sh
For large model:
bash ./script/seq2seq/coqa_large/run.sh
Model | Acc |
UNIMO-base | 80.2 |
UNIMO-large | 84.9 |
Note: Visual features are extracted by bottom-up-attention
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/Flickr30k.tar.gz # occupies about 37G disk space
tar -zxf Flickr30k.tar.gz
For base model:
bash ./script/retrieval/Flickr30k/run.sh
For large model:
bash ./script/retrieval/Flickr30k_large/run.sh
Results of Image Retrieval task on Flickr30k dataset
Model | R@1 | R@5 | R@10 |
UNIMO-base | 74.66 | 93.40 | 96.08 |
UNIMO-large | 78.04 | 94.24 | 97.12 |
Results of Text Retrieval task on Flickr30k dataset
Model | R@1 | R@5 | R@10 |
UNIMO-base | 89.70 | 98.40 | 99.10 |
UNIMO-large | 89.40 | 98.90 | 99.80 |
Note: Visual features are extracted by bottom-up-attention
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SNLI-VE.tar.gz
tar -zxf SNLI-VE.tar.gz
For base model:
bash ./script/visual_entailment/SNLI-VE/run.sh
For large model:
bash ./script/visual_entailment/SNLI-VE_large/run.sh
Results of Visual Entailment task on SNLI-VE dataset
Model | dev | test |
UNIMO-base | 80.00 | 79.10 |
UNIMO-large | 81.11 | 80.63 |
Note: Visual features are extracted by bottom-up-attention
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coco.tar.gz
tar -zxf coco.tar.gz
cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coco.tar.gz
tar -zxf coco.tar.gz
For base model:
bash ./script/img2txt/coco/run.sh
For large model:
bash ./script/img2txt/coco_large/run.sh
Model | BLUE4 | CIDEr |
UNIMO-base | 38.8 | 124.4 |
UNIMO-large | 39.6 | 127.7 |
If you find our paper and code useful, please cite the following paper:
@article{li2020unimo,
title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning},
author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15409},
year={2020}
}
For help or issues using UNIMO
, please submit a GitHub issue.
For personal communication related to UNIMO
, please contact Wei Li (liwei85@baidu.com), Guocheng Niu (niuguocheng@baidu.com) , Can Gao (gaocan01@baidu.com).