UNIMO

`UNIMO`

Code for the main conference of ACL 2021 long paper UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Abstract

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNIfied-MOdal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs augmented with related images and texts. With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space. The experimental results show that UNIMO greatly improves the performance of several single-modal and multi-modal downstream tasks.

Performance

Results on multi-modal understanding and generation tasks:

Results on single-modal understanding and generation tasks:

TODOs

[] Add VQA tasks

Dependencies

python 3.7.4
paddlepaddle-gpu==1.8.4.post107
pyrouge==0.1.3 regex==2020.7.14

Pre-trained Models

UNIMO adopts large-scale text corpus, image collections and image-text aligned datasets as the pre-training data. We provide UNIMO pre-trained models below:

UNIMO base (lowercased | 12 layers)

UNIMO-mnli base (lowercased | 12 layers)

UNIMO large (lowercased | 24 layers)

UNIMO-mnli large (lowercased | 24 layers)

MODEL_SIZE=base # base | mnli_base | large | mnli_large
cd /path/to/model_files
wget --no-check-certificate -q https://unimo.bj.bcebos.com/model/unimo_${MODEL_SIZE}_en.tar.gz
tar -zxf unimo_${MODEL_SIZE}_en.tar.gz

Experiments

Our fine-tuning experiments are carried on V100 GPU. The following are the startup methods and basic settings of all downstream tasks:

Task Type	Datatset	Pre-trained Models	Start Command	V100 GPU Cards	Running Time
Text Understanding	SST-2	UNIMO base	sh ./script/classification/SST-2/run.sh	8	9h
	SST-2	UNIMO large	sh ./script/classification/SST-2_large/run.sh	8	14h
	CoLA	UNIMO base	sh ./script/classification/CoLA/run.sh	4	2h
	CoLA	UNIMO large	sh ./script/classification/CoLA_large/run.sh	4	4h
	MNLI-AX	UNIMO base	sh ./script/classification/MNLI-AX/run.sh	8	1d20h
	MNLI-AX	UNIMO large	sh ./script/classification/MNLI-AX_large/run.sh	8	2d13h
	STS-B	UNIMO-mnli base	sh ./script/regression/STS-B/run.sh	8	2h
	STS-B	UNIMO-mnli large	sh ./script/regression/STS-B_large/run.sh	8	4h
Text Generation	CNN/DailyMail	UNIMO base	sh ./script/seq2seq/cnndm/run.sh	4	1d8h
	CNN/DailyMail	UNIMO large	sh ./script/seq2seq/cnndm_large/run.sh	4	3d18h
	Gigaword	UNIMO base	sh ./script/seq2seq/gigaword/run.sh	4	1d3h
	Gigaword	UNIMO large	sh ./script/seq2seq/gigaword_large/run.sh	4	2d3h
	CoQA	UNIMO base	sh ./script/seq2seq/coqa/run.sh	4	7h
	CoQA	UNIMO large	sh ./script/seq2seq/coqa_large/run.sh	4	22h
	Squad_QG	UNIMO base	sh ./script/seq2seq/squad_qg/run.sh	4	4h
	Squad_QG	UNIMO large	sh ./script/seq2seq/squad_qg_large/run.sh	4	8h
Multi-Modal Understanding	Flickr30k	UNIMO base	sh ./script/retrieval/Flickr30k/run.sh	16	3d
	Flickr30k	UNIMO large	sh ./script/retrieval/Flickr30k_large/run.sh	16	3d
	SNLI-VE	UNIMO base	sh ./script/visual_entailment/SNLI-VE/run.sh	16	16h
	SNLI-VE	UNIMO large	sh ./script/visual_entailment/SNLI-VE_large/run.sh	16	2d
	VQA	UNIMO base	-	-	-
	VQA	UNIMO large	-	-	-
Multi-Modal Generation	COCO Caption	UNIMO base	sh ./script/img2txt/coco/run.sh	16	3d
	COCO Caption	UNIMO large	sh ./script/img2txt/coco_large/run.sh	16	4d

Text Understanding Tasks

(1) Sentiment Classification

Download SST-2 dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SST-2.tar.gz
tar -zxf SST.tar.gz

Run the following common to train and evaluate on the SST-2 dataset:

For base model:

bash ./script/classification/SST-2/run.sh

For large model:

bash ./script/classification/SST-2_large/run.sh

Evaluation Results:

Model	Acc
UNIMO-base	95.1
UNIMO-large	96.8

(2) Natural Language Inference

Download MNLI-AX dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/MNLI-AX.tar.gz
tar -zxf MNLI-AX.tar.gz

Run the following common to train and evaluate on the MNLI-AX dataset:

For base model:

bash ./script/classification/MNLI-AX/run.sh

For large model:

bash ./script/classification/MNLI-AX_large/run.sh

Evaluation Results:

Model	Acc-(m/mm)
UNIMO-base	86.8/86.7
UNIMO-large	89.8/89.5

(3) Similarity Tasks

Download STS-B dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/STS-B.tar.gz
tar -zxf STS-B.tar.gz

Run the following common to train and evaluate on the STS-B dataset:

For base model:

bash ./script/regression/STS-B/run.sh

For large model:

bash ./script/regression/STS-B_large/run.sh

Evaluation Results:

Model	Pearson correlation
UNIMO-base	91.0
UNIMO-large	92.6

(4) Linguistic Acceptability Judgments

Download CoLA dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/CoLA.tar.gz
tar -zxf CoLA.tar.gz

Run the following common to train and evaluate on the CoLA dataset:

For base model:

bash ./script/classification/CoLA/run.sh

For large model:

bash ./script/classification/CoLA_large/run.sh

Evaluation Results:

Model	Matthews correlation
UNIMO-base	65.4
UNIMO-large	68.5

Text Generation Tasks

(1) Document Summarization

Download CNN/DailyMail dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/cnndm.tar.gz
tar -zxf cnndm.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/cnndm.tar.gz
tar -zxf cnndm.tar.gz

Run the following common to train and evaluate on the CNN/DailyMail dataset:

For base model:

bash ./script/seq2seq/cnndm/run.sh

For large model:

bash ./script/seq2seq/cnndm_large/run.sh

Evaluation Results:

Model	ROUGE-1	ROUGE-2	ROUGE-L
UNIMO-base	42.42	20.12	39.61
UNIMO-large	43.51	20.65	40.63

(2) Sentence Compression

Download Gigaword dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/gigaword.tar.gz
tar -zxf gigaword.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/gigaword.tar.gz
tar -zxf gigaword.tar.gz

Run the following common to train and evaluate on the Gigaword dataset:

For base model:

bash ./script/seq2seq/gigaword/run.sh

For large model:

bash ./script/seq2seq/gigaword_large/run.sh

Evaluation Results:

Model	ROUGE-1	ROUGE-2	ROUGE-L
UNIMO-base	38.80	19.99	36.27
UNIMO-large	39.71	20.37	36.88

(3) Question Generation

Download Squad dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/squad_qg.tar.gz
tar -zxf squad_qg.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/squad_qg.tar.gz
tar -zxf squad_qg.tar.gz

Run the following common to train and evaluate on the Squad dataset:

For base model:

bash ./script/seq2seq/squad_qg/run.sh

For large model:

bash ./script/seq2seq/squad_qg_large/run.sh

Evaluation Results:

Model	BLUE4	METEOR	ROUGE-L
UNIMO-base	22.78	25.24	51.34
UNIMO-large	24.59	26.39	52.47

(4) Conversation Question Answering

Download CoQA dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coqa.tar.gz
tar -zxf coqa.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coqa.tar.gz
tar -zxf coqa.tar.gz

Run the following common to train and evaluate on the CoQA dataset:

For base model:

bash ./script/seq2seq/coqa/run.sh

For large model:

bash ./script/seq2seq/coqa_large/run.sh

Evaluation Results:

Model	Acc
UNIMO-base	80.2
UNIMO-large	84.9

Multi-Modal Understanding Tasks

(1) Image-Text Retrieval

Download Flickr30k dataset:

Note: Visual features are extracted by bottom-up-attention

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/Flickr30k.tar.gz # occupies about 37G disk space
tar -zxf Flickr30k.tar.gz

Run the following common to train and evaluate on the Flickr30k dataset:

For base model:

bash ./script/retrieval/Flickr30k/run.sh

For large model:

bash ./script/retrieval/Flickr30k_large/run.sh

Evaluation Results:

Results of Image Retrieval task on Flickr30k dataset

Model	R@1	R@5	R@10
UNIMO-base	74.66	93.40	96.08
UNIMO-large	78.04	94.24	97.12

Results of Text Retrieval task on Flickr30k dataset

Model	R@1	R@5	R@10
UNIMO-base	89.70	98.40	99.10
UNIMO-large	89.40	98.90	99.80

(2) Visual Entailment

Download SNLI-VE dataset:

Note: Visual features are extracted by bottom-up-attention

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SNLI-VE.tar.gz
tar -zxf SNLI-VE.tar.gz

Run the following common to train and evaluate on the SNLI-VE dataset:

For base model:

bash ./script/visual_entailment/SNLI-VE/run.sh

For large model:

bash ./script/visual_entailment/SNLI-VE_large/run.sh

Evaluation Results:

Results of Visual Entailment task on SNLI-VE dataset

Model	dev	test
UNIMO-base	80.00	79.10
UNIMO-large	81.11	80.63

Multi-Modal Generation Tasks

(1) Image Caption Generation

Download COCO Caption dataset:

Note: Visual features are extracted by bottom-up-attention

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coco.tar.gz
tar -zxf coco.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coco.tar.gz
tar -zxf coco.tar.gz

Run the following common to train and evaluate on the COCO Caption dataset:

For base model:

bash ./script/img2txt/coco/run.sh

For large model:

bash ./script/img2txt/coco_large/run.sh

Evaluation Results:

Model	BLUE4	CIDEr
UNIMO-base	38.8	124.4
UNIMO-large	39.6	127.7

Citation

If you find our paper and code useful, please cite the following paper:

@article{li2020unimo,
  title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning},
  author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2012.15409},
  year={2020}
}

Contact information

For help or issues using UNIMO, please submit a GitHub issue.

For personal communication related to UNIMO, please contact Wei Li (liwei85@baidu.com), Guocheng Niu (niuguocheng@baidu.com) , Can Gao (gaocan01@baidu.com).

Name	Name	Last commit message	Last commit date
parent directory ..
data	data	add UNIMO	May 20, 2021
images	images	add UNIMO	May 20, 2021
model_files	model_files	add unimo large and most of tasks	May 25, 2021
script	script	add unimo large and most of tasks	May 25, 2021
src	src	add unimo large and most of tasks	May 25, 2021
README.md	README.md	add unimo large and most of tasks	May 25, 2021
env.sh	env.sh	add UNIMO	May 20, 2021
requirements.txt	requirements.txt	add UNIMO	May 20, 2021
utils.sh	utils.sh	add UNIMO	May 20, 2021

Collapse file tree

Files

UNIMO

Directory actions

More options

Directory actions

More options

Latest commit

History

UNIMO

Folders and files

parent directory

README.md

UNIMO

Abstract

Performance

TODOs

Dependencies

Pre-trained Models

Experiments

Text Understanding Tasks

(1) Sentiment Classification

Download SST-2 dataset:

Run the following common to train and evaluate on the SST-2 dataset:

Evaluation Results:

(2) Natural Language Inference

Download MNLI-AX dataset:

Run the following common to train and evaluate on the MNLI-AX dataset:

Evaluation Results:

(3) Similarity Tasks

Download STS-B dataset:

Run the following common to train and evaluate on the STS-B dataset:

Evaluation Results:

(4) Linguistic Acceptability Judgments

Download CoLA dataset:

Run the following common to train and evaluate on the CoLA dataset:

Evaluation Results:

Text Generation Tasks

(1) Document Summarization

Download CNN/DailyMail dataset:

Download evaluation script:

Run the following common to train and evaluate on the CNN/DailyMail dataset:

Evaluation Results:

(2) Sentence Compression

Download Gigaword dataset:

Download evaluation script:

Run the following common to train and evaluate on the Gigaword dataset:

Evaluation Results:

(3) Question Generation

Download Squad dataset:

Download evaluation script:

Run the following common to train and evaluate on the Squad dataset:

Evaluation Results:

(4) Conversation Question Answering

Download CoQA dataset:

Download evaluation script:

Run the following common to train and evaluate on the CoQA dataset:

Evaluation Results:

Multi-Modal Understanding Tasks

(1) Image-Text Retrieval

Download Flickr30k dataset:

Note: Visual features are extracted by bottom-up-attention

Run the following common to train and evaluate on the Flickr30k dataset:

Evaluation Results:

(2) Visual Entailment

Download SNLI-VE dataset:

Note: Visual features are extracted by bottom-up-attention

Run the following common to train and evaluate on the SNLI-VE dataset:

Evaluation Results:

Multi-Modal Generation Tasks

(1) Image Caption Generation

Download COCO Caption dataset:

Note: Visual features are extracted by bottom-up-attention

Download evaluation script:

Run the following common to train and evaluate on the COCO Caption dataset:

Evaluation Results:

Citation

Contact information

`UNIMO`