TinyBERT

TinyBERT is 7.5x smaller and 9.4x faster on inference than BERT-base and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages. The overview of TinyBERT learning is illustrated as follows:

For more details about the techniques of TinyBERT, refer to our paper:

TinyBERT: Distilling BERT for Natural Language Understanding

Release Notes

First version: 2019/11/26 Add Chinese General_TinyBERT: 2021.7.27

Installation

Run command below to install the environment(using python3)

pip install -r requirements.txt

General Distillation

In general distillation, we use the original BERT-base without fine-tuning as the teacher and a large-scale text corpus as the learning data. By performing the Transformer distillation on the text from general domain, we obtain a general TinyBERT which provides a good initialization for the task-specific distillation.

General distillation has two steps: (1) generate the corpus of json format; (2) run the transformer distillation;

Step 1: use pregenerate_training_data.py to produce the corpus of json format

 
# ${BERT_BASE_DIR}$ includes the BERT-base teacher model.
 
python pregenerate_training_data.py --train_corpus ${CORPUS_RAW} \ 
                  --bert_model ${BERT_BASE_DIR}$ \
                  --reduce_memory --do_lower_case \
                  --epochs_to_generate 3 \
                  --output_dir ${CORPUS_JSON_DIR}$

Step 2: use general_distill.py to run the general distillation

 # ${STUDENT_CONFIG_DIR}$ includes the config file of student_model.
 
python general_distill.py --pregenerated_data ${CORPUS_JSON}$ \ 
                          --teacher_model ${BERT_BASE}$ \
                          --student_model ${STUDENT_CONFIG_DIR}$ \
                          --reduce_memory --do_lower_case \
                          --train_batch_size 256 \
                          --output_dir ${GENERAL_TINYBERT_DIR}$

We also provide the models of general TinyBERT here and users can skip the general distillation.

=================1st version to reproduce our results in the paper ===========================

General_TinyBERT(4layer-312dim)

General_TinyBERT(6layer-768dim)

=================2nd version (2019/11/18) trained with more (book+wiki) and no [MASK] corpus =======

General_TinyBERT_v2(4layer-312dim)

General_TinyBERT_v2(6layer-768dim)

=================Chinese version trained with WIKI and NEWS corpus =======

General_TinyBERT_zh(4layer-312dim)

General_TinyBERT_zh(6layer-768dim)

Data Augmentation

Data augmentation aims to expand the task-specific training set. Learning more task-related examples, the generalization capabilities of student model can be further improved. We combine a pre-trained language model BERT and GloVe embeddings to do word-level replacement for data augmentation.

Use data_augmentation.py to run data augmentation and the augmented dataset train_aug.tsv is automatically saved into the corresponding $G L U E_{D} I R / T A S K_{N} A M E$


python data_augmentation.py --pretrained_bert_model ${BERT_BASE_DIR}$ \
                            --glove_embs ${GLOVE_EMB}$ \
                            --glue_dir ${GLUE_DIR}$ \  
                            --task_name ${TASK_NAME}$

Before running data augmentation of GLUE tasks you should download the GLUE data by running this script and unpack it to some directory GLUE_DIR. And TASK_NAME can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE.

Task-specific Distillation

In the task-specific distillation, we re-perform the proposed Transformer distillation to further improve TinyBERT by focusing on learning the task-specific knowledge.

Task-specific distillation includes two steps: (1) intermediate layer distillation; (2) prediction layer distillation.

Step 1: use task_distill.py to run the intermediate layer distillation.


# ${FT_BERT_BASE_DIR}$ contains the fine-tuned BERT-base model.

python task_distill.py --teacher_model ${FT_BERT_BASE_DIR}$ \
                       --student_model ${GENERAL_TINYBERT_DIR}$ \
                       --data_dir ${TASK_DIR}$ \
                       --task_name ${TASK_NAME}$ \ 
                       --output_dir ${TMP_TINYBERT_DIR}$ \
                       --max_seq_length 128 \
                       --train_batch_size 32 \
                       --num_train_epochs 10 \
                       --aug_train \
                       --do_lower_case

Step 2: use task_distill.py to run the prediction layer distillation.


python task_distill.py --pred_distill  \
                       --teacher_model ${FT_BERT_BASE_DIR}$ \
                       --student_model ${TMP_TINYBERT_DIR}$ \
                       --data_dir ${TASK_DIR}$ \
                       --task_name ${TASK_NAME}$ \
                       --output_dir ${TINYBERT_DIR}$ \
                       --aug_train  \  
                       --do_lower_case \
                       --learning_rate 3e-5  \
                       --num_train_epochs  3  \
                       --eval_step 100 \
                       --max_seq_length 128 \
                       --train_batch_size 32

We here also provide the distilled TinyBERT(both 4layer-312dim and 6layer-768dim) of all GLUE tasks for evaluation. Every task has its own folder where the corresponding model has been saved.

TinyBERT(4layer-312dim)

TinyBERT(6layer-768dim)

Evaluation

The task_distill.py also provide the evalution by running the following command:

${TINYBERT_DIR}$ includes the config file, student model and vocab file.

python task_distill.py --do_eval \
                       --student_model ${TINYBERT_DIR}$ \
                       --data_dir ${TASK_DIR}$ \
                       --task_name ${TASK_NAME}$ \
                       --output_dir ${OUTPUT_DIR}$ \
                       --do_lower_case \
                       --eval_batch_size 32 \
                       --max_seq_length 128

To Dos

Evaluate TinyBERT on Chinese tasks.
Tiny*: use NEZHA or ALBERT as the teacher in TinyBERT learning.
Release better general TinyBERTs.

Name	Name	Last commit message	Last commit date
parent directory ..
transformer	transformer	upload	Dec 2, 2019
LICENSE	LICENSE	update	Dec 6, 2019
README.md	README.md	Create README.md	Jul 27, 2021
THIRD PARTY OPEN SOURCE SOFTWARE NOTICE	THIRD PARTY OPEN SOURCE SOFTWARE NOTICE	update BBPE DynaBERT NEZHA-Tensorflow	Dec 18, 2020
data_augmentation.py	data_augmentation.py	Fix augmentation bug and support cased models	Sep 27, 2021
general_distill.py	general_distill.py	update BBPE DynaBERT NEZHA-Tensorflow	Dec 18, 2020
pregenerate_training_data.py	pregenerate_training_data.py	update BBPE DynaBERT NEZHA-Tensorflow	Dec 18, 2020
requirements.txt	requirements.txt	upload	Dec 2, 2019
task_distill.py	task_distill.py	update BBPE DynaBERT NEZHA-Tensorflow	Dec 18, 2020
tinybert_overview.png	tinybert_overview.png	upload	Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

TinyBERT

TinyBERT

README.md

TinyBERT

Release Notes

Installation

General Distillation

Data Augmentation

Task-specific Distillation

Evaluation

To Dos

Files

TinyBERT

Directory actions

More options

Directory actions

More options

Latest commit

History

TinyBERT

Folders and files

parent directory

README.md

TinyBERT

Release Notes

Installation

General Distillation

Data Augmentation

Task-specific Distillation

Evaluation

To Dos