The code synthesis receives more and more attention recently, especially with the powerful machine learning techniques. In order to facilitate the code synthesis research, we propose a new evaluation metric CodeBLEU, which can not only consider the surface match similar with the original BLEU, but can also consider the grammatical correctness and the logic correctness, leveraging the abstract syntax tree and the data-flow structure.
(1) The current metrics (BLEU, acc) for code synthesis tasks, such as code translation, are not suitable enough.
- BLEU: just consider n-gram match, ignore code structure.
- Accuracy (exactly match): too strict for code evaluation.
(2) Program language vs. Natural language
- Limited keywords vs. millions of words.
- Tree structure vs. sequential structure.
- Unique instructions vs. ambiguous semantic.
An ideal evaluation metric should consider the grammatical correctness and the logic correctness.
We propose weighted n-gram match and syntactic AST match to measure grammatical correctness, and introduce semantic data-flow match to calculate logic correctness.
Here we will give two toy examples and show the qualitative advantages of CodeBLEU compared with the traditional BLEU score.
We conduct experiments on three code generation tasks (text-to-code, code translation, and code refinement) to evaluate the effectiveness of CodeBLEU.
(1)Translate or generate codes with different systems.
(2)Calculate the CodeBLEU of the code synthesis results.
(3)Carry out human evaluation on code synthesis results.
(4)Calculate the correlation coefficient of the CodeBLEU and human evaluation scores.
(1) The results of all baselines of the given three tasks evaluated by BLEU, accuracy (exactly match), CodeBLEU and human evaluation scores.
-Text-to-code
System | BLEU | ACC | CodeBLEU | Human score |
---|---|---|---|---|
Sys1 | 12.02 | 3.05 | 18.04 | 1.88 |
Sys2 | 16.82 | 10.50 | 21.71 | 1.99 |
Sys3 | 21.18 | 17.35 | 24.95 | 2.56 |
Sys4 | 26.45 | 20.10 | 30.96 | 3.13 |
-Code Translation
System | BLEU | ACC | CodeBLEU | Human score |
---|---|---|---|---|
Sys1 | 44.53 | 13.2 | 45.71 | 3.25 |
Sys2 | 54.84 | 31.75 | 61.14 | 3.77 |
Sys3 | 80.18 | 60.2 | 82.74 | 4.04 |
Sys4 | 81.44 | 63.5 | 84.75 | 4.25 |
-Code Refinement
System | BLEU | ACC | CodeBLEU | Human score |
---|---|---|---|---|
Sys1 | 90.35 | 3.00 | 80.81 | 1.38 |
Sys2 | 91.40 | 7.01 | 82.16 | 1.54 |
Sys3 | 92.80 | 17.6 | 83.85 | 2.02 |
(2) Comparison of the Pearson correlation coefficients between human evaluation scores and three different metrics.
Metric | Text-to-code | Code Translation | Code Refinement |
---|---|---|---|
BLEU | 0.967 | 0.940 | 0.923 |
ACC | 0.912 | 0.968 | 0.999 |
CodeBLEU | 0.977 (+1.0%) | 0.970 (+3.0%) | 0.979 (+5.6%) |
Resutls of the respective Pearson correlation between the human evaluation scores and the scores given by different components.
Components | Text-to-code | Code Translation | Code Refinement |
---|---|---|---|
n-gram | 0.967 | 0.940 | 0.923 |
weighted n-gram | 0.960 | 0.934 | 0.985 |
syntactic AST | 0.985 | 0.977 | 0.967 |
semantic data-flow | 0.978 | 0.974 | 0.983 |
CodeBLEU | 0.977 | 0.970 | 0.979 |