LaTeX_OCR_PRO

数学公式识别，增强：中文公式、手写公式

Seq2Seq + Attention + Beam Search。结构如下：

1. 搭建环境
2. 开始训练
3. 可视化
4. 部署
5. 评价
6. 更多细节
- 模型实现细节
- 解决方案
7. 致谢
8. 相关项目
9. 引用

1. 搭建环境

python3.5 + tensorflow1.12.2
[可选] latex (latex 转 pdf)
[可选] ghostscript (图片处理)
[可选] magick (pdf 转 png)

如果你想直接训练，不想自己构建数据集：

[可选] 新开一个虚拟环境

virtualenv env35 --python=python3.5
source env35/bin/activate

安装依赖

pip install -r requirements.txt     // cpu 版
pip install -r requirements-gpu.txt // gpu 版

下载数据集
```
git submodule init
git submodule update
```
如果 git 速度太慢，您也可以手动下载数据集，放到 data 目录下。数据集仓库在 https://github.com/LinXueyuanStdio/Data-for-LaTeX_OCR 数据仓库同时托管到 huggingface (linxy/LaTeX_OCR)，欢迎使用！

如果你想自己构建数据集，然后再训练：

Linux

一键安装

make install-linux

或

安装本项目依赖

virtualenv env35 --python=python3.5
source env35/bin/activate
pip install -r requirements.txt

安装 latex (latex 转 pdf)

sudo apt-get install texlive-latex-base
sudo apt-get install texlive-latex-extra

安装 ghostscript

sudo apt-get update
sudo apt-get install ghostscript
sudo apt-get install libgs-dev

安装magick (pdf 转 png)

wget http://www.imagemagick.org/download/ImageMagick.tar.gz
tar -xvf ImageMagick.tar.gz
cd ImageMagick-7.*; \
./configure --with-gslib=yes; \
make; \
sudo make install; \
sudo ldconfig /usr/local/lib
rm ImageMagick.tar.gz
rm -r ImageMagick-7.*

Mac

一键安装

make install-mac

或

安装本项目依赖

sudo pip install -r requirements.txt

LaTeX

我们需要 pdflatex，可以傻瓜式一键安装：http://www.tug.org/mactex/mactex-download.html

安装magick (pdf 转 png)

wget http://www.imagemagick.org/download/ImageMagick.tar.gz
tar -xvf ImageMagick.tar.gz
cd ImageMagick-7.*; \
./configure --with-gslib=yes; \
make;\
sudo make install; \
rm ImageMagick.tar.gz
rm -r ImageMagick-7.*

2. 开始训练

生成小数据集、训练、评价

提供了样本量为 100 的小数据集，方便测试。只需 2 分钟就可以根据 ./data/small.formulas/ 下的公式生成用于训练的图片。

注意：样本量很小，是无法有效训练模型的。这个小数据集仅用于确认代码有没有 bug。如果用于预测，那结果极差，因为数据不够。

一步训练

make small

或

生成数据集

用 LaTeX 公式生成图片，同时保存公式-图片映射文件，生成字典 只用运行一次

# 默认
python build.py
# 或者
python build.py --data=configs/data_small.json --vocab=configs/vocab_small.json

训练

# 默认
python train.py
# 或者
python train.py --data=configs/data_small.json --vocab=configs/vocab_small.json --training=configs/training_small.json --model=configs/model.json --output=results/small/

评价预测的公式

# 默认
python evaluate_txt.py
# 或者
python evaluate_txt.py --results=results/small/

评价数学公式图片

# 默认
python evaluate_img.py
# 或者
python evaluate_img.py --results=results/small/

生成完整数据集、训练、评价

根据公式生成 70,000+ 数学公式图片需要 2-3 个小时

一步训练

make full

或

生成数据集

用 LaTeX 公式生成图片，同时保存公式-图片映射文件，生成字典 只用运行一次
```
python build.py --data=configs/data.json --vocab=configs/vocab.json
```

训练

python train.py --data=configs/data.json --vocab=configs/vocab.json --training=configs/training.json --model=configs/model.json --output=results/full/

评价预测的公式

python evaluate_txt.py --results=results/full/

评价数学公式图片

python evaluate_img.py --results=results/full/

3. 可视化

可视化训练过程

用 tensorboard 可视化训练过程

小数据集

cd results/small
tensorboard --logdir ./

完整数据集

cd results/full
tensorboard --logdir ./

可视化预测过程

打开 visualize_attention.ipynb，一步步观察模型是如何预测 LaTeX 公式的。

或者运行

# 默认
python visualize_attention.py
# 或者
python visualize_attention.py --image=data/images_test/6.png --vocab=configs/vocab.json --model=configs/model.json --output=results/full/

可在 --output 下生成预测过程的注意力图。

4. 部署

部署为 Django 应用

安装部署需要的环境
```
pip install django
```
开启服务
```
python manage.py runserver 0.0.0.0:8010
```

开启图片服务

cd data/images_train
python -m SimpleHTTPServer 8020

使用方法在输入框里依次输入 0.png, 1.png 等等，即可看到结果

5. 评价

指标	训练分数	测试分数
perplexity	1.12	1.13
EditDistance	94.16	93.36
BLEU-4	91.03	90.47
ExactMatchScore	49.30	46.22

perplexity 是越接近1越好，其余3个指标是越大越好。

其中 EditDistance 和 BLEU-4 已达到业内先进水平

将 perplexity 训练到 1.03 左右，ExactMatchScore 还可以再升，应该可以到 70 以上。

机器不太好，训练太费时间了。

6. 更多细节

模型实现细节

包括数据获取、数据处理、模型架构、训练细节
解决方案

包括 “如何可视化 Attention 层”、“在 win10 用 GPU 加速训练” 等等

7. 致谢

十分感谢 Harvard 和 Guillaume Genthial 、Kelvin Xu 等人提供巨人的肩膀。

论文：

8. 相关项目

LaTeX_OCR 的 PyTorch 版: https://github.com/qs956/Latex_OCR_Pytorch by @qs956

9. 引用

BibTeX

@misc{lin2024latex_ocr_pro,
  title={LaTeX_OCR_PRO},
  author={Xueyuan Lin},
  year={2024},
  publisher={GitHub},
  howpublished={\url{https://github.com/LinXueyuanStdio/LaTeX_OCR_PRO}},
}

Name	Name	Last commit message	Last commit date
Latest commit LinXueyuanStdio save Jun 11, 2024 57c8724 · Jun 11, 2024 History 99 Commits
configs	configs	Fix configs	Nov 4, 2019
data @ d8dd211	data @ d8dd211	move 'data' folder as submodule	Oct 2, 2019
data_preprocess	data_preprocess	update README	Aug 25, 2019
django_app	django_app	Update run_model.py	Jul 25, 2019
doc	doc	update README	Aug 23, 2019
latex	latex	latex ast to mathml	Feb 6, 2020
model	model	Fix a bug in image.py	Mar 4, 2024
templates	templates	django app	Jul 25, 2019
work	work	mv data	Jul 31, 2019
.gitignore	.gitignore	Update .gitignore	Aug 9, 2019
.gitmodules	.gitmodules	move 'data' folder as submodule	Oct 2, 2019
GANdemo.ipynb	GANdemo.ipynb	GAN implement	Aug 9, 2019
LICENSE	LICENSE	Initial commit	Jul 20, 2019
README.md	README.md	save	Jun 11, 2024
Untitled.ipynb	Untitled.ipynb	show config	Aug 22, 2019
build.py	build.py	init	Jul 20, 2019
evaluate_img.py	evaluate_img.py	init	Jul 20, 2019
evaluate_txt.py	evaluate_txt.py	init	Jul 20, 2019
makefile	makefile	Update makefile	Aug 9, 2019
manage.py	manage.py	django app	Jul 25, 2019
predict.py	predict.py	init	Jul 20, 2019
requirements-gpu.txt	requirements-gpu.txt	Update requirements-gpu.txt to tensorflow1.15.5	Mar 4, 2024
requirements.txt	requirements.txt	Bump django from 2.2.8 to 2.2.24	Oct 18, 2021
train.ipynb	train.ipynb	Create train.ipynb	Aug 23, 2019
train.py	train.py	huggingface (linxy/LaTeX_OCR)	Jun 11, 2024
train_2.py	train_2.py	add encoder 2	Aug 20, 2019
train_GAN.py	train_GAN.py	remove torch	Aug 20, 2019
visualize_attention.ipynb	visualize_attention.ipynb	init	Jul 20, 2019
visualize_attention.py	visualize_attention.py	Update visualize_attention.py	Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaTeX_OCR_PRO

1. 搭建环境

如果你想直接训练，不想自己构建数据集：

如果你想自己构建数据集，然后再训练：

2. 开始训练

3. 可视化

4. 部署

5. 评价

6. 更多细节

7. 致谢

8. 相关项目

9. 引用

About

Releases

Packages

Contributors 5

Languages

License

LinXueyuanStdio/LaTeX_OCR_PRO

Folders and files

Latest commit

History

Repository files navigation

LaTeX_OCR_PRO

1. 搭建环境

如果你想直接训练，不想自己构建数据集：

如果你想自己构建数据集，然后再训练：

2. 开始训练

3. 可视化

4. 部署

5. 评价

6. 更多细节

7. 致谢

8. 相关项目

9. 引用

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages