PyTorch impelementations of BERT-based Spelling Error Correction Models.

Last update: Dec 30, 2022

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型，使用PyTorch实现

数据准备

从 http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录

请确保以下文件在 datasets/csc 中

train.sgml
B1_training.sgml
C1_training.sgml  
SIGHAN15_CSC_A2_Training.sgml  
SIGHAN15_CSC_B2_Training.sgml  
SIGHAN15_CSC_TestInput.txt
SIGHAN15_CSC_TestTruth.txt

环境准备

使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境（推荐）
克隆本项目并进入项目根目录
安装所需依赖 pip install -r requirements.txt
如果出现报错 GLIBC 版本过低的问题（GLIBC 的版本更迭容易出事故，不推荐更新），openCC 改为安装较低版本（例如 1.1.0）
在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型，首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型，目前支持以下配置文件：

train_bert4csc.yml
train_macbert4csc.yml
train_SoftMaskedBert.yml

如有其他需求，可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component	sentence level acc	p	r	f
Detection	0.5045	0.8252	0.8416	0.8333
Correction	0.8055	0.9395	0.8748	0.9060

Bert类

char level

MODEL	p	r	f
BERT4CSC	0.9269	0.8651	0.8949
MACBERT4CSC	0.9380	0.8736	0.9047

sentence level

model	acc	p	r	f
BERT4CSC	0.7990	0.8482	0.7214	0.7797
MACBERT4CSC	0.8027	0.8525	0.7251	0.7836

推理

方法一，使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下：

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二，直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重，使用transformers或pycorrector调用

使用convert_to_pure_state_dict.py导出bert权重
后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目，请按如下格式引用：

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0，可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护，侵权必究。

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Related tags

Overview

BertBasedCorrectionModels

数据准备

环境准备

训练

实验结果

SoftMaskedBert

Bert类

char level

sentence level

推理

方法一，使用inference脚本:

方法二，直接调用

方法三、导出bert权重，使用transformers或pycorrector调用

引用

License

更新记录

20210618

20210518

20210517

References

Owner

Heng Cai

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Extracting Summary Knowledge Graphs from Long Documents

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Rhyme with AI

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

This repository is home to the Optimus data transformation plugins for various data processing needs.

A PyTorch Implementation of End-to-End Models for Speech-to-Text

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"