中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Last update: Oct 22, 2021

Overview

Transformer QG on DRCD

The inputs of the model refers to

we integrate C and A into a new C' in the following form.
C' = [c1, c2, ..., [HL], a1, ..., a|A|, [HL], ..., c|C|]

Proposed by Ying-Hong Chan & Yao-Chung Fan. (2019). A Re-current BERT-based Model for Question Generation.

我們還有另外一個英文QG: Transformer-QG-on-SQuAD

Features

完整的流程；從微調到模型評分
支援許多先進的語言模型
內建Flask，可快速作為API server

DRCD dataset

台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 DRCD資料集從2,108篇維基條目中整理出10,014篇段落，並從段落中標註出30,000多個問題。

Available models

BART (base on uer/bart-base-chinese-cluecorpussmall)

Use in Transformers

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer = AutoTokenizer.from_pretrained("p208p2002/bart-drcd-qg-hl")

model = AutoModelForSeq2SeqLM.from_pretrained("p208p2002/bart-drcd-qg-hl")

Expriments

Model	Bleu 1	Bleu 2	Bleu 3	Bleu 4	METEOR	ROUGE-L
BART-HLSQG	34.25	27.70	22.43	18.13	23.58	36.88

Environment requirements

The hole development is based on Ubuntu system

If you don't have pytorch 1.6+ please install or update first

https://pytorch.org/get-started/locally/

Install packages pip install -r requirements.txt
Setup scorer python setup_scorer.py
Download dataset python init_dataset.py

Training

Seq2Seq LM

usage: train_seq2seq_lm.py [-h]
                           [--base_model {bert-base-chinese,uer/bart-base-chinese-cluecorpussmall,p208p2002/bart-drcd-qg-hl}]
                           [-d {drcd}] [--batch_size BATCH_SIZE]
                           [--epoch EPOCH] [--lr LR] [--dev DEV] [--server]
                           [--run_test] [-fc FROM_CHECKPOINT]

optional arguments:
  -h, --help            show this help message and exit
  --base_model {bert-base-chinese,uer/bart-base-chinese-cluecorpussmall,p208p2002/bart-drcd-qg-hl}
  -d {drcd}, --dataset {drcd}
  --batch_size BATCH_SIZE
  --epoch EPOCH
  --lr LR
  --dev DEV
  --server
  --run_test
  -fc FROM_CHECKPOINT, --from_checkpoint FROM_CHECKPOINT

Run as API server

From pre-trained (recommend)

python train_seq2seq_lm.py --server --base_model p208p2002/bart-drcd-qg-hl

From your own checkpoint

python train_xxx_lm.py --server --base_model YOUR_BASE_MODEL --from_checkpoint FROM_CHECKPOINT

Request example

curl --location --request POST 'http://127.0.0.1:5000/' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'context=[HL]伊隆·里夫·馬斯克[HL]是一名企業家和商業大亨'

{"predict": "哪一個人是一名企業家和商業大亨?"}

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Related tags

Overview

Transformer QG on DRCD

Features

DRCD dataset

Available models

Use in Transformers

Expriments

Environment requirements

Training

Seq2Seq LM

Run as API server

From pre-trained (recommend)

From your own checkpoint

Request example

Owner

Philip

CMeEE 数据集医学实体抽取

Chinese Grammatical Error Diagnosis

String Gen + Word Checker

AutoGluon: AutoML for Text, Image, and Tabular Data

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

GSoC'2021 | TensorFlow implementation of Wav2Vec2

A python wrapper around the ZPar parser for English.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

This repo stores the codes for topic modeling on palliative care journals.

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

BERT Attention Analysis

基于Transformer的单模型、多尺度的VAE模型

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.