Code for evaluating Japanese pretrained models provided by NTT Ltd.

Overview

japanese-dialog-transformers

日本語の説明文はこちら

This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialogue model provided by NTT on fairseq.


Table of contents.
Update log
Notice for using the codes
Model download
Quick start
LICENSE

Update log

  • 2021/09/17 Published dialogue models (fairseq version japanese-dialog-transformer-1.6B) and evaluation codes.

Notice for using the codes

The dialogue models provided are for evaluation and verification of model performance. Before downloading these models, please read the LICENSE and CAUTION documents. You can download and use these models only if you agree to the following three points.

  1. LICENSE
  2. To be used only for the purpose of evaluation and verification of this model, and not for the purpose of providing dialogue service itself.
  3. Take all possible care and measures to prevent damage caused by the generated text, and take responsibility for the text you generate, whether appropriate or inappropriate.

BibTeX

When publishing results using this model, please cite the following paper.

@misc{sugiyama2021empirical,
      title={Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems}, 
      author={Hiroaki Sugiyama and Masahiro Mizukami and Tsunehiro Arimoto and Hiromi Narimatsu and Yuya Chiba and Hideharu Nakajima and Toyomi Meguro},
      year={2021},
      eprint={2109.05217},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model download


Quick start

The models published on this page can be used for utterance generation and additional fine-tuning using the scripts included in fairseq.

Install dependent libraries

The verification environment is as follows.

  • Python 3.8.10 on miniconda
  • CUDA 11.1/10.2
  • Pytorch 1.8.2 (For the installation commands, be sure to check the official page. We recommend using pip.)
  • fairseq 1.0.0(validated commit ID: 8adff65ab30dd5f3a3589315bbc1fafad52943e7)
  • sentencepiece 0.19.6

When installing fairseq, please check the official page and install the latest version. Normal pip install will only install the older version 0.10.2. If you want to run finetune with your own data, you need to install the standalone version of sentencepiece.

fairseq-interactive

Since fairseq-interactive does not have any way to keep the context, it generates responses based on the input sentences only, which is different from the setting that uses the context in Finetune and the paper experiment, so it is easy to generate inappropriate utterances.

In the following command, a small value (10) is used for beam and nbest (number of output candidates) to make the results easier to read. In actual use, it would be better to set the number to 20 or more for better results.

fairseq-interactive data/sample/bin/ \
 --path checkpoints/persona50k-flat_1.6B_33avog1i_4.16.pt\
 --beam 10 \
 --seed 0 \
 --min-len 10 \
 --source-lang src \
 --target-lang dst \
 --tokenizer space \
 --bpe sentencepiece \
 --sentencepiece-model data/dicts/sp_oall_32k.model \
--no-repeat-ngram-size 3 \
--nbest 10 \
--sampling \
--sampling-topp 0.9 \
--temperature 1.0 

dialog.py

The system utilizes a context of about four utterances, which is equivalent to the settings used in the Finetune and paper experiments.

python scripts/dialog.py data/sample/bin/ \
 --path checkpoints/dials5_1e-4_1li20zh5_tw5.143_step85.pt \
 --beam 80 \
 --min-len 10 \
 --source-lang src \
 --target-lang dst \
 --tokenizer space \
 --bpe sentencepiece \
 --sentencepiece-model data/dicts/sp_oall_32k.model \
 --no-repeat-ngram-size 3 \
 --nbest 80 \
 --sampling \
 --sampling-topp 0.9 \
 --temperature 1.0 \
 --show-nbest 5

Perplexity calculation on a specific data set

Computes the perplexity (ppl) on a particular dataset. The lower the ppl, the better the model can represent the interaction on that dataset.

fairseq-validate $DATA_PATH \
 --path $MODEL_PATH \
 --task translation \
 --source-lang src \
 --target-lang dst \
 --batch-size 2 \ 
 --ddp-backend no_c10d \
 --valid-subset test \ 
 --skip-invalid-size-inputs-valid-test 

Finetuning with Persona-chat and EmpatheticDialogues

By finetuning the Pretrained model with PersonaChat or EmpatheticDialogues, you can create a model that is almost identical to the finetuned model provided.

If you have your own dialogue data, you can place the data in the same format in data/*/raw and perform Finetune on that data. Please note, however, that we do not allow the release or distribution of Finetune models under the LISENCE. You can release your own data and let a third party run Finetune from this model.

Downloading and converting datasets

Convert data from Excel to a simple input statement (src) and output statement (dst) format, where the same row in src and dst is the corresponding input/output pair. 50000 rows are split and output as a train.

python scripts/extract_ed.py japanese_empathetic_dialogues.xlsx data/empdial/raw/

License

LISENCE

Owner
NTT Communication Science Laboratories
NTT Communication Science Laboratories
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

Alexander 423 Jan 01, 2023
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Mapping a variable-length sentence to a fixed-length vector using BERT model

Are you looking for X-as-service? Try the Cloud-Native Neural Search Framework for Any Kind of Data bert-as-service Using BERT model as a sentence enc

Han Xiao 11.1k Jan 01, 2023
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
基于pytorch_rnn的古诗词生成

pytorch_peot_rnn 基于pytorch_rnn的古诗词生成 说明 config.py里面含有训练、测试、预测的参数,更改后运行: python main.py 预测结果 if config.do_predict: result = trainer.generate('丽日照残春')

西西嘛呦 3 May 26, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022