For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Overview

Cross-Encoder-with-Bi-Encoder

For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Data

Data used by "Open-Domain Question Answering Competition" hosted by Aistages, and copyrights can be used under CC-BY-2.0.

+- data
|   +- train_dataset
    |   +- train
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- validataion
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- test_dataset
    |   +- validation
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- wikipedia_documents.json
  • Wikipedia data can be uploaded to the folder location above and used.
!git clone https://github.com/jjonhwa/Cross-Encoder-with-Bi-Encoder.git # git clone
% cd ./Cross-Encoder-with-Bi-Encoder/_data                              # change directory (input your own path)

!gdown --id 1O-kxt4DupOibNhkwmg3luTLt07faRgvO # wiki data upload        # download wikipedia data

Setup

Dependencies

  • datasets==1.5.0
  • transformers==4.5.0
  • tqdm==4.41.1
  • pandas==1.1.4
  • CUDA==11.0

Install Requirements

bash install_requirements.sh

Hardware

  • GPU : Tesla V100 (32GB)

Checkpoint

  • You can check the code in the Colab environment using Demo.
  • It does not work in Colab Basic.

What can we do to improve the performance of Retriever?

1. Explore the data set production process.

  • Sparse Embedding may be better in tasks for viewing Passage and creating a question (if there is an annotation bias), such as SQuAD.
  • In most other data, documents can be extracted with higher accuracy if Dense Passage Retreat is used.

2. Sparse Embedding & Dense Embedding

  • Most of the content was knowledge obtained by referring to Paper, and based on this, it led to improvement in Retriever performance.
  • Prior to the application of DPR, in the case of 'KLUE MRC database' in which datasets were configured in the same manner as SQuAD, it would be better to utilize techniques such as Sparse embedding technique BM25 compared to DPR.
  • Actually, until ReRank Strategy was applied, the highest performance was achieved with elastic search based on BM25.
  • When only biencoder was used, Retrieval accuracy was far below elastic search in the 'KLUE MRC competition'
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR Bi-Encoder - 0.775 0.85

3. ReRank Strategy with CrossEncoder (In-Batch_Negative Samples)

  • Our purpose is to bring high performance from KLUE MRC competition to End-to-End from Retrieval to Reader. From this, the ReRank strategy using Cross Encoder was used.
  • In addition, when implementing Cross Encoder, the key point is to extract a negative sample within Batch and use it to calculate loss.
  • After extracting the Retrival Passage of the Top-500 using the Bi-Encoder, only a small number of Passages are finally extracted by returning to the Cross Encoder.
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR without CrossEncoder - 0.775 0.85
DPR with CrossEncoder 0.825 0.95 -

4. Ensemble

  • In this process, the contents of CrossEncoder were mainly written, and the contents of Ensemble were omitted.
  • An experiment was conducted assuming that performance improvement would be achieved from different types of Retrival combinations by conducting Ensemble using Sparse Embedding and Dense Embedding.
  • Top-100 was selected using Elastic Search and Top-100 was selected using DPR and Cross Encoder, and the final output score was calculated by combining them 1 to 1 and normalizing them.
  • When the final Reader model was tested, when Top-5 was input, the performance was the best, so the experiment was conducted after limiting the number of passages to be returned to five.
  • Actually, the performance has improved significantly, and the retrival accuracy is as follows.
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR with CrossEncoder 0.825 0.95 -
Ensemble 0.9082 - -

Train CrossEncoder & BiEncoder

  • Learn crossencoder and biencoder and store them.
  • Modify only the data path to match your data. (find "your_dataset_path")
python train.py --encoder 'cross' --output_directory './save_directory/'

or

python train.py --encoder 'bi' --output_directory './save_directory/'

Run ReRank

  • It precedes creating an encoder using crossencoder and biencoder. (Before Run ReRank, you have to run 'train.py' to make)
  • Modify only the data path to match your data. (find "your_dataset_path")
python rerank.py --input_directory './save_directory/'

Run Retriever Demo

  • Top 500 Passages are Retrieved from about 60000 data using Biencoder, and Top 5 is finally retrieved using CrossEncoder.
  • Passage Embedding about wiki data, Cross Encoder and Bi-Encoder can be downloaded and utilized
  • Open In Colab
SkyPort console user terminal written in python

SkyPort terminal implemented as a console script written in Python Description Sky Port is an universal bus between user software and compute resource

Sky Workflows 1 Oct 23, 2022
Irrigation Component V4 providing support for a custom card

Irrigation Component V4 This release sees the delivery of a custom card https://github.com/petergridge/irrigation_card to render the program options s

12 Oct 28, 2022
Basit bir sunucu - istemci örneği

basitSunucuistemci Aşağıdaki adresteki uygulamadaki process kapanmama sorununun çözülmesi ile oluşturulmuş yeni depo https://github.com/pricheal/pytho

Ali Orhun Akkirman 10 Dec 27, 2022
Meower a social media platform written in Scratch 3.0 and Python

Meower Meower is a social media platform written in Scratch 3.0 and Python, ported to HTML for self-hosting. Try Beta 4.6 Changelog for 4.6 Start impl

Meower Media Co. 23 Dec 02, 2022
A simple language for new programmers and a toy language ;)

Yell An extremely simple, yet powerful language for new programmers, as well as a toy language ;) Explore the docs » Report Bug · Request Feature Yell

Yell 4 Dec 28, 2021
Labspy06 With Python

Labspy06 Profil Nama : Nafal mumtaz fuadi Nim : 312110457 Kelas : T1.21.A.2 Latihan 1 Ubahlah kode dibawah ini menjadi fungsi menggunakan lambda impor

Mas Nafal 1 Dec 12, 2021
Tethered downgrade 64-bit iDevices vulnerable to checkm8

ra1nstorm Tethered downgrade 64-bit iDevices vulnerable to checkm8 Since the purpose of this tool is to tethered downgrade a device, after restoring p

mini_exploit 65 Nov 08, 2022
Minecraft Multi-Server Pinger Discord Embed

Minecraft Network Pinger Minecraft Multi-Server Pinger Discord Embed What does this bot do? It sends an embed and uses mcsrvstat API and checks if the

YungHub 2 Jan 05, 2022
Stop python warnings, no matter what!

SHUTUP - Stop python warnings, no matter what! Sometimes you just can't mute python warnings. Use this library to solve this. Installation pip install

80 Jan 04, 2023
Automated Changelog/release note generation

Quickly generate changelogs and release notes by analysing your git history. A tool written in python, but works on any language.

Documatic 95 Jan 03, 2023
STAC in Jupyter Notebooks

stac-nb STAC in Jupyter Notebooks Install pip install stac-nb Usage To use stac-nb in a project, start Jupyter Lab (jupyter lab), create a new noteboo

Darren Wiens 32 Oct 04, 2022
Online-update est un programme python permettant de mettre a jour des dossier et de fichier depuis une adresse web.

Démarrage rapide Online-update est un programme python permettant de mettre a jour des dossier et de fichier depuis une adresse web. Mode préconfiguré

pf4 2 Nov 26, 2021
For my Philips Airpurifier AC3259/10

Philips-Airpurifier For my Philips Airpurifier AC3259/10 I will try to keep this code

AcidSleeper 7 Feb 26, 2022
Film-dosimetry - Film dosimetry for DUVS

film-dosimetry Film dosimetry for DUVS Hi David and Joe, here we go this is a te

Christine L Kuryla 3 Jan 20, 2022
The Ultimate Widevine Content Ripper (KEY Extract + Download + Decrypt) is REBORN

NARROWVINE-REBORN ** UPDATE 21.12.01 ** As expected Google patched its ChromeCDM Whitebox exploit by Satsuoni with a force-update on the ChromeCDM. Th

Vank0n 104 Dec 07, 2022
Tutor plugin for integration of Open edX with a Richie course catalog

Richie plugin for Tutor This is a plugin to integrate Richie, the learning portal CMS, with Open edX. The integration takes the form of a Tutor plugin

Overhang.IO 2 Sep 08, 2022
全局指针统一处理嵌套与非嵌套NER

GlobalPointer 全局指针统一处理嵌套与非嵌套NER。 介绍 博客:https://kexue.fm/archives/8373 效果 人民日报NER 验证集F1 测试集F1 训练速度 预测速度 CRF 96.39% 95.46% 1x 1x GlobalPointer (w/o RoPE

苏剑林(Jianlin Su) 183 Jan 06, 2023
A telegram bot which programed to countdown.

countdown-vi this is a telegram bot which programed to countdown. usage well, first you should specify a exact interval. there is 5 column, very first

Arya Shabane 3 Feb 15, 2022
Herramienta para pentesting web.

iTell 🕴 ¡Tool con herramientas para pentesting web! Metodos ❣ DDoS Attacks Recon Active Recon (Vulns) Extras (Bypass CF, FTP && SSH Bruter) Respons

1 Jul 28, 2022
Multi View Stereo on Internet Images

Evaluating MVS in a CPC Scenario This repository contains the set of artficats used for the ENGN8601/8602 research project. The thesis emphasizes on t

Namas Bhandari 1 Nov 10, 2021