The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

Overview

TriageSQL

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text-to-SQL"

Dataset Download

Due to the size limitation, please download the dataset from Google Drive.

Citations

If you want to use TriageSQL in your work, please cite as follows:

@article{zhang2020did,
  title={Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text-to-SQL},
  author={Zhang, Yusen and Dong, Xiangyu and Chang, Shuaichen and Yu, Tao and Shi, Peng and Zhang, Rui},
  journal={arXiv preprint arXiv:2010.12634},
  year={2020}
}

Dataset

In each json file of the dataset, one can find a field called type, which includes 5 different values, including small talk, answerable, ambiguous, lack data, and unanswerable by sql, corresponding to 5 different types described in our paper. Here is the summary of our dataset and the corresponding experiment results:

Type Trainset Devset Testset Type Alias Reported F1
small talk 31160 7790 500 Improper 0.88
ambiguous 48592 9564 500 Ambiguous 0.43
lack data 90375 19566 500 ExtKnow 0.56
unanswerable by sql 124225 26330 500 Non-SQL 0.90
answerable 139884 32892 500 Answerable 0.53
overall 434236 194037 2500 TriageSQL 0.66

The folder src contains all the source files used to construct the proposed TriageSQL. In addition, some part of files contains more details about the dataset, such as databaseid which is the id of the schema in the original dataset, e.g. "flight_2" in CoSQL, while question_datasetid indicates the original dataset name of the questions, e.g. "quac". Some of the samples do not contain these fields because they are either human-annotated or edited.

Model

We also include the source code for RoBERTa baseline in our project in /model. It is a multi-classifer with 5 classes where '0' represents answerable, '1'-'4' represent distinct types of unanswerable questions. Given the dataset from Google Drive, you may need to conduct some preprocessing to obtain train/dev/test set. You can directly download from here or make your own dataset using the following instructions:

Constructing input file for the RoBERTa model

The same as /testset/test.json, our input file is a json list with shape (num_of_question, 3) containing 3 lists: query, schema, and label.

  • query: containing strings of questions
  • schema: contianing strings of schema for each question, i.e., "table_name.column_name1 | table_name.column_name2 | ... " for multi-table questions, and column_name1 | column_name2 for single-table questions.
  • labels of questions, see config.label_dict for the mapping, leave arbitary value if testing is not needed or true labels are not given.

when preprocessing, please use lower case for all data, and remove the meaningless table names as well, such as T10023-1242. Also, we sample 10k from each type to form the large input dataset

Running

After adjusting the parameters in config.py, one can simply run python train.py or python eval.py to train or evaluate the model.

Explanation of other files

  • config.py: hyper parameters
  • train.py: training and evaluation of the model
  • utils.py: loading the dataset and tokenization
  • model.py: the RoBERTa classification model we used
  • test.json: sample of test input
Owner
Yusen Zhang
Yusen Zhang
PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

Rowel Atienza 198 Dec 27, 2022
Code for Reciprocal Adversarial Learning for Brain Tumor Segmentation: A Solution to BraTS Challenge 2021 Segmentation Task

BRATS 2021 Solution For Segmentation Task This repo contains the supported pytorch code and configuration files to reproduce 3D medical image segmenta

Himashi Amanda Peiris 6 Sep 15, 2022
Towards Debiasing NLU Models from Unknown Biases

Towards Debiasing NLU Models from Unknown Biases Abstract: NLU models often exploit biased features to achieve high dataset-specific performance witho

Ubiquitous Knowledge Processing Lab 22 Jun 14, 2022
Pytorch Implementation of rpautrat/SuperPoint

SuperPoint-Pytorch (A Pure Pytorch Implementation) SuperPoint: Self-Supervised Interest Point Detection and Description Thanks This work is based on:

76 Dec 27, 2022
A study project using the AA-RMVSNet to reconstruct buildings from multiple images

3d-building-reconstruction This is part of a study project using the AA-RMVSNet to reconstruct buildings from multiple images. Introduction It is exci

17 Oct 17, 2022
Pytorch implementation of DeePSiM

Pytorch implementation of DeePSiM

1 Nov 05, 2021
Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

PlantDoc: A Dataset for Visual Plant Disease Detection This repository contains the Cropped-PlantDoc dataset used for benchmarking classification mode

Pratik Kayal 109 Dec 29, 2022
The code uses SegFormer for Semantic Segmentation on Drone Dataset.

SegFormer_Segmentation The code uses SegFormer for Semantic Segmentation on Drone Dataset. The details for the SegFormer can be obtained from the foll

Dr. Sander Ali Khowaja 1 May 08, 2022
Code for CVPR 2021 paper: Anchor-Free Person Search

Introduction This is the implementationn for Anchor-Free Person Search in CVPR2021 License This project is released under the Apache 2.0 license. Inst

158 Jan 04, 2023
RoadMap and preparation material for Machine Learning and Data Science - From beginner to expert.

ML-and-DataScience-preparation This repository has the goal to create a learning and preparation roadMap for Machine Learning Engineers and Data Scien

33 Dec 29, 2022
SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

[arXiv] The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service IQON, wh

ZOZO, Inc. 138 Nov 24, 2022
Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

Bosch Research 272 Dec 28, 2022
a generic C++ library for image analysis

VIGRA Computer Vision Library Copyright 1998-2013 by Ullrich Koethe This file is part of the VIGRA computer vision library. You may use,

Ullrich Koethe 378 Dec 30, 2022
Torchreid: Deep learning person re-identification in PyTorch.

Torchreid Torchreid is a library for deep-learning person re-identification, written in PyTorch. It features: multi-GPU training support both image- a

Kaiyang 3.7k Jan 05, 2023
A PyTorch implementation of NeRF (Neural Radiance Fields) that reproduces the results.

NeRF-pytorch NeRF (Neural Radiance Fields) is a method that achieves state-of-the-art results for synthesizing novel views of complex scenes. Here are

Yen-Chen Lin 3.2k Jan 08, 2023
Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021) Linsen Song, Wayne Wu, Chaoyou Fu, Chen Qian, Chen Change Loy, and Ran He [Paper], [Video

71 Dec 21, 2022
Simple cross-platform application for DaVinci surgical video frame annotation

About DaVid is a simple cross-platform GUI for annotating robotic and endoscopic surgical actions for use in deep-learning research. Features Simple a

Cyril Zakka 4 Oct 09, 2021
Vehicles Counting using YOLOv4 + DeepSORT + Flask + Ngrok

A project for counting vehicles using YOLOv4 + DeepSORT + Flask + Ngrok

Duong Tran Thanh 37 Dec 16, 2022
Repository for the NeurIPS 2021 paper: "Exploiting Domain-Specific Features to Enhance Domain Generalization".

meta-Domain Specific-Domain Invariant (mDSDI) Source code implementation for the paper: Manh-Ha Bui, Toan Tran, Anh Tuan Tran, Dinh Phung. "Exploiting

VinAI Research 12 Nov 25, 2022