For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Last update: Nov 02, 2022

Related tags

Deep Learning SciBERTSUM

Overview

LongScientificFormer

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)

Data Preparation

Option 1: download the processed data

Pre-processed data

Put all files into raw_data directory

Step 2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-4.2.2 directory.

step 3. extracting sections from GROBID XML files

python preprocess.py -mode extract_pdf_sections -log_file ../logs/extract_section.log

step 4. extracting text from TIKA XML files

python preprocess.py -mode get_text_clean_tika -log_file ../logs/extract_tika_text.log

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

python preprocess.py -mode tokenize  -save_path ../temp -log_file ../logs/tokenize_by_corenlp.log

Step 6. Extract source, section, and target from tokenized files

python preprocess.py -mode clean_paper_jsons -save_path ../json_data/  -n_cpus 10 -log_file ../logs/build_json.log

Step 7. Generate BERT `.pt` files from source, sections and targets

python preprocess.py -mode format_to_bert -raw_path ../json_data/ -save_path ../bert_data  -lower -n_cpus 40 -log_file ../logs/build_bert_files.log

Model Training

First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use -visible_gpus -1, after downloading, you could kill the process and rerun the code with multi-GPUs.

Train

python train.py  -ext_dropout 0.1 -lr 2e-3  -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

To continue training from a checkpoint

python train.py  -ext_dropout 0.1 -lr 2e-3  -train_from ../models/model_step_99000.pt -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

Test

python train.py -mode test  -test_batch_size 1 -bert_data_path ../bert_data -log_file ../logs/ext_bert_test -test_from ../models/model_step_99000.pt -model_path ../models -sep_optim true -use_interval true -visible_gpus 1,2,3 -alpha 0.95 -result_path ../results/ext

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT `.pt` files from source, sections and targets

Model Training

Train

Test

Owner

Athar Sefid

Official implementation of "CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding" (CVPR, 2022)

AVD Quickstart Containerlab

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

YOLO-v5 기반 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adaptive Cruise Control 기능 구현

Generate images from texts. In Russian. In PaddlePaddle

Official implementation of paper Gradient Matching for Domain Generalization

Official Pytorch implementation of Meta Internal Learning

In the AI for TSP competition we try to solve optimization problems using machine learning.

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

This project is a loose implementation of paper "Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach"

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Unofficial implementation of Google "CutPaste: Self-Supervised Learning for Anomaly Detection and Localization" in PyTorch

Code for the paper "Reinforcement Learning as One Big Sequence Modeling Problem"

[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

HIVE: Evaluating the Human Interpretability of Visual Explanations

Vision Transformer for 3D medical image registration (Pytorch).

Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

Laplacian Score-regularized Concrete Autoencoders

Real-time LIDAR-based Urban Road and Sidewalk detection for Autonomous Vehicles 🚗

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT .pt files from source, sections and targets

Model Training

Train

Test

Owner

Athar Sefid

Official implementation of "CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding" (CVPR, 2022)

AVD Quickstart Containerlab

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

YOLO-v5 기반 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adaptive Cruise Control 기능 구현

Generate images from texts. In Russian. In PaddlePaddle

Official implementation of paper Gradient Matching for Domain Generalization

Official Pytorch implementation of Meta Internal Learning

In the AI for TSP competition we try to solve optimization problems using machine learning.

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

This project is a loose implementation of paper "Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach"

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Unofficial implementation of Google "CutPaste: Self-Supervised Learning for Anomaly Detection and Localization" in PyTorch

Code for the paper "Reinforcement Learning as One Big Sequence Modeling Problem"

[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

HIVE: Evaluating the Human Interpretability of Visual Explanations

Vision Transformer for 3D medical image registration (Pytorch).

Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

Laplacian Score-regularized Concrete Autoencoders

Real-time LIDAR-based Urban Road and Sidewalk detection for Autonomous Vehicles 🚗

Step 7. Generate BERT `.pt` files from source, sections and targets