This repository contains the scripts for downloading and validating scripts for the documents

Last update: Jun 07, 2022

Related tags

Overview

HC4: HLTCOE CLIR Common-Crawl Collection

This repository contains the scripts for downloading and validating scripts for the documents. Document ids, topics, and qrel files are in resources/hc4/

Required packages for the scripts are recorded in requirements.txt.

Topics and Qrels

Topics are stored in jsonl format and located in resources/hc4. The language(s) the topic is annotated for is recored in the language_with_qrels field. We provide the English topic title and description for all topics and human translation for the languages that it has qrels for. We also provide machine translation of them in all three languages for all topics. Narratives(field narratives) are all in English and has one entry for each of the languages that has qrels. Each topic also has an English report(field report) that is designed to record the prior knowledge the searcher has.

Qrels are stored in the classic TREC style located in resources/hc4/{lang}.

Download Documents

To download the documents from Common Crawl, please use the following command. If you plan to use HC4 with ir_datasets, please specify ~/.ir_datasets/hc4 as the storage or make a soft link to to the directory you wish to store the documents. The document ids and hashs are stored in resources/hc4/{lang}/ids*.jsonl.gz. Russian document ids are separated into 8 files.

python download_documents.py --storage ./data/ \
                             --zho ./resources/hc4/zho/ids.jsonl.gz \
                             --fas ./resources/hc4/fas/ids.jsonl.gz \
                             --rus ./resources/hc4/rus/ids.*.jsonl.gz \
                             --jobs 4 \
                             --check_hash

If you wish to only download the documents for one language, just specify the id file for the language you wish to download. We encourage using the flag --check_hash to varify the documents downloaded match with the documents we intend to use in the collection. The full description of the arguments can be found when execute with the --help flag.

Validate

After documents are downloaded, please run the validate_hc4_documents.py to verify all documents are downloaded for each language.

python validate_hc4_documents.py --hc4_file ./data/zho/hc4_docs.jsonl \
                                 --id_file ./resources/hc4/zho/ids.jsonl.gz \
                                 --qrels ./resources/hc4/zho/*.qrels.v1-0.txt

Reference

If you use this collection, please kindly cite our dataset paper with the following bibtex entry.

@inproceedings{hc4,
	author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang},
	title = {{HC4}: A New Suite of Test Collections for Ad Hoc {CLIR}},
	booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
	year = {2022}
}

This repository contains the scripts for downloading and validating scripts for the documents

Related tags

Overview

HC4: HLTCOE CLIR Common-Crawl Collection

Topics and Qrels

Download Documents

Validate

Reference

Owner

JHU Human Language Technology Center of Excellence

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation.

Artificial Intelligence search algorithm base on Pacman

DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation

Wav2Vec for speech recognition, classification, and audio classification

PyTorch Implementation of SSTNs for hyperspectral image classifications from the IEEE T-GRS paper "Spectral-Spatial Transformer Network for Hyperspectral Image Classification: A FAS Framework."

Keras implementation of the GNM model in paper ’Graph-Based Semi-Supervised Learning with Nonignorable Nonresponses‘

Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

This repository allows you to anonymize sensitive information in images/videos. The solution is fully compatible with the DL-based training/inference solutions that we already published/will publish for Object Detection and Semantic Segmentation.

Code for the RA-L (ICRA) 2021 paper "SeqNet: Learning Descriptors for Sequence-Based Hierarchical Place Recognition"

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

Hierarchical probabilistic 3D U-Net, with attention mechanisms (—𝘈𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘜-𝘕𝘦𝘵, 𝘚𝘌𝘙𝘦𝘴𝘕𝘦𝘵) and a nested decoder structure with deep supervision (—𝘜𝘕𝘦𝘵++).

Python utility to generate filesystem content for Obsidian.

Links to works on deep learning algorithms for physics problems, TUM-I15 and beyond

Reinforcement learning for self-driving in a 3D simulation

Pytorch implementation of ICASSP 2022 paper Attention Probe: Vision Transformer Distillation in the Wild

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Official code for paper "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight"

Scales, Chords, and Cadences: Practical Music Theory for MIR Researchers

code for paper -- "Seamless Satellite-image Synthesis"

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge