TREC COVID

Top Spot on TREC-COVID Challenge (May 2020, Round2).

The twin goals of the challenge are to evaluate search algorithms and systems for helping scientists, clinicians, policy makers, and others manage the existing and rapidly growing corpus of scientific literature related to COVID-19, and to discover methods that will assist with managing scientific information in future global biomedical crises.

>> Reproduce Our Submit >> About COVID-19 Dataset >> Our Paper

Data Statistics

Data can be downloaded from Datasets. Each round (except Round1) only contains 5 new queries.

Datasets

Queries

Valid Documents

Round1

30

51.1K

Round2

35

59.9K

Round3

40

128.5K

Round4

45

157.8K

Round5

50

191.2K

Tasks

TREC-COVID. Domain: BioMed Papers.

Models

We use SciBERT base model for TREC-COVID experiments. ReInfoSelect and MetaAdaptRank frameworks are used to select more adaptive data for better performance of ranking models.

Results

Round

Method

Pre-trained Model

NDCG@20

P@20

5

MetaAdaptRank

PudMedBERT

0.7904

0.9400

training

Get training data from google drive.

Preprocess round1 data.

python ./data/preprocess.py \
  -input_trec anserini.covid-r1.fusion1.txt \
  -input_qrels qrels-cord19-round1.txt \
  -input_queries questions_cord19-rnd1.txt \
  -input_docs cord19_0501_titabs.jsonl \
  -output ./data/dev_trec-covid-round1.jsonl

Train.

CUDA_VISIBLE_DEVICES=0 \
python train.py \
        -task classification \
        -model bert \
        -train ./data/seanmed.train.320K-pairs.jsonl \
        -max_input 1280000 \
        -save ./checkpoints/scibert.bin \
        -dev ./data/dev_trec-covid-round1.jsonl \
        -qrels qrels-cord19-round1.txt \
        -vocab allenai/scibert_scivocab_uncased \
        -pretrain allenai/scibert_scivocab_uncased \
        -res ./results/scibert.trec \
        -metric ndcg_cut_10 \
        -max_query_len 32 \
        -max_doc_len 256 \
        -epoch 5 \
        -batch_size 16 \
        -lr 2e-5 \
        -n_warmup_steps 4000 \
        -eval_every 200

For ReInfoSelect training:

CUDA_VISIBLE_DEVICES=0 \
python train.py \
        -task classification \
        -model bert \
        -reinfoselect \
        -reset \
        -train ./data/seanmed.train.320K-pairs.jsonl \
        -max_input 1280000 \
        -save ./checkpoints/scibert_rl.bin \
        -dev ./data/dev_trec-covid-round1.jsonl \
        -qrels qrels-cord19-round1.txt \
        -vocab allenai/scibert_scivocab_uncased \
        -pretrain allenai/scibert_scivocab_uncased \
        -checkpoint ./checkpoints/scibert.bin \
        -res ./results/scibert_rl.trec \
        -metric ndcg_cut_10 \
        -max_query_len 32 \
        -max_doc_len 256 \
        -epoch 5 \
        -batch_size 8 \
        -lr 2e-5 \
        -tau 1 \
        -n_warmup_steps 5000 \
        -eval_every 1

Inference

Get checkpoint. checkpoints

Get data from Google Drive. round1 and round2

Filter round1 data from round2 data.

python data/filter.py \
  -input_qrels qrels-cord19-round1.txt \
  -input_trec anserini.covid-r2.fusion2.txt \
  -output_topk 50 \
  -output_trec anserini.covid-r2.fusion2-filtered.txt

Preprocess round2 data.

python ./data/preprocess.py \
  -input_trec anserini.covid-r2.fusion2-filtered.txt \
  -input_queries questions_cord19-rnd2.txt \
  -input_docs cord19_0501_titabs.jsonl \
  -output ./data/test_trec-covid-round2.jsonl

Reproduce scibert.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task classification \
        -model bert \
        -max_input 1280000 \
        -test ./data/test_trec-covid-round2.jsonl \
        -vocab allenai/scibert_scivocab_uncased \
        -pretrain allenai/scibert_scivocab_uncased \
        -checkpoint ./checkpoints/scibert.bin \
        -res ./results/scibert.trec \
        -mode cls \
        -max_query_len 32 \
        -max_doc_len 256 \
        -batch_size 32

Reproduce reinfoselect scibert.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task classification \
        -model bert \
        -max_input 1280000 \
        -test ./data/test_trec-covid-round2.jsonl \
        -vocab allenai/scibert_scivocab_uncased \
        -pretrain allenai/scibert_scivocab_uncased \
        -checkpoint ./checkpoints/reinfoselect_scibert.bin \
        -res ./results/reinfoselect_scibert.trec \
        -mode pooling \
        -max_query_len 32 \
        -max_doc_len 256 \
        -batch_size 32