MS MARCO Passage Ranking¶
Given a query q and a the 1000 most relevant passages P = p1, p2, p3,… p1000, as retrieved by BM25 a successful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR.
Models¶
We use BERT base model, RoBERTa large model, and ELECTRA base and large models for experiments.
Training¶
Train.
CUDA_VISIBLE_DEVICES=0 \
python train.py \
-task ranking \
-model bert \
-train queries=./data/queries.train.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.train.tsv,trec=./data/trids_bm25_marco-10.tsv \
-max_input 12800000 \
-save ./checkpoints/bert.bin \
-dev queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.dev.small.tsv,trec=./data/run.msmarco-passage.dev.small.100.trec \
-qrels ./data/qrels.dev.small.tsv \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-res ./results/bert.trec \
-metric mrr_cut_10 \
-max_query_len 32 \
-max_doc_len 221 \
-epoch 3 \
-batch_size 16 \
-lr 3e-6 \
-n_warmup_steps 160000 \
-eval_every 10000
To train electra-large and roberta-large
First convert training data to jsonl version vis data_process.py
import json
import codecs
def main():
f_train_tsv = codecs.open('./data/triples.train.small.tsv','r')
f_train_jsonl = codecs.open('./data/train.jsonl', 'w')
cnt = 0
for line in f_train_tsv:
s = line.strip().split('\t')
f_train_jsonl.write(json.dumps({'query':s[0],'doc_pos':s[1],'doc_neg':s[2]}) + '\n')
cnt += 1
if cnt > 3000000:
break
f_train_jsonl.close()
f_train_tsv.close()
print(cnt)
if __name__ == "__main__":
main()
python3 data_process.py
Train electra-large
CUDA_VISIBLE_DEVICES=0 \
python train.py\
-task ranking \
-model bert \
-train ./data/train.jsonl \
-max_input 3000000 \
-save ./checkpoints/electra_large.bin \
-dev queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.dev.small.tsv,trec=./data/run.msmarco-passage.dev.small.100.trec \
-qrels ./data/qrels.dev.small.tsv \
-vocab google/electra-large-discriminator \
-pretrain google/electra-large-discriminator \
-res ./results/electra_large.trec \
-metric mrr_cut_10 \
-max_query_len 32 \
-max_doc_len 256 \
-epoch 1 \
-batch_size 2 \
-lr 5e-6 \
-eval_every 10000
Train roberta-large
CUDA_VISIBLE_DEVICES=0 \
python train.py \
-task ranking \
-model roberta \
-train ./data/train.jsonl \
-max_input 3000000 \
-save ./checkpoints/roberta_large.bin \
-dev queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.dev.small.tsv,trec=./data/run.msmarco-passage.dev.small.100.trec \
-qrels ./data/qrels.dev.small.tsv \
-vocab roberta-large \
-pretrain roberta-large \
-res ./results/roberta_large.trec \
-metric mrr_cut_10 \
-max_query_len 32 \
-max_doc_len 256 \
-epoch 1 \
-batch_size 1 \
-lr 5e-7 \
-eval_every 20000
Since the whole dev dataset is too large, we only evaluate on top100 when training, and inference on whole dataset.
Inference¶
Get data and checkpoint from Google Drive
Get checkpoints of electra-large and roberta-large from electra-large roberta-large
Get MS MARCO collection.
wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -P ./data
tar -zxvf ./data/collection.tar.gz -C ./data/
Reproduce bert-base, MRR@10(dev): 0.3494.
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task ranking \
-model bert \
-max_input 12800000 \
-test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base.bin \
-res ./results/bert-base_msmarco-dev.trec \
-max_query_len 32 \
-max_doc_len 221 \
-batch_size 256
Reproduce electra-base, MRR@10(dev): 0.3518.
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task ranking \
-model bert \
-max_input 12800000 \
-test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
-vocab google/electra-base-discriminator \
-pretrain google/electra-base-discriminator \
-checkpoint ./checkpoints/electra-base.bin \
-res ./results/electra-base_msmarco-dev.trec \
-max_query_len 32 \
-max_doc_len 221 \
-batch_size 256
Reproduce electra-large, MRR@10(dev): 0.388
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task ranking \
-model bert \
-max_input 12800000 \
-test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
-vocab google/electra-large-discriminator \
-pretrain google/electra-large-discriminator \
-checkpoint ./checkpoints/electra_large.bin \
-res ./results/electra-large_msmarco-dev.trec \
-max_query_len 32 \
-max_doc_len 221 \
-batch_size 256
Reproduce roberta-large, MRR@10(dev): 0.386
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task ranking \
-model roberta \
-max_input 12800000 \
-test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
-vocab roberta-large \
-pretrain roberta-large \
-checkpoint ./checkpoints/roberta_large.bin \
-res ./results/roberta-large_msmarco-dev.trec \
-max_query_len 32 \
-max_doc_len 221 \
-batch_size 256
The checkpoints of roberta-large and electra-large are trained on MS-MARCO training data
wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ./data
tar -zxvf ./data/triples.train.small.tar.gz -C ./data/
For eval dataset inference, just change the trec file to ./data/run.msmarco-passage.eval.small.trec. The top1000 trec files for dev and eval queries are generated following anserini.
Results¶
Results of the runs we submitted.
Retriever |
Reranker |
Coor-Ascent |
dev |
eval |
---|---|---|---|---|
BM25 |
BERT Base |
- |
0.349 |
0.345 |
BM25 |
ELECTRA Base |
- |
0.352 |
0.344 |
BM25 |
RoBERTa Large |
- |
0.386 |
0.375 |
BM25 |
ELECTRA Large |
+ |
0.388 |
0.376 |