MS MARCO Passage Ranking

Given a query q and a the 1000 most relevant passages P = p1, p2, p3,… p1000, as retrieved by BM25 a successful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR.

Tasks

* MSMARCO. Domain: Web Pages.

* MSMARCO Passage Ranking. Domain: Web Pages.

Models

We use BERT base model, RoBERTa large model, and ELECTRA base and large models for experiments.

Training

Train.

CUDA_VISIBLE_DEVICES=0 \
python train.py \
        -task ranking \
        -model bert \
        -train  queries=./data/queries.train.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.train.tsv,trec=./data/trids_bm25_marco-10.tsv \
        -max_input 12800000 \
        -save ./checkpoints/bert.bin \
        -dev queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.dev.small.tsv,trec=./data/run.msmarco-passage.dev.small.100.trec \
        -qrels ./data/qrels.dev.small.tsv \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -res ./results/bert.trec \
        -metric mrr_cut_10 \
        -max_query_len 32 \
        -max_doc_len 221 \
        -epoch 3 \
        -batch_size 16 \
        -lr 3e-6 \
        -n_warmup_steps 160000 \
        -eval_every 10000

To train electra-large and roberta-large

First convert training data to jsonl version vis data_process.py

import json
import codecs

def main():
    f_train_tsv = codecs.open('./data/triples.train.small.tsv','r')
    f_train_jsonl = codecs.open('./data/train.jsonl', 'w')
    cnt = 0
    for line in f_train_tsv:
        s = line.strip().split('\t')
        f_train_jsonl.write(json.dumps({'query':s[0],'doc_pos':s[1],'doc_neg':s[2]}) + '\n')
        cnt += 1
        if cnt > 3000000:
            break
    f_train_jsonl.close()
    f_train_tsv.close()
    print(cnt)

if __name__ == "__main__":
    main()
python3 data_process.py

Train electra-large

CUDA_VISIBLE_DEVICES=0 \
python train.py\
        -task ranking \
        -model bert \
        -train ./data/train.jsonl \
        -max_input 3000000 \
        -save ./checkpoints/electra_large.bin \
        -dev queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.dev.small.tsv,trec=./data/run.msmarco-passage.dev.small.100.trec \
        -qrels ./data/qrels.dev.small.tsv \
        -vocab google/electra-large-discriminator \
        -pretrain google/electra-large-discriminator \
        -res ./results/electra_large.trec \
        -metric mrr_cut_10 \
        -max_query_len 32 \
        -max_doc_len 256 \
        -epoch 1 \
        -batch_size 2 \
        -lr 5e-6 \
        -eval_every 10000

Train roberta-large

CUDA_VISIBLE_DEVICES=0 \
python train.py \
        -task ranking \
        -model roberta \
        -train ./data/train.jsonl \
        -max_input 3000000 \
        -save ./checkpoints/roberta_large.bin \
        -dev queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.dev.small.tsv,trec=./data/run.msmarco-passage.dev.small.100.trec \
        -qrels ./data/qrels.dev.small.tsv \
        -vocab roberta-large \
        -pretrain roberta-large \
        -res ./results/roberta_large.trec \
        -metric mrr_cut_10 \
        -max_query_len 32 \
        -max_doc_len 256 \
        -epoch 1 \
        -batch_size 1 \
        -lr 5e-7 \
        -eval_every 20000

Since the whole dev dataset is too large, we only evaluate on top100 when training, and inference on whole dataset.

Inference

Get data and checkpoint from Google Drive

Get checkpoints of electra-large and roberta-large from electra-large roberta-large

Get MS MARCO collection.

wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -P ./data
tar -zxvf ./data/collection.tar.gz -C ./data/

Reproduce bert-base, MRR@10(dev): 0.3494.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task ranking \
        -model bert \
        -max_input 12800000 \
        -test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base.bin \
        -res ./results/bert-base_msmarco-dev.trec \
        -max_query_len 32 \
        -max_doc_len 221 \
        -batch_size 256

Reproduce electra-base, MRR@10(dev): 0.3518.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task ranking \
        -model bert \
        -max_input 12800000 \
        -test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
        -vocab google/electra-base-discriminator \
        -pretrain google/electra-base-discriminator \
        -checkpoint ./checkpoints/electra-base.bin \
        -res ./results/electra-base_msmarco-dev.trec \
        -max_query_len 32 \
        -max_doc_len 221 \
        -batch_size 256

Reproduce electra-large, MRR@10(dev): 0.388

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task ranking \
        -model bert \
        -max_input 12800000 \
        -test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
        -vocab google/electra-large-discriminator \
        -pretrain google/electra-large-discriminator \
        -checkpoint ./checkpoints/electra_large.bin \
        -res ./results/electra-large_msmarco-dev.trec \
        -max_query_len 32 \
        -max_doc_len 221 \
        -batch_size 256

Reproduce roberta-large, MRR@10(dev): 0.386

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task ranking \
        -model roberta \
        -max_input 12800000 \
        -test queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,trec=./data/run.msmarco-passage.dev.small.trec \
        -vocab roberta-large \
        -pretrain roberta-large \
        -checkpoint ./checkpoints/roberta_large.bin \
        -res ./results/roberta-large_msmarco-dev.trec \
        -max_query_len 32 \
        -max_doc_len 221 \
        -batch_size 256

The checkpoints of roberta-large and electra-large are trained on MS-MARCO training data

wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ./data
tar -zxvf ./data/triples.train.small.tar.gz -C ./data/

For eval dataset inference, just change the trec file to ./data/run.msmarco-passage.eval.small.trec. The top1000 trec files for dev and eval queries are generated following anserini.

Results

Results of the runs we submitted.

Retriever

Reranker

Coor-Ascent

dev

eval

BM25

BERT Base

-

0.349

0.345

BM25

ELECTRA Base

-

0.352

0.344

BM25

RoBERTa Large

-

0.386

0.375

BM25

ELECTRA Large

+

0.388

0.376