MS MARCO Document Ranking

MS MARCO (Microsoft Machine Reading Comprehension) is a large scale dataset, the current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. The corpus of document ranking task has 3.2 million documents and the training set has 367,013 queries.

Tasks

* MSMARCO. Domain: Web Pages.

* MSMARCO Document Ranking. Domain: Web Pages.

Datasets & Checkpoints

For BERT FirstP, we concatenate the title and content of each document by a ‘[SEP]’. For BERT MaxP, we only use the content of each document. To reproduce our runs, we need to preprocess the official document file to the format: doc_id doc.

Type

File

Records

Format

Description

Corpus

msmarco-docs.tsv

3,213,835

tsv: docid, url, title, body

Document Collections

Train

msmarco-doctrain-queries.tsv

367,013

tsv: qid, query

Training Queries

Train

msmarco-doctrain-qrels.tsv

384,597

TREC qrels

Training Query-Doc Relevance Labels

Train

Training-Data-FirstP

7,340,240

tsv: qid, docid, label

ANCE FirstP training data

Train

Training-Data-MaxP

7,340,240

tsv: qid, docid, label

ANCE MaxP training data

Dev

msmarco-docdev-queries.tsv

5,193

tsv: qid, query

Dev Queries

Dev

msmarco-docdev-qrels.tsv

5,478

TREC qrels

Dev Query-Doc Relevance Labels

Dev

ANCE-FirstP-dev-top100

519,300

TREC submission

ANCE FirstP dev top100

Dev

ANCE-MaxP-dev-top100

519,300

TREC submission

ANCE MaxP dev top100

Test

docleaderboard-queries.tsv

5,793

tsv: qid, query

Test Queries

Test

ANCE-FirstP-eval-top100

579,300

TREC submission

ANCE FirstP eval top100

Test

ANCE-MaxP-eval-top100

579,300

TREC submission

ANCE MaxP eval top100

Model

BERT-Base-ANCE-FirstP

-

-

BERT Base ANCE FirstP checkpoint

Model

BERT-Base-ANCE-MaxP

-

-

BERT Base ANCE MaxP checkpoint

Model

F-MaxP

-

-

BERT Base ANCE MaxP Coor-Ascent weights

Models

We use ANCE FirstP and MaxP as retrieval models, BERT FirstP and MaxP as reranking models. The FirstP and MaxP settings are same as paper.

Training

You can also finetune BERT yourself instead of using our checkpoints.

* BERT FirstP

We provide our training data (qid did label): Training-Data-FirstP. 10 negative documents are randomly sampled for each training query from ANCE FirstP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_firstp-50.jsonl.

CUDA_VISIBLE_DEVICES=0 \
python train.py \
        -task classification \
        -model bert \
        -train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-firstp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-firstp-10.tsv \
        -max_input 12800000 \
        -save ./checkpoints/bert-base-firstp.bin \
        -dev ./data/msmarco-doc_dev_firstp-50.jsonl \
        -qrels ./data/msmarco-docdev-qrels.tsv \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -res ./results/bert.trec \
        -metric mrr_cut_100 \
        -max_query_len 64 \
        -max_doc_len 445 \
        -epoch 1 \
        -batch_size 4 \
        -lr 3e-6 \
        -n_warmup_steps 100000 \
        -eval_every 10000

After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.

CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -dev ./data/msmarco-doc_dev_firstp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base-firstp.bin \
        -res ./features/bert-base_ance_dev_firstp_features \
        -max_query_len 64 \
        -max_doc_len 445 \
        -batch_size 256

Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.

java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_firstp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_firstp.ca

Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.

* BERT MaxP

We provde our training data (qid did label): Training-Data-MaxP. 10 negative documents are randomly sampled for each training query from ANCE MaxP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_maxp-50.jsonl.

Train.

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python train.py \
        -task classification \
        -model bert \
        -train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-maxp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-maxp-10.tsv \
        -max_input 12800000 \
        -save ./checkpoints/bert-base-maxp.bin \
        -dev ./data/msmarco-doc_dev_maxp-50.jsonl \
        -qrels ./data/msmarco-docdev-qrels.tsv \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -res ./results/bert.trec \
        -metric mrr_cut_100 \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -epoch 1 \
        -batch_size 8 \
        -lr 2e-5 \
        -n_warmup_steps 50000 \
        -eval_every 10000

After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.

CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -dev ./data/msmarco-doc_dev_maxp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base-maxp.bin \
        -res ./features/bert-base_ance_dev_maxp_features \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -batch_size 64

Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.

java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_maxp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_maxp.ca

Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.

Inference

* BERT FirstP

We provide the ANCE FirstP top-100 documents of dev and docleaderboard queries in aliyun in standard TREC format. You can click to download these data.

Preprocess dev and eval dataset, msmarco-docs-firstp.tsv is the preprocessed document file, each line is doc_id title [SEP] content:

python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_firstp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_firstp.jsonl

The checkpoint of BERT Base FirstP is available at BERT-Base-ANCE-FirstP. Now you can reproduce ANCE FirstP + BERT Base FirstP, MRR@100(dev): 0.4079.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -test ./data/msmarco-doc_dev_firstp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base_ance_firstp.bin \
        -res ./results/bert-base_ance_dev_firstp.trec \
        -max_query_len 64 \
        -max_doc_len 445 \
        -batch_size 256

* BERT MaxP

ANCE MaxP top-100 documents of dev and docleaderboard queries are also provided.

Preprocess dev dataset, msmarco-docs-maxp.tsv is the preprocessed document file, each line is doc_id content:

python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_maxp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_maxp.jsonl

The checkpoint of BERT Base MaxP is available at BERT-Base-ANCE-MaxP. Now you can reproduce ANCE MaxP + BERT Base MaxP, MRR@100(dev): 0.4094.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -test ./data/msmarco-doc_dev_maxp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base_ance_maxp.bin \
        -res ./results/bert-base_ance_dev_maxp.trec \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -batch_size 64

We also provide the weights of BERT Base MaxP features learned by Coor-Ascent: F-MaxP. First, generate the BERT Base MaxP features of eval dataset.

CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -dev ./data/msmarco-doc_eval_maxp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base_ance_maxp.bin \
        -res ./features/bert-base_ance_eval_maxp_features \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -batch_size 64

Then, we compute the ranking score using the weights.

java -jar LeToR/RankLib-2.1-patched.jar -load checkpoints/f_maxp.ca -rank features/bert-base_ance_eval_maxp_features -score f0.score
python LeToR/gen_trec.py -dev data/msmarco-doc_eval_maxp.jsonl -res results/bert-base_ance_eval_maxp_ca.trec -k -1

Dense Retriever Inference

We provide dense retriever codes (optimized ANCE) in OpenMatch for MSMARCO Document and MSMARCO-like datasets.

In such dataset, several files must be provided: - documents - [document collection.tsv]: each line contains [passage id]\t[text]\n for the document texts. [passage id]s are in format “Dxxxxx”, where “xxxxx” are integers. - files need be provided for each query set. (training, dev, eval, etc.) - [custom queries.tsv]: [query id]\t[text]\n for lines. [query id] is also integers. - [custom qrels.tsv]: [query id] 0 [passage id] 1\n for lines. This is optional because we may not have answers for the testset queries.

Let’s go to the retriever folder to do futher operation cd ./retrievers/DANCE/.

Pre-processing:

python data/custom_data.py \
--data_dir [raw tsv data folder] \
--out_data_dir [processed data folder] \
--model_type rdot_nll \
--model_name_or_path roberta-base \
--max_seq_length 512 \
--data_type 0 \
--doc_collection_tsv [doc text path] \
--save_prefix [query saving name] \
--query_collection_tsv [query text path] \
--qrel_tsv [optional qrel tsv] \

You can specify a pytorch checkpoint and use it to inference the embeddings of the documents or queries.

python -u -m torch.distributed.launch \
--nproc_per_node=[num GPU] --master_port=57135 ./drivers/run_ann_emb_inference.py \
--model_type rdot_nll \
--inference_type query --save_prefix [prefix of the query preprocessed file. eg., train] \
--split_ann_search --gpu_index \
--init_model_dir [checkpoint folder] \
--data_dir [processed data folder] \
--training_dir [task folder] \
--output_dir [task folder]/ann_data/ \
--cache_dir [task folder]/ann_data/cache/ \
--max_query_length 64 --max_seq_length 512 --per_gpu_eval_batch_size 256

With using parameters --inference_type query --save_prefix [prefix of the query preprocessed file. eg., train] \, you can inference different sets of queries. With using parameters --inference_type document and removing --save_prefix, you can inference the document embeddings.

Next, you can use the following code to produce the trec format retrieval results of different query sets. Note that the embedding files will be matched by emb_file_pattern = os.path.join(emb_dir,f'{emb_prefix}{checkpoint_postfix}__emb_p__data_obj_*.pb'), check out how your embeddings are saved and specify the checkpoint_postfix for the program to load the embeddings.

python ./evaluation/retrieval.py \
--trec_path [output trec path] \
--emb_dir [folder dumpped query and passage/document embeddings which is output_dir, same as --output_dir above] \
--checkpoint_postfix [checkpoint custom name] \
--processed_data_dir [processed data folder] ] \
--queryset_prefix [query saving name] \
--gpu_index True --topN 100 --data_type 0

Now you can calculate different metrics and play with the trec files for further reranking experiments with OpenMatch.

Results

Results of the runs we submitted.

Retriever

Reranker

Coor-Ascent

dev

eval

ANCE FirstP

-

-

0.373

0.334

ANCE MaxP

-

-

0.383

0.342

ANCE FirstP+BM25

BERT Base FirstP

-

0.407

-

ANCE FirstP+BM25

BERT Base FirstP

+

0.431

0.380

ANCE MaxP

BERT Base MaxP

-

0.409

-

ANCE MaxP

BERT Base MaxP

+

0.432

0.391