MS MARCO Document Ranking¶

MS MARCO (Microsoft Machine Reading Comprehension) is a large scale dataset, the current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. The corpus of document ranking task has 3.2 million documents and the training set has 367,013 queries.

Tasks¶

* MSMARCO. Domain: Web Pages.

* MSMARCO Document Ranking. Domain: Web Pages.

Datasets & Checkpoints¶

For BERT FirstP, we concatenate the title and content of each document by a ‘[SEP]’. For BERT MaxP, we only use the content of each document. To reproduce our runs, we need to preprocess the official document file to the format: doc_id doc.

Type	File	Records	Format	Description
Corpus	msmarco-docs.tsv	3,213,835	tsv: docid, url, title, body	Document Collections
Train	msmarco-doctrain-queries.tsv	367,013	tsv: qid, query	Training Queries
Train	msmarco-doctrain-qrels.tsv	384,597	TREC qrels	Training Query-Doc Relevance Labels
Train	Training-Data-FirstP	7,340,240	tsv: qid, docid, label	ANCE FirstP training data
Train	Training-Data-MaxP	7,340,240	tsv: qid, docid, label	ANCE MaxP training data
Dev	msmarco-docdev-queries.tsv	5,193	tsv: qid, query	Dev Queries
Dev	msmarco-docdev-qrels.tsv	5,478	TREC qrels	Dev Query-Doc Relevance Labels
Dev	ANCE-FirstP-dev-top100	519,300	TREC submission	ANCE FirstP dev top100
Dev	ANCE-MaxP-dev-top100	519,300	TREC submission	ANCE MaxP dev top100
Test	docleaderboard-queries.tsv	5,793	tsv: qid, query	Test Queries
Test	ANCE-FirstP-eval-top100	579,300	TREC submission	ANCE FirstP eval top100
Test	ANCE-MaxP-eval-top100	579,300	TREC submission	ANCE MaxP eval top100
Model	BERT-Base-ANCE-FirstP	-	-	BERT Base ANCE FirstP checkpoint
Model	BERT-Base-ANCE-MaxP	-	-	BERT Base ANCE MaxP checkpoint
Model	F-MaxP	-	-	BERT Base ANCE MaxP Coor-Ascent weights

Models¶

We use ANCE FirstP and MaxP as retrieval models, BERT FirstP and MaxP as reranking models. The FirstP and MaxP settings are same as paper.

Training¶

You can also finetune BERT yourself instead of using our checkpoints.

* BERT FirstP

We provide our training data (qid did label): Training-Data-FirstP. 10 negative documents are randomly sampled for each training query from ANCE FirstP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_firstp-50.jsonl.

CUDA_VISIBLE_DEVICES=0 \
python train.py \
        -task classification \
        -model bert \
        -train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-firstp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-firstp-10.tsv \
        -max_input 12800000 \
        -save ./checkpoints/bert-base-firstp.bin \
        -dev ./data/msmarco-doc_dev_firstp-50.jsonl \
        -qrels ./data/msmarco-docdev-qrels.tsv \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -res ./results/bert.trec \
        -metric mrr_cut_100 \
        -max_query_len 64 \
        -max_doc_len 445 \
        -epoch 1 \
        -batch_size 4 \
        -lr 3e-6 \
        -n_warmup_steps 100000 \
        -eval_every 10000

After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.

CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -dev ./data/msmarco-doc_dev_firstp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base-firstp.bin \
        -res ./features/bert-base_ance_dev_firstp_features \
        -max_query_len 64 \
        -max_doc_len 445 \
        -batch_size 256

Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.

java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_firstp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_firstp.ca

Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.

* BERT MaxP

We provde our training data (qid did label): Training-Data-MaxP. 10 negative documents are randomly sampled for each training query from ANCE MaxP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_maxp-50.jsonl.

Train.

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python train.py \
        -task classification \
        -model bert \
        -train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-maxp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-maxp-10.tsv \
        -max_input 12800000 \
        -save ./checkpoints/bert-base-maxp.bin \
        -dev ./data/msmarco-doc_dev_maxp-50.jsonl \
        -qrels ./data/msmarco-docdev-qrels.tsv \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -res ./results/bert.trec \
        -metric mrr_cut_100 \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -epoch 1 \
        -batch_size 8 \
        -lr 2e-5 \
        -n_warmup_steps 50000 \
        -eval_every 10000

After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.

CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -dev ./data/msmarco-doc_dev_maxp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base-maxp.bin \
        -res ./features/bert-base_ance_dev_maxp_features \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -batch_size 64

Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.

java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_maxp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_maxp.ca

Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.

Inference¶

* BERT FirstP

We provide the ANCE FirstP top-100 documents of dev and docleaderboard queries in aliyun in standard TREC format. You can click to download these data.

Preprocess dev and eval dataset, msmarco-docs-firstp.tsv is the preprocessed document file, each line is doc_id title [SEP] content:

python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_firstp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_firstp.jsonl

The checkpoint of BERT Base FirstP is available at BERT-Base-ANCE-FirstP. Now you can reproduce ANCE FirstP + BERT Base FirstP, MRR@100(dev): 0.4079.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -test ./data/msmarco-doc_dev_firstp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base_ance_firstp.bin \
        -res ./results/bert-base_ance_dev_firstp.trec \
        -max_query_len 64 \
        -max_doc_len 445 \
        -batch_size 256

* BERT MaxP

ANCE MaxP top-100 documents of dev and docleaderboard queries are also provided.

Preprocess dev dataset, msmarco-docs-maxp.tsv is the preprocessed document file, each line is doc_id content:

python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_maxp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_maxp.jsonl

The checkpoint of BERT Base MaxP is available at BERT-Base-ANCE-MaxP. Now you can reproduce ANCE MaxP + BERT Base MaxP, MRR@100(dev): 0.4094.

CUDA_VISIBLE_DEVICES=0 \
python inference.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -test ./data/msmarco-doc_dev_maxp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base_ance_maxp.bin \
        -res ./results/bert-base_ance_dev_maxp.trec \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -batch_size 64

We also provide the weights of BERT Base MaxP features learned by Coor-Ascent: F-MaxP. First, generate the BERT Base MaxP features of eval dataset.

CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
        -task classification \
        -model bert \
        -max_input 12800000 \
        -dev ./data/msmarco-doc_eval_maxp.jsonl \
        -vocab bert-base-uncased \
        -pretrain bert-base-uncased \
        -checkpoint ./checkpoints/bert-base_ance_maxp.bin \
        -res ./features/bert-base_ance_eval_maxp_features \
        -max_query_len 64 \
        -max_doc_len 445 \
        -maxp \
        -batch_size 64

Then, we compute the ranking score using the weights.

java -jar LeToR/RankLib-2.1-patched.jar -load checkpoints/f_maxp.ca -rank features/bert-base_ance_eval_maxp_features -score f0.score
python LeToR/gen_trec.py -dev data/msmarco-doc_eval_maxp.jsonl -res results/bert-base_ance_eval_maxp_ca.trec -k -1

Dense Retriever Inference¶

We provide dense retriever codes (optimized ANCE) in OpenMatch for MSMARCO Document and MSMARCO-like datasets.

In such dataset, several files must be provided: - documents - [document collection.tsv]: each line contains [passage id]\t[text]\n for the document texts. [passage id]s are in format “Dxxxxx”, where “xxxxx” are integers. - files need be provided for each query set. (training, dev, eval, etc.) - [custom queries.tsv]: [query id]\t[text]\n for lines. [query id] is also integers. - [custom qrels.tsv]: [query id] 0 [passage id] 1\n for lines. This is optional because we may not have answers for the testset queries.

Let’s go to the retriever folder to do futher operation cd ./retrievers/DANCE/.

Pre-processing:

python data/custom_data.py \
--data_dir [raw tsv data folder] \
--out_data_dir [processed data folder] \
--model_type rdot_nll \
--model_name_or_path roberta-base \
--max_seq_length 512 \
--data_type 0 \
--doc_collection_tsv [doc text path] \
--save_prefix [query saving name] \
--query_collection_tsv [query text path] \
--qrel_tsv [optional qrel tsv] \

You can specify a pytorch checkpoint and use it to inference the embeddings of the documents or queries.

python -u -m torch.distributed.launch \
--nproc_per_node=[num GPU] --master_port=57135 ./drivers/run_ann_emb_inference.py \
--model_type rdot_nll \
--inference_type query --save_prefix [prefix of the query preprocessed file. eg., train] \
--split_ann_search --gpu_index \
--init_model_dir [checkpoint folder] \
--data_dir [processed data folder] \
--training_dir [task folder] \
--output_dir [task folder]/ann_data/ \
--cache_dir [task folder]/ann_data/cache/ \
--max_query_length 64 --max_seq_length 512 --per_gpu_eval_batch_size 256

With using parameters --inference_type query --save_prefix [prefix of the query preprocessed file. eg., train] \, you can inference different sets of queries. With using parameters --inference_type document and removing --save_prefix, you can inference the document embeddings.

Next, you can use the following code to produce the trec format retrieval results of different query sets. Note that the embedding files will be matched by emb_file_pattern = os.path.join(emb_dir,f'{emb_prefix}{checkpoint_postfix}__emb_p__data_obj_*.pb'), check out how your embeddings are saved and specify the checkpoint_postfix for the program to load the embeddings.

python ./evaluation/retrieval.py \
--trec_path [output trec path] \
--emb_dir [folder dumpped query and passage/document embeddings which is output_dir, same as --output_dir above] \
--checkpoint_postfix [checkpoint custom name] \
--processed_data_dir [processed data folder] ] \
--queryset_prefix [query saving name] \
--gpu_index True --topN 100 --data_type 0

Now you can calculate different metrics and play with the trec files for further reranking experiments with OpenMatch.

Results¶

Results of the runs we submitted.

Retriever	Reranker	Coor-Ascent	dev	eval
ANCE FirstP	-	-	0.373	0.334
ANCE MaxP	-	-	0.383	0.342
ANCE FirstP+BM25	BERT Base FirstP	-	0.407	-
ANCE FirstP+BM25	BERT Base FirstP	+	0.431	0.380
ANCE MaxP	BERT Base MaxP	-	0.409	-
ANCE MaxP	BERT Base MaxP	+	0.432	0.391