MS MARCO Document Ranking¶
MS MARCO (Microsoft Machine Reading Comprehension) is a large scale dataset, the current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. The corpus of document ranking task has 3.2 million documents and the training set has 367,013 queries.
Datasets & Checkpoints¶
For BERT FirstP, we concatenate the title and content of each document by a ‘[SEP]’. For BERT MaxP, we only use the content of each document. To reproduce our runs, we need to preprocess the official document file to the format: doc_id doc.
Type |
File |
Records |
Format |
Description |
---|---|---|---|---|
Corpus |
3,213,835 |
tsv: docid, url, title, body |
Document Collections |
|
Train |
367,013 |
tsv: qid, query |
Training Queries |
|
Train |
384,597 |
TREC qrels |
Training Query-Doc Relevance Labels |
|
Train |
7,340,240 |
tsv: qid, docid, label |
ANCE FirstP training data |
|
Train |
7,340,240 |
tsv: qid, docid, label |
ANCE MaxP training data |
|
Dev |
5,193 |
tsv: qid, query |
Dev Queries |
|
Dev |
5,478 |
TREC qrels |
Dev Query-Doc Relevance Labels |
|
Dev |
519,300 |
TREC submission |
ANCE FirstP dev top100 |
|
Dev |
519,300 |
TREC submission |
ANCE MaxP dev top100 |
|
Test |
5,793 |
tsv: qid, query |
Test Queries |
|
Test |
579,300 |
TREC submission |
ANCE FirstP eval top100 |
|
Test |
579,300 |
TREC submission |
ANCE MaxP eval top100 |
|
Model |
- |
- |
BERT Base ANCE FirstP checkpoint |
|
Model |
- |
- |
BERT Base ANCE MaxP checkpoint |
|
Model |
- |
- |
BERT Base ANCE MaxP Coor-Ascent weights |
Models¶
We use ANCE FirstP and MaxP as retrieval models, BERT FirstP and MaxP as reranking models. The FirstP and MaxP settings are same as paper.
Training¶
You can also finetune BERT yourself instead of using our checkpoints.
* BERT FirstP
We provide our training data (qid did label): Training-Data-FirstP. 10 negative documents are randomly sampled for each training query from ANCE FirstP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_firstp-50.jsonl.
CUDA_VISIBLE_DEVICES=0 \
python train.py \
-task classification \
-model bert \
-train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-firstp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-firstp-10.tsv \
-max_input 12800000 \
-save ./checkpoints/bert-base-firstp.bin \
-dev ./data/msmarco-doc_dev_firstp-50.jsonl \
-qrels ./data/msmarco-docdev-qrels.tsv \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-res ./results/bert.trec \
-metric mrr_cut_100 \
-max_query_len 64 \
-max_doc_len 445 \
-epoch 1 \
-batch_size 4 \
-lr 3e-6 \
-n_warmup_steps 100000 \
-eval_every 10000
After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.
CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
-task classification \
-model bert \
-max_input 12800000 \
-dev ./data/msmarco-doc_dev_firstp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base-firstp.bin \
-res ./features/bert-base_ance_dev_firstp_features \
-max_query_len 64 \
-max_doc_len 445 \
-batch_size 256
Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.
java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_firstp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_firstp.ca
Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.
* BERT MaxP
We provde our training data (qid did label): Training-Data-MaxP. 10 negative documents are randomly sampled for each training query from ANCE MaxP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_maxp-50.jsonl.
Train.
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python train.py \
-task classification \
-model bert \
-train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-maxp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-maxp-10.tsv \
-max_input 12800000 \
-save ./checkpoints/bert-base-maxp.bin \
-dev ./data/msmarco-doc_dev_maxp-50.jsonl \
-qrels ./data/msmarco-docdev-qrels.tsv \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-res ./results/bert.trec \
-metric mrr_cut_100 \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-epoch 1 \
-batch_size 8 \
-lr 2e-5 \
-n_warmup_steps 50000 \
-eval_every 10000
After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.
CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
-task classification \
-model bert \
-max_input 12800000 \
-dev ./data/msmarco-doc_dev_maxp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base-maxp.bin \
-res ./features/bert-base_ance_dev_maxp_features \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-batch_size 64
Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.
java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_maxp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_maxp.ca
Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.
Inference¶
* BERT FirstP
We provide the ANCE FirstP top-100 documents of dev and docleaderboard queries in aliyun in standard TREC format. You can click to download these data.
Preprocess dev and eval dataset, msmarco-docs-firstp.tsv is the preprocessed document file, each line is doc_id title [SEP] content:
python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_firstp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_firstp.jsonl
The checkpoint of BERT Base FirstP is available at BERT-Base-ANCE-FirstP. Now you can reproduce ANCE FirstP + BERT Base FirstP, MRR@100(dev): 0.4079.
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task classification \
-model bert \
-max_input 12800000 \
-test ./data/msmarco-doc_dev_firstp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base_ance_firstp.bin \
-res ./results/bert-base_ance_dev_firstp.trec \
-max_query_len 64 \
-max_doc_len 445 \
-batch_size 256
* BERT MaxP
ANCE MaxP top-100 documents of dev and docleaderboard queries are also provided.
Preprocess dev dataset, msmarco-docs-maxp.tsv is the preprocessed document file, each line is doc_id content:
python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_maxp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_maxp.jsonl
The checkpoint of BERT Base MaxP is available at BERT-Base-ANCE-MaxP. Now you can reproduce ANCE MaxP + BERT Base MaxP, MRR@100(dev): 0.4094.
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task classification \
-model bert \
-max_input 12800000 \
-test ./data/msmarco-doc_dev_maxp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base_ance_maxp.bin \
-res ./results/bert-base_ance_dev_maxp.trec \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-batch_size 64
We also provide the weights of BERT Base MaxP features learned by Coor-Ascent: F-MaxP. First, generate the BERT Base MaxP features of eval dataset.
CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
-task classification \
-model bert \
-max_input 12800000 \
-dev ./data/msmarco-doc_eval_maxp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base_ance_maxp.bin \
-res ./features/bert-base_ance_eval_maxp_features \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-batch_size 64
Then, we compute the ranking score using the weights.
java -jar LeToR/RankLib-2.1-patched.jar -load checkpoints/f_maxp.ca -rank features/bert-base_ance_eval_maxp_features -score f0.score
python LeToR/gen_trec.py -dev data/msmarco-doc_eval_maxp.jsonl -res results/bert-base_ance_eval_maxp_ca.trec -k -1
Dense Retriever Inference¶
We provide dense retriever codes (optimized ANCE) in OpenMatch for MSMARCO Document and MSMARCO-like datasets.
In such dataset, several files must be provided: - documents -
[document collection.tsv]
: each line contains
[passage id]\t[text]\n
for the document texts. [passage id]
s
are in format “Dxxxxx”, where “xxxxx” are integers. - files need be
provided for each query set. (training, dev, eval, etc.) -
[custom queries.tsv]
: [query id]\t[text]\n
for lines.
[query id]
is also integers. - [custom qrels.tsv]
:
[query id] 0 [passage id] 1\n
for lines. This is optional because we
may not have answers for the testset queries.
Let’s go to the retriever folder to do futher operation
cd ./retrievers/DANCE/
.
Pre-processing:
python data/custom_data.py \
--data_dir [raw tsv data folder] \
--out_data_dir [processed data folder] \
--model_type rdot_nll \
--model_name_or_path roberta-base \
--max_seq_length 512 \
--data_type 0 \
--doc_collection_tsv [doc text path] \
--save_prefix [query saving name] \
--query_collection_tsv [query text path] \
--qrel_tsv [optional qrel tsv] \
You can specify a pytorch checkpoint and use it to inference the embeddings of the documents or queries.
python -u -m torch.distributed.launch \
--nproc_per_node=[num GPU] --master_port=57135 ./drivers/run_ann_emb_inference.py \
--model_type rdot_nll \
--inference_type query --save_prefix [prefix of the query preprocessed file. eg., train] \
--split_ann_search --gpu_index \
--init_model_dir [checkpoint folder] \
--data_dir [processed data folder] \
--training_dir [task folder] \
--output_dir [task folder]/ann_data/ \
--cache_dir [task folder]/ann_data/cache/ \
--max_query_length 64 --max_seq_length 512 --per_gpu_eval_batch_size 256
With using parameters
--inference_type query --save_prefix [prefix of the query preprocessed file. eg., train] \
,
you can inference different sets of queries. With using parameters
--inference_type document
and removing --save_prefix
, you can
inference the document embeddings.
Next, you can use the following code to produce the trec format
retrieval results of different query sets. Note that the embedding files
will be matched by
emb_file_pattern = os.path.join(emb_dir,f'{emb_prefix}{checkpoint_postfix}__emb_p__data_obj_*.pb')
,
check out how your embeddings are saved and specify the
checkpoint_postfix
for the program to load the embeddings.
python ./evaluation/retrieval.py \
--trec_path [output trec path] \
--emb_dir [folder dumpped query and passage/document embeddings which is output_dir, same as --output_dir above] \
--checkpoint_postfix [checkpoint custom name] \
--processed_data_dir [processed data folder] ] \
--queryset_prefix [query saving name] \
--gpu_index True --topN 100 --data_type 0
Now you can calculate different metrics and play with the trec files for further reranking experiments with OpenMatch.
Results¶
Results of the runs we submitted.
Retriever |
Reranker |
Coor-Ascent |
dev |
eval |
---|---|---|---|---|
ANCE FirstP |
- |
- |
0.373 |
0.334 |
ANCE MaxP |
- |
- |
0.383 |
0.342 |
ANCE FirstP+BM25 |
BERT Base FirstP |
- |
0.407 |
- |
ANCE FirstP+BM25 |
BERT Base FirstP |
+ |
0.431 |
0.380 |
ANCE MaxP |
BERT Base MaxP |
- |
0.409 |
- |
ANCE MaxP |
BERT Base MaxP |
+ |
0.432 |
0.391 |