OpenMatch Guidelines¶

Train¶

For bert-like models training

sh train_bert.sh

For edrm, cknrm, knrm or tk training

sh train.sh

Options¶

-task             'ranking': pair-wise, 'classification': query-doc.
-model            'bert', 'tk', 'edrm', 'cknrm' or 'knrm'.
-reinfoselect     use reinfoselect or not.
-resset           reset the model or not, used in reinfoselect setting.
-train            path to training dataset.
-max_input        max input of instances.
-save             path for saving model checkpoint.
-dev              path to dev dataset.
-qrels            path to qrels.
-vocab            path to glove or customized vocab.
-ent_vocab        path to entity vocab, for edrm.
-pretrain         path to pretrained bert model.
-checkpoint       path to checkpoint.
-res              path for saving result.
-metric           which metrics to use, e.g. ndcg_cut_10.
-mode             use cls or pooling as bert representation.
-n_kernels        kernel number, for tk, edrm, cknrm or knrm.
-max_query_len    max length of query tokens.
-max_doc_len      max length of document tokens.
-maxp             bert max passage.
-epoch            how many epoch.
-batch_size       batch size.
-lr               learning rate.
-n_warmup_steps   warmup steps.
-eval_every       e.g. 1000, every 1000 steps evaluate on dev data.

Inference¶

For bert-like models inference

sh inference_bert.sh

For edrm, cknrm, knrm or tk inference

sh inference.sh

Options¶

-task             'ranking': pair-wise, 'classification': query-doc.
-model            'bert', 'tk', 'edrm', 'cknrm' or 'knrm'.
-max_input        max input of instances.
-test             path to test dataset.
-vocab            path to glove or customized vocab.
-ent_vocab        path to entity vocab.
-pretrain         path to pretrained bert model.
-checkpoint       path to checkpoint.
-res              path for saving result.
-n_kernels        kernel number, for tk, edrm, cknrm or knrm.
-max_query_len    max length of query tokens.
-max_doc_len      max length of document tokens.
-batch_size       batch size.

Data Format¶

Ranking Task¶

For bert, tk, cknrm or knrm:

file	format
train	{“query”: str, “doc_pos”: str, “doc_neg”: str}
dev	{“query”: str, “doc”: str, “label”: int, “query_id”: str, “doc_id”: str, “retrieval_score”: float}
test	{“query”: str, “doc”: str, “query_id”: str, “doc_id”: str, “retrieval_score”: float}

For edrm:

file	format
train	+{“query_ent”: list, “doc_pos_ent”: list, “doc_neg_ent”: list, “query_des”: list, “doc_pos_des”: list, “doc_neg_des”: list}
dev	+{“query_ent”: list, “doc_ent”: list, “query_des”: list, “doc_des”: list}
test	+{“query_ent”: list, “doc_ent”: list, “query_des”: list, “doc_des”: list}

The query_ent, doc_ent is a list of entities relevant to the query or document, query_des is a list of entity descriptions.

Classification Task¶

Only train file format different with ranking task.

For bert, tk, cknrm or knrm:

file	format
train	{“query”: str, “doc”: str, “label”: int}

For edrm:

file	format
train	+{“query_ent”: list, “doc_ent”: list, “query_des”: list, “doc_des”: list}

Others¶

The dev and test files can be set as:

-dev queries={path to queries},docs={path to docs},qrels={path to qrels},trec={path to trec}
-test queries={path to queries},docs={path to docs},trec={path to trec}

file	format
queries	{“query_id”:, “query”:}
docs	{“doc_id”:, “doc”:}
qrels	query_id iteration doc_id label
trec	query_id Q0 doc_id rank score run-tag

For edrm, the queries and docs are a little different:

file	format
queries	+{“query_ent”: list, “query_des”: list}
docs	+{“doc_ent”: list, “doc_des”: list}

Other bert-like models are also available, e.g. electra, scibert. You just need to change the path to the vocab and the pretrained model.

You can also train bert for masked language model with train_bertmlm.py. The train file format is as follows:

file	format
train	{‘doc’: str}

If you want to concatenate the neural features with retrieval scores (SDM/BM25), and run coor-ascent, you need to generate a features file using gen_feature.py, and run

sh coor_ascent.sh