From cfdb103bd8e9d993a97dfde2433292368b7b8c73 Mon Sep 17 00:00:00 2001 From: Xiao Date: Sun, 11 Feb 2024 12:26:49 +0000 Subject: [PATCH] Update README.md --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 2518f12..3794565 100644 --- a/README.md +++ b/README.md @@ -215,11 +215,6 @@ print(model.compute_score(sentence_pairs, We compare BGE-M3 with some popular methods, including BM25, openAI embedding, etc. -We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline). -To make the BM25 and BGE-M3 more comparable, in the experiment, -BM25 used the same tokenizer as BGE-M3 (i.e., the tokenizer of XLM-Roberta). -Using the same vocabulary can also ensure that both approaches have the same retrieval latency. - - Multilingual (Miracl dataset) @@ -242,6 +237,12 @@ Using the same vocabulary can also ensure that both approaches have the same ret - NarritiveQA: ![avatar](./imgs/nqa.jpg) +- BM25 + +We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline). + +![avatar](./imgs/bm25.jpg) + ## Training - Self-knowledge Distillation: combining multiple outputs from different @@ -259,7 +260,7 @@ Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details. ## Acknowledgement Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc. -Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserial](https://github.com/pyserial/pyserial). +Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini).