Update README.md

This commit is contained in:
Xiao 2024-02-11 12:26:49 +00:00 committed by system
parent aa47896ffa
commit cfdb103bd8
No known key found for this signature in database
GPG Key ID: 6A528E38E0733467

@ -215,11 +215,6 @@ print(model.compute_score(sentence_pairs,
We compare BGE-M3 with some popular methods, including BM25, openAI embedding, etc.
We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline).
To make the BM25 and BGE-M3 more comparable, in the experiment,
BM25 used the same tokenizer as BGE-M3 (i.e., the tokenizer of XLM-Roberta).
Using the same vocabulary can also ensure that both approaches have the same retrieval latency.
- Multilingual (Miracl dataset)
@ -242,6 +237,12 @@ Using the same vocabulary can also ensure that both approaches have the same ret
- NarritiveQA:
![avatar](./imgs/nqa.jpg)
- BM25
We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline).
![avatar](./imgs/bm25.jpg)
## Training
- Self-knowledge Distillation: combining multiple outputs from different
@ -259,7 +260,7 @@ Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
## Acknowledgement
Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserial](https://github.com/pyserial/pyserial).
Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini).