Update README.md

This commit is contained in:
Xiao 2024-02-06 08:47:26 +00:00 committed by system
parent f11a3f18ef
commit 5cedd82596
No known key found for this signature in database
GPG Key ID: 6A528E38E0733467

@ -27,7 +27,7 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
## News: ## News:
- 2/6/2024: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR), a long document retrieval dataset covering 13 languages. - 2/6/2024: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
- 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
@ -243,22 +243,29 @@ The small-batch strategy is simple but effective, which also can used to fine-tu
- MCLS: A simple method to improve the performance on long text without fine-tuning. - MCLS: A simple method to improve the performance on long text without fine-tuning.
If you have no enough resource to fine-tuning model with long text, the method is useful. If you have no enough resource to fine-tuning model with long text, the method is useful.
Refer to our [report](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf) for more details. Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
**The fine-tuning codes and datasets will be open-sourced in the near future.** **The fine-tuning codes and datasets will be open-sourced in the near future.**
## Acknowledgement ## Acknowledgement
Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc. Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserini](https://github.com/castorini/pyserini). Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserial](https://github.com/pyserial/pyserial).
## Citation ## Citation
If you find this repository useful, please consider giving a star :star: and a citation If you find this repository useful, please consider giving a star :star: and citation
``` ```
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` ```