Update README.md

This commit is contained in:
Xiao 2024-02-06 08:18:30 +00:00 committed by system
parent 8017dbb41d
commit f11a3f18ef
No known key found for this signature in database
GPG Key ID: 6A528E38E0733467

@ -9,7 +9,8 @@ license: mit
For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
# BGE-M3 # BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity. In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
- Multi-Linguality: It can support more than 100 working languages. - Multi-Linguality: It can support more than 100 working languages.
@ -26,12 +27,14 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
## News: ## News:
- 2/6/2024: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR), a long document retrieval dataset covering 13 languages.
- 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
## Specs ## Specs
- Model - Model
| Model Name | Dimension | Sequence Length | Introduction | | Model Name | Dimension | Sequence Length | Introduction |
|:----:|:---:|:---:|:---:| |:----:|:---:|:---:|:---:|
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised| | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
@ -48,7 +51,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
| [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages| | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
## FAQ ## FAQ
**1. Introduction for different retrieval methods** **1. Introduction for different retrieval methods**
@ -57,7 +59,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720) - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
- Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832). - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
**2. Comparison with BGE-v1.5 and other monolingual models** **2. Comparison with BGE-v1.5 and other monolingual models**
BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages. BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
@ -77,6 +78,11 @@ For sparse retrieval methods, most open-source libraries currently do not suppor
Contributions from the community are welcome. Contributions from the community are welcome.
In our experiments, we use [Pyserini](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse) and Faiss to do hybrid retrieval.
**Now you can ou can try the hybrid mode of BGE-M3 in [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
). Thanks @jobergum.**
**4. How to fine-tune bge-M3 model?** **4. How to fine-tune bge-M3 model?**
You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
@ -218,10 +224,10 @@ print(model.compute_score(sentence_pairs,
- Long Document Retrieval - Long Document Retrieval
- MLDR: - MLDR:
![avatar](./imgs/long.jpg) ![avatar](./imgs/long.jpg)
Please note that MLDR is a document retrieval dataset we constructed via LLM, Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
covering 13 languages, including test set, validation set, and training set. covering 13 languages, including test set, validation set, and training set.
We utilized the training set from MLDR to enhance the model's long document retrieval capabilities. We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
Therefore, comparing baseline with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable. Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets. Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
We believe that this data will be helpful for the open-source community in training document retrieval models. We believe that this data will be helpful for the open-source community in training document retrieval models.