Compare commits

...

10 Commits

Author SHA1 Message Date
Xiao
5617a9f61b
Update MIRACL evaluation results of BGE-M3 (#67)
- Update MIRACL evaluation results of BGE-M3 (be9f7f99731dba86ab44550821489908ef3b4baa)


Co-authored-by: Jianlv Chen <hanhainebula@users.noreply.huggingface.co>
2024-07-03 14:50:10 +00:00
Xiao
e1f456861c
Update MIRACL evaluation results (#68)
- Update MIRACL evaluation results (0c6f0d0ea8f284b9070c3ffaa50677440943f984)


Co-authored-by: Jianlv Chen <hanhainebula@users.noreply.huggingface.co>
2024-07-02 15:13:57 +00:00
Xiao
babcf60cae
Adding ONNX file of this model (#44)
- Adding ONNX file of this model (6a3fd5fa10d7c4e4fabeace29e36b2bfa76d45d5)


Co-authored-by: Yash <yashvardhan7@users.noreply.huggingface.co>
2024-04-13 13:01:41 +00:00
Xiao
c400e3d69d
Delete model.safetensors 2024-04-02 14:01:45 +00:00
Xiao
7b44caba5c
Delete bge-m3-with-deocder 2024-04-02 13:03:35 +00:00
Xiao
80087bf0b7
Upload bge-m3-with-deocder/pytorch_model.bin with huggingface_hub 2024-04-02 12:53:42 +00:00
Xiao
50f9396f75
Update tokenizer_config.json 2024-03-26 02:54:48 +00:00
Xiao
5a212480c9
Update README.md 2024-03-20 14:21:50 +00:00
Xiao
cab5a405d2
Update README.md 2024-03-19 13:32:16 +00:00
Xiao
ecbfce391c
Upload others.webp 2024-03-08 04:46:52 +00:00
15 changed files with 183 additions and 74 deletions

1
.gitattributes vendored

@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text tokenizer.json filter=lfs diff=lfs merge=lfs -text
onnx/model.onnx_data filter=lfs diff=lfs merge=lfs -text

@ -17,7 +17,31 @@ In this project, we introduce BGE-M3, which is distinguished for its versatility
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. - Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
**Some suggestions for retrieval pipeline in RAG**
We recommend to use the following pipeline: hybrid retrieval + re-ranking.
- Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
A classic example: using both embedding retrieval and the BM25 algorithm.
Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
- As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)) after retrieval can further filter the selected text.
## News: ## News:
- 2024/7/1: **We update the MIRACL evaluation results of BGE-M3**. To reproduce the new results, you can refer to: [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr). We have also updated our [paper](https://arxiv.org/pdf/2402.03216) on arXiv.
<details>
<summary> Details </summary>
The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the `--remove-query` parameter when using `pyserini.search.faiss` or `pyserini.search.lucene` to search the passages.
</details>
- 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
/hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
- 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.** - 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
- 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data) - 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR). - 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
@ -54,47 +78,25 @@ In this project, we introduce BGE-M3, which is distinguished for its versatility
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720) - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
- Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832). - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
**2. Comparison with BGE-v1.5 and other monolingual models**
BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages. **2. How to use BGE-M3 in other projects?**
However, we still recommend trying BGE-M3 because of its versatility (support for multiple languages and long texts).
Moreover, it can simultaneously generate multiple representations, and using them together can enhance accuracy and generalization,
unlike most existing models that can only perform dense retrieval.
In the open-source community, there are many excellent models (e.g., jina-embedding, colbert, e5, etc),
and users can choose a model that suits their specific needs based on practical considerations,
such as whether to require multilingual or cross-language support, and whether to process long texts.
**3. How to use BGE-M3 in other projects?**
For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE. For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE.
The only difference is that the BGE-M3 model no longer requires adding instructions to the queries. The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.
For sparse retrieval methods, most open-source libraries currently do not support direct utilization of the BGE-M3 model.
Contributions from the community are welcome. For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
In our experiments, we use [Pyserini](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse) and Faiss to do hybrid retrieval. **3. How to fine-tune bge-M3 model?**
**Now you can ou can try the hybrid mode of BGE-M3 in [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
). Thanks @jobergum.**
**4. How to fine-tune bge-M3 model?**
You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
to fine-tune the dense embedding. to fine-tune the dense embedding.
If you want to fine-tune all embedding function of m3, you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
**5. Some suggestions for retrieval pipeline in RAG**
We recommend to use following pipeline: hybrid retrieval + re-ranking.
- Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
A classic example: using both embedding retrieval and the BM25 algorithm.
Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
- As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [cohere-reranker](https://txt.cohere.com/rerank/)) after retrieval can further filter the selected text.
@ -218,6 +220,8 @@ print(model.compute_score(sentence_pairs,
## Evaluation ## Evaluation
We provide the evaluation script for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR)
### Benchmarks from the open-source community ### Benchmarks from the open-source community
![avatar](./imgs/others.webp) ![avatar](./imgs/others.webp)
The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI). The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI).
@ -266,7 +270,10 @@ The small-batch strategy is simple but effective, which also can used to fine-tu
- MCLS: A simple method to improve the performance on long text without fine-tuning. - MCLS: A simple method to improve the performance on long text without fine-tuning.
If you have no enough resource to fine-tuning model with long text, the method is useful. If you have no enough resource to fine-tuning model with long text, the method is useful.
Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details. Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 67 KiB

After

Width:  |  Height:  |  Size: 129 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 437 KiB

After

Width:  |  Height:  |  Size: 563 KiB

BIN
imgs/others.webp Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

BIN
model.safetensors (Stored with Git LFS)

Binary file not shown.

BIN
onnx/Constant_7_attr__value Normal file

Binary file not shown.

28
onnx/config.json Normal file

@ -0,0 +1,28 @@
{
"_name_or_path": "BAAI/bge-m3",
"architectures": [
"XLMRobertaModel"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 8194,
"model_type": "xlm-roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"output_past": true,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.37.2",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 250002
}

BIN
onnx/model.onnx (Stored with Git LFS) Normal file

Binary file not shown.

BIN
onnx/model.onnx_data (Stored with Git LFS) Normal file

Binary file not shown.

BIN
onnx/sentencepiece.bpe.model (Stored with Git LFS) Normal file

Binary file not shown.

@ -0,0 +1,51 @@
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"cls_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"mask_token": {
"content": "<mask>",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"sep_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

BIN
onnx/tokenizer.json (Stored with Git LFS) Normal file

Binary file not shown.

@ -0,0 +1,55 @@
{
"added_tokens_decoder": {
"0": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"250001": {
"content": "<mask>",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<s>",
"clean_up_tokenization_spaces": true,
"cls_token": "<s>",
"eos_token": "</s>",
"mask_token": "<mask>",
"model_max_length": 8192,
"pad_token": "<pad>",
"sep_token": "</s>",
"sp_model_kwargs": {},
"tokenizer_class": "XLMRobertaTokenizer",
"unk_token": "<unk>"
}

@ -1,46 +1,4 @@
{ {
"added_tokens_decoder": {
"0": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"250001": {
"content": "<mask>",
"lstrip": true,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<s>", "bos_token": "<s>",
"clean_up_tokenization_spaces": true, "clean_up_tokenization_spaces": true,
"cls_token": "<s>", "cls_token": "<s>",