From e306d6093313a23e6a2de4b96a84b16e4b1529cf Mon Sep 17 00:00:00 2001
From: Xiao <Shitao@users.noreply.huggingface.co>
Date: Mon, 18 Mar 2024 06:36:49 +0000
Subject: [PATCH] Upload README.md with huggingface_hub

---
 README.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/README.md b/README.md
index e5f0e70..c39437a 100644
--- a/README.md
+++ b/README.md
@@ -245,6 +245,20 @@ with torch.no_grad():
 
 ## Fine-tune
 
+### Data Format
+
+Train data should be a json file, where each line is a dict like this:
+
+```
+{"query": str, "pos": List[str], "neg":List[str], "prompt": str}
+```
+
+`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the relationship between query and texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.
+
+See [toy_finetune_data.jsonl](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker/toy_finetune_data.jsonl) for a toy data file.
+
+### Train
+
 You can fine-tune the reranker with the following code:
 
 **For llm-based reranker**