From e306d6093313a23e6a2de4b96a84b16e4b1529cf Mon Sep 17 00:00:00 2001 From: Xiao Date: Mon, 18 Mar 2024 06:36:49 +0000 Subject: [PATCH] Upload README.md with huggingface_hub --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index e5f0e70..c39437a 100644 --- a/README.md +++ b/README.md @@ -245,6 +245,20 @@ with torch.no_grad(): ## Fine-tune +### Data Format + +Train data should be a json file, where each line is a dict like this: + +``` +{"query": str, "pos": List[str], "neg":List[str], "prompt": str} +``` + +`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the relationship between query and texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives. + +See [toy_finetune_data.jsonl](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker/toy_finetune_data.jsonl) for a toy data file. + +### Train + You can fine-tune the reranker with the following code: **For llm-based reranker**