diff --git a/README.md b/README.md index e5f0e70..c39437a 100644 --- a/README.md +++ b/README.md @@ -245,6 +245,20 @@ with torch.no_grad(): ## Fine-tune +### Data Format + +Train data should be a json file, where each line is a dict like this: + +``` +{"query": str, "pos": List[str], "neg":List[str], "prompt": str} +``` + +`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the relationship between query and texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives. + +See [toy_finetune_data.jsonl](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker/toy_finetune_data.jsonl) for a toy data file. + +### Train + You can fine-tune the reranker with the following code: **For llm-based reranker**