From 03b084fb760689db60d0b77c100a0ed1b0224e23 Mon Sep 17 00:00:00 2001 From: CodeSage Date: Sat, 28 Dec 2024 22:01:10 +0000 Subject: [PATCH] Update README.md --- README.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 5f28ae6..c505486 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,9 @@ language: ## CodeSage-Large-v2 +### [Blogpost] +Please check out our [blogpost](https://code-representation-learning.github.io/codesage-v2.html) for more details. + ### Model description CodeSage is a family of open code embedding models with an encoder architecture that supports a wide range of source code understanding tasks. It was initially introduced in the paper: @@ -55,9 +58,11 @@ For this V2 model, we enhanced semantic search performance by improving the qual ### Training Data This pretrained checkpoint is the same as those used by our V1 model ([codesage/codesage-small](https://huggingface.co/codesage/codesage-small), which is trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup) data. The constative learning data are extracted from [The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2). Same as our V1 model, we supported nine languages as follows: c, c-sharp, go, java, javascript, typescript, php, python, ruby. -### How to use -This checkpoint consists of an encoder (130M model), which can be used to extract code embeddings of 1024 dimension. It can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf). +### How to Use +This checkpoint consists of an encoder (1.3B model), which can be used to extract code embeddings of 2048 dimension. +1. Accessing CodeSage via HuggingFace: it can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf). + ``` from transformers import AutoModel, AutoTokenizer @@ -74,6 +79,12 @@ inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", ret embedding = model(inputs)[0] ``` +2. Accessing CodeSage via SentenceTransformer +``` +from sentence_transformers import SentenceTransformer +model = SentenceTransformer("codesage/codesage-large-v2", trust_remote_code=True) +``` + ### BibTeX entry and citation info ``` @inproceedings{