Compare commits

..

10 Commits

Author SHA1 Message Date
Josephine Parquet
f499ead74c
Update README.md 2024-07-10 11:59:18 +00:00
Josephine Parquet
4b27fa32bf
Rename LICENSE to LICENSE.md 2024-07-10 11:58:16 +00:00
Jonathan Tow
8879812ccc
tmpfix(tokenizer_config): force GPT2TokenizerFast 2024-06-05 19:45:00 +00:00
Hassan Zayour
78f86b80f0
Update README.md 2024-04-12 08:21:42 +00:00
Jonathan Tow
3a8f08d8ab
Update README.md 2024-04-08 20:18:37 +00:00
Jonathan Tow
16f78806b1
Update README.md 2024-04-08 19:30:37 +00:00
Jonathan Tow
33923f685b
fix(README): correct tokenizer name 2024-03-21 22:01:19 +00:00
Jonathan Tow
db5a120c4d
fix(README): remove trust_remote_code requirement from tokenizer snippet 2024-03-01 07:35:30 +00:00
Jonathan Tow
a7a1fb8a83
update(tokenizer): convert to GPT2Tokenizer (#7)
- update(tokenizer): convert to `GPT2Tokenizer` (85625532dc8753c206eecc8a76323783a7b64744)
2024-03-01 07:31:02 +00:00
Jonathan Tow
a2eb1af48d
revert(config): use float16 default torch dtype 2024-02-21 20:41:28 +00:00
11 changed files with 401076 additions and 100608 deletions

42
LICENSE

@ -1,42 +0,0 @@
STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE AGREEMENT
Dated: December 06, 2023
By using or distributing any portion or element of the Models, Software, Software Products or Derivative Works, you agree to be bound by this Agreement.
"Agreement" means this Stable Non-Commercial Research Community License Agreement.
“AUP” means the Stability AI Acceptable Use Policy available at https://stability.ai/use-policy, as may be updated from time to time.
"Derivative Work(s)” means (a) any derivative work of the Software Products as recognized by U.S. copyright laws and (b) any modifications to a Model, and any other model created which is based on or derived from the Model or the Models output. For clarity, Derivative Works do not include the output of any Model.
“Documentation” means any specifications, manuals, documentation, and other written information provided by Stability AI related to the Software.
"Licensee" or "you" means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity's behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
“Model(s)" means, collectively, Stability AIs proprietary models and algorithms, including machine-learning models, trained model weights and other elements of the foregoing, made available under this Agreement.
“Non-Commercial Uses” means exercising any of the rights granted herein for the purpose of research or non-commercial purposes. Non-Commercial Uses does not include any production use of the Software Products or any Derivative Works.
"Stability AI" or "we" means Stability AI Ltd. and its affiliates.
"Software" means Stability AIs proprietary software made available under this Agreement.
“Software Products” means the Models, Software and Documentation, individually or in any combination.
1. License Rights and Redistribution.
a. Subject to your compliance with this Agreement, the AUP (which is hereby incorporated herein by reference), and the Documentation, Stability AI grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Stability AIs intellectual property or other rights owned or controlled by Stability AI embodied in the Software Products to use, reproduce, distribute, and create Derivative Works of, the Software Products, in each case for Non-Commercial Uses only.
b. You may not use the Software Products or Derivative Works to enable third parties to use the Software Products or Derivative Works as part of your hosted service or via your APIs, whether you are adding substantial additional functionality thereto or not. Merely distributing the Software Products or Derivative Works for download online without offering any related service (ex. by distributing the Models on HuggingFace) is not a violation of this subsection. If you wish to use the Software Products or any Derivative Works for commercial or production use or you wish to make the Software Products or any Derivative Works available to third parties via your hosted service or your APIs, contact Stability AI at https://stability.ai/contact.
c. If you distribute or make the Software Products, or any Derivative Works thereof, available to a third party, the Software Products, Derivative Works, or any portion thereof, respectively, will remain subject to this Agreement and you must (i) provide a copy of this Agreement to such third party, and (ii) retain the following attribution notice within a "Notice" text file distributed as a part of such copies: "This Stability AI Model is licensed under the Stability AI Non-Commercial Research Community License, Copyright (c) Stability AI Ltd. All Rights Reserved.” If you create a Derivative Work of a Software Product, you may add your own attribution notices to the Notice file included with the Software Product, provided that you clearly indicate which attributions apply to the Software Product and you must state in the NOTICE file that you changed the Software Product and how it was modified.
2. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE SOFTWARE PRODUCTS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE SOFTWARE PRODUCTS, DERIVATIVE WORKS OR ANY OUTPUT OR RESULTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE SOFTWARE PRODUCTS, DERIVATIVE WORKS AND ANY OUTPUT AND RESULTS.
3. Limitation of Liability. IN NO EVENT WILL STABILITY AI OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF STABILITY AI OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
4. Intellectual Property.
a. No trademark licenses are granted under this Agreement, and in connection with the Software Products or Derivative Works, neither Stability AI nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Software Products or Derivative Works.
b. Subject to Stability AIs ownership of the Software Products and Derivative Works made by or for Stability AI, with respect to any Derivative Works that are made by you, as between you and Stability AI, you are and will be the owner of such Derivative Works
c. If you institute litigation or other proceedings against Stability AI (including a cross-claim or counterclaim in a lawsuit) alleging that the Software Products, Derivative Works or associated outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Stability AI from and against any claim by any third party arising out of or related to your use or distribution of the Software Products or Derivative Works in violation of this Agreement.
5. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Software Products and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Stability AI may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of any Software Products or Derivative Works. Sections 2-4 shall survive the termination of this Agreement.
6. Governing Law. This Agreement will be governed by and construed in accordance with the laws of the United States and the State of California without regard to choice of law
principles.

58
LICENSE.md Normal file

@ -0,0 +1,58 @@
STABILITY AI COMMUNITY LICENSE AGREEMENT
Last Updated: July 5, 2024
1. INTRODUCTION
This Agreement applies to any individual person or entity (“You”, “Your” or “Licensee”) that uses or distributes any portion or element of the Stability AI Materials or Derivative Works thereof for any Research & Non-Commercial or Commercial purpose. Capitalized terms not otherwise defined herein are defined in Section V below.
This Agreement is intended to allow research, non-commercial, and limited commercial uses of the Models free of charge. In order to ensure that certain limited commercial uses of the Models continue to be allowed, this Agreement preserves free access to the Models for people or organizations generating annual revenue of less than US $1,000,000 (or local currency equivalent).
By clicking “I Accept” or by using or distributing or using any portion or element of the Stability Materials or Derivative Works, You agree that You have read, understood and are bound by the terms of this Agreement. If You are acting on behalf of a company, organization or other entity, then “You” includes you and that entity, and You agree that You: (i) are an authorized representative of such entity with the authority to bind such entity to this Agreement, and (ii) You agree to the terms of this Agreement on that entitys behalf.
2. RESEARCH & NON-COMMERCIAL USE LICENSE
Subject to the terms of this Agreement, Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AIs intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Research or Non-Commercial Purpose. “Research Purpose” means academic or scientific advancement, and in each case, is not primarily intended for commercial advantage or monetary compensation to You or others. “Non-Commercial Purpose” means any purpose other than a Research Purpose that is not primarily intended for commercial advantage or monetary compensation to You or others, such as personal use (i.e., hobbyist) or evaluation and testing.
3. COMMERCIAL USE LICENSE
Subject to the terms of this Agreement (including the remainder of this Section III), Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AIs intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Commercial Purpose. “Commercial Purpose” means any purpose other than a Research Purpose or Non-Commercial Purpose that is primarily intended for commercial advantage or monetary compensation to You or others, including but not limited to, (i) creating, modifying, or distributing Your product or service, including via a hosted service or application programming interface, and (ii) for Your businesss or organizations internal operations.
If You are using or distributing the Stability AI Materials for a Commercial Purpose, You must register with Stability AI at (https://stability.ai/community-license). If at any time You or Your Affiliate(s), either individually or in aggregate, generate more than USD $1,000,000 in annual revenue (or the equivalent thereof in Your local currency), regardless of whether that revenue is generated directly or indirectly from the Stability AI Materials or Derivative Works, any licenses granted to You under this Agreement shall terminate as of such date. You must request a license from Stability AI at (https://stability.ai/enterprise) , which Stability AI may grant to You in its sole discretion. If you receive Stability AI Materials, or any Derivative Works thereof, from a Licensee as part of an integrated end user product, then Section III of this Agreement will not apply to you.
4. GENERAL TERMS
Your Research, Non-Commercial, and Commercial License(s) under this Agreement are subject to the following terms.
a. Distribution & Attribution. If You distribute or make available the Stability AI Materials or a Derivative Work to a third party, or a product or service that uses any portion of them, You shall: (i) provide a copy of this Agreement to that third party, (ii) retain the following attribution notice within a "Notice" text file distributed as a part of such copies: "This Stability AI Model is licensed under the Stability AI Community License, Copyright © Stability AI Ltd. All Rights Reserved”, and (iii) prominently display “Powered by Stability AI” on a related website, user interface, blogpost, about page, or product documentation. If You create a Derivative Work, You may add your own attribution notice(s) to the “Notice” text file included with that Derivative Work, provided that You clearly indicate which attributions apply to the Stability AI Materials and state in the “Notice” text file that You changed the Stability AI Materials and how it was modified.
b. Use Restrictions. Your use of the Stability AI Materials and Derivative Works, including any output or results of the Stability AI Materials or Derivative Works, must comply with applicable laws and regulations (including Trade Control Laws and equivalent regulations) and adhere to the Documentation and Stability AIs AUP, which is hereby incorporated by reference. Furthermore, You will not use the Stability AI Materials or Derivative Works, or any output or results of the Stability AI Materials or Derivative Works, to create or improve any foundational generative AI model (excluding the Models or Derivative Works).
c. Intellectual Property.
(i) Trademark License. No trademark licenses are granted under this Agreement, and in connection with the Stability AI Materials or Derivative Works, You may not use any name or mark owned by or associated with Stability AI or any of its Affiliates, except as required under Section IV(a) herein.
(ii) Ownership of Derivative Works. As between You and Stability AI, You are the owner of Derivative Works You create, subject to Stability AIs ownership of the Stability AI Materials and any Derivative Works made by or for Stability AI.
(iii) Ownership of Outputs. As between You and Stability AI, You own any outputs generated from the Models or Derivative Works to the extent permitted by applicable law.
(iv) Disputes. If You or Your Affiliate(s) institute litigation or other proceedings against Stability AI (including a cross-claim or counterclaim in a lawsuit) alleging that the Stability AI Materials, Derivative Works or associated outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by You, then any licenses granted to You under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Stability AI from and against any claim by any third party arising out of or related to Your use or distribution of the Stability AI Materials or Derivative Works in violation of this Agreement.
(v) Feedback. From time to time, You may provide Stability AI with verbal and/or written suggestions, comments or other feedback related to Stability AIs existing or prospective technology, products or services (collectively, “Feedback”). You are not obligated to provide Stability AI with Feedback, but to the extent that You do, You hereby grant Stability AI a perpetual, irrevocable, royalty-free, fully-paid, sub-licensable, transferable, non-exclusive, worldwide right and license to exploit the Feedback in any manner without restriction. Your Feedback is provided “AS IS” and You make no warranties whatsoever about any Feedback.
d. Disclaimer Of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE STABILITY AI MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OR LAWFULNESS OF USING OR REDISTRIBUTING THE STABILITY AI MATERIALS, DERIVATIVE WORKS OR ANY OUTPUT OR RESULTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE STABILITY AI MATERIALS, DERIVATIVE WORKS AND ANY OUTPUT AND RESULTS.
e. Limitation Of Liability. IN NO EVENT WILL STABILITY AI OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF STABILITY AI OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
f. Term And Termination. The term of this Agreement will commence upon Your acceptance of this Agreement or access to the Stability AI Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Stability AI may terminate this Agreement if You are in breach of any term or condition of this Agreement. Upon termination of this Agreement, You shall delete and cease use of any Stability AI Materials or Derivative Works. Section IV(d), (e), and (g) shall survive the termination of this Agreement.
g. Governing Law. This Agreement will be governed by and constructed in accordance with the laws of the United States and the State of California without regard to choice of law principles, and the UN Convention on Contracts for International Sale of Goods does not apply to this Agreement.
5. DEFINITIONS
“Affiliate(s)” means any entity that directly or indirectly controls, is controlled by, or is under common control with the subject entity; for purposes of this definition, “control” means direct or indirect ownership or control of more than 50% of the voting interests of the subject entity.
"Agreement" means this Stability AI Community License Agreement.
“AUP” means the Stability AI Acceptable Use Policy available at (https://stability.ai/use-policy), as may be updated from time to time.
"Derivative Work(s)” means (a) any derivative work of the Stability AI Materials as recognized by U.S. copyright laws and (b) any modifications to a Model, and any other model created which is based on or derived from the Model or the Models output, including “fine tune” and “low-rank adaptation” models derived from a Model or a Models output, but do not include the output of any Model.
“Documentation” means any specifications, manuals, documentation, and other written information provided by Stability AI related to the Software or Models.
“Model(s)" means, collectively, Stability AIs proprietary models and algorithms, including machine-learning models, trained model weights and other elements of the foregoing listed on Stabilitys Core Models Webpage available at (https://stability.ai/core-models), as may be updated from time to time.
"Stability AI" or "we" means Stability AI Ltd. and its Affiliates.
"Software" means Stability AIs proprietary software made available under this Agreement now or in the future.
“Stability AI Materials” means, collectively, Stabilitys proprietary Models, Software and Documentation (and any portion or combination thereof) made available under this Agreement.
“Trade Control Laws” means any applicable U.S. and non-U.S. export control and trade sanctions laws and regulations.

@ -20,6 +20,8 @@ tags:
---
# `Stable LM 2 1.6B`
Please note: For commercial use, please refer to https://stability.ai/license
## Model Description
`Stable LM 2 1.6B` is a 1.6 billion parameter decoder-only language model pre-trained on 2 trillion tokens of diverse multilingual and code datasets for two epochs.
@ -30,7 +32,7 @@ Get started generating text with `Stable LM 2 1.6B` by using the following code
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-1_6b", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-1_6b")
model = AutoModelForCausalLM.from_pretrained(
"stabilityai/stablelm-2-1_6b",
torch_dtype="auto",
@ -54,7 +56,7 @@ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-1_6b", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-1_6b")
model = AutoModelForCausalLM.from_pretrained(
"stabilityai/stablelm-2-1_6b",
torch_dtype="auto",
@ -82,7 +84,8 @@ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
* **Language(s)**: English
* **Paper**: [Stable LM 2 1.6B Technical Report](https://drive.google.com/file/d/1JYJHszhS8EFChTbNAf8xmqhKjogWRrQF/view?usp=sharing)
* **Library**: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
* **License**: [Stability AI Non-Commercial Research Community License](https://huggingface.co/stabilityai/stablelm-2-1_6b/blob/main/LICENSE). If you'd like to use this model for commercial products or purposes, please contact us [here](https://stability.ai/membership) to learn more.
* **License**: [Stability AI Community License](https://huggingface.co/stabilityai/stablelm-2-1_6b/blob/main/LICENSE.md).
* **Commercial License**: to use this model commercially, please refer to https://stability.ai/license
* **Contact**: For questions and comments about the model, please email `lm@stability.ai`
### Model Architecture
@ -108,7 +111,7 @@ The dataset is comprised of a filtered mixture of open-source large-scale datase
### Training Procedure
The model is pre-trained on the aforementioned datasets in `bfloat16` precision, optimized with AdamW, and trained using the NeoX tokenizer with a vocabulary size of 100,352. We outline the complete hyperparameters choices in the project's [GitHub repository - config*](https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-2-1.6b.yml). The final checkpoint of pre-training, before cooldown, is provided in the `global_step420000` [branch](https://huggingface.co/stabilityai/stablelm-2-1_6b/blob/global_step420000/README.md).
The model is pre-trained on the aforementioned datasets in `bfloat16` precision, optimized with AdamW, and trained using the Arcade100k tokenizer with a vocabulary size of 100,352. We outline the complete hyperparameters choices in the project's [GitHub repository - config*](https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-2-1_6b.yml). The final checkpoint of pre-training, before cooldown, is provided in the `global_step420000` [branch](https://huggingface.co/stabilityai/stablelm-2-1_6b/blob/global_step420000/README.md).
### Training Infrastructure
@ -120,7 +123,7 @@ The model is pre-trained on the aforementioned datasets in `bfloat16` precision,
### Intended Use
The model is intended to be used as a foundational base model for application-specific fine-tuning. Developers must evaluate and fine-tune the model for safe performance in downstream applications.
The model is intended to be used as a foundational base model for application-specific fine-tuning. Developers must evaluate and fine-tune the model for safe performance in downstream applications. For commercial use, please refer to https://stability.ai/membership.
### Limitations and Bias
@ -129,9 +132,10 @@ As a base model, this model may exhibit unreliable, unsafe, or other undesirable
## How to Cite
```bibtex
@misc{StableLM-2-1.6B,
url={[https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b)},
title={Stable LM 2 1.6B},
author={Stability AI Language Team}
@article{bellagente2024stable,
title={Stable LM 2 1.6 B Technical Report},
author={Bellagente, Marco and Tow, Jonathan and Mahan, Dakota and Phung, Duy and Zhuravinskyi, Maksym and Adithyan, Reshinth and Baicoianu, James and Brooks, Ben and Cooper, Nathan and Datta, Ashish and others},
journal={arXiv preprint arXiv:2402.17834},
year={2024}
}
```

File diff suppressed because it is too large Load Diff

@ -17,7 +17,7 @@
"partial_rotary_factor": 0.25,
"rope_theta": 10000,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"torch_dtype": "float16",
"transformers_version": "4.38.0",
"use_cache": true,
"use_qkv_bias": true,

100001
merges.txt Normal file

File diff suppressed because it is too large Load Diff

40
special_tokens_map.json Normal file

@ -0,0 +1,40 @@
{
"additional_special_tokens": [
"<|reg_extra|>",
"<|endoftext|>",
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>",
"<|fim_pad|>",
"<gh_stars>",
"<filename>",
"<issue_start>",
"<issue_comment>",
"<issue_closed>",
"<jupyter_start>",
"<jupyter_text>",
"<jupyter_code>",
"<jupyter_output>",
"<empty_output>",
"<commit_before>",
"<commit_msg>",
"<commit_after>",
"<reponame>",
"<|endofprompt|>",
"<|im_start|>",
"<|im_end|>",
"<|pause|>",
"<|reg0|>",
"<|reg1|>",
"<|reg2|>",
"<|reg3|>",
"<|reg4|>",
"<|reg5|>",
"<|reg6|>",
"<|reg7|>",
"<|extra0|>"
],
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"unk_token": "<|endoftext|>"
}

@ -1,292 +0,0 @@
# coding=utf-8
# Copyright (c) 2023 Alibaba Cloud & Stability AI.
#
# Tongyi Qianwen LICENSE AGREEMENT:
# https://github.com/QwenLM/Qwen/blob/5aa84bdfd3237b37f01bc88cd49b3279b9a71d0b/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
"""Tokenization classes for Arcade100k."""
import base64
import os
import unicodedata
from typing import Collection, Dict, List, Set, Tuple, Union
import tiktoken
from transformers.utils import logging
from transformers import PreTrainedTokenizer, AddedToken
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "arcade100k.tiktoken"}
NAME = "arcade100k"
def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
with open(tiktoken_bpe_file, "rb") as f:
contents = f.read()
return {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in contents.splitlines() if line)
}
ENDOFTEXT = "<|endoftext|>"
FIM = [
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>",
"<|fim_pad|>",
]
# `StarCoder` Tokens
CODE = [
"<gh_stars>",
"<filename>",
"<issue_start>",
"<issue_comment>",
"<issue_closed>",
"<jupyter_start>",
"<jupyter_text>",
"<jupyter_code>",
"<jupyter_output>",
"<empty_output>",
"<commit_before>",
"<commit_msg>",
"<commit_after>",
"<reponame>",
]
CHAT = [
"<|im_start|>", # Chat: Input message start
"<|im_end|>", # Chat: Input message end
]
PAUSE = "<|pause|>" # Think before you speak (https://arxiv.org/abs/2310.02226)
REGISTERS = [
f"<|reg{i}|>" for i in range(0, 8)
] # Register 0 sink token (https://arxiv.org/abs/2309.17453)
ENDOFPROMPT = "<|endofprompt|>"
SPECIAL_TOKENS_NAMES = (
[ENDOFTEXT]
+ FIM
+ CODE
+ [ENDOFPROMPT]
+ CHAT
+ [PAUSE]
+ REGISTERS
+ ["<|extra0|>"]
)
START_ID = 100257
SPECIAL_TOKENS = {t: START_ID + i for i, t in enumerate(SPECIAL_TOKENS_NAMES)}
def _arcade100k(vocab_file: str):
mergeable_ranks = _load_tiktoken_bpe(vocab_file)
return {
"name": NAME,
"pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""",
"mergeable_ranks": mergeable_ranks,
"special_tokens": SPECIAL_TOKENS,
}
class Arcade100kTokenizer(PreTrainedTokenizer):
"""
Construct a Arcade100k tokenizer backed by `tiktoken`.
Args:
vocab_file (`str`):
Path to the vocabulary file.
errors (`str`, *optional*, defaults to `"replace"`):
How to handle errors in decoding UTF-8 byte sequences.
WARNING: the default behaviour of this function is lossy, since decoded bytes are not
guaranteed to be valid UTF-8. You can control this behaviour using the `errors` parameter,
for instance, setting `errors=strict`.
"""
vocab_files_names = VOCAB_FILES_NAMES
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
vocab_file: str,
errors: str = "replace",
**kwargs,
):
super().__init__(errors=errors, **kwargs)
self.errors = errors
self._tiktoken_config = _arcade100k(vocab_file)
self.tokenizer = tiktoken.Encoding(**self._tiktoken_config)
# TODO: Remove this assertion
assert (
len(self.tokenizer._mergeable_ranks)
+ len(self.tokenizer._special_tokens)
+ 1
== self.tokenizer.n_vocab
), f"{len(self.tokenizer._mergeable_ranks) + len(self.tokenizer._special_tokens)} != {self.tokenizer.n_vocab} in encoding"
self.decoder = {i: n for n, i in self.tokenizer._mergeable_ranks.items()}
self.decoder.update({i: n for n, i in self.tokenizer._special_tokens.items()})
# Provide default `eos_token` and `pad_token`
if self.eos_token is None:
self.eos_token = self.decoder[self.tokenizer.eot_token]
if self.pad_token is None:
self.pad_token = self.decoder[self.tokenizer.pad_token]
# Expose for convenience
self.mergeable_ranks = self.tokenizer._mergeable_ranks
self.special_tokens = self.tokenizer._special_tokens
def __len__(self):
return self.tokenizer.n_vocab
def __getstate__(self):
# Required for `pickle` support
state = self.__dict__.copy()
del state["tokenizer"]
return state
def __setstate__(self, state):
self.__dict__.update(state)
self.tokenizer = tiktoken.Encoding(**self._tiktoken_config)
@property
def vocab_size(self):
return self.tokenizer.n_vocab
def get_vocab(self) -> Dict[bytes, int]:
return self.tokenizer._mergeable_ranks
def convert_tokens_to_ids(
self, tokens: Union[bytes, str, List[Union[bytes, str]]]
) -> List[int]:
ids = []
if isinstance(tokens, (str, bytes)):
if tokens in self.tokenizer._special_tokens:
return self.tokenizer._special_tokens[tokens]
else:
return self.tokenizer._mergeable_ranks.get(tokens)
for token in tokens:
if token in self.tokenizer._special_tokens:
ids.append(self.tokenizer._special_tokens[token])
else:
ids.append(self.tokenizer._mergeable_ranks.get(token))
return ids
def _add_tokens(
self,
new_tokens: Union[List[str], List[AddedToken]],
special_tokens: bool = False,
) -> int:
if not special_tokens and new_tokens:
raise ValueError("Adding regular tokens is not supported")
for token in new_tokens:
surface_form = token.content if isinstance(token, AddedToken) else token
if surface_form not in SPECIAL_TOKENS:
raise ValueError("Adding unknown special tokens is not supported")
return 0
def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
"""
Save only the vocabulary of the tokenizer (vocabulary).
Returns:
`Tuple(str)`: Paths to the files saved.
"""
file_path = os.path.join(save_directory, "arcade100k.tiktoken")
with open(file_path, "w", encoding="utf8") as w:
for k, v in self.tokenizer._mergeable_ranks.items():
line = base64.b64encode(k).decode("utf8") + " " + str(v) + "\n"
w.write(line)
return (file_path,)
def tokenize(
self,
text: str,
allowed_special: Union[Set, str] = "all",
disallowed_special: Union[Collection, str] = (),
**kwargs,
) -> List[Union[bytes, str]]:
"""
Converts a string in a sequence of tokens.
Args:
text (`str`):
The sequence to be encoded.
allowed_special (`Literal["all"]` or `set`):
The surface forms of the tokens to be encoded as special tokens in regular texts.
Default to "all".
disallowed_special (`Literal["all"]` or `Collection`):
The surface forms of the tokens that should not be in regular texts and trigger errors.
Default to an empty tuple.
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific encode method.
Returns:
`List[bytes|str]`: The list of tokens.
"""
tokens = []
text = unicodedata.normalize("NFC", text)
# this implementation takes a detour: text -> token id -> token surface forms
for t in self.tokenizer.encode(
text, allowed_special=allowed_special, disallowed_special=disallowed_special
):
tokens.append(self.decoder[t])
return tokens
def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
"""
Converts a sequence of tokens in a single string.
"""
text = ""
temp = b""
for t in tokens:
if isinstance(t, str):
if temp:
text += temp.decode("utf-8", errors=self.errors)
temp = b""
text += t
elif isinstance(t, bytes):
temp += t
else:
raise TypeError("token should only be of type types or str")
if temp:
text += temp.decode("utf-8", errors=self.errors)
return text
def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
"""Converts an id to a token, special tokens included"""
if index in self.decoder:
return self.decoder[index]
raise ValueError("unknown ids")
def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
"""Converts a token to an id using the vocab, special tokens included"""
if token in self.tokenizer._special_tokens:
return self.tokenizer._special_tokens[token]
if token in self.tokenizer._mergeable_ranks:
return self.tokenizer._mergeable_ranks[token]
raise ValueError("unknown token")
def _tokenize(self, text: str, **kwargs):
"""
Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
Do NOT take care of added tokens.
"""
raise NotImplementedError
def _decode(
self,
token_ids: Union[int, List[int]],
skip_special_tokens: bool = False,
errors: str = None,
**kwargs,
) -> str:
if isinstance(token_ids, int):
token_ids = [token_ids]
if skip_special_tokens:
token_ids = [i for i in token_ids if i < self.tokenizer.eot_token]
return self.tokenizer.decode(token_ids)

200632
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

@ -1,11 +1,43 @@
{
"tokenizer_class": "Arcade100kTokenizer",
"auto_map": {
"AutoTokenizer": [
"tokenization_arcade100k.Arcade100kTokenizer",
null
]
},
"add_prefix_space": false,
"additional_special_tokens": [
"<|reg_extra|>",
"<|endoftext|>",
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>",
"<|fim_pad|>",
"<gh_stars>",
"<filename>",
"<issue_start>",
"<issue_comment>",
"<issue_closed>",
"<jupyter_start>",
"<jupyter_text>",
"<jupyter_code>",
"<jupyter_output>",
"<empty_output>",
"<commit_before>",
"<commit_msg>",
"<commit_after>",
"<reponame>",
"<|endofprompt|>",
"<|im_start|>",
"<|im_end|>",
"<|pause|>",
"<|reg0|>",
"<|reg1|>",
"<|reg2|>",
"<|reg3|>",
"<|reg4|>",
"<|reg5|>",
"<|reg6|>",
"<|reg7|>",
"<|extra0|>"
],
"bos_token": "<|endoftext|>",
"clean_up_tokenization_spaces": true,
"eos_token": "<|endoftext|>",
"pad_token": "<|endoftext|>"
"tokenizer_class": "GPT2TokenizerFast",
"unk_token": "<|endoftext|>"
}

100291
vocab.json Normal file

File diff suppressed because it is too large Load Diff