fix(README): clarify usage of bias terms

2024-02-01 23:55:19 +00:00 · 2024-02-01 23:55:19 +00:00 · 64c3d6a37d
commit 64c3d6a37d
parent 1909ae19b3
1 changed files with 1 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -96,7 +96,7 @@ The model is a decoder-only transformer similar to the LLaMA ([Touvron et al., 2

 * **Position Embeddings**: Rotary Position Embeddings ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) applied to the first 25% of head embedding dimensions for improved throughput following [Black et al. (2022)](https://arxiv.org/pdf/2204.06745.pdf).
 * **Normalization**: LayerNorm ([Ba et al., 2016](https://arxiv.org/abs/1607.06450)) with learned bias terms as opposed to RMSNorm ([Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467)).
-* **Biases**: We remove all bias terms from the model except for attention Q,K,V projections ([Bai et al., 2023](https://arxiv.org/abs/2309.16609)).
+* **Biases**: We remove all bias terms from the feed-forward networks and multi-head self-attention layers, except for the biases of the query, key, and value projections ([Bai et al., 2023](https://arxiv.org/abs/2309.16609)).
 * **Tokenizer**: We use Arcade100k, a BPE tokenizer extended from OpenAI's [`tiktoken.cl100k_base`](https://github.com/openai/tiktoken). We split digits into individual tokens following findings by [Liu & Low (2023)](https://arxiv.org/abs/2305.14201).

 ## Training