diff --git a/README.md b/README.md index c5bd1ea..63520b4 100644 --- a/README.md +++ b/README.md @@ -96,7 +96,7 @@ The model is a decoder-only transformer similar to the LLaMA ([Touvron et al., 2 * **Position Embeddings**: Rotary Position Embeddings ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) applied to the first 25% of head embedding dimensions for improved throughput following [Black et al. (2022)](https://arxiv.org/pdf/2204.06745.pdf). * **Normalization**: LayerNorm ([Ba et al., 2016](https://arxiv.org/abs/1607.06450)) with learned bias terms as opposed to RMSNorm ([Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467)). -* **Biases**: We remove all bias terms from the model except for attention Q,K,V projections ([Bai et al., 2023](https://arxiv.org/abs/2309.16609)). +* **Biases**: We remove all bias terms from the feed-forward networks and multi-head self-attention layers, except for the biases of the query, key, and value projections ([Bai et al., 2023](https://arxiv.org/abs/2309.16609)). * **Tokenizer**: We use Arcade100k, a BPE tokenizer extended from OpenAI's [`tiktoken.cl100k_base`](https://github.com/openai/tiktoken). We split digits into individual tokens following findings by [Liu & Low (2023)](https://arxiv.org/abs/2305.14201). ## Training