Update model card (#25 )

- Update model card (8124ca2a858fd1c8ceea2fc56f2ff3f9e76a30c3) Co-authored-by: Pedro Cuenca <pcuenq@users.noreply.huggingface.co>
Update README.md (#11 )
2024-10-24 15:07:51 +00:00 · 2024-09-25 17:53:59 +00:00 · 2024-09-25 15:42:33 +00:00 · 2024-09-25 13:11:46 +00:00 · 2024-09-25 08:00:24 +00:00 · 2024-09-25 06:36:42 +00:00
5 changed files with 275 additions and 90 deletions
--- a/README.md
+++ b/README.md
@ -16,48 +16,58 @@ tags:
 - pytorch
 - llama
 - llama-3
+license: llama3.2
 extra_gated_prompt: >-
  ### LLAMA 3.2 COMMUNITY LICENSE AGREEMENT

+
  Llama 3.2 Version Release Date: September 25, 2024

+  
  “Agreement” means the terms and conditions for use, reproduction, distribution 
  and modification of the Llama Materials set forth herein.

+  
  “Documentation” means the specifications, manuals and documentation accompanying Llama 3.2
  distributed by Meta at https://llama.meta.com/doc/overview.

+  
  “Licensee” or “you” means you, or your employer or any other person or entity (if you are 
  entering into this Agreement on such person or entity’s behalf), of the age required under
  applicable laws, rules or regulations to provide legal consent and that has legal authority
  to bind your employer or such other person or entity if you are entering in this Agreement
  on their behalf.

+  
  “Llama 3.2” means the foundational large language models and software and algorithms, including
  machine-learning model code, trained model weights, inference-enabling code, training-enabling code,
  fine-tuning enabling code and other elements of the foregoing distributed by Meta at 
  https://www.llama.com/llama-downloads.

+  
  “Llama Materials” means, collectively, Meta’s proprietary Llama 3.2 and Documentation (and 
  any portion thereof) made available under this Agreement.

+  
  “Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, 
  if you are an entity, your principal place of business is in the EEA or Switzerland) 
  and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland). 

+
  By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials,
  you agree to be bound by this Agreement.

+  
  1. License Rights and Redistribution.
  
-      a. Grant of Rights. You are granted a non-exclusive, worldwide, 
+  a. Grant of Rights. You are granted a non-exclusive, worldwide, 
  non-transferable and royalty-free limited license under Meta’s intellectual property or other rights 
  owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works 
  of, and make modifications to the Llama Materials.  

-      b. Redistribution and Use.  
+  b. Redistribution and Use.  

-          i. If you distribute or make available the Llama Materials (or any derivative works thereof), 
+  i. If you distribute or make available the Llama Materials (or any derivative works thereof), 
  or a product or service (including another AI model) that contains any of them, you shall (A) provide
  a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Llama”
  on a related website, user interface, blogpost, about page, or product documentation. If you use the
@ -65,15 +75,15 @@ extra_gated_prompt: >-
  otherwise improve an AI model, which is distributed or made available, you shall also include “Llama”
  at the beginning of any such AI model name.

-          ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part
+  ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part
  of an integrated end user product, then Section 2 of this Agreement will not apply to you. 

-          iii. You must retain in all copies of the Llama Materials that you distribute the 
+  iii. You must retain in all copies of the Llama Materials that you distribute the 
  following attribution notice within a “Notice” text file distributed as a part of such copies: 
  “Llama 3.2 is licensed under the Llama 3.2 Community License, Copyright © Meta Platforms,
  Inc. All Rights Reserved.”

-          iv. Your use of the Llama Materials must comply with applicable laws and regulations
+  iv. Your use of the Llama Materials must comply with applicable laws and regulations
  (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for
  the Llama Materials (available at https://www.llama.com/llama3_2/use-policy), which is hereby 
  incorporated by reference into this Agreement.
@ -98,7 +108,7 @@ extra_gated_prompt: >-
  
  5. Intellectual Property.
  
-      a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, 
+  a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, 
  neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, 
  except as required for reasonable and customary use in describing and redistributing the Llama Materials or as 
  set forth in this Section 5(a). Meta hereby grants you a license to use “Llama” (the “Mark”) solely as required 
@ -106,16 +116,16 @@ extra_gated_prompt: >-
  at https://about.meta.com/brand/resources/meta/company-brand/). All goodwill arising out of your use of the Mark 
  will inure to the benefit of Meta.
  
-      b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any
-      derivative works and modifications of the Llama Materials that are made by you, as between you and Meta,
-      you are and will be the owner of such derivative works and modifications.
+  b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any
+  derivative works and modifications of the Llama Materials that are made by you, as between you and Meta,
+  you are and will be the owner of such derivative works and modifications.

-      c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or
-      counterclaim in a lawsuit) alleging that the Llama Materials or Llama 3.2 outputs or results, or any portion
-      of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable
-      by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or
-      claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third
-      party arising out of or related to your use or distribution of the Llama Materials.
+  c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or
+  counterclaim in a lawsuit) alleging that the Llama Materials or Llama 3.2 outputs or results, or any portion
+  of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable
+  by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or
+  claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third
+  party arising out of or related to your use or distribution of the Llama Materials.
  
  6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access
  to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms
@ -171,13 +181,19 @@ extra_gated_prompt: >-
  4. Fail to appropriately disclose to end users any known dangers of your AI system
  5. Interact with third party tools, models, or software designed to generate unlawful content or engage in unlawful or harmful conduct and/or represent that the outputs of such tools, models, or software are associated with Meta or Llama 3.2

+
  With respect to any multimodal models included in Llama 3.2, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.

+
  Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:

+
  * Reporting issues with the model: [https://github.com/meta-llama/llama-models/issues](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2Fmeta-llama%2Fllama-models%2Fissues&h=AT0qV8W9BFT6NwihiOHRuKYQM_UnkzN_NmHMy91OT55gkLpgi4kQupHUl0ssR4dQsIQ8n3tfd0vtkobvsEvt1l4Ic6GXI2EeuHV8N08OG2WnbAmm0FL4ObkazC6G_256vN0lN9DsykCvCqGZ)
+  
  * Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback)
+  
  * Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info)
+  
  * Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama 3.2: LlamaUseReport@meta.com
 extra_gated_fields:
  First Name: text
@ -205,7 +221,7 @@ extra_gated_button_content: Submit

 ## Model Information

-The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.
+The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.

 **Model Developer:** Meta

@ -215,6 +231,8 @@ The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
 | Llama 3.2 (text only)  | A new mix of publicly available online data. | 1B (1.23B) | Multilingual Text | Multilingual Text and code  | 128k | Yes | Yes | Up to 9T tokens | December 2023 |
 |  |  | 3B (3.21B) | Multilingual Text | Multilingual Text and code  |  |  |  |  |  |
+| Llama 3.2 Quantized (text only)  | A new mix of publicly available online data. | 1B (1.23B) | Multilingual Text | Multilingual Text and code  | 8k | Yes | Yes | Up to 9T tokens | December 2023 |
+|  |  | 3B (3.21B) | Multilingual Text | Multilingual Text and code |  |  |  |  |  |

 **Supported Languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.

@ -226,29 +244,77 @@ The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a

 **License:** Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

-**Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https://github.com/meta-llama/llama-recipes). 
+**Feedback:** Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https://github.com/meta-llama/llama-recipes).

 ## Intended Use

-**Intended Use Cases:** Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. 
+**Intended Use Cases:** Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.

 **Out of Scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.

+## How to use
+
+This repository contains two versions of Llama-3.2-1B-Instruct, for use with transformers and with the original `llama` codebase.
+
+### Use with transformers
+
+Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.
+
+Make sure to update your transformers installation via `pip install --upgrade transformers`.
+
+```python
+import torch
+from transformers import pipeline
+
+model_id = "meta-llama/Llama-3.2-1B-Instruct"
+pipe = pipeline(
+    "text-generation",
+    model=model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+messages = [
+    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
+    {"role": "user", "content": "Who are you?"},
+]
+outputs = pipe(
+    messages,
+    max_new_tokens=256,
+)
+print(outputs[0]["generated_text"][-1])
+```
+
+Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at [`huggingface-llama-recipes`](https://github.com/huggingface/huggingface-llama-recipes)
+
+### Use with `llama`
+
+Please, follow the instructions in the [repository](https://github.com/meta-llama/llama)
+
+To download Original checkpoints, see the example command below leveraging `huggingface-cli`:
+
+```
+huggingface-cli download meta-llama/Llama-3.2-1B-Instruct --include "original/*" --local-dir Llama-3.2-1B-Instruct
+```
+
 ## Hardware and Software

-**Training Factors:** We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure.
+**Training Factors:** We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.

 **Training Energy Use:** Training utilized a cumulative of **916k** GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.

-## 
-
 **Training Greenhouse Gas Emissions:** Estimated total location-based greenhouse gas emissions were **240** tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy; therefore, the total market-based greenhouse gas emissions for training were 0 tons CO2eq.

 |  | Training Time (GPU hours) | Logit Generation Time (GPU Hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions (tons CO2eq) | Training Market-Based Greenhouse Gas Emissions (tons CO2eq) |
 | :---- | :---: | ----- | :---: | :---: | :---: |
 | Llama 3.2 1B | 370k | \- | 700 | 107 | 0 |
 | Llama 3.2 3B | 460k | \- | 700 | 133 | 0 |
-| Total | 830k |         86k |  | 240 | 0 |
+| Llama 3.2 1B SpinQuant | 1.7 | 0 | 700 | *Negligible*\*\* | 0 |
+| Llama 3.2 3B SpinQuant | 2.4 | 0 | 700 | *Negligible*\*\* | 0 |
+| Llama 3.2 1B QLora | 1.3k | 0 | 700 | 0.381 | 0 |
+| Llama 3.2 3B QLora | 1.6k | 0 | 700 | 0.461 | 0 |
+| Total | 833k |         86k |  | 240 | 0 |
+
+\*\* The location-based CO2e emissions of Llama 3.2 1B SpinQuant and Llama 3.2 3B SpinQuant are less than 0.001 metric tonnes each. This is due to the minimal training GPU hours that are required.

 The methodology used to determine training energy use and greenhouse gas emissions can be found [here](https://arxiv.org/pdf/2204.05149). Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.

@ -258,6 +324,24 @@ The methodology used to determine training energy use and greenhouse gas emissio

 **Data Freshness:** The pretraining data has a cutoff of December 2023\.

+## Quantization
+
+### Quantization Scheme
+
+We designed the current quantization scheme with the [PyTorch’s ExecuTorch](https://github.com/pytorch/executorch) inference framework and Arm CPU backend in mind, taking into account metrics including model quality, prefill/decoding speed, and memory footprint. Our quantization scheme involves three parts:
+- All linear layers in all transformer blocks are quantized to a 4-bit groupwise scheme (with a group size of 32) for weights and 8-bit per-token dynamic quantization for activations.
+- The classification layer is quantized to 8-bit per-channel for weight and 8-bit per token dynamic quantization for activation.
+- Similar to classification layer, an 8-bit per channel quantization is used for embedding layer.
+
+
+### Quantization-Aware Training and LoRA
+
+The quantization-aware training (QAT) with low-rank adaptation (LoRA) models went through only post-training stages, using the same data as the full precision models. To initialize QAT, we utilize BF16 Llama 3.2 model checkpoints obtained after supervised fine-tuning (SFT) and perform an additional full round of SFT training with QAT. We then freeze the backbone of the QAT model and perform another round of SFT with LoRA adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in BF16. Because our approach is similar to QLoRA of Dettmers et al., (2023) (i.e., quantization followed by LoRA adapters), we refer this method as QLoRA. Finally, we fine-tune the resulting model (both backbone and LoRA adaptors) using direct preference optimization (DPO).
+
+### SpinQuant
+
+[SpinQuant](https://arxiv.org/abs/2405.16406) was applied, together with generative post-training quantization (GPTQ). For the SpinQuant rotation matrix fine-tuning, we optimized for 100 iterations, using 800 samples with sequence-length 2048 from the WikiText 2 dataset. For GPTQ, we used 128 samples from the same dataset with the same sequence-length.
+
 ## Benchmarks \- English Text

 In this section, we report the results for Llama 3.2 models on standard automatic benchmarks. For all these evaluations, we used our internal evaluations library.
@ -276,35 +360,64 @@ In this section, we report the results for Llama 3.2 models on standard automati

 ### Instruction Tuned Models

-| Capability |  | Benchmark | \# Shots | Metric | Llama 3.2 1B | Llama 3.2 3B | Llama 3.1 8B |
-| :---: | ----- | :---: | :---: | :---: | :---: | :---: | :---: |
-| General |  | MMLU | 5 | macro\_avg/acc | 49.3 | 63.4 | 69.4 |
-| Re-writing |  | Open-rewrite eval | 0 | micro\_avg/rougeL | 41.6 | 40.1 | 40.9 |
-| Summarization |  | TLDR9+ (test) | 1 | rougeL | 16.8 | 19.0 | 17.2 |
-| Instruction following |  | IFEval | 0 | Avg(Prompt/Instruction acc Loose/Strict) | 59.5 | 77.4 | 80.4 |
-| Math |  | GSM8K (CoT) | 8 | em\_maj1@1 | 44.4 | 77.7 | 84.5 |
-|  |  | MATH (CoT) | 0 | final\_em | 30.6 | 48.0 | 51.9 |
-| Reasoning |  | ARC-C | 0 | acc | 59.4 | 78.6 | 83.4 |
-|  |  | GPQA | 0 | acc | 27.2 | 32.8 | 32.8 |
-|  |  | Hellaswag | 0 | acc | 41.2 | 69.8 | 78.7 |
-| Tool Use |  | BFCL V2 | 0 | acc | 25.7 | 67.0 | 67.1 |
-|  |  | Nexus | 0 | macro\_avg/acc | 13.5 | 34.3 | 38.5 |
-| Long Context |  | InfiniteBench/En.QA | 0 | longbook\_qa/f1 | 20.3 | 19.8 | 27.3 |
-|  |  | InfiniteBench/En.MC | 0 | longbook\_choice/acc | 38.0 | 63.3 | 72.2 |
-|  |  | NIH/Multi-needle | 0 | recall | 75.0 | 84.7 | 98.8 |
-| Multilingual |  | MGSM (CoT) | 0 | em | 24.5 | 58.2 | 68.9 |
+| Capability |  | Benchmark | \# Shots | Metric | Llama 3.2 1B bf16 | Llama 3.2 1B Vanilla PTQ\*\* | Llama 3.2 1B Spin Quant | Llama 3.2 1B QLoRA | Llama 3.2 3B bf16 | Llama 3.2 3B Vanilla PTQ\*\* | Llama 3.2 3B Spin Quant | Llama 3.2 3B QLoRA | Llama 3.1 8B |
+| :---: | ----- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| General |  | MMLU | 5 | macro\_avg/acc | 49.3 | 43.3 | 47.3 | 49.0 | 63.4 | 60.5 | 62 | 62.4 | 69.4 |
+| Re-writing |  | Open-rewrite eval | 0 | micro\_avg/rougeL | 41.6 | 39.2 | 40.9 | 41.2 | 40.1 | 40.3 | 40.8 | 40.7 | 40.9 |
+| Summarization |  | TLDR9+ (test) | 1 | rougeL | 16.8 | 14.9 | 16.7 | 16.8 | 19.0 | 19.1 | 19.2 | 19.1 | 17.2 |
+| Instruction following |  | IFEval | 0 | Avg(Prompt/Instruction acc Loose/Strict) | 59.5 | 51.5 | 58.4 | 55.6 | 77.4 | 73.9 | 73.5 | 75.9 | 80.4 |
+| Math |  | GSM8K (CoT) | 8 | em\_maj1@1 | 44.4 | 33.1 | 40.6 | 46.5 | 77.7 | 72.9 | 75.7 | 77.9 | 84.5 |
+|  |  | MATH (CoT) | 0 | final\_em | 30.6 | 20.5 | 25.3 | 31.0 | 48.0 | 44.2 | 45.3 | 49.2 | 51.9 |
+| Reasoning |  | ARC-C | 0 | acc | 59.4 | 54.3 | 57 | 60.7 | 78.6 | 75.6 | 77.6 | 77.6 | 83.4 |
+|  |  | GPQA | 0 | acc | 27.2 | 25.9 | 26.3 | 25.9 | 32.8 | 32.8 | 31.7 | 33.9 | 32.8 |
+|  |  | Hellaswag | 0 | acc | 41.2 | 38.1 | 41.3 | 41.5 | 69.8 | 66.3 | 68 | 66.3 | 78.7 |
+| Tool Use |  | BFCL V2 | 0 | acc | 25.7 | 14.3 | 15.9 | 23.7 | 67.0 | 53.4 | 60.1 | 63.5 | 67.1 |
+|  |  | Nexus | 0 | macro\_avg/acc | 13.5 | 5.2 | 9.6 | 12.5 | 34.3 | 32.4 | 31.5 | 30.1 | 38.5 |
+| Long Context |  | InfiniteBench/En.QA | 0 | longbook\_qa/f1 | 20.3 | N/A | N/A | N/A | 19.8 | N/A | N/A | N/A | 27.3 |
+|  |  | InfiniteBench/En.MC | 0 | longbook\_choice/acc | 38.0 | N/A | N/A | N/A | 63.3 | N/A | N/A | N/A | 72.2 |
+|  |  | NIH/Multi-needle | 0 | recall | 75.0 | N/A | N/A | N/A | 84.7 | N/A | N/A | N/A | 98.8 |
+| Multilingual |  | MGSM (CoT) | 0 | em | 24.5 | 13.7 | 18.2 | 24.4 | 58.2 | 48.9 | 54.3 | 56.8 | 68.9 |
+
+\*\*for comparison purposes only. Model not released.

 ### Multilingual Benchmarks

-| Category | Benchmark | Language | Llama 3.2 1B | Llama 3.2 3B | Llama 3.1 8B |
-| :---: | :---: | :---: | :---: | :---: | :---: |
-| General | MMLU (5-shot, macro\_avg/acc) | Portuguese | 39.82 | 54.48 | 62.12 |
-|  |  | Spanish | 41.52 | 55.09 | 62.45 |
-|  |  | Italian | 39.79 | 53.77 | 61.63 |
-|  |  | German | 39.20 | 53.29 | 60.59 |
-|  |  | French | 40.47 | 54.59 | 62.34 |
-|  |  | Hindi | 33.51 | 43.31 | 50.88 |
-|  |  | Thai | 34.67 | 44.54 | 50.32 |
+| Category | Benchmark | Language | Llama 3.2 1B | Llama 3.2 1B Vanilla PTQ\*\* | Llama 3.2 1B Spin Quant | Llama 3.2 1B QLoRA | Llama 3.2 3B | Llama 3.2 3B Vanilla PTQ\*\* | Llama 3.2 3B Spin Quant | Llama 3.2 3B QLoRA | Llama 3.1 8B |
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| General | MMLU (5-shot, macro_avg/acc) | Portuguese | 39.8 | 34.9 | 38.9 | 40.2 | 54.5 | 50.9 | 53.3 | 53.4 | 62.1 |
+| | | Spanish | 41.5 | 36.0 | 39.8 | 41.8 | 55.1 | 51.9 | 53.6 | 53.6 | 62.5 |
+| | | Italian | 39.8 | 34.9 | 38.1 | 40.6 | 53.8 | 49.9 | 52.1 | 51.7 | 61.6 |
+| | | German | 39.2 | 34.9 | 37.5 | 39.6 | 53.3 | 50.0 | 52.2 | 51.3 | 60.6 |
+| | | French | 40.5 | 34.8 | 39.2 | 40.8 | 54.6 | 51.2 | 53.3 | 53.3 | 62.3 |
+| | | Hindi | 33.5 | 30.0 | 32.1 | 34.0 | 43.3 | 40.4 | 42.0 | 42.1 | 50.9 |
+| | | Thai | 34.7 | 31.2 | 32.4 | 34.9 | 44.5 | 41.3 | 44.0 | 42.2 | 50.3 |
+
+\*\*for comparison purposes only. Model not released.
+
+## Inference time
+
+In the below table, we compare the performance metrics of different quantization methods (SpinQuant and QAT \+ LoRA) with the BF16 baseline. The evaluation was done using the [ExecuTorch](https://github.com/pytorch/executorch) framework as the inference engine, with the ARM CPU as a backend using Android OnePlus 12 device.
+
+| Category | Decode (tokens/sec)  | Time-to-first-token (sec) | Prefill (tokens/sec) | Model size (PTE file size in MB) | Memory size (RSS in MB) |
+| :---- | ----- | ----- | ----- | ----- | ----- |
+| 1B BF16 (baseline) | 19.2 | 1.0 | 60.3 | 2358 | 3,185 |
+| 1B SpinQuant | 50.2 (2.6x) | 0.3 (-76.9%) | 260.5 (4.3x) | 1083 (-54.1%) | 1,921 (-39.7%) |
+| 1B QLoRA | 45.8 (2.4x) | 0.3 (-76.0%) | 252.0 (4.2x) | 1127 (-52.2%) | 2,255 (-29.2%) |
+| 3B BF16 (baseline) | 7.6 | 3.0 | 21.2 | 6129 | 7,419 |
+| 3B SpinQuant | 19.7 (2.6x) | 0.7 (-76.4%) | 89.7 (4.2x) | 2435 (-60.3%) | 3,726 (-49.8%) |
+| 3B QLoRA | 18.5 (2.4x) | 0.7 (-76.1%) | 88.8 (4.2x) | 2529 (-58.7%) | 4,060 (-45.3%) |
+
+(\*) The performance measurement is done using an adb binary-based approach.
+(\*\*) It is measured on an Android OnePlus 12 device.
+(\*\*\*) Time-to-first-token (TTFT)  is measured with prompt length=64
+
+*Footnote:*
+
+- *Decode (tokens/second) is for how quickly it keeps generating. Higher is better.*
+- *Time-to-first-token (TTFT for shorthand) is for how fast it generates the first token for a given prompt. Lower is better.*
+- *Prefill is the inverse of TTFT (aka 1/TTFT)  in tokens/second. Higher is better*
+- *Model size \- how big is the model, measured by, PTE file, a binary file format for ExecuTorch*
+- *RSS size \- Memory usage in resident set size (RSS)*

 ## Responsibility & Safety

@ -350,7 +463,8 @@ In addition to our safety work above, we took extra care on measuring and/or mit

 **2\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.

-**3\. Cyber Attacks:** Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention.
+**3\. Cyber Attacks:** For Llama 3.1 405B, our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed.
+Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Because Llama 3.2’s 1B and 3B models are smaller and less capable models than Llama 3.1 405B, we broadly believe that the testing conducted for the 405B model also applies to Llama 3.2 models.

 ### Community

--- a/config.json
+++ b/config.json
@ -24,7 +24,7 @@
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
-    "factor": 8.0,
+    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@ -1,4 +1,16 @@
 {
-  "bos_token": "<|begin_of_text|>",
-  "eos_token": "<|eot_id|>"
+  "bos_token": {
+    "content": "<|begin_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|eot_id|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
@ -2329,10 +2329,69 @@
    ]
  },
  "post_processor": {
-    "type": "ByteLevel",
-    "add_prefix_space": true,
-    "trim_offsets": false,
-    "use_regex": true
+    "type": "Sequence",
+    "processors": [
+      {
+        "type": "ByteLevel",
+        "add_prefix_space": true,
+        "trim_offsets": false,
+        "use_regex": true
+      },
+      {
+        "type": "TemplateProcessing",
+        "single": [
+          {
+            "SpecialToken": {
+              "id": "<|begin_of_text|>",
+              "type_id": 0
+            }
+          },
+          {
+            "Sequence": {
+              "id": "A",
+              "type_id": 0
+            }
+          }
+        ],
+        "pair": [
+          {
+            "SpecialToken": {
+              "id": "<|begin_of_text|>",
+              "type_id": 0
+            }
+          },
+          {
+            "Sequence": {
+              "id": "A",
+              "type_id": 0
+            }
+          },
+          {
+            "SpecialToken": {
+              "id": "<|begin_of_text|>",
+              "type_id": 1
+            }
+          },
+          {
+            "Sequence": {
+              "id": "B",
+              "type_id": 1
+            }
+          }
+        ],
+        "special_tokens": {
+          "<|begin_of_text|>": {
+            "id": "<|begin_of_text|>",
+            "ids": [
+              128000
+            ],
+            "tokens": [
+              "<|begin_of_text|>"
+            ]
+          }
+        }
+      }
+    ]
  },
  "decoder": {
    "type": "ByteLevel",
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@ -2050,7 +2050,7 @@
    }
  },
  "bos_token": "<|begin_of_text|>",
-  "chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",
+  "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now(\"%d %b %Y\") %}\n    {%- else %}\n        {%- set date_string = \"26 Jul 2024\" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n        {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n        {{- '\"parameters\": ' }}\n        {{- tool_call.arguments | tojson }}\n        {{- \"}\" }}\n        {{- \"<|eot_id|>\" }}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
  "clean_up_tokenization_spaces": true,
  "eos_token": "<|eot_id|>",
  "model_input_names": [
Author	SHA1	Message	Date
Varun Vontimitta	9213176726	Update model card (#25 ) - Update model card (8124ca2a858fd1c8ceea2fc56f2ff3f9e76a30c3) Co-authored-by: Pedro Cuenca <pcuenq@users.noreply.huggingface.co>	2024-10-24 15:07:51 +00:00
Pedro Cuenca	e9f8effbab	Update README.md (#11 ) - Update README.md (f678605081c436120a4c56aaa69d2f1e2ce093c9)	2024-09-25 17:53:59 +00:00
Pedro Cuenca	0907763fe3	Update chat template (#10 ) - Update chat template (41f045afe680103c97600a98f2f9021f095d8983)	2024-09-25 15:42:33 +00:00
Omar Sanseviero	bb0afa26a8	Fixes gate (#9 ) - Fixes gate (bcedc546f90c68cbe7a98dbdb0f502f2d5069322)	2024-09-25 13:11:46 +00:00
Omar Sanseviero	b0445fe70d	Upload folder using huggingface_hub (#8 ) - Upload folder using huggingface_hub (270281d0d0cc91dc6fa781a4a968e5c34c468345) Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>	2024-09-25 08:00:24 +00:00
Omar Sanseviero	efe927efa7	Update README (#4 ) - Update README (fc9004aec2c581e0163b71ab8ed59c52a0207b1f) Co-authored-by: Pedro Cuenca <pcuenq@users.noreply.huggingface.co>	2024-09-25 06:36:42 +00:00
Pedro Cuenca	6325870065	Update README.md (#3 ) - Update README.md (7d207e4971d450a04237f3ae7644bb4eb01306bb) Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>	2024-09-25 00:36:29 +00:00
Omar Sanseviero	123455c8ba	Update README.md	2024-09-24 18:51:29 +00:00
Omar Sanseviero	123b9d11c7	Update README.md	2024-09-24 18:44:14 +00:00
Kai Wu	c4219cc9e6	Update config.json (#2 ) - Update config.json (77454b7f1dcbf58a47cbdc056a3800c8acc4940e) Co-authored-by: Sanyam Bhutani <Sanyam@users.noreply.huggingface.co>	2024-09-24 00:22:38 +00:00