258 lines
7.2 KiB
Markdown
258 lines
7.2 KiB
Markdown
---
|
||
license: apache-2.0
|
||
datasets:
|
||
- pjura/mahjong_board_states
|
||
language:
|
||
- zh
|
||
- en
|
||
base_model:
|
||
- unsloth/Qwen3-4B-Instruct-2507
|
||
tags:
|
||
- riichi-mahjong
|
||
- game-ai
|
||
- qwen
|
||
- qwen3
|
||
- mahjong
|
||
- discard-recommendation
|
||
- gguf
|
||
pipeline_tag: text-generation
|
||
---
|
||
|
||
# Qwen3-4B-Instruct-2507-mahjong-alpha
|
||
|
||
[中文](./README.md)
|
||
|
||
`Qwen3-4B-Instruct-2507-mahjong-alpha` is a Riichi Mahjong domain model fine-tuned from `unsloth/Qwen3-4B-Instruct-2507` with QLoRA.
|
||
|
||
It is designed for 4-player Riichi Mahjong discard recommendation: given round information, hand tiles, calls, visible tiles, tile-efficiency, and defense signals, the model outputs the single best discard tile for the current state.
|
||
|
||
The current version is mainly intended for tool integration. The output is a single tile text only, without explanation.
|
||
|
||
## Model Features
|
||
|
||
- **Task**: 4-player Riichi Mahjong discard recommendation
|
||
- **Base model**: `unsloth/Qwen3-4B-Instruct-2507`
|
||
- **Fine-tuning**: `QLoRA`
|
||
- **Training framework**: `Unsloth`
|
||
- **Release format**: `GGUF (F16)`
|
||
- **Inference**: `llama.cpp`
|
||
- **Maintainer**: `TTDXQ`
|
||
|
||
## Scope
|
||
|
||
This model targets 4-player Riichi Mahjong without red dora. The current version focuses only on discard recommendation. It does not provide full-game planning, yaku/score analysis, or detailed offense-defense explanations.
|
||
|
||
## Limitations
|
||
|
||
- Discard recommendation only
|
||
- No full-game planning
|
||
- No yaku, point calculation, or detailed strategic explanation
|
||
- Not guaranteed for competitive or real-match performance
|
||
- For research and learning purposes only
|
||
|
||
## Prohibited Uses
|
||
|
||
This model must not be used for:
|
||
|
||
- cheating
|
||
- game automation or plug-ins
|
||
- account boosting or ghost-playing
|
||
- real-money gambling assistance
|
||
|
||
## Input and Output
|
||
|
||
### Input Format
|
||
|
||
The model input is a structured natural-language game-state description. Example:
|
||
|
||
```text
|
||
[情景分析]
|
||
- 牌局: 东一局,你是庄家 (第1巡,牌墙余69张)。
|
||
- 状态: 当前排名 1/4 (与一位差 0)。
|
||
- 宝牌: 5万
|
||
- 各玩家分数: 你有 25分, 下家: 25分, 对家: 25分, 上家: 25分。
|
||
- 你的手牌: 1万 5万 7万 3筒 5筒 6筒 8筒 8筒 3索 5索 8索 南 白 发
|
||
- 牌效: 5 向听,进张 82 张。
|
||
- 防御:
|
||
最安全牌放铳率:11.3%
|
||
平均放铳率:18.5%
|
||
最危险牌放铳率:25.9%
|
||
场上已见牌信息
|
||
各玩家副露信息:本家副露:无, 下家副露:无, 对家副露:无, 上家副露:无
|
||
各玩家牌河信息:本家:无, 下家:无, 对家:无, 上家:无
|
||
|
||
[任务]
|
||
根据当前情景,选择一张最应该打出的手牌。
|
||
```
|
||
|
||
### Output Format
|
||
|
||
The output is strictly a single tile text without any prefix like "discard" and without explanation. Example:
|
||
|
||
```text
|
||
白
|
||
```
|
||
|
||
## Usage
|
||
|
||
### llama.cpp Inference
|
||
|
||
```bash
|
||
llama-server -m Qwen3-4B-Instruct-2507-mahjong-alpha.gguf -c 2048
|
||
```
|
||
|
||
### Python Inference Example
|
||
|
||
```python
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
"TTDXQ/Qwen3-4B-Instruct-2507-mahjong-alpha"
|
||
)
|
||
tokenizer = AutoTokenizer.from_pretrained(
|
||
"TTDXQ/Qwen3-4B-Instruct-2507-mahjong-alpha"
|
||
)
|
||
|
||
# Prepare input
|
||
input_text = """[情景分析]
|
||
- 牌局: 东一局,你是庄家 (第1巡,牌墙余69张)。
|
||
- 状态: 当前排名 1/4 (与一位差 0)。
|
||
..."""
|
||
|
||
# Inference
|
||
inputs = tokenizer(input_text, return_tensors="pt")
|
||
outputs = model.generate(**inputs, max_new_tokens=10)
|
||
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||
print(result) # Output: 白
|
||
```
|
||
|
||
## Dataset
|
||
|
||
The training data uses the 2018 subset of `pjura/mahjong_board_states`. This dataset originates from Tenhou.net gameplay records, with each record containing 511 data points covering game basics, dora indicators, player hand tiles, calls, discard piles, and discard decisions.
|
||
|
||
### Data Processing
|
||
|
||
The raw data was converted into human-readable natural language descriptions, with calculated turn numbers, actual dora, and simplified risk assessment. Sample distribution by turn:
|
||
|
||
- Turns 1-3: 15%
|
||
- Turns 4-6: 20%
|
||
- Turns 7-12: 35%
|
||
|
||
A total of `192000` samples were used, with no general instruction data or self-built data mixed in.
|
||
|
||
- Train: `192000`
|
||
- Validation: `2000`
|
||
- Test: sampled as needed from 2019 data
|
||
- Train / validation / test are fully non-overlapping
|
||
|
||
### Dataset Citation
|
||
|
||
```bibtex
|
||
@dataset{mahjong_board_states,
|
||
title = {MahJong Board States Dataset},
|
||
author = {Patrick Jura},
|
||
year = {2024},
|
||
url = {https://huggingface.co/datasets/pjura/mahjong_board_states}
|
||
}
|
||
```
|
||
|
||
## Training Details
|
||
|
||
### Model Configuration
|
||
- Base Model: `unsloth/Qwen3-4B-Instruct-2507`
|
||
- Training Precision: `4bit`
|
||
- Fine-tuning Method: `QLoRA`
|
||
- Framework: `Unsloth`
|
||
- Max Sequence Length: `2048`
|
||
|
||
### LoRA Parameters
|
||
- Rank: `128`
|
||
- Alpha: `256`
|
||
- Target Modules: All
|
||
|
||
### Training Hyperparameters
|
||
- Learning Rate: `1e-4`
|
||
- LR Scheduler: `cosine`
|
||
- Batch Size: `64`
|
||
- Per-device Batch: `2`
|
||
- Gradient Accumulation Steps: `32`
|
||
- Training Steps: `3000`
|
||
- Warmup Steps: `300`
|
||
- Random Seed: `3407`
|
||
- Load Best Checkpoint: Yes
|
||
|
||
### Training Time
|
||
- Total Duration: ~16.44 hours
|
||
|
||
## Evaluation Results
|
||
|
||
### Comparison with Dataset Actions
|
||
|
||
Inference parameters: Temperature=0.1, Top_P=0.1
|
||
|
||
**Metrics explanation**:
|
||
- Score: Max 500 points (1 point per correct sample, 0 for incorrect)
|
||
- Full-match rate: Samples where all 3 tests matched the dataset
|
||
- Zero-score rate: Samples where all 3 tests disagreed with the dataset
|
||
|
||
#### Tile-Efficiency Test
|
||
|
||
| Model | Method | Score | Full-match Rate | Zero-score Rate |
|
||
|-------|--------|-------|----------------|-----------------|
|
||
| Qwen3-4B | Prompt Engineering | 50.21 | 6.60% | 86.13% |
|
||
| Qwen3-4B | Fine-tuned | 229.66 | 45.87% | 53.93% |
|
||
| DeepSeek-V3.2 | Prompt Engineering | 181.66 | 21.40% | 46.33% |
|
||
|
||
#### Defense Test
|
||
|
||
| Model | Method | Score | Full-match Rate | Zero-score Rate |
|
||
|-------|--------|-------|----------------|-----------------|
|
||
| Qwen3-4B | Prompt Engineering | 53.55 | 6.17% | 84.43% |
|
||
| Qwen3-4B | Fine-tuned | 239.89 | 47.93% | 52.00% |
|
||
| DeepSeek-V3.2 | Prompt Engineering | 172.00 | 16.00% | 46.80% |
|
||
|
||
#### Comprehensive Test
|
||
|
||
| Model | Method | Score | Full-match Rate | Zero-score Rate |
|
||
|-------|--------|-------|----------------|-----------------|
|
||
| Qwen3-4B | Prompt Engineering | 53.44 | 0.60% | 84.40% |
|
||
| Qwen3-4B | Fine-tuned | 233.33 | 46.53% | 53.20% |
|
||
| DeepSeek-V3.2 | Prompt Engineering | 179.44 | 18.07% | 44.93% |
|
||
|
||
### Comparison with Mortal
|
||
|
||
Inference parameters: Temperature=0.6, Top_P=0.95
|
||
|
||
#### Test 1: All Turn Data
|
||
|
||
- Samples: 3000
|
||
- Top-1 Accuracy: **50.73%**
|
||
- Top-3 Accuracy: **83.37%**
|
||
|
||
#### Test 2: Excluding Early Turns
|
||
|
||
- Valid Samples: 3000
|
||
- Top-1 Accuracy: **48.70%**
|
||
- Top-3 Accuracy: **79.20%**
|
||
|
||
> Note: Mortal is one of the strongest open-source Riichi Mahjong AIs currently available
|
||
|
||
## Repository Links
|
||
|
||
- GitHub: https://github.com/ttdxq/Qwen3-4B-Instruct-2507-mahjong-alpha
|
||
- Hugging Face: https://huggingface.co/TTDXQ/Qwen3-4B-Instruct-2507-mahjong-alpha
|
||
|
||
## License
|
||
|
||
This model is licensed under Apache License 2.0.
|
||
|
||
The training data comes from `pjura/mahjong_board_states`, which is licensed under `CC BY 4.0`. Please preserve the required attribution and citation when using it.
|
||
|
||
## Acknowledgements
|
||
|
||
Thanks to the following open-source resources:
|
||
|
||
- `unsloth/Qwen3-4B-Instruct-2507`
|
||
- `pjura/mahjong_board_states`
|
||
- `Mortal`
|