Ko-PIQA
Go to file
2025-11-14 12:44:42 +09:00
.gitattributes initial commit 2025-11-14 03:43:22 +00:00
final_bench.csv add files 2025-11-14 12:44:42 +09:00
README.md add files 2025-11-14 12:44:42 +09:00

Ko-PIQA: Korean Physical Commonsense Reasoning Dataset

arXiv

📖 Dataset Overview

Ko-PIQA is a Korean Physical Commonsense Reasoning dataset designed to complement English-centric benchmarks like PIQA and to include culturally-grounded physical reasoning questions.

  • Total items: 441
  • Culturally-grounded items: 87 (19.7%)
    (e.g., kimchi storage, hanbok care, ondol heating)
  • Format: PIQA-style binary choice (solution0 / solution1)
  • Goal: Evaluate Korean LLM physical reasoning capabilities

📊 Data Fields

Field Type Description
prompt string The goal or question
solution0 string Candidate answer A
solution1 string Candidate answer B
label int Correct answer index (0 or 1)
cultural int/null 1 if culturally-grounded, otherwise null

🔎 Source & Filtering Pipeline

  • Source: 3.01M Korean Q&A pairs from Naver Knowledge iN (collected until May 2025)
  • Step 1: Filtered PIQA-style questions using Qwen3-4B, Qwen3-32B, and HCX-14B
    → 11,553 candidates
  • Step 2: Sampled 600 general and 158 cultural questions
  • Step 3: Refined and generated distractors using GPT-4o
  • Step 4: Two native Korean speakers validated and filtered questions → 471 items
  • Step 5: Deduplicated using KoSentenceBERT (cosine similarity > 0.85) → final 441 items

💡 Example

{
  "prompt": "김치찌개를 끓일 때 묵은지의 신맛을 중화시키면서도 깊은 맛을 내려면?",
  "solution0": "설탕을 한 스푼 넣고 물을 부은 후 중불에서 5분간 끓인다.",
  "solution1": "설탕을 한 스푼 넣고 중불에서 5분간 먼저 볶은 후 물을 붓는다.",
  "label": 1,
  "cultural": 1
}

💻 Usage

from datasets import load_dataset

ds = load_dataset("HAERAE-HUB/Ko-PIQA")
print(ds['train'][0])

📌 Citation

@misc{choi2025kopiqakoreanphysicalcommonsense,
      title={Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context}, 
      author={Dasol Choi and Jungwhan Kim and Guijin Son},
      year={2025},
      eprint={2509.11303},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.11303}, 
}