Fengshuo Bai4,6, Zilong Zheng5, Yipeng Kang5, Yaodong Yang1
Institute for Artificial Intelligence, Peking University1 USTC2 WHU3 SJTU4 BIGAI5 Zhongguancun Academy6
This repository is the official implementation of PoliCon, a benchmark for evaluating large language models on political consensus tasks under different objectives (seat apportionment, Rawlsian fairness, utilitarianism) and voting rules (simple majority, 2/3 majority, veto power).
- 2026.1.26 PoliCon is accepted by ICLR 2026!
- π Overview
- π¦ Installation
- π Project Structure
- βοΈ Configuration
- π Datasets
- π Running Evaluations
- π Results
- π Citation
PoliCon evaluates LLMs in a simulated parliament setting: given multi-party stances on policy topics, the model must draft resolutions that satisfy specified consensus objectives and voting thresholds. The benchmark supports:
- Task types:
seat_apportionment,rawlsianism,utilitarianism - Voting rules:
simple_majority,2_3_majority,veto_power - Party counts: 2, 4, or 6 political groups
- Topics: 19 PoliCon policy areas (see Datasets)
Both local models (e.g., Qwen, Llama) and API-based models (e.g., GPT-4o, Gemini) are supported.
-
Clone the repository
git clone https://github.com/bigai-nlco/PoliCon.git cd PoliCon -
Create and activate a conda environment
conda create -y --name policon python=3.10 conda activate policon export PYTHONPATH=$(pwd)
-
Install dependencies
pip install -r requirements.txt
-
API keys (for OpenAI-compatible and API-based models)
Edit
openai_keys.pyand set your keys and base URL:OPENAI_API_KEY = "<your-api-key>" OPENAI_BASE_URL = "<your-api-base-url>" # e.g. OpenAI or compatible endpoint
PoliCon/
βββ config.py # Argument definitions and defaults
βββ openai_keys.py # API keys (user-configured)
βββ runner/
β βββ task_runner.py # Main evaluation entry point
β βββ task_prompts.py # System and task prompts
βββ evals/ # Voting logic and GPT-based scoring
βββ utils/ # Model loading, API calls, I/O
βββ datas/
β βββ task_datas/ # Evaluation data (by party_num and topic)
β βββ topic_datas/ # Raw topic data
β βββ task_infos.py # Topic list and party metadata
β βββ building_tasks.py # Regenerate task_datas from topic_datas
βββ scripts/
β βββ run_all_tasks.sh # Batch run over full config grid
βββ results/ # Outputs (created at runtime)
βββ policon_logs/ # Per-run detailed logs
βββ policon_rsts/ # Aggregated scores per (model, topic, setting)
All evaluation parameters are defined in config.py and can be passed as CLI arguments to task_runner.py.
| Argument | Type | Default | Description |
|---|---|---|---|
--env_name |
str | PoliCon |
Environment name |
--task_setting |
str | seat_apportionment |
Task type: seat_apportionment, rawlsianism, utilitarianism |
--voting_threshold_setting |
str | simple_majority |
Voting rule: simple_majority, 2_3_majority, veto_power |
--party_num |
int | 2 | Number of parties: 2, 4, 6 |
--eval_topic |
str | gender equality |
Topic name (see datas/task_infos.py) |
--model_name_or_path |
str | (see config) | Model path (local) or API model name |
--device |
str | cuda:0 |
GPU device |
--multi_gpu |
flag | False | Use multiple GPUs for local models |
--max_new_tokens |
int | 512 | Max generation length |
--temperature |
float | 0.7 | Sampling temperature |
--top_p |
float | 0.95 | Nucleus sampling threshold |
--seed |
int | 42 | Random seed |
We provide two types of datasets: task_datas and topic_datas in our benchmark, which are placed in the datas folder as the following structure:
datas
βββ building_tasks.py
βββ task_datas
βΒ Β βββ 2
βΒ Β βΒ Β βββ agriculture.json
βΒ Β βΒ Β βββ budget.json
βΒ Β βΒ Β βββ budgetary control.json
βΒ Β βΒ Β βββ civil liberties, justice & home affairs.json
βΒ Β βΒ Β βββ constitutional and inter-institutional affairs.json
βΒ Β βΒ Β βββ culture & education.json
βΒ Β βΒ Β βββ development.json
βΒ Β βΒ Β βββ economic & monetary affairs.json
βΒ Β βΒ Β βββ employment & social affairs.json
βΒ Β βΒ Β βββ environment & public health.json
βΒ Β βΒ Β βββ fisheries.json
βΒ Β βΒ Β βββ foreign & security policy.json
βΒ Β βΒ Β βββ gender equality.json
βΒ Β βΒ Β βββ industry, research & energy.json
βΒ Β βΒ Β βββ internal market & consumer protection.json
βΒ Β βΒ Β βββ international trade.json
βΒ Β βΒ Β βββ legal affairs.json
βΒ Β βΒ Β βββ regional development.json
βΒ Β βΒ Β βββ transport & tourism.json
βΒ Β βββ 4
βΒ Β βΒ Β βββ agriculture.json
βΒ Β βΒ Β βββ budget.json
βΒ Β βΒ Β βββ budgetary control.json
βΒ Β βΒ Β βββ civil liberties, justice & home affairs.json
βΒ Β βΒ Β βββ constitutional and inter-institutional affairs.json
βΒ Β βΒ Β βββ culture & education.json
βΒ Β βΒ Β βββ development.json
βΒ Β βΒ Β βββ economic & monetary affairs.json
βΒ Β βΒ Β βββ employment & social affairs.json
βΒ Β βΒ Β βββ environment & public health.json
βΒ Β βΒ Β βββ fisheries.json
βΒ Β βΒ Β βββ foreign & security policy.json
βΒ Β βΒ Β βββ gender equality.json
βΒ Β βΒ Β βββ industry, research & energy.json
βΒ Β βΒ Β βββ internal market & consumer protection.json
βΒ Β βΒ Β βββ international trade.json
βΒ Β βΒ Β βββ legal affairs.json
βΒ Β βΒ Β βββ regional development.json
βΒ Β βΒ Β βββ transport & tourism.json
βΒ Β βββ 6
βΒ Β βββ agriculture.json
βΒ Β βββ budget.json
βΒ Β βββ budgetary control.json
βΒ Β βββ civil liberties, justice & home affairs.json
βΒ Β βββ constitutional and inter-institutional affairs.json
βΒ Β βββ culture & education.json
βΒ Β βββ development.json
βΒ Β βββ economic & monetary affairs.json
βΒ Β βββ employment & social affairs.json
βΒ Β βββ environment & public health.json
βΒ Β βββ fisheries.json
βΒ Β βββ foreign & security policy.json
βΒ Β βββ gender equality.json
βΒ Β βββ industry, research & energy.json
βΒ Β βββ internal market & consumer protection.json
βΒ Β βββ international trade.json
βΒ Β βββ legal affairs.json
βΒ Β βββ regional development.json
βΒ Β βββ transport & tourism.json
βββ task_infos.py
βββ topic_datas
βββ agriculture.json
βββ budget.json
βββ budgetary control.json
βββ civil liberties, justice & home affairs.json
βββ constitutional and inter-institutional affairs.json
βββ culture & education.json
βββ development.json
βββ economic & monetary affairs.json
βββ employment & social affairs.json
βββ environment & public health.json
βββ fisheries.json
βββ foreign & security policy.json
βββ foreign and security policy.json
βββ gender equality.json
βββ industry, research & energy.json
βββ internal market & consumer protection.json
βββ international trade.json
βββ legal affairs.json
βββ regional development.json
βββ transport & tourism.jsonscripts/run_all_tasks.sh runs the full sweep: every combination of models Γ topics Γ party counts Γ task types Γ voting rules. It prints a progress bar and skips combinations whose results already exist.
Before running: set model_paths (and optionally other arrays) in run_all_tasks.sh to match your setup:
- Local models: use absolute paths to your model dirs (e.g.
/.cache/Qwen/Qwen2.5-32B-Instruct). - API models: use the API model name (e.g.
gpt-4o,gemini-2.5-flash-thinking,deepseek-v3.1).
Default arrays in the script:
task_settings:utilitarianism,rawlsianism,seat_apportionmentvoting_threshold_settings:simple_majority,2_3_majority,veto_powerparty_nums:2,4,6eval_topics: all 19 topics from the benchmarkmodel_paths: edit to your models (see script)
Single-GPU users: remove the --multi_gpu flag from the python runner/task_runner.py call inside the script.
bash scripts/run_all_tasks.shThe results show that Gemini-2.5 performs the best, achieving the best results on 60% of the tasks. Deepseek-V3.1 and GPT-4o follow with both attaining top performance on 33% of the tasks. We also compare the performance differences among other evaluated LLMs and identify the following trends: (1) Thinking models like Gemini-2.5 and Deepseek-V3.1 generally outperform no-thinking models like GPT-4o and Llama-3.3-70B. (2) Commercial models typically outperform non-commercial models. (3) Based on the results of four open-sourced models with known parameter sizes, we find that the performance is generally positively correlated with the model size.
If you find PoliCon useful in your research, please cite our paper:
@inproceedings{zhang2026policon,
title={PoliCon: Evaluating {LLM}s on Achieving Diverse Political Consensus Objectives},
author={Zhaowei Zhang and Xiaobo Wang and Minghua Yi and Mengmeng Wang and Fengshuo Bai and Zilong Zheng and Yipeng Kang and Yaodong Yang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=MHlwNs9k1Y}
}