PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

Zhaowei Zhang¹, Xiaobo Wang^2,5, Minghua Yi³, Mengmeng Wang⁵,
Fengshuo Bai^4,6, Zilong Zheng⁵, Yipeng Kang⁵, Yaodong Yang¹
Institute for Artificial Intelligence, Peking University¹ USTC² WHU³ SJTU⁴ BIGAI⁵ Zhongguancun Academy⁶

This repository is the official implementation of PoliCon, a benchmark for evaluating large language models on political consensus tasks under different objectives (seat apportionment, Rawlsian fairness, utilitarianism) and voting rules (simple majority, 2/3 majority, veto power).

🔥 News

2026.1.26 PoliCon is accepted by ICLR 2026!

📋 Overview

PoliCon evaluates LLMs in a simulated parliament setting: given multi-party stances on policy topics, the model must draft resolutions that satisfy specified consensus objectives and voting thresholds. The benchmark supports:

Task types: seat_apportionment, rawlsianism, utilitarianism
Voting rules: simple_majority, 2_3_majority, veto_power
Party counts: 2, 4, or 6 political groups
Topics: 19 PoliCon policy areas (see Datasets)

Both local models (e.g., Qwen, Llama) and API-based models (e.g., GPT-4o, Gemini) are supported.

📦 Installation

Clone the repository

git clone https://github.com/bigai-nlco/PoliCon.git
cd PoliCon

Create and activate a conda environment

conda create -y --name policon python=3.10
conda activate policon
export PYTHONPATH=$(pwd)

Install dependencies
```
pip install -r requirements.txt
```

API keys (for OpenAI-compatible and API-based models)

Edit openai_keys.py and set your keys and base URL:

OPENAI_API_KEY = "<your-api-key>"
OPENAI_BASE_URL = "<your-api-base-url>"  # e.g. OpenAI or compatible endpoint

📁 Project Structure

PoliCon/
├── config.py              # Argument definitions and defaults
├── openai_keys.py         # API keys (user-configured)
├── runner/
│   ├── task_runner.py     # Main evaluation entry point
│   └── task_prompts.py    # System and task prompts
├── evals/                 # Voting logic and GPT-based scoring
├── utils/                 # Model loading, API calls, I/O
├── datas/
│   ├── task_datas/        # Evaluation data (by party_num and topic)
│   ├── topic_datas/       # Raw topic data
│   ├── task_infos.py      # Topic list and party metadata
│   └── building_tasks.py  # Regenerate task_datas from topic_datas
├── scripts/
│   └── run_all_tasks.sh   # Batch run over full config grid
└── results/               # Outputs (created at runtime)
    ├── policon_logs/      # Per-run detailed logs
    └── policon_rsts/      # Aggregated scores per (model, topic, setting)

⚙️ Configuration

All evaluation parameters are defined in config.py and can be passed as CLI arguments to task_runner.py.

Argument	Type	Default	Description
`--env_name`	str	`PoliCon`	Environment name
`--task_setting`	str	`seat_apportionment`	Task type: `seat_apportionment`, `rawlsianism`, `utilitarianism`
`--voting_threshold_setting`	str	`simple_majority`	Voting rule: `simple_majority`, `2_3_majority`, `veto_power`
`--party_num`	int	2	Number of parties: 2, 4, 6
`--eval_topic`	str	`gender equality`	Topic name (see `datas/task_infos.py`)
`--model_name_or_path`	str	(see config)	Model path (local) or API model name
`--device`	str	`cuda:0`	GPU device
`--multi_gpu`	flag	False	Use multiple GPUs for local models
`--max_new_tokens`	int	512	Max generation length
`--temperature`	float	0.7	Sampling temperature
`--top_p`	float	0.95	Nucleus sampling threshold
`--seed`	int	42	Random seed

📊 Datasets

We provide two types of datasets: task_datas and topic_datas in our benchmark, which are placed in the datas folder as the following structure:

datas
├── building_tasks.py
├── task_datas
│   ├── 2
│   │   ├── agriculture.json
│   │   ├── budget.json
│   │   ├── budgetary control.json
│   │   ├── civil liberties, justice & home affairs.json
│   │   ├── constitutional and inter-institutional affairs.json
│   │   ├── culture & education.json
│   │   ├── development.json
│   │   ├── economic & monetary affairs.json
│   │   ├── employment & social affairs.json
│   │   ├── environment & public health.json
│   │   ├── fisheries.json
│   │   ├── foreign & security policy.json
│   │   ├── gender equality.json
│   │   ├── industry, research & energy.json
│   │   ├── internal market & consumer protection.json
│   │   ├── international trade.json
│   │   ├── legal affairs.json
│   │   ├── regional development.json
│   │   └── transport & tourism.json
│   ├── 4
│   │   ├── agriculture.json
│   │   ├── budget.json
│   │   ├── budgetary control.json
│   │   ├── civil liberties, justice & home affairs.json
│   │   ├── constitutional and inter-institutional affairs.json
│   │   ├── culture & education.json
│   │   ├── development.json
│   │   ├── economic & monetary affairs.json
│   │   ├── employment & social affairs.json
│   │   ├── environment & public health.json
│   │   ├── fisheries.json
│   │   ├── foreign & security policy.json
│   │   ├── gender equality.json
│   │   ├── industry, research & energy.json
│   │   ├── internal market & consumer protection.json
│   │   ├── international trade.json
│   │   ├── legal affairs.json
│   │   ├── regional development.json
│   │   └── transport & tourism.json
│   └── 6
│       ├── agriculture.json
│       ├── budget.json
│       ├── budgetary control.json
│       ├── civil liberties, justice & home affairs.json
│       ├── constitutional and inter-institutional affairs.json
│       ├── culture & education.json
│       ├── development.json
│       ├── economic & monetary affairs.json
│       ├── employment & social affairs.json
│       ├── environment & public health.json
│       ├── fisheries.json
│       ├── foreign & security policy.json
│       ├── gender equality.json
│       ├── industry, research & energy.json
│       ├── internal market & consumer protection.json
│       ├── international trade.json
│       ├── legal affairs.json
│       ├── regional development.json
│       └── transport & tourism.json
├── task_infos.py
└── topic_datas
    ├── agriculture.json
    ├── budget.json
    ├── budgetary control.json
    ├── civil liberties, justice & home affairs.json
    ├── constitutional and inter-institutional affairs.json
    ├── culture & education.json
    ├── development.json
    ├── economic & monetary affairs.json
    ├── employment & social affairs.json
    ├── environment & public health.json
    ├── fisheries.json
    ├── foreign & security policy.json
    ├── foreign and security policy.json
    ├── gender equality.json
    ├── industry, research & energy.json
    ├── internal market & consumer protection.json
    ├── international trade.json
    ├── legal affairs.json
    ├── regional development.json
    └── transport & tourism.json

🚀 Running Evaluations

scripts/run_all_tasks.sh runs the full sweep: every combination of models × topics × party counts × task types × voting rules. It prints a progress bar and skips combinations whose results already exist.

Before running: set model_paths (and optionally other arrays) in run_all_tasks.sh to match your setup:

Local models: use absolute paths to your model dirs (e.g. /.cache/Qwen/Qwen2.5-32B-Instruct).
API models: use the API model name (e.g. gpt-4o, gemini-2.5-flash-thinking, deepseek-v3.1).

Default arrays in the script:

task_settings: utilitarianism, rawlsianism, seat_apportionment
voting_threshold_settings: simple_majority, 2_3_majority, veto_power
party_nums: 2, 4, 6
eval_topics: all 19 topics from the benchmark
model_paths: edit to your models (see script)

Single-GPU users: remove the --multi_gpu flag from the python runner/task_runner.py call inside the script.

bash scripts/run_all_tasks.sh

📈 Results

The results show that Gemini-2.5 performs the best, achieving the best results on 60% of the tasks. Deepseek-V3.1 and GPT-4o follow with both attaining top performance on 33% of the tasks. We also compare the performance differences among other evaluated LLMs and identify the following trends: (1) Thinking models like Gemini-2.5 and Deepseek-V3.1 generally outperform no-thinking models like GPT-4o and Llama-3.3-70B. (2) Commercial models typically outperform non-commercial models. (3) Based on the results of four open-sourced models with known parameter sizes, we find that the performance is generally positively correlated with the model size.

📄 Citation

If you find PoliCon useful in your research, please cite our paper:

@inproceedings{zhang2026policon,
    title={PoliCon: Evaluating {LLM}s on Achieving Diverse Political Consensus Objectives},
    author={Zhaowei Zhang and Xiaobo Wang and Minghua Yi and Mengmeng Wang and Fengshuo Bai and Zilong Zheng and Yipeng Kang and Yaodong Yang},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=MHlwNs9k1Y}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

🔥 News

📑 Table of Contents

📋 Overview

📦 Installation

📁 Project Structure

⚙️ Configuration

📊 Datasets

🚀 Running Evaluations

📈 Results

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
datas		datas
evals		evals
runner		runner
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
openai_keys.py		openai_keys.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

🔥 News

📑 Table of Contents

📋 Overview

📦 Installation

📁 Project Structure

⚙️ Configuration

📊 Datasets

🚀 Running Evaluations

📈 Results

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages