Skip to content

bigai-nlco/PoliCon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

Documentation Documentation Documentation License: MIT

Zhaowei Zhang1, Xiaobo Wang2,5, Minghua Yi3, Mengmeng Wang5,
Fengshuo Bai4,6, Zilong Zheng5, Yipeng Kang5, Yaodong Yang1

Institute for Artificial Intelligence, Peking University1 USTC2 WHU3 SJTU4 BIGAI5 Zhongguancun Academy6
PoliCon Overview

This repository is the official implementation of PoliCon, a benchmark for evaluating large language models on political consensus tasks under different objectives (seat apportionment, Rawlsian fairness, utilitarianism) and voting rules (simple majority, 2/3 majority, veto power).


πŸ”₯ News

  • 2026.1.26 PoliCon is accepted by ICLR 2026!

πŸ“‘ Table of Contents


πŸ“‹ Overview

PoliCon evaluates LLMs in a simulated parliament setting: given multi-party stances on policy topics, the model must draft resolutions that satisfy specified consensus objectives and voting thresholds. The benchmark supports:

  • Task types: seat_apportionment, rawlsianism, utilitarianism
  • Voting rules: simple_majority, 2_3_majority, veto_power
  • Party counts: 2, 4, or 6 political groups
  • Topics: 19 PoliCon policy areas (see Datasets)

Both local models (e.g., Qwen, Llama) and API-based models (e.g., GPT-4o, Gemini) are supported.


πŸ“¦ Installation

  1. Clone the repository

    git clone https://github.com/bigai-nlco/PoliCon.git
    cd PoliCon
  2. Create and activate a conda environment

    conda create -y --name policon python=3.10
    conda activate policon
    export PYTHONPATH=$(pwd)
  3. Install dependencies

    pip install -r requirements.txt
  4. API keys (for OpenAI-compatible and API-based models)

    Edit openai_keys.py and set your keys and base URL:

    OPENAI_API_KEY = "<your-api-key>"
    OPENAI_BASE_URL = "<your-api-base-url>"  # e.g. OpenAI or compatible endpoint

πŸ“ Project Structure

PoliCon/
β”œβ”€β”€ config.py              # Argument definitions and defaults
β”œβ”€β”€ openai_keys.py         # API keys (user-configured)
β”œβ”€β”€ runner/
β”‚   β”œβ”€β”€ task_runner.py     # Main evaluation entry point
β”‚   └── task_prompts.py    # System and task prompts
β”œβ”€β”€ evals/                 # Voting logic and GPT-based scoring
β”œβ”€β”€ utils/                 # Model loading, API calls, I/O
β”œβ”€β”€ datas/
β”‚   β”œβ”€β”€ task_datas/        # Evaluation data (by party_num and topic)
β”‚   β”œβ”€β”€ topic_datas/       # Raw topic data
β”‚   β”œβ”€β”€ task_infos.py      # Topic list and party metadata
β”‚   └── building_tasks.py  # Regenerate task_datas from topic_datas
β”œβ”€β”€ scripts/
β”‚   └── run_all_tasks.sh   # Batch run over full config grid
└── results/               # Outputs (created at runtime)
    β”œβ”€β”€ policon_logs/      # Per-run detailed logs
    └── policon_rsts/      # Aggregated scores per (model, topic, setting)

βš™οΈ Configuration

All evaluation parameters are defined in config.py and can be passed as CLI arguments to task_runner.py.

Argument Type Default Description
--env_name str PoliCon Environment name
--task_setting str seat_apportionment Task type: seat_apportionment, rawlsianism, utilitarianism
--voting_threshold_setting str simple_majority Voting rule: simple_majority, 2_3_majority, veto_power
--party_num int 2 Number of parties: 2, 4, 6
--eval_topic str gender equality Topic name (see datas/task_infos.py)
--model_name_or_path str (see config) Model path (local) or API model name
--device str cuda:0 GPU device
--multi_gpu flag False Use multiple GPUs for local models
--max_new_tokens int 512 Max generation length
--temperature float 0.7 Sampling temperature
--top_p float 0.95 Nucleus sampling threshold
--seed int 42 Random seed

πŸ“Š Datasets

We provide two types of datasets: task_datas and topic_datas in our benchmark, which are placed in the datas folder as the following structure:

datas
β”œβ”€β”€ building_tasks.py
β”œβ”€β”€ task_datas
β”‚Β Β  β”œβ”€β”€ 2
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ agriculture.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ budget.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ budgetary control.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ civil liberties, justice & home affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ constitutional and inter-institutional affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ culture & education.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ development.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ economic & monetary affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ employment & social affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ environment & public health.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ fisheries.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ foreign & security policy.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ gender equality.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ industry, research & energy.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ internal market & consumer protection.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ international trade.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ legal affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ regional development.json
β”‚Β Β  β”‚Β Β  └── transport & tourism.json
β”‚Β Β  β”œβ”€β”€ 4
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ agriculture.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ budget.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ budgetary control.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ civil liberties, justice & home affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ constitutional and inter-institutional affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ culture & education.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ development.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ economic & monetary affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ employment & social affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ environment & public health.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ fisheries.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ foreign & security policy.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ gender equality.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ industry, research & energy.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ internal market & consumer protection.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ international trade.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ legal affairs.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ regional development.json
β”‚Β Β  β”‚Β Β  └── transport & tourism.json
β”‚Β Β  └── 6
β”‚Β Β      β”œβ”€β”€ agriculture.json
β”‚Β Β      β”œβ”€β”€ budget.json
β”‚Β Β      β”œβ”€β”€ budgetary control.json
β”‚Β Β      β”œβ”€β”€ civil liberties, justice & home affairs.json
β”‚Β Β      β”œβ”€β”€ constitutional and inter-institutional affairs.json
β”‚Β Β      β”œβ”€β”€ culture & education.json
β”‚Β Β      β”œβ”€β”€ development.json
β”‚Β Β      β”œβ”€β”€ economic & monetary affairs.json
β”‚Β Β      β”œβ”€β”€ employment & social affairs.json
β”‚Β Β      β”œβ”€β”€ environment & public health.json
β”‚Β Β      β”œβ”€β”€ fisheries.json
β”‚Β Β      β”œβ”€β”€ foreign & security policy.json
β”‚Β Β      β”œβ”€β”€ gender equality.json
β”‚Β Β      β”œβ”€β”€ industry, research & energy.json
β”‚Β Β      β”œβ”€β”€ internal market & consumer protection.json
β”‚Β Β      β”œβ”€β”€ international trade.json
β”‚Β Β      β”œβ”€β”€ legal affairs.json
β”‚Β Β      β”œβ”€β”€ regional development.json
β”‚Β Β      └── transport & tourism.json
β”œβ”€β”€ task_infos.py
└── topic_datas
    β”œβ”€β”€ agriculture.json
    β”œβ”€β”€ budget.json
    β”œβ”€β”€ budgetary control.json
    β”œβ”€β”€ civil liberties, justice & home affairs.json
    β”œβ”€β”€ constitutional and inter-institutional affairs.json
    β”œβ”€β”€ culture & education.json
    β”œβ”€β”€ development.json
    β”œβ”€β”€ economic & monetary affairs.json
    β”œβ”€β”€ employment & social affairs.json
    β”œβ”€β”€ environment & public health.json
    β”œβ”€β”€ fisheries.json
    β”œβ”€β”€ foreign & security policy.json
    β”œβ”€β”€ foreign and security policy.json
    β”œβ”€β”€ gender equality.json
    β”œβ”€β”€ industry, research & energy.json
    β”œβ”€β”€ internal market & consumer protection.json
    β”œβ”€β”€ international trade.json
    β”œβ”€β”€ legal affairs.json
    β”œβ”€β”€ regional development.json
    └── transport & tourism.json

πŸš€ Running Evaluations

scripts/run_all_tasks.sh runs the full sweep: every combination of models Γ— topics Γ— party counts Γ— task types Γ— voting rules. It prints a progress bar and skips combinations whose results already exist.

Before running: set model_paths (and optionally other arrays) in run_all_tasks.sh to match your setup:

  • Local models: use absolute paths to your model dirs (e.g. /.cache/Qwen/Qwen2.5-32B-Instruct).
  • API models: use the API model name (e.g. gpt-4o, gemini-2.5-flash-thinking, deepseek-v3.1).

Default arrays in the script:

  • task_settings: utilitarianism, rawlsianism, seat_apportionment
  • voting_threshold_settings: simple_majority, 2_3_majority, veto_power
  • party_nums: 2, 4, 6
  • eval_topics: all 19 topics from the benchmark
  • model_paths: edit to your models (see script)

Single-GPU users: remove the --multi_gpu flag from the python runner/task_runner.py call inside the script.

bash scripts/run_all_tasks.sh

πŸ“ˆ Results

PoliCon Overview

The results show that Gemini-2.5 performs the best, achieving the best results on 60% of the tasks. Deepseek-V3.1 and GPT-4o follow with both attaining top performance on 33% of the tasks. We also compare the performance differences among other evaluated LLMs and identify the following trends: (1) Thinking models like Gemini-2.5 and Deepseek-V3.1 generally outperform no-thinking models like GPT-4o and Llama-3.3-70B. (2) Commercial models typically outperform non-commercial models. (3) Based on the results of four open-sourced models with known parameter sizes, we find that the performance is generally positively correlated with the model size.


πŸ“„ Citation

If you find PoliCon useful in your research, please cite our paper:

@inproceedings{zhang2026policon,
    title={PoliCon: Evaluating {LLM}s on Achieving Diverse Political Consensus Objectives},
    author={Zhaowei Zhang and Xiaobo Wang and Minghua Yi and Mengmeng Wang and Fengshuo Bai and Zilong Zheng and Yipeng Kang and Yaodong Yang},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=MHlwNs9k1Y}
}

About

ICLR 2026 | PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors