GitHub - Alab-NII/complex_ques_decomposition

This is the GitHub repo for the paper: [Identifying Where Large Language Models Struggle in Answering Complex Questions]

Reproduction of Results

download data
download outputs

Table 1: Automatic and human scores (green) in the decomposition stage

python3 eval_decompose_automatic.py

python3 eval_decompose_human.py

Table 2: LLM-as-a-Judge accuracy (based on Llama 3.370B) in the sub-problem-solving stage

python3 eval_sub_problem_get_scores.py

Table 3: LLM-as-a-Judge accuracy (Llama 3.3 70B) for full-QA performance using zero-shot-CoT

python3 eval_full_get_scores.py

Running process

Stage 1: Decomposition

python3 run_s1_decompose.py

Stage 2: Subproblem Solving

python3 run_s2_ans_sub.py

Stage 2: Evaluation

python3 eval_sub_problem_llm.py

Full-QA

python3 run_full_qa.py

Full-QA: Evaluation

python3 eval_full_qa_llm.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
.gitignore		.gitignore
ReadMe.md		ReadMe.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproduction of Results

Table 1: Automatic and human scores (green) in the decomposition stage

Table 2: LLM-as-a-Judge accuracy (based on Llama 3.370B) in the sub-problem-solving stage

Table 3: LLM-as-a-Judge accuracy (Llama 3.3 70B) for full-QA performance using zero-shot-CoT

Running process

Stage 1: Decomposition

Stage 2: Subproblem Solving

Stage 2: Evaluation

Full-QA

Full-QA: Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reproduction of Results

Table 1: Automatic and human scores (green) in the decomposition stage

Table 2: LLM-as-a-Judge accuracy (based on Llama 3.370B) in the sub-problem-solving stage

Table 3: LLM-as-a-Judge accuracy (Llama 3.3 70B) for full-QA performance using zero-shot-CoT

Running process

Stage 1: Decomposition

Stage 2: Subproblem Solving

Stage 2: Evaluation

Full-QA

Full-QA: Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages