This is the GitHub repo for the paper: [Identifying Where Large Language Models Struggle in Answering Complex Questions]
python3 eval_decompose_automatic.pypython3 eval_decompose_human.pypython3 eval_sub_problem_get_scores.pypython3 eval_full_get_scores.pypython3 run_s1_decompose.pypython3 run_s2_ans_sub.pypython3 eval_sub_problem_llm.pypython3 run_full_qa.pypython3 eval_full_qa_llm.py