This repository contains the dataset and code for the paper "Benchmarking Query-conditioned Natural Language Inference" (Canby et al., 2025).
Natural language inference (NLI). (a) Sentence-level NLI has a label ℓ indicating the semantic relationship between a premise sentence sp and hypothesis sentence sh. (b) Document-level NLI conditions ℓ on a premise document dp and a hypothesis document dh. (c) Query-conditioned NLI conditions label ℓi on premise document dp, hypothesis document dh, and a query qi, which indicates the aspect of the documents the semantic relationship should be based on.
- Python 3.8+
- Required API keys (OpenAI, Google AI)
- Clone this repository:
git clone https://github.com/amazon-science/Query-Conditioned-NLI.git
cd Query-Conditioned-NLI- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install required packages:
pip install -r requirements.txt- Set up API keys:
export OPENAI_API_KEY="your-openai-key"
export GOOGLE_API_KEY="your-google-key"The QC-NLI dataset is located in the data/ folder and includes adaptations from four existing datasets:
| Dataset | Task | Size | Label Set |
|---|---|---|---|
| SNLI (Bowman et al., 2015) | Image descriptions | 4,452 | entailment, not_entailment |
| RobustQA (Han et al., 2023) | Inconsistent document detection | 2,578 | contradiction, not_contradiction |
| RAGTruth (Niu et al., 2024) | Hallucination detection | 829 | entailment, not_entailment |
| FactScore (Min et al., 2023) | Fact verification | 13,796 | entailment, not_entailment |
Use src/perform_task.py to evaluate models on QC-NLI data:
python src/perform_task.py \
--dataset robustqa \
--prompt-type zero \
--do-merge True \
--use-query True \
--start-num 0 \
--model gproParameters:
--dataset: Dataset to use- Options:
snli,ragtruth,robustqa,factscore_chatgpt,factscore_instructgpt,factscore_perplexityai
- Options:
--prompt-type: Prompting strategyzero: Zero-shot promptingfew: Few-shot promptingqanli: QA+NLI (question-answering followed by NLI)
--do-merge: Mergeneutralandcontradictionintonot_entailment(set toTruefor experiments in paper)--use-query: Include query in inference (True/False)--start-num: Starting index in dataset (typically0)--model: Model to usegpt: GPT-4ogpt3: GPT-3.5-turbo-0125gpt4: GPT-4-0613gflash: Gemini 1.5 Flashgpro: Gemini 1.5 Pro
Use src/perform_generations.py to convert existing datasets into QC-NLI format:
python src/perform_generations.py \
--dataset snli \
--partition train \
--start-num 0 \
--model gptParameters:
--dataset: Source dataset- Options:
snli,ragtruth,robustqa,factscore
- Options:
--partition: Data partition to convert (valid partitions depend on dataset)- SNLI:
train,val,test - RobustQA:
all - RagTruth:
train,test - Factscore:
chatgpt,instructgpt,perplexityai
- SNLI:
--start-num: Starting index in dataset (typically0)--model: Model for generation (same options as above)
To adapt a new dataset to QC-NLI format:
- Create a class extending
ExampleGeneratorinsrc/generator.py - Implement the required methods:
read_data(self): Load your datasetgenerate(self, idx): Convert theidxth data example to QC-NLI format
Example structure:
class YourDatasetGenerator(ExampleGenerator):
def __init__(self, **kwargs):
self.dname = 'your-dataset-name'
super().__init__(**kwargs)
def read_data(self):
# Load your dataset
pass
def generate(self, idx):
# Convert to QC-NLI format
passComing soon!
This library is licensed under the CC-By-4.0 License.
See CONTRIBUTING for more information.
For questions or issues, please contact marc.canby@gmail.com or open an issue on GitHub.