Skip to content

Add tests#26

Merged
ethan-tonic merged 20 commits intomainfrom
add-tests
Apr 7, 2025
Merged

Add tests#26
ethan-tonic merged 20 commits intomainfrom
add-tests

Conversation

@ethan-tonic
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

@gandersteele gandersteele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but need to have coverage of synthesis mode. See above comments. Once env vars are added to secrets we'll merge

Comment thread tests/sample.env
Comment thread tests/utils/dataset_utils.py Outdated
Comment on lines +15 to +31
def check_dataset_str(original_text: str, dataset_str: str):
# Extract all redacted portions using regex pattern for [ENTITY_TYPE_*]
redaction_pattern = r"\[([A-Z_]+)(?:_[a-zA-Z0-9]+)?\]"
redactions = re.findall(redaction_pattern, dataset_str)

# Replace all redactions with empty string to get the non-redacted text
non_redacted_text = re.sub(redaction_pattern, "", dataset_str)

# Check if the non-redacted portions exist in the original text
for segment in non_redacted_text.split():
if segment.strip(): # Skip empty segments
assert segment in original_text, (
f"Non-redacted segment '{segment}' not found in original text"
)

# Ensure we found at least one redaction
assert len(redactions) > 0, "No redactions found in the dataset string"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good, but note that it doesnt apply in synthesis mode. i'd suggest a similar method that
1.asserts len(spans) > 0
2. asserts that original_text[span['start']:span['end']] == span['text']
3. asserts that dataset_str[span['new_start']:span['new_end']] == span['new_text']
this is a slightly different test than yours, so can be done in addition to, but the main point is that this exercises the synthesis mode as well. we could add additioanl checks that in synthesis mode, replacement text doesnt contain the standard redaction pattern

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is only used for tests that don't test with synthesis anyhow. The other tests have their own checks for this stuff.

@ethan-tonic ethan-tonic requested a review from gandersteele April 5, 2025 03:30
Copy link
Copy Markdown
Contributor

@gandersteele gandersteele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ethan-tonic ethan-tonic merged commit 4cf99cc into main Apr 7, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants