Conversation
gandersteele
left a comment
There was a problem hiding this comment.
Looks good, but need to have coverage of synthesis mode. See above comments. Once env vars are added to secrets we'll merge
| def check_dataset_str(original_text: str, dataset_str: str): | ||
| # Extract all redacted portions using regex pattern for [ENTITY_TYPE_*] | ||
| redaction_pattern = r"\[([A-Z_]+)(?:_[a-zA-Z0-9]+)?\]" | ||
| redactions = re.findall(redaction_pattern, dataset_str) | ||
|
|
||
| # Replace all redactions with empty string to get the non-redacted text | ||
| non_redacted_text = re.sub(redaction_pattern, "", dataset_str) | ||
|
|
||
| # Check if the non-redacted portions exist in the original text | ||
| for segment in non_redacted_text.split(): | ||
| if segment.strip(): # Skip empty segments | ||
| assert segment in original_text, ( | ||
| f"Non-redacted segment '{segment}' not found in original text" | ||
| ) | ||
|
|
||
| # Ensure we found at least one redaction | ||
| assert len(redactions) > 0, "No redactions found in the dataset string" |
There was a problem hiding this comment.
this is good, but note that it doesnt apply in synthesis mode. i'd suggest a similar method that
1.asserts len(spans) > 0
2. asserts that original_text[span['start']:span['end']] == span['text']
3. asserts that dataset_str[span['new_start']:span['new_end']] == span['new_text']
this is a slightly different test than yours, so can be done in addition to, but the main point is that this exercises the synthesis mode as well. we could add additioanl checks that in synthesis mode, replacement text doesnt contain the standard redaction pattern
There was a problem hiding this comment.
This check is only used for tests that don't test with synthesis anyhow. The other tests have their own checks for this stuff.
…n for improved flexibility in dataset validation
No description provided.