Skip to content

Simplify: fix correctness bugs, dead code, and efficiency issues#14

Open
divij-sinha wants to merge 2 commits intomainfrom
simplify/code-cleanup
Open

Simplify: fix correctness bugs, dead code, and efficiency issues#14
divij-sinha wants to merge 2 commits intomainfrom
simplify/code-cleanup

Conversation

@divij-sinha
Copy link
Copy Markdown
Collaborator

@divij-sinha divij-sinha commented Mar 24, 2026

Summary

  • Correctness bug (main.py): sorted() was called on the concatenation of two schema name strings (sorting individual characters), causing incorrect deduplication of cross-schema link pairs — fixed to sorted([a, b])
  • Silent bug (utils.py): typo "probablistic" meant the interactive config builder always set the wrong key, so probabilistic matching was never enabled via interactive setup
  • Efficiency (cleaning_functions.py): SearchEngine() (uszipcode DB connection) was instantiated on every cache miss; moved to a module-level singleton
  • Efficiency (tfidf_utils.py): replaced Python for loop in get_matches_df with numpy vectorized indexing; removed dead top parameter
  • Simplification (main.py): extracted _load_bad_list() helper to replace two identical try/except CSV loading blocks; built schemas_by_name dict to replace repeated O(n) list-comprehension lookups; removed redundant else branches
  • Quality (utils.py): replaced bare input() calls with Prompt.ask for consistency with the Rich abstraction used throughout
  • Dead code removed: unreachable return, df = df no-op, predict_org if/else where both branches returned 0, and a redundant schema lookup
  • Minor fixes: None-check order in clean_address, merged two sequential record.items() loops

Test plan

  • uv run pytest — all 32 tests pass
  • Verify interactive config builder now correctly sets probabilistic: True when user answers yes
  • Verify cross-schema link deduplication works correctly with multi-schema configs

🤖 Generated with Claude Code

divij-sinha and others added 2 commits March 24, 2026 11:58
- Fix typo "probablistic" -> "probabilistic" in create_config (was silently
  disabling probabilistic matching for all interactive config users)
- Fix sorted() bug: was sorting characters of concatenated schema names
  instead of sorting the two names as a list
- Add SearchEngine module-level singleton to avoid re-instantiating the
  uszipcode DB connection on every cache miss
- Vectorize get_matches_df with numpy fancy indexing; remove dead `top` param
- Extract _load_bad_list() helper to replace two identical try/except blocks
- Build schemas_by_name dict to replace O(n) list-comprehension lookups
- Replace bare input() calls with Prompt.ask for consistency
- Fix None-check order in clean_address state validation
- Merge two sequential record.items() loops into one in clean_address
- Remove dead code: unreachable return, no-op df=df, redundant else branches,
  predict_org if/else where both branches returned 0
- Update test fixture to reflect corrected probabilistic key spelling

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant