Skip to content

Fix thread safety, resource leaks, and error handling across core and external modules #1928

Closed
dpol1 wants to merge 1 commit into
mainfrom
thread-safety-resource-leaks
Closed

Fix thread safety, resource leaks, and error handling across core and external modules #1928
dpol1 wants to merge 1 commit into
mainfrom
thread-safety-resource-leaks

Conversation

@dpol1
Copy link
Copy Markdown
Member

@dpol1 dpol1 commented Jun 4, 2026

Summary

This PR addresses a set of code quality, correctness, and robustness issues found across the core and external modules.

Thread Safety

  • Replace non-thread-safe SimpleDateFormat with DateTimeFormatter in FileResponse and DateUtils.parseDate() in CookieConverter
  • Replace shared mutable Matcher in RefreshTag with a Pattern constant and per-call Matcher instances

Resource Leaks

  • Use try-with-resources for FileInputStream in FileResponse and InputStream in RegexURLFilterBase
  • Add cleanup() to DebugParseFilter to close the output stream
  • Ensure Playwright tracing is always stopped via a finally block in HttpProtocol, even when an exception is thrown

Error Handling & Robustness

  • Wrap Long.parseLong() / Integer.parseInt() in FetcherBolt with try-catch to handle invalid metadata values gracefully
  • Add null check for classpath resource stream in RegexURLFilterBase
  • Fix duplicate setExpiryDate() call and add null guard in CookieConverter
  • Fix misleading error message in CloudSearchUtils ("must be score" → "must NOT be score")

Logging & Code Cleanup

  • Replace all e.printStackTrace() calls with proper SLF4J LOG.error()
  • Replace URLEncoder.encode(url, "UTF-8") (unnecessary checked exception) with the StandardCharsets.UTF_8 overload
  • Replace manual MessageDigest boilerplate with DigestUtils.sha512Hex()
  • Use actual document charset in JsRenderingDetector instead of hardcoded UTF-8

No behavior changes intended

All changes are backward-compatible. The crawl logic and output remain the same.

 - Replace SimpleDateFormat with DateTimeFormatter/DateUtils (thread safety)
- Replace shared Matcher in RefreshTag with per-call instance (thread safety)
- Use try-with-resources for streams in FileResponse and RegexURLFilterBase
- Add cleanup() to DebugParseFilter to close output stream
- Stop Playwright tracing in finally block to prevent resource leak
- Handle NumberFormatException in FetcherBolt for metadata delay/thread values
- Add null check for missing classpath resource in RegexURLFilterBase
- Fix duplicate setExpiryDate() call in CookieConverter
- Fix error message typo in CloudSearchUtils ("must be" -> "must NOT be")
- Replace e.printStackTrace() with SLF4J LOG.error() throughout
- Use StandardCharsets overload of URLEncoder.encode() in S3 bolts
- Replace manual MessageDigest with DigestUtils.sha512Hex()
- Use detected charset in JsRenderingDetector instead of hardcoded UTF-8
@dpol1 dpol1 requested review from jnioche, mvolikas, rzo1 and sigee June 4, 2026 08:19
@jnioche
Copy link
Copy Markdown
Contributor

jnioche commented Jun 4, 2026

Thanks @dpol1. Having a PR affecting loads of different files and covering such different aspects does not make it easy to review. Instead of a large dump, having one PR per type of issue (e.g. Resource Leaks) means that changes can be reviewed quicker, merged in isolation and also reviewed in parallel.
Any chance you could break it down a bit?

@dpol1
Copy link
Copy Markdown
Member Author

dpol1 commented Jun 4, 2026

Sure, sorry about that — got a bit carried away. Will split it into three PRs: thread-safety, resource leaks, and fault tolerance/cleanup.

@dpol1 dpol1 closed this Jun 4, 2026
@dpol1 dpol1 deleted the thread-safety-resource-leaks branch June 4, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants