Is there an existing issue for this?
What happened?
Search crawler runs for hours / days and never completes.
Uses excessive CPU.
Steps to reproduce?
Duplicated on site with content containing invalid HTML.
Current Behavior
No response
Expected Behavior
No response
Relevant log output
Anything else?
On my site, I had previously used Google search integration and didn’t use the Lucene indexing until we recently updated the forums module to use DNN search/Lucene rather than SQL fulltext.
Despite my previous thought that this issue was related to user indexing, it isn’t. Rather, it is content that contains markup of some kind that is incorrect/incomplete, and during indexing, there is a regex that is run to remove tags from the content but leaves the attributes. Unfortunately, the regex isn’t executed with a timeout and doesn’t work correctly if it’s not valid HTML content.
This is where things get stuck in the call stack
In that method, this regex never returns if the content is not valid HTML
In my case, someone had posted some hacked-up wsdl (gack!) into a forum post; in other words, partially looks like HTML but it’s perfectly valid “text” content
This causes the regex to never complete due to catastrophic backtracking.
When I look at it in regex buddy
What should be done to fix it.
- Add a timeout when declaring the regexes,
- also use the cached regex API in DNN;
- add a try/catch in the method to handle any timeout exceptions.
- Fix the regex (adding atomic groups – this will actually fix the regex itself, and it won’t go into an endless loop)
From
private const string HtmlTagsWithAttrs = "<[a-z_:][\w:.-](\s+(?\w+\s?=\s*?["'].?["']))+\s/?>";
to
private const string HtmlTagsWithAttrs = “<[a-z_:][\w:.-](?>(?:\s+(?\w+\s*?=\s*?["'].?["'])))?\s*/))%3f/s*/)?>”;
this causes the regex to ignore invalid html
But correctly matches valid HTML (the point of the regex is to remove tags but keep the attributes)
Affected Versions
10.3.0 (latest release)
What browsers are you seeing the problem on?
No response
Code of Conduct
Is there an existing issue for this?
What happened?
Search crawler runs for hours / days and never completes.
Uses excessive CPU.
Steps to reproduce?
Duplicated on site with content containing invalid HTML.
Current Behavior
No response
Expected Behavior
No response
Relevant log output
Anything else?
On my site, I had previously used Google search integration and didn’t use the Lucene indexing until we recently updated the forums module to use DNN search/Lucene rather than SQL fulltext.
Despite my previous thought that this issue was related to user indexing, it isn’t. Rather, it is content that contains markup of some kind that is incorrect/incomplete, and during indexing, there is a regex that is run to remove tags from the content but leaves the attributes. Unfortunately, the regex isn’t executed with a timeout and doesn’t work correctly if it’s not valid HTML content.
This is where things get stuck in the call stack
In that method, this regex never returns if the content is not valid HTML
In my case, someone had posted some hacked-up wsdl (gack!) into a forum post; in other words, partially looks like HTML but it’s perfectly valid “text” content
This causes the regex to never complete due to catastrophic backtracking.
When I look at it in regex buddy
What should be done to fix it.
From
private const string HtmlTagsWithAttrs = "<[a-z_:][\w:.-](\s+(?\w+\s?=\s*?["'].?["']))+\s/?>";
to
private const string HtmlTagsWithAttrs = “<[a-z_:][\w:.-](?>(?:\s+(?\w+\s*?=\s*?["'].?["'])))?\s*/))%3f/s*/)?>”;
this causes the regex to ignore invalid html
But correctly matches valid HTML (the point of the regex is to remove tags but keep the attributes)
Affected Versions
10.3.0 (latest release)
What browsers are you seeing the problem on?
No response
Code of Conduct