Skip to content

[Bug]: Site Search Crawler never completes and uses excessive CPU resources #7191

@johnhenley

Description

@johnhenley

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Search crawler runs for hours / days and never completes.
Uses excessive CPU.

Steps to reproduce?

Duplicated on site with content containing invalid HTML.

Current Behavior

No response

Expected Behavior

No response

Relevant log output

Anything else?

On my site, I had previously used Google search integration and didn’t use the Lucene indexing until we recently updated the forums module to use DNN search/Lucene rather than SQL fulltext.

Despite my previous thought that this issue was related to user indexing, it isn’t. Rather, it is content that contains markup of some kind that is incorrect/incomplete, and during indexing, there is a regex that is run to remove tags from the content but leaves the attributes. Unfortunately, the regex isn’t executed with a timeout and doesn’t work correctly if it’s not valid HTML content.

This is where things get stuck in the call stack

Image

In that method, this regex never returns if the content is not valid HTML

Image

In my case, someone had posted some hacked-up wsdl (gack!) into a forum post; in other words, partially looks like HTML but it’s perfectly valid “text” content

Image

This causes the regex to never complete due to catastrophic backtracking.
When I look at it in regex buddy

Image

What should be done to fix it.

Image
  1. Add a timeout when declaring the regexes,
  2. also use the cached regex API in DNN;
  3. add a try/catch in the method to handle any timeout exceptions.
  4. Fix the regex (adding atomic groups – this will actually fix the regex itself, and it won’t go into an endless loop)

From
private const string HtmlTagsWithAttrs = "<[a-z_:][\w:.-](\s+(?\w+\s?=\s*?["'].?["']))+\s/?>";
to
private const string HtmlTagsWithAttrs = “<[a-z_:][\w:.-](?>(?:\s+(?\w+\s*?=\s*?["'].?["'])))?\s*/))%3f/s*/)?>”;

this causes the regex to ignore invalid html

Image

But correctly matches valid HTML (the point of the regex is to remove tags but keep the attributes)

Image

Affected Versions

10.3.0 (latest release)

What browsers are you seeing the problem on?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions