[Bug]: Site Search Crawler never completes and uses excessive CPU resources

### Is there an existing issue for this?

- [x] I have searched the existing issues

### What happened?

Search crawler runs for hours / days and never completes.
Uses excessive CPU.

### Steps to reproduce?

Duplicated on site with content containing invalid HTML.

### Current Behavior

_No response_

### Expected Behavior

_No response_

### Relevant log output

```shell

```

### Anything else?

On my site, I had previously used Google search integration and didn’t use the Lucene indexing until we recently updated the forums module to use DNN search/Lucene rather than SQL fulltext.
 
Despite my previous thought that this issue was related to user indexing, it isn’t. Rather, it is content that contains markup of some kind that is incorrect/incomplete, and during indexing, there is a regex that is run to remove tags from the content but leaves the attributes. Unfortunately, the regex isn’t executed with a timeout and doesn’t work correctly if it’s not valid HTML content.
 
This is where things get stuck in the call stack
 

 
<img width="1774" height="513" alt="Image" src="https://github.com/user-attachments/assets/e94c2d24-c404-4936-9293-cf30325edd6d" />
 
In that method, this regex never returns if the content is not valid HTML
 
<img width="1070" height="283" alt="Image" src="https://github.com/user-attachments/assets/aa5a820a-1c97-4e07-bb76-5c528937f273" />


In my case, someone had posted some hacked-up wsdl (gack!) into a forum post; in other words, partially looks like HTML but it’s perfectly valid “text” content 
 
<img width="1430" height="499" alt="Image" src="https://github.com/user-attachments/assets/1d225f9f-adf9-42ec-9935-65372f447276" />

 
This causes the regex to never complete due to catastrophic backtracking. 
When I look at it in regex buddy
 
 
<img width="1243" height="732" alt="Image" src="https://github.com/user-attachments/assets/3b2645c2-bb2c-4ed1-9fb7-ee9702656b15" />

What should be done to fix it.
 
 
<img width="1100" height="372" alt="Image" src="https://github.com/user-attachments/assets/ec20e8b4-a491-4086-b5fc-a811fb203f04" />

 
1.	Add a timeout when declaring the regexes, 
2.	also use the cached regex API in DNN; 
3.	add a try/catch in the method to handle any timeout exceptions.
4.	Fix the regex (adding atomic groups – this will actually fix the regex itself, and it won’t go into an endless loop)
 
From
        private const string HtmlTagsWithAttrs = "<[a-z_:][\\w:.-]*(\\s+(?<attr>\\w+\\s*?=\\s*?[\"'].*?[\"']))+\\s*/?>";
to
        private const string HtmlTagsWithAttrs = “<[a-z_:][\\w:.-]*(?>(?:\\s+(?<attr>[\\w+\\s*?=\\s*?[\"'].*?[\"']))*)?\\s*/](file://w+/s*%3f=/s*%3f%5b/%22'%5d.*%3f%5b/%22'%5d))*)%3f/s*/)?>”;
 
this causes the regex to ignore invalid html
 
<img width="880" height="539" alt="Image" src="https://github.com/user-attachments/assets/a5e01046-5c0b-410f-a45d-fbfe91f87c4d" />

 
But correctly matches valid HTML (the point of the regex is to remove tags but keep the attributes)
 
 
<img width="950" height="556" alt="Image" src="https://github.com/user-attachments/assets/836cddf9-9608-4e6b-b7c1-0c7555b007d0" />

 

### Affected Versions

10.3.0 (latest release)

### What browsers are you seeing the problem on?

_No response_

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Site Search Crawler never completes and uses excessive CPU resources #7191

Is there an existing issue for this?

What happened?

Steps to reproduce?

Current Behavior

Expected Behavior

Relevant log output

Anything else?

Affected Versions

What browsers are you seeing the problem on?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Site Search Crawler never completes and uses excessive CPU resources #7191

Description

Is there an existing issue for this?

What happened?

Steps to reproduce?

Current Behavior

Expected Behavior

Relevant log output

Anything else?

Affected Versions

What browsers are you seeing the problem on?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions