Skip to content

Commit 5aff3ac

Browse files
authored
AI Search: Add CSS content selectors docs (#29699)
* [AI Search] Add CSS content selectors documentation * [AI Search] add changelog
1 parent f5abc94 commit 5aff3ac

File tree

2 files changed

+151
-0
lines changed

2 files changed

+151
-0
lines changed
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
title: Website Source CSS content selectors for precise content extraction in AI Search
3+
description: Control which parts of crawled pages are indexed using CSS selectors.
4+
products:
5+
- ai-search
6+
date: 2026-04-08
7+
---
8+
9+
[AI Search](/ai-search/) now supports [CSS content selectors](/ai-search/configuration/data-source/website/#content-selectors) for website data sources. You can now define which parts of a crawled page are extracted and indexed by specifying CSS selectors paired with URL glob patterns.
10+
11+
Content selectors solve the problem of indexing only relevant content while ignoring navigation, sidebars, footers, and other boilerplate. When a page URL matches a glob pattern, only elements matching the corresponding CSS selector are extracted and converted to Markdown for indexing.
12+
13+
Configure content selectors via the dashboard or API:
14+
15+
```bash
16+
curl "https://api.cloudflare.com/client/v4/accounts/{account_id}/ai-search/instances" \
17+
-H "Authorization: Bearer {api_token}" \
18+
-H "Content-Type: application/json" \
19+
-d '{
20+
"id": "my-ai-search",
21+
"source": "https://example.com",
22+
"type": "web-crawler",
23+
"source_params": {
24+
"web_crawler": {
25+
"parse_options": {
26+
"content_selector": [
27+
{
28+
"path": "**/blog/**",
29+
"selector": "article .post-body"
30+
}
31+
]
32+
}
33+
}
34+
}
35+
}'
36+
```
37+
38+
Selectors are evaluated in order, and the first matching pattern wins. You can define up to 10 content selector entries per instance.
39+
40+
For configuration details and examples, refer to the [content selectors documentation](/ai-search/configuration/data-source/website/#content-selectors).

src/content/docs/ai-search/configuration/data-source/website.mdx

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,117 @@ For example, to index only blog posts while excluding drafts:
4747

4848
Refer to [Path filtering](/ai-search/configuration/path-filtering/) for pattern syntax, filtering behavior, and more examples.
4949

50+
## Content selectors
51+
52+
Content selectors let you control which parts of a crawled page are indexed. Each entry pairs a URL glob pattern with a CSS selector. When a page URL matches a glob pattern, only the elements matching the corresponding CSS selector — and their descendants — are extracted and converted to Markdown for indexing.
53+
54+
The list is ordered and the **first matching path wins**. If a page URL matches multiple glob patterns, only the selector from the first match is applied. Order your entries from most specific to least specific.
55+
56+
### Default behavior
57+
58+
Without content selectors, AI Search applies a default processing pipeline that removes elements such as `<header>`, `<footer>`, and `<head>` before converting the remaining content to Markdown. For more details on how HTML is processed, refer to [How HTML is processed](/workers-ai/features/markdown-conversion/how-it-works/#html).
59+
60+
### Configure content selectors in the dashboard
61+
62+
<Steps>
63+
64+
1. Go to the [AI Search](https://dash.cloudflare.com/?to=/:account/ai/ai-search) page in the Cloudflare dashboard.
65+
66+
<DashButton url="/?to=/:account/ai/ai-search" />
67+
68+
2. Select your AI Search instance, or select **Create** to create a new one with a **Website** data source.
69+
3. Under the data source settings, locate the **Content selectors** section.
70+
4. Select **Add selector**.
71+
5. In the **Path** field, enter a glob pattern to match page URLs. For example, `**/blog/**`.
72+
6. In the **Selector** field, enter a CSS selector to extract content from matching pages. For example, `article .post-body`.
73+
7. To add more entries, select **Add selector** again. Entries are evaluated in order from top to bottom.
74+
75+
</Steps>
76+
77+
### Configure content selectors via the API
78+
79+
Content selectors are configured in the `source_params.web_crawler.parse_options.content_selector` field when creating or updating an AI Search instance. The field accepts an array of objects, each with a `path` and `selector` property.
80+
81+
```bash
82+
curl "https://api.cloudflare.com/client/v4/accounts/{account_id}/ai-search/instances" \
83+
-H "Authorization: Bearer {api_token}" \
84+
-H "Content-Type: application/json" \
85+
-d '{
86+
"id": "my-ai-search",
87+
"source": "https://example.com",
88+
"type": "web-crawler",
89+
"source_params": {
90+
"web_crawler": {
91+
"parse_options": {
92+
"content_selector": [
93+
{
94+
"path": "**/blog/**",
95+
"selector": "article .post-body"
96+
},
97+
{
98+
"path": "**/docs/**",
99+
"selector": "main .content"
100+
}
101+
]
102+
}
103+
}
104+
}
105+
}'
106+
```
107+
108+
| Field | Type | Description |
109+
| ---------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
110+
| `path` | string | Glob pattern to match against the full page URL. Uses the same glob syntax as [path filtering](/ai-search/configuration/path-filtering/)`*` matches within a segment, `**` crosses directories. Maximum 200 characters. |
111+
| `selector` | string | CSS selector to extract content from pages matching the path pattern. Supports standard CSS selectors including element, class, ID, and attribute selectors. Maximum 200 characters. |
112+
113+
### Examples
114+
115+
#### Extract main content from blog pages
116+
117+
To index only the article body on blog pages and ignore navigation, sidebars, and footers:
118+
119+
| Path | Selector |
120+
| ------------ | -------------------- |
121+
| `**/blog/**` | `article .post-body` |
122+
123+
#### Target documentation content
124+
125+
To index the main content area of a documentation site:
126+
127+
| Path | Selector |
128+
| ------------ | --------------- |
129+
| `**/docs/**` | `main .content` |
130+
131+
#### Different selectors for different sections
132+
133+
You can define multiple entries to apply different selectors to different parts of your site. The first matching path wins, so place more specific patterns first:
134+
135+
| Path | Selector |
136+
| --------------------- | -------------------- |
137+
| `**/blog/releases/**` | `.release-notes` |
138+
| `**/blog/**` | `article .post-body` |
139+
| `**/docs/**` | `main .content` |
140+
141+
In this example, a page at `https://example.com/blog/releases/v2` matches the first pattern and uses the `.release-notes` selector. A page at `https://example.com/blog/my-post` skips the first pattern and matches the second.
142+
143+
:::caution
144+
If a CSS selector does not match any elements on a page, the resulting Markdown is empty and AI Search marks the item as errored. Verify that your selectors match the expected elements before applying them to a broad set of pages.
145+
:::
146+
147+
### Interaction with other features
148+
149+
- **Path filtering**: [Path filtering](/ai-search/configuration/path-filtering/) takes priority over content selectors. Pages excluded by path filters are never crawled, so content selectors do not apply to them.
150+
- **Browser Rendering**: Content selectors apply to the HTML that AI Search receives. For sites that render content with JavaScript, turn on [Browser Rendering](#rendering-mode) so that selectors can target the fully rendered DOM.
151+
- **Automatic re-indexing**: Updating content selectors triggers a new [sync job](/ai-search/configuration/indexing/) immediately, so changes are applied to all indexed pages.
152+
153+
### Limits
154+
155+
| Limit | Value |
156+
| -------------------------------- | -------------- |
157+
| Maximum content selector entries | 10 |
158+
| Maximum path pattern length | 200 characters |
159+
| Maximum selector length | 200 characters |
160+
50161
## Best practices for robots.txt and sitemap
51162

52163
Configure your `robots.txt` and sitemap to help AI Search crawl your site efficiently.

0 commit comments

Comments
 (0)