PineScript V6 Documentation Crawler

A Python-based tool for crawling and processing TradingView's Pine Script V6 documentation, built using the Crawl4Ai framework. This tool extracts, cleans, and organizes the documentation, making it easier to reference and analyze. Crawl4Ai provides the core framework for web crawling, data extraction, and asynchronous processing, making it possible.

Features

Crawling

Automatically extracts documentation from TradingView's Pine Script V6 website using Crawl4Ai
Efficiently handles navigation through documentation pages
Supports batch processing with rate limiting
Maintains a structured extraction schema for consistent results
Saves individual pages into an unprocessed/ folder and also writes a combined raw file

Content Processing

Cleans and formats documentation content
Preserves PineScript code blocks with proper syntax highlighting
Extracts and formats function documentation
Removes unnecessary navigation elements and formatting
Processes content into a clean, readable markdown format

Output Organization

Creates individual markdown files for each documentation page (raw output)
Raw (unprocessed) markdown files are saved to pinescript_docs/unprocessed/
A combined raw file all_docs_{timestamp}.md is written to pinescript_docs/
The processor reads from pinescript_docs/unprocessed/ and writes enhanced files to pinescript_docs/processed/
The processor also writes a combined processed file processed_all_docs.md into the repository root (same directory as the scripts)
Tracks failed URLs and crawling statistics
Preserves original source URLs and timestamps

Setup

Clone the repository:

git clone git@github.com:samuelymh/pinescript-docs-scraper.git
cd pinescript-docs-scraper

Create and activate a virtual environment:

On macOS/Linux:

python -m venv .venv
source .venv/bin/activate

On Windows PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Install required dependencies:

python -m pip install -r requirements.txt

Usage

Crawling Documentation:

Run the crawler:
```
python 1_scrap_docs.py
```
This script will collect documentation URLs, download content, and save raw markdown files to pinescript_docs/unprocessed/ and a combined raw all_docs_{timestamp}.md to pinescript_docs/.
Processing Documentation:

To clean and organize the crawled content, run:
```
python 2_process_docs.py
```

Run both (crawl then process) using the orchestrator:

A convenience script 3_scrap_and_process.py was added to run the crawler and then the processor in sequence (or to run either step individually). It loads the two scripts and invokes their main behaviors so you don't need to run them separately.

# Run crawl then process
python 3_scrap_and_process.py

# Only crawl (no processing)
python 3_scrap_and_process.py --crawl-only

# Only process (useful if you've already crawled)
python 3_scrap_and_process.py --process-only

# Reduce console output
python 3_scrap_and_process.py --no-verbose

Preview Cleanup (dry-run):

Before deleting files, run the cleanup script in dry-run mode to see which files would be removed:

# Preview files that would be removed from 'unprocessed'
python scripts/cleanup_processed.py --target unprocessed --dry-run

# Preview files that would be removed from 'processed'
python scripts/cleanup_processed.py --target processed --dry-run

# Preview both directories
python scripts/cleanup_processed.py --target both --dry-run

Apply Cleanup (delete old files):

When you're ready to remove the old timestamped files (the script keeps the most recent file per page), run:
```
# Remove old files from 'unprocessed'
python scripts/cleanup_processed.py --target unprocessed

# Remove old files from 'processed' (keeps newest per page)
python scripts/cleanup_processed.py --target processed

# Remove from both
python scripts/cleanup_processed.py --target both
```
Note: Use --dry-run first to confirm the exact files that will be deleted.

This script reads raw markdown files from pinescript_docs/unprocessed/, extracts code examples and function documentation, and writes processed versions to pinescript_docs/processed/. It also writes a combined processed_all_docs.md next to the scripts (repository root) for easy access.

Output Structure

pinescript_docs/
├── all_docs_{timestamp}.md           # Combined raw documentation (from crawler)
├── unprocessed/                      # Raw markdown files produced by the crawler
│   └── {index}_{page_name}_{timestamp}.md
├── failed_urls_{timestamp}.txt       # Failed crawl attempts
└── processed/                        # Enhanced content produced by the processor
    └── processed_{page_name}_{timestamp}.md

processed_all_docs.md                  # Combined processed file (written to repository root)

Running the server (RAG API)

The full deployment and operational instructions are maintained in docs/DEPLOYMENT.md (operator-focused). For local development and quick commands see server/STARTUP.md (developer-focused).

Quick start (dev): see server/STARTUP.md
Production / deployment: see docs/DEPLOYMENT.md

In short: apply SQL migrations in migrations/, set required secrets (see docs/DEPLOYMENT.md), build the Docker image (optional), and either run the server locally with uvicorn for development or use the Docker image / Gunicorn for production-like runs. Use the ingest CLI (server/run_ingest.py) for full reindexes to avoid worker timeouts.

Customization

The crawler and processor can be customized through their respective class initializations:

PineScriptDocsCrawler: Configures crawling behavior, batch size, and extraction schema.
PineScriptDocsProcessor: Customizes content processing and output formatting.

License

This project is open source and available under the MIT License.

Error Handling

Failed URLs are logged with error messages.
Batch processing ensures resilience to temporary failures.
Rate limiting helps avoid server overload.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PineScript V6 Documentation Crawler

Features

Crawling

Content Processing

Output Organization

Setup

Usage

Output Structure

Running the server (RAG API)

Customization

License

Error Handling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
docs		docs
log		log
migrations		migrations
pinescript_docs/processed		pinescript_docs/processed
scripts		scripts
server		server
.dockerignore		.dockerignore
.gitignore		.gitignore
1_scrap_docs.py		1_scrap_docs.py
2_process_docs.py		2_process_docs.py
3_scrap_and_process.py		3_scrap_and_process.py
Dockerfile		Dockerfile
README.md		README.md
SUPABASE_SCHEMA.md		SUPABASE_SCHEMA.md
processed_all_docs.md		processed_all_docs.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PineScript V6 Documentation Crawler

Features

Crawling

Content Processing

Output Organization

Setup

Usage

Output Structure

Running the server (RAG API)

Customization

License

Error Handling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages