A Python-based tool for crawling and processing TradingView's Pine Script V6 documentation, built using the Crawl4Ai framework. This tool extracts, cleans, and organizes the documentation, making it easier to reference and analyze. Crawl4Ai provides the core framework for web crawling, data extraction, and asynchronous processing, making it possible.
- Automatically extracts documentation from TradingView's Pine Script V6 website using Crawl4Ai
- Efficiently handles navigation through documentation pages
- Supports batch processing with rate limiting
- Maintains a structured extraction schema for consistent results
- Saves individual pages into an
unprocessed/folder and also writes a combined raw file
- Cleans and formats documentation content
- Preserves PineScript code blocks with proper syntax highlighting
- Extracts and formats function documentation
- Removes unnecessary navigation elements and formatting
- Processes content into a clean, readable markdown format
- Creates individual markdown files for each documentation page (raw output)
- Raw (unprocessed) markdown files are saved to
pinescript_docs/unprocessed/ - A combined raw file
all_docs_{timestamp}.mdis written topinescript_docs/ - The processor reads from
pinescript_docs/unprocessed/and writes enhanced files topinescript_docs/processed/ - The processor also writes a combined processed file
processed_all_docs.mdinto the repository root (same directory as the scripts) - Tracks failed URLs and crawling statistics
- Preserves original source URLs and timestamps
-
Clone the repository:
git clone git@github.com:samuelymh/pinescript-docs-scraper.git cd pinescript-docs-scraper -
Create and activate a virtual environment:
On macOS/Linux:
python -m venv .venv source .venv/bin/activateOn Windows PowerShell:
python -m venv .venv .\.venv\Scripts\Activate.ps1 -
Install required dependencies:
python -m pip install -r requirements.txt
-
Crawling Documentation:
Run the crawler:
python 1_scrap_docs.py
This script will collect documentation URLs, download content, and save raw markdown files to
pinescript_docs/unprocessed/and a combined rawall_docs_{timestamp}.mdtopinescript_docs/. -
Processing Documentation:
To clean and organize the crawled content, run:
python 2_process_docs.py
-
Run both (crawl then process) using the orchestrator:
A convenience script
3_scrap_and_process.pywas added to run the crawler and then the processor in sequence (or to run either step individually). It loads the two scripts and invokes their main behaviors so you don't need to run them separately.# Run crawl then process python 3_scrap_and_process.py # Only crawl (no processing) python 3_scrap_and_process.py --crawl-only # Only process (useful if you've already crawled) python 3_scrap_and_process.py --process-only # Reduce console output python 3_scrap_and_process.py --no-verbose
-
Preview Cleanup (dry-run):
Before deleting files, run the cleanup script in dry-run mode to see which files would be removed:
# Preview files that would be removed from 'unprocessed' python scripts/cleanup_processed.py --target unprocessed --dry-run # Preview files that would be removed from 'processed' python scripts/cleanup_processed.py --target processed --dry-run # Preview both directories python scripts/cleanup_processed.py --target both --dry-run
-
Apply Cleanup (delete old files):
When you're ready to remove the old timestamped files (the script keeps the most recent file per page), run:
# Remove old files from 'unprocessed' python scripts/cleanup_processed.py --target unprocessed # Remove old files from 'processed' (keeps newest per page) python scripts/cleanup_processed.py --target processed # Remove from both python scripts/cleanup_processed.py --target both
Note: Use
--dry-runfirst to confirm the exact files that will be deleted.This script reads raw markdown files from
pinescript_docs/unprocessed/, extracts code examples and function documentation, and writes processed versions topinescript_docs/processed/. It also writes a combinedprocessed_all_docs.mdnext to the scripts (repository root) for easy access.
pinescript_docs/
├── all_docs_{timestamp}.md # Combined raw documentation (from crawler)
├── unprocessed/ # Raw markdown files produced by the crawler
│ └── {index}_{page_name}_{timestamp}.md
├── failed_urls_{timestamp}.txt # Failed crawl attempts
└── processed/ # Enhanced content produced by the processor
└── processed_{page_name}_{timestamp}.md
processed_all_docs.md # Combined processed file (written to repository root)
The full deployment and operational instructions are maintained in docs/DEPLOYMENT.md (operator-focused). For local development and quick commands see server/STARTUP.md (developer-focused).
- Quick start (dev): see
server/STARTUP.md - Production / deployment: see
docs/DEPLOYMENT.md
In short: apply SQL migrations in migrations/, set required secrets (see docs/DEPLOYMENT.md), build the Docker image (optional), and either run the server locally with uvicorn for development or use the Docker image / Gunicorn for production-like runs. Use the ingest CLI (server/run_ingest.py) for full reindexes to avoid worker timeouts.
The crawler and processor can be customized through their respective class initializations:
PineScriptDocsCrawler: Configures crawling behavior, batch size, and extraction schema.PineScriptDocsProcessor: Customizes content processing and output formatting.
This project is open source and available under the MIT License.
- Failed URLs are logged with error messages.
- Batch processing ensures resilience to temporary failures.
- Rate limiting helps avoid server overload.