A Helm chart for deploying ClearlyDefined - a service that helps open source projects be more successful through clearly defined project data.
This Helm chart deploys the following components:
- ClearlyDefined Service - Main API service (port 4000)
- Crawler - Data harvesting crawler (port 5000)
- MongoDB - Database for definitions and curations (port 27017)
- Redis - Queue and caching service (port 6379)
- MongoDB Seed Job - Initial database seeding (optional, disabled by default)
- Kubernetes 1.19+
- Helm 3.0+
- PV provisioner support in the underlying infrastructure (for persistent storage)
- GitHub Personal Access Token with minimal permissions
git clone <repository-url>
cd clearlydefined-helmCreate a secret containing your GitHub tokens before installing the chart:
kubectl create secret generic clearlydefined-secrets \
--namespace <your-namespace> \
--from-literal=CURATION_GITHUB_TOKEN="ghp_your_token_here" \
--from-literal=CRAWLER_GITHUB_TOKEN="ghp_your_token_here" \
--from-literal=WEBHOOK_GITHUB_SECRET="any-random-string-here" \
--from-literal=WEBHOOK_CRAWLER_SECRET="any-random-string-here" \
--from-literal=CRAWLER_SERVICE_AUTH_TOKEN="some-shared-token" \
--from-literal=CRAWLER_API_AUTH_TOKEN="some-shared-token"See the Secrets section below for the full list of supported keys.
Create a file named my-values.yaml:
# Reference your secret (must match the name created above)
existingSecret: "clearlydefined-secrets"
# Configure your own curated data repository
config:
curation:
github:
owner: "your-github-username"
repo: "your-curated-data-repo"
branch: "main"
# Traefik ingress (choose one option)
# Option 1: Traefik with Let's Encrypt
useTraefikLe: true
leHost: "clearlydefined.example.com"
projectProtocol: "https"
# Option 2: Traefik behind load balancer (uncomment instead of above)
# traefikBehindLb: true
# leHost: "clearlydefined.example.com"helm install clearlydefined . -f my-values.yamlThe chart uses the following images by default:
| Component | Image |
|---|---|
| Service | registry.relizahub.com/83d27192-7d06-4d01-a5d6-de4839926da2-public/clearlydefined-service:latest |
| Crawler | registry.relizahub.com/83d27192-7d06-4d01-a5d6-de4839926da2-public/clearlydefined-crawler:latest |
| MongoDB | mongo:5.0.6 |
| Redis | redis:latest |
| Parameter | Description | Default |
|---|---|---|
service.enabled |
Enable the ClearlyDefined service | true |
service.replicaCount |
Number of service replicas | 1 |
crawler.enabled |
Enable the crawler | true |
crawler.replicaCount |
Number of crawler replicas | 1 |
mongodb.enabled |
Enable MongoDB | true |
mongodb.persistence.enabled |
Enable MongoDB persistence | true |
mongodb.persistence.size |
MongoDB PVC size | 10Gi |
redis.enabled |
Enable Redis | true |
redis.persistence.enabled |
Enable Redis persistence | true |
redis.persistence.size |
Redis PVC size | 5Gi |
harvestedData.persistence.enabled |
Enable harvested data persistence | true |
harvestedData.persistence.size |
Harvested data PVC size | 20Gi |
For complete environment variable documentation, see the ClearlyDefined Installation Guide.
This chart does not manage secrets directly. Instead, it references an existing Kubernetes secret specified by existingSecret in your values (default: clearlydefined-secrets).
Create the secret before installing the chart:
kubectl create secret generic clearlydefined-secrets \
--namespace <your-namespace> \
--from-literal=CURATION_GITHUB_TOKEN="ghp_your_token_here" \
--from-literal=CRAWLER_GITHUB_TOKEN="ghp_your_token_here" \
--from-literal=WEBHOOK_GITHUB_SECRET="any-random-string-here" \
--from-literal=WEBHOOK_CRAWLER_SECRET="any-random-string-here" \
--from-literal=CRAWLER_SERVICE_AUTH_TOKEN="some-shared-token" \
--from-literal=CRAWLER_API_AUTH_TOKEN="some-shared-token"Required keys:
-
CURATION_GITHUB_TOKEN- GitHub PAT with read/write access to your curated data repo- Get token from: https://github.com/settings/tokens
- Needs Contents (read/write) and Pull Requests (read/write) permissions on your curated data repo
- Reference: ClearlyDefined Docs - Setting up environmental variables
-
CRAWLER_GITHUB_TOKEN- GitHub PAT with read access for crawling repositories- Can use the same token as
CURATION_GITHUB_TOKEN - Read-only access is sufficient
- Reference: ClearlyDefined Docs - Setting up environmental variables
- Can use the same token as
-
WEBHOOK_GITHUB_SECRET- Shared secret for GitHub webhook payload verification- Can be any arbitrary string if not using webhooks yet
- Must match the "Secret" field in your GitHub webhook settings when configured
-
WEBHOOK_CRAWLER_SECRET- Shared secret for crawler webhook payload verification- Can be any arbitrary string if not using webhooks yet
-
CRAWLER_SERVICE_AUTH_TOKEN- Token the crawler uses to validate incoming requests- Must match
CRAWLER_API_AUTH_TOKEN - Can be any arbitrary string, but both values must be identical
- Must match
-
CRAWLER_API_AUTH_TOKEN- Token the service sends to the crawler for authentication- Must match
CRAWLER_SERVICE_AUTH_TOKEN - Can be any arbitrary string, but both values must be identical
- Must match
Optional keys:
-
GITLAB_TOKEN- GitLab token- Can be a random string if not working with GitLab API
- Only needed if harvesting from GitLab repositories
- Reference: ClearlyDefined Docs
-
CRAWLER_WEBHOOK_TOKEN- Webhook authentication token- Used to secure GitHub webhook endpoints
- Reference: ClearlyDefined Docs - GitHub curation setup
-
CRAWLER_AZBLOB_CONNECTION_STRING- Azure Blob Storage connection string- Only needed if using Azure Blob Storage for harvest data (production deployments)
- Format:
DefaultEndpointsProtocol=https;AccountName=...;AccountKey=... - TODO: Find specific documentation reference for Azure Blob Storage configuration
-
CRAWLER_INSIGHTS_CONNECTION_STRING- Application Insights connection string- Only needed for Azure Application Insights monitoring
- Format:
InstrumentationKey=... - TODO: Find specific documentation reference for Application Insights configuration
All non-sensitive configuration values are in values.yaml under the config section:
Curation Settings:
config.curation.github.branch- Branch for curated data (default: "master")config.curation.github.owner- GitHub owner/org for curated data repo (default: "clearlydefined")config.curation.github.repo- Repository name for curated data (default: "curated-data-dev")- Important: Change these to point to your own curated data repository
config.curation.provider- Curation provider type (default: "github")config.curation.store.provider- Storage backend for curations (default: "mongo")
Database Settings:
config.curation.store.connectionString- MongoDB connection string for curationsconfig.definition.store.connectionString- MongoDB connection string for definitions- Database names and collection names are configurable
Storage Settings:
config.harvest.store.provider- Harvest storage provider (default: "file")- Options: "file", "azblob"
config.fileStore.location- Path for file-based harvest storage (default: "/tmp/harvested_data")
Crawler Settings:
config.crawler.apiUrl- Internal URL for crawler serviceconfig.crawler.name- Crawler instance nameconfig.crawler.queueProvider- Queue provider (default: "memory")- Options: "memory", "redis", "amqp"
- TODO: Find documentation for queue provider options
config.crawler.storeProvider- Crawler storage providerconfig.crawler.host- External crawler host (optional)config.crawler.webhookUrl- Webhook callback URL (optional)config.crawler.queuePrefix- Queue name prefix (optional)config.crawler.azblob.containerName- Azure Blob container name (optional)
For detailed information about each variable, see:
# Access the Service API
kubectl port-forward svc/clearlydefined-service 4000:4000
# Access the Crawler API
kubectl port-forward svc/clearlydefined-crawler 5000:5000This chart uses Traefik IngressRoute resources for ingress. You have two options:
For automatic HTTPS with Let's Encrypt certificate resolver:
useTraefikLe: true
leHost: "clearlydefined.example.com"
projectProtocol: "https" # Enables HTTP to HTTPS redirect
traefik_crd_api_version: "traefik.containo.us/v1alpha1"This creates:
- HTTP IngressRoute (port 80) with redirect to HTTPS
- HTTPS IngressRoute (port 443) with Let's Encrypt TLS
- Middleware for HTTP to HTTPS redirection
For Traefik behind a load balancer that handles TLS termination:
traefikBehindLb: true
leHost: "clearlydefined.example.com"
traefik_crd_api_version: "traefik.containo.us/v1alpha1"This creates:
- HTTP IngressRoute (port 80) only
- No TLS configuration (handled by load balancer)
Note: Ensure Traefik is installed in your cluster with the appropriate CRDs and cert resolver configured.
To enable Node.js debugging for the service or crawler:
service:
debug:
enabled: true
port: 9230
crawler:
debug:
enabled: true
port: 9229Then port-forward the debug port:
kubectl port-forward svc/clearlydefined-service 9230:9230
kubectl port-forward svc/clearlydefined-crawler 9229:9229# Service logs
kubectl logs -l app.kubernetes.io/component=service -f
# Crawler logs
kubectl logs -l app.kubernetes.io/component=crawler -f
# MongoDB logs
kubectl logs -l app.kubernetes.io/component=mongodb -fThe chart creates three PersistentVolumeClaims:
- MongoDB Data - Stores database data
- Redis Data - Stores Redis data
- Harvested Data - Shared volume for harvested data between service and crawler
The harvested data volume uses ReadWriteOnce access mode by default. This means:
- The volume can only be mounted by pods on a single node
- Both the service and crawler pods must be scheduled on the same Kubernetes node
- This is suitable for single-node clusters or when using node affinity rules
The chart automatically configures pod affinity rules to ensure both pods are scheduled on the same node when using ReadWriteOnce. However, be aware:
- If one pod is running and the node becomes unavailable, both pods will need to be rescheduled
- For multi-node production clusters, consider using a storage class that supports
ReadWriteMany(NFS, Azure Files, EFS, etc.)
To use ReadWriteMany (requires compatible storage class):
harvestedData:
persistence:
accessMode: ReadWriteMany
storageClass: "nfs" # or azure-file, efs-sc, etc.To use a specific storage class:
global:
storageClass: "fast-ssd"Or per component:
mongodb:
persistence:
storageClass: "standard"
harvestedData:
persistence:
storageClass: "standard"helm upgrade clearlydefined . -f my-values.yamlhelm uninstall clearlydefinedNote: PersistentVolumeClaims are not automatically deleted. To delete them:
kubectl delete pvc -l app.kubernetes.io/instance=clearlydefinedimages:
mongodb:
tag: "4.4.28" # For Mac computers without AVX supportThe MongoDB seed job is disabled by default because it's only needed for development/testing with sample data.
When to enable:
- You want to test ClearlyDefined with sample data
- You're setting up a development environment
- You're not using your own curated data repository
When to keep disabled (default):
- You're using your own GitHub repository for curated data
- You're deploying to production
- You want to start with an empty database
mongoSeed:
enabled: true # Enable for development/testing with sample dataNote about the seed image:
- The seed image (
clearlydefined/docker_dev_env_experiment_clearlydefined_mongo_seed) is from the ClearlyDefined docker_dev_env_experiment - It populates MongoDB collections with sample data for testing
- See Container Documentation for more information
- If you need to build your own seed image, follow instructions at the GitHub repository above
mongodb:
persistence:
enabled: false
redis:
persistence:
enabled: false
harvestedData:
persistence:
enabled: falseservice:
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
crawler:
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Giconfig:
harvest:
store:
provider: "azblob"
crawler:
storeProvider: "azblob"
azblob:
containerName: "clearlydefined-harvested"
secrets:
crawlerAzblobConnectionString: "DefaultEndpointsProtocol=https;AccountName=..."Check the job logs:
kubectl logs -l app.kubernetes.io/component=mongo-seedThe seed job runs as a Helm hook and will retry on failure.
If using ReadWriteMany access mode, ensure your storage class supports it (e.g., NFS, Azure Files, EFS).
For single-node clusters or development, you can use ReadWriteOnce:
harvestedData:
persistence:
accessMode: ReadWriteOnceVerify MongoDB is running:
kubectl get pods -l app.kubernetes.io/component=mongodbCheck the connection string in the ConfigMap:
kubectl get configmap clearlydefined-config -o yamlFor more information about ClearlyDefined:
Controlled via following env vars on service (not currently implemented in this chart)
Env Var Purpose Default RATE_LIMIT_WINDOW General API window in seconds 1 RATE_LIMIT_MAX Max requests per window 0 (disabled) BATCH_RATE_LIMIT_WINDOW Batch API window in seconds 1 BATCH_RATE_LIMIT_MAX Max batch requests per window 0 (disabled) Key detail: When max is 0, rate limiting is disabled — it sets skip: () => true
So by default, rate limiting is off. To enable it, set e.g.:
RATE_LIMIT_WINDOW=300 RATE_LIMIT_MAX=1000 BATCH_RATE_LIMIT_WINDOW=300 BATCH_RATE_LIMIT_MAX=250 This would allow 1000 requests per 5 minutes globally, and 250 batch POSTs per 5 minutes.
MIT.