Skip to content

andrii-kravets-x/k8s-python-gpu-scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Custom k8s scheduler in python

1. Task explained in task.md

2. Resulting logs shown here

3. Steps

  1. Spin up Ubuntu VM in Proxmox (QEMU, cloud-init, etc). VM size tested with kind: 4CPU, 8 GB RAM, 30Gb.
    Note: you can install everything locally on your laptop(not tested)

    Info on software versions:

    > lsb_release -a
    No LSB modules are available.
    Distributor ID:	Ubuntu
    Description:	Ubuntu 25.04
    Release:	25.04
    Codename:	plucky
    
    > docker --version 
    Docker version 28.3.3, build 980b856
    
    > kind version 
    kind v0.29.0 go1.24.2 linux/amd64
    
    > kubectl version 
    Client Version: v1.33.3
    Kustomize Version: v5.6.0
    Server Version: v1.33.1
  2. Install Docker, Kind, kubectl

    curl -fsSL https://get.docker.com -o get-docker.sh
    sudo sh get-docker.sh
    
    sudo usermod -aG docker ubuntu
    # logout / login
    
    curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.29.0/kind-linux-amd64
    chmod +x ./kind
    sudo mv ./kind /usr/local/bin/kind
    
    curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
    chmod +x kubectl
    sudo mv ./kubectl /usr/local/bin/kubectl
    
    # Create/delete cluster to test that it works
    kind create cluster 
    kind delete cluster   
  3. Copy repo to VM

    rsync -avz ./ ubuntu@your_vm_ip:/home/ubuntu/k8s-scheduler
    # or use git clone
  4. Start Kind cluster with 5 nodes:

    # inside repo_dir
    kind create cluster --config kind-config.yaml
    
    kubectl cluster-info --context kind-kind
    
    # check status, before doing anything else
    kubectl get nodes
    kubectl get pods -owide -A
  5. Build Docker images:

    docker build -t gpu-scheduler:latest ./gpu-scheduler
    docker build -t gpu-scheduler-check:latest ./gpu-scheduler-check
  6. Load images into Kind nodes:

    kind load docker-image gpu-scheduler:latest
    kind load docker-image gpu-scheduler-check:latest
  7. Apply manifests:

    # kubectl apply -f manifests/rbac.yaml
    # rbac is broken, use k8sadmin meanwhile
    kubectl create serviceaccount k8sadmin -n kube-system; kubectl create clusterrolebinding k8sadmin-binding --clusterrole=cluster-admin --serviceaccount=kube-system:k8sadmin
    
    kubectl apply -f manifests/gpu-scheduler.yaml
    kubectl apply -f manifests/gpu-scheduler-check.yaml
  8. Verify deployment:

    # better run in 4 different windows
    # or use tmux
    kubectl get pods -owide -w -A
    kubectl events -w -A
    kubectl logs -f deployments/gpu-scheduler -n kube-system
    kubectl logs statefulsets/gpu-scheduler-check --all-pods=true -f --timestamps

4. Notes

4.1. IF you get "Failed to create inotify object: Too many open files"

  • link
    sudo sysctl fs.inotify.max_user_instances=8192

4.2. TODO: Debug RBAC, see spoiler

Spoiler
kubectl auth can-i --list  --as=system:serviceaccount:default:gpu-scheduler
Resources                                       Non-Resource URLs                      Resource Names   Verbs
events                                          []                                     []               [create update patch]
pods/binding                                    []                                     []               [create]
selfsubjectreviews.authentication.k8s.io        []                                     []               [create]
selfsubjectaccessreviews.authorization.k8s.io   []                                     []               [create]
selfsubjectrulesreviews.authorization.k8s.io    []                                     []               [create]
pods                                            []                                     []               [get list watch update patch]
nodes                                           []                                     []               [get list watch]
                                                [/.well-known/openid-configuration/]   []               [get]
                                                [/.well-known/openid-configuration]    []               [get]
                                                [/api/*]                               []               [get]
                                                [/api]                                 []               [get]
                                                [/apis/*]                              []               [get]
                                                [/apis]                                []               [get]
                                                [/healthz]                             []               [get]
                                                [/healthz]                             []               [get]
                                                [/livez]                               []               [get]
                                                [/livez]                               []               [get]
                                                [/openapi/*]                           []               [get]
                                                [/openapi]                             []               [get]
                                                [/openid/v1/jwks/]                     []               [get]
                                                [/openid/v1/jwks]                      []               [get]
                                                [/readyz]                              []               [get]
                                                [/readyz]                              []               [get]
                                                [/version/]                            []               [get]
                                                [/version/]                            []               [get]
                                                [/version]                             []               [get]
                                                [/version]                             []               [get]
pods/status                                     []                                     []               [update]

kubectl describe clusterrole gpu-scheduler
Name:         gpu-scheduler
Labels:       <none>
Annotations:  <none>
PolicyRule:
  Resources     Non-Resource URLs  Resource Names  Verbs
  ---------     -----------------  --------------  -----
  events        []                 []              [create update patch]
  pods/binding  []                 []              [create]
  pods          []                 []              [get list watch update patch]
  nodes         []                 []              [get list watch]
  pods/status   []                 []              [update]


kubectl auth can-i bind pods --as=system:serviceaccount:default:gpu-scheduler
no

5. Materials used

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Contributors