Skip to content

Latest commit

 

History

History
 
 

README.md

High-Performance Deep Learning with SLURM

Overview

This repository provides a guide to deep learning with PyTorch, along with best practices for running workloads on an HPC cluster using SLURM. It includes:

  • Deep Learning Basics: Jupyter notebooks covering foundational concepts.
  • SLURM Job Scheduling: Guides and scripts for distributed training.
  • Module Management: Best practices for handling dependencies on HPC clusters.

Repository Structure

/01_introduction/
 ├── 01_SLURM.md                    # SLURM job scheduling guide
 ├── 02_Modules.md                  # Guide on managing modules
 ├── 03_introduction_to_DeepLearning.ipynb  # Jupyter Notebook on DL basics
 ├── 04_slurm_cheatbook.pdf         # SLURM command reference
 ├── README.md                   # Project documentation


Contents

🔹 Deep Learning Topics Covered

  • Understanding Tensors in PyTorch
  • Forward & Backward Propagation
  • Loss Functions & Optimization
  • Leveraging PyTorch Tensor Cores
  • Building a Simple Neural Network

🔹 SLURM & HPC Topics Covered

  • Managing Job Queues & Partitions
  • Writing & Submitting SLURM Jobs
  • Monitoring & Debugging Jobs
  • Using SLURM for Distributed Training
  • Managing Dependencies with Modules

Prerequisites

To effectively use this repository, ensure you have:

  • Python basics
  • Familiarity with PyTorch

Additional Resources

📚 PyTorch Documentation 📚 SLURM Official Guide 📚 Deep Learning Book by Ian Goodfellow