This repository provides a guide to deep learning with PyTorch, along with best practices for running workloads on an HPC cluster using SLURM. It includes:
- Deep Learning Basics: Jupyter notebooks covering foundational concepts.
- SLURM Job Scheduling: Guides and scripts for distributed training.
- Module Management: Best practices for handling dependencies on HPC clusters.
/01_introduction/
├── 01_SLURM.md # SLURM job scheduling guide
├── 02_Modules.md # Guide on managing modules
├── 03_introduction_to_DeepLearning.ipynb # Jupyter Notebook on DL basics
├── 04_slurm_cheatbook.pdf # SLURM command reference
├── README.md # Project documentation
- Understanding Tensors in PyTorch
- Forward & Backward Propagation
- Loss Functions & Optimization
- Leveraging PyTorch Tensor Cores
- Building a Simple Neural Network
- Managing Job Queues & Partitions
- Writing & Submitting SLURM Jobs
- Monitoring & Debugging Jobs
- Using SLURM for Distributed Training
- Managing Dependencies with Modules
To effectively use this repository, ensure you have:
- Python basics
- Familiarity with PyTorch
📚 PyTorch Documentation 📚 SLURM Official Guide 📚 Deep Learning Book by Ian Goodfellow