# Supplementary Material for Lectures
[![](https://dcbadge.vercel.app/api/server/gpumode?style=flat)](https://discord.gg/gpumode)

[YouTube Channel](https://www.youtube.com/@GPUMODE)

The PMPP Book: [Programming Massively Parallel Processors: A Hands-on Approach](https://a.co/d/2S2fVzt) (Amazon link)


## Lecture 1: Profiling and Integrating CUDA kernels in PyTorch
- Speaker: [Mark Saroufim](https://twitter.com/marksaroufim)
- Notebook and slides in [lecture_001](./lecture_001/) folder

## Lecture 2: Recap Ch. 1-3 from the PMPP book
- Speaker: [Andreas Koepf](https://twitter.com/neurosp1ke)
- Slides: The powerpoint file [lecture_002/cuda_mode_lecture2.pptx](./lecture_002/cuda_mode_lecture2.pptx) can be found in the root directory of this repository. Alternatively [here](https://docs.google.com/presentation/d/1deqvEHdqEC4LHUpStO6z3TT77Dt84fNAvTIAxBJgDck/edit#slide=id.g2b1444253e5_1_75) as Google docs presentation.

## Lecture 3: Getting Started With CUDA
- Speaker: [Jeremy Howard](https://twitter.com/jeremyphoward)
- Notebook: See the [lecture_003](./lecture_003/) folder, or run the [Colab version](https://colab.research.google.com/drive/180uk6frvMBeT4tywhhYXmz3PJaCIA_uk?usp=sharing)

## Lecture 4: Intro to Compute and Memory Architecture
- Speaker: [Thomas Viehmann](https://lernapparat.de/)
- Notebook and slides in the [lecture_004](./lecture_004/) folder.

## Lecture 5: Going Further with CUDA for Python Programmers
- Speaker: [Jeremy Howard](https://twitter.com/jeremyphoward)
- Notebook in the [lecture_005](./lecture_005/) folder.

## Lecture 6: Optimizing PyTorch Optimizers
- Speaker: [Jane Xu](https://github.com/janeyx99)
- [Slides](https://docs.google.com/presentation/d/13WLCuxXzwu5JRZo0tAfW0hbKHQMvFw4O/edit#slide=id.p1)

## Lecture 7: Advanced Quantization
- Speaker: [Charles Hernandez](https://github.com/HDCharles)
- [Slides](https://www.dropbox.com/scl/fi/hzfx1l267m8gwyhcjvfk4/Quantization-Cuda-vs-Triton.pdf?rlkey=s4j64ivi2kpp2l0uq8xjdwbab&dl=0)

## Lecture 8: CUDA Performance Checklist
- Speaker: [Mark Saroufim](https://github.com/msaroufim)
- Code in the [lecture_008](./lecture_008/) folder
- [Slides](https://docs.google.com/presentation/d/1cvVpf3ChFFiY4Kf25S4e4sPY6Y5uRUO-X-A4nJ7IhFE/edit?usp=sharing)

## Lecture 9: Reductions
- Speaker: [Mark Saroufim](https://github.com/msaroufim)
- Code in the [lecture_009](./lecture_009/) folder
- [Slides](https://docs.google.com/presentation/d/1s8lRU8xuDn-R05p1aSP6P7T5kk9VYnDOCyN5bWKeg3U/edit?usp=drive_link)

## Lecture 10: Build a Prod Ready CUDA Library
* Speaker: [Oscar Amoros Huguet](https://github.com/morousg)
* [slides](https://drive.google.com/drive/folders/158V8BzGj-IkdXXDAdHPNwUzDLNmr971_?usp=drive_link)

## Lecture 11: Sparsity
* Speaker: [Jesse Cai](https://github.com/jcaip)
* [Slides](./lecture_011/sparsity.pptx)

## Lecture 12: Flash Attention
- Speaker: [Thomas Viehmann](https://lernapparat.de/)
- Code in the [lecture_012](./lecture_012/) folder

## Lecture 13: Ring Attention
- Speaker: [Andreas Koepf](https://twitter.com/neurosp1ke)
- [Slides](./lecture_013/ring_attention.pptx)

## Lecture 14: Practitioner's Guide to Triton
- Date: 2024-04-13, Speaker: [Umer Adil](https://twitter.com/UmerHAdil)
- [Notebook](./lecture_014/A_Practitioners_Guide_to_Triton.ipynb)

## Lecture 15: CUTLASS
- Speaker: [Eric Auld](https://github.com/ericauld)

## Lecture 16: On Hands profiling
- Speaker: [Taylor Robbie](https://www.linkedin.com/in/taylor-robie/)

## Bonus Lecture: CUDA C++ llm.cpp
- Speaker: [Jake Hemstad & Georgii Evtushenko]()
- [Slides](https://drive.google.com/drive/folders/1T-t0d_u0Xu8w_-1E5kAwmXNfF72x-HTA)

## Lecture 17: GPU Collective Communication (NCCL)
- Speaker: [Dan Johnson](https://physbam.stanford.edu/~dansj/)
- Code in the [lecture_017](./lecture_017/) folder

## Lecture 18: Fused Kernels
- Speaker: [Kapil Sharma](https://www.kapilsharma.dev/)
- Code in the [lecture_018](./lecture_018/) folder

## Lecture 19: Data Processing on GPUs
- Speaker: [Devavret Makkar](https://github.com/devavret)

## Lecture 20: Scan Algorithm
- Speaker: [Izzat El Haj](https://ielhajj.github.io/)
- [Slides](https://docs.google.com/presentation/d/1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)

## Lecture 21: Scan Algorithm Part 2
- Speaker: [Izzat El Haj](https://ielhajj.github.io/)
- [Slides](https://docs.google.com/presentation/d/1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)

## Lecture 22: Hacker's Guide to Speculative Decoding in VLLM
- Speaker: [Cade Daniel](https://x.com/cdnamz)
- [Slides](https://docs.google.com/presentation/d/1p1xE-EbSAnXpTSiSI0gmy_wdwxN5XaULO3AnCWWoRe4/edit#slide=id.p)

## Lecture 23: Tensor Cores
- Speaker: Vijay Thakkar & Pradeep Ramani
- [Slides](https://drive.google.com/file/d/18sthk6IUOKbdtFphpm_jZNXoJenbWR8m/view)

## Lecture 24: Scan at the Speed of Light
- Speaker: Jake Hemstad & Georgii Evtushenko

## Lecture 25: Speaking Composable Kernel
- Speaker: Haocong Wang
- [Slides](./lecture_025/AMD_ROCm_Speaking_Composable_Kernel_July_20_2024.pdf)

## Lecture 26: SYCL MODE (Intel GPU)
- Speaker: Patric Zhao
- [Slides](https://docs.google.com/presentation/d/1SW4XKomAJhhJSH5-jpZI9Qlwp7TEunbV/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)

## Lecture 27: gpu.cpp
- Speaker: [Austin Huang](https://x.com/austinvhuang)
- [Slides](https://gpucpp-presentation.answer.ai/)

## Lecture 28: Liger Kernel
- Speaker: [Byron Hsu](https://x.com/hsu_byron)
- [Slides](https://docs.google.com/presentation/d/1CGTV-uKw9crrBo13q1jAzAFCFzlpZFjeL4bnK67pTd8/edit?usp=sharing)
- Hands-on  Notebooks
  1. [RMSNorm: Verifying Correctness and Performance](https://colab.research.google.com/drive/1CQYhul7MVG5F0gmqTBbx1O1HgolPgF0M?usp=sharing)
  2. [FusedLinearCrossEntropy: Verifying Memory Reduction](https://colab.research.google.com/drive/1Z2QtvaIiLm5MWOs7X6ZPS1MN3hcIJFbj?usp=sharing)
  3. [Convergence Comparison: Triton Kernel Patched vs. Original Model Layer-by-Layer](https://colab.research.google.com/drive/1e52FH0BcE739GZaVp-3_Dv7mc4jF1aif?usp=sharing)
  4. [Contiguity is the hidden killer](https://colab.research.google.com/drive/1llnAdo0hc9FpxYRRnjih0l066NCp7Ylu?usp=sharing)
  5. [Address int32 overflow](https://colab.research.google.com/drive/1WgaU_cmaxVzx8PcdKB5P9yHB6_WyGd4T?usp=sharing)

## Lecture 29: Triton Internals
- Speaker: [Kapil Sharma](https://www.kapilsharma.dev/)
- Code/presentation in the [lecture_029](./lecture_029/) folder

## Lecture 30: Quantized training
- Speaker: [Thien Tran](https://github.com/gau-nernst)
- Code/presentation in the [lecture_030](./lecture_030/) folder

## Lecture 31: Beginners Guide to Metal Kernels
- Speaker: [Nikita Shulga](https://github.com/gau-nernst)
- Code/presentation in the [lecture_031](./lecture_031/) folder

## Lecture 32: Unsloth - LLM Systems Engineering
- Speaker: [Daniel Han](https://x.com/danielhanchen)
- [Slides](https://docs.google.com/presentation/d/1BvgbDwvOY6Uy6jMuNXrmrz_6Km_CBW0f2espqeQaWfc/edit?usp=sharing)

## Lecture 33: BitBLAS
- Speaker: [Wang Lei](https://github.com/LeiWang1999)
- Code/presentation in the [lecture_033](./lecture_033/) folder

## Lecture 34: Low Bit Triton Kernels
- Speaker: [Hicham Badri](https://github.com/mobicham)
- [Slides](https://docs.google.com/presentation/d/1R9B6RLOlAblyVVFPk9FtAq6MXR1ufj1NaT0bjjib7Vc/edit)

## Lecture 35: SGLang Performance Optimization
- Speaker: [Yineng Zhang](https://linkedin.com/in/zhyncs)
- [Slides](https://github.com/zhyncs/lectures/blob/main/lecture_035/SGLang-Performance-Optimization-YinengZhang.pdf)

## Lecture 36: CUTLASS and Flash ATtention 3
- Speaker: [Jay Shah](https://research.colfax-intl.com/blog/)
- [Slides](lecture_036/)

## Lecture 37: Introduction to SASS & GPU Microarchitecture
- Speaker: [Arun Demeure](https://github.com/ademeure)
- [Slides](lecture_037/)

## Lecture 38: Lowbit kernels for ARM CPU
- Speaker: [Scott Roy](https://github.com/metascroy)
- [Slides](lecture_038/)

## Lecture 39: TorchTitan
- Speaker: Mark Saroufim and Tianyu Liu

## Lecture 40: Flash Infer
- Speaker: [Zihao Ye](https://homes.cs.washington.edu/~zhye/)

## Lecture 41: CUDA Docs for Humans
- Speaker: [Charles Frye](https://x.com/charles_irl/status/1867306225706447023)
- [Slides](https://docs.google.com/presentation/d/15lTG6aqf72Hyk5_lqH7iSrc8aP1ElEYxCxch-tD37PE/edit#slide=id.g326210b960f_0_42)
 
## Lecture 42: Mosaic GPU
- Speaker: [Adam Paszke](https://x.com/apaszke)

## Lecture 43:
- Speaker: Erik Schultheis
- [Slides](lecture_042)

## Lecture 57: CuTE
- Speaker: Cris Cecka
- [Slides](lecture_057)

## Lecture 67: NCCL & NVSHMEM
- Speaker: Jeff Hammond
- [Slides](https://drive.google.com/file/d/1T8uHhFIeVa_g1oYb_O4d2Ltb8YQly1zK/view?usp=sharing)
- [Code](https://github.com/ParRes/Kernels/tree/main/Cxx11)

## Lecture 69: Quartet 4 bit training
- Speakers: Roberto Castro and Andrei Panferov
- Code: https://github.com/IST-DASLab/Quartet and https://github.com/isT-DASLab/qutlass Roberto Castro and Andrei Panferov
- [Paper](https://arxiv.org/abs/2505.14669)

## Lecture 70: Fault tolerant communication collectives
- Speaker: mike64_t
- [Slides](https://docs.google.com/presentation/d/1MKB51lhNOsV-Y_hscSaJk7wZskzxft2pFJQZKyvcMyo/edit?usp=sharing)

## Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use
- Speaker: [Sewon Min](https://www.sewonmin.com)
- [Slides](lecture_071)

## Lecture 72: [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models
- Speaker: [Guangxuan Xiao](https://guangxuanx.com)
- [Slides](lecture_072)

## Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention
- Speaker: [Songlin Yang](https://sustcsonglin.github.io)
- [Slides](lecture_074)

## Lecture 75: [ScaleML Series] GPU Programming Fundamentals + ThunderKittens
- Speaker 1: William Brandon
  - [Slides 1](https://docs.google.com/presentation/d/1ypi4IjEF36PUZGOJSaFxjNzk7BpO61TicdTBBf77oqc/)
- Speaker 2: [Simran Arora](https://arorasimran.com)
  - [Slides 2](lecture_075)

## Lecture 78: Iris: Multi-GPU Programming in Triton
Speakers: Muhammad Awad, Muhammad Osama & Brandon Potter
- [Slides](lecture_078)

## Lecture 79: Mirage (MPK): Compiling LLMs into Mega Kernels
Speakers: Mengdi Wu, Xinhao Cheng
- [Slides](lecture_079)

## Lecture 84: Numerics and AI
Speaker: Paulius Micikevicius
- [Slides](lecture_084)

## Lecture 86: Introduction to CuTeDSL (for NVIDIA competition)
Speaker: Vicki Wang
- [Slides](lecture_086)

## Lecture 103: Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation
Speaker: Jack Carlisle and Jay Shah
- [Slides](lecture_103)

## Lecture 104: Gluon: Tile-Based GPU Programming with Low-Level Control
Speakers: Peter Bell, Mario Lezcano, Keren Zhou
- [Slides and notes](lecture_104)

## Lecture 106: HF kernels
- [Slides](https://docs.google.com/presentation/d/1RibAIrOJv0BcAx2QjNYHDZCrMfGYifTggtKT6uwv7CY/edit)