[ 2024.03.02 / 14 min read ]
Engineering

Deep Neural Networks (DNN) at Scale

Engineering for Depth

Scaling a network from 5 layers to 500 layers isn't just a matter of adding more code. It introduces massive engineering challenges like vanishing gradients and catastrophic forgetting.

1. Residual Connections (ResNet)

To train very deep networks, we use "skip connections" that allow the gradient to bypass layers. This ensures that the training signal can reach the very first layers of a deep network without fading into mathematical noise.

2. Model Parallelism

Modern DNNs are too big to fit on a single GPU. Engineering teams must use sharding (splitting the model across hundreds of chips) and pipeline parallelism to synchronize weights and activations in real-time.

INFRASTRUCTURE SNAPSHOT: Training a state-of-the-art DNN today can require a cluster of 10,000+ H100 GPUs and a dedicated power substation.

3. Quantization and Pruning

Once a massive DNN is trained, we "prune" it by removing neurons that don't contribute much to the output. This, combined with 8-bit quantization, allows us to run deep models on consumer hardware with minimal loss in accuracy.