Deep Neural Networks (DNN) at Scale
Engineering for Depth
Scaling a network from 5 layers to 500 layers isn't just a matter of adding more code. It introduces massive engineering challenges like vanishing gradients and catastrophic forgetting.
1. Residual Connections (ResNet)
To train very deep networks, we use "skip connections" that allow the gradient to bypass layers. This ensures that the training signal can reach the very first layers of a deep network without fading into mathematical noise.
2. Model Parallelism
Modern DNNs are too big to fit on a single GPU. Engineering teams must use sharding (splitting the model across hundreds of chips) and pipeline parallelism to synchronize weights and activations in real-time.
3. Quantization and Pruning
Once a massive DNN is trained, we "prune" it by removing neurons that don't contribute much to the output. This, combined with 8-bit quantization, allows us to run deep models on consumer hardware with minimal loss in accuracy.