We'll describe training of very deep networks with mixed-precision float ("float16") using Volta Tensor Core. Float16 has two major potential benefits: high training speed and reduced memory footprint. But float16 has smaller numerical range than regular single precision float, which can result in overflow or underflow ("vanishing gradient") during training. We'll describe simple rescaling mechanism which solves these potential issues. With this rescaling algorithm, we successfully used mixed precision training for such networks as Alexnet, GoogLeNet, Inception_v3, and Resnets without any loss in accuracy.
Hall: Hall I
Track: General Deep Learning