Memory is one of the biggest challenges in deep neural networks (DNNs) today. Researchers are struggling with the limited memory bandwidth of the DRAM devices that have to be used by today's systems to store the huge amounts of weights and activations in DNNs. DRAM capacity appears to be a limitation too. But these challenges are not quite as they seem.
Computer architectures have developed with processor chips specialised for serial processing and DRAMs optimised for high density memory. The interface between these two devices is a major bottleneck that introduces latency and bandwidth limitations and adds a considerable overhead in power consumption.
Although we do not yet have a complete understanding of human brains and how they work, it is generally understood that there is no large, separate memory store. The long- and short-term memory function in human brains is thought to be embedded in the neuron/synapse structure. Even simple organisms such as the C. elegans worm, with a neural structure made up of just over 300 neurons, has some basic memory functions of this sort.
Building memory into conventional processors is one way of getting around the memory bottleneck problem by opening huge memory bandwidth at much lower power consumption. However, memory on-chip is area expensive and it wouldn’t be possible to add on the large amounts of memory currently attached to the CPU and GPU processors currently used to train and deploy DNNs.
So it's useful to look at how memory is used today in CPU and GPU-powered deep learning systems and to ask why we appear to need such large attached memory storage with these systems when our brains appear to work well without it.
Memory in neural networks is required to store input data, weight parameters and activations as an input propagates through the network. In training, activations from a forward pass must be retained until they can be used to calculate the error gradients in the backwards pass. As an example, the 50-layer ResNet network has ~26 million weight parameters and computes ~16 million activations in the forward pass. If you use a 32-bit floating-point value to store each weight and activation this would give a total storage requirement of 168 MB. By using a lower precision value to store these weights and activations we could halve or even quarter this storage requirement.
A greater memory challenge arises from GPUs' reliance on data being laid out as dense vectors so they can fill very wide single instruction multiple data (SIMD) compute engines, which they use to achieve high compute density. CPUs use similar wide vector units to deliver high-performance arithmetic. In GPUs the vector paths are typically 1024 bits wide, so GPUs using 32-bit floating-point data typically parallelise the training data up into a mini-batch of 32 samples, to create 1024-bit-wide data vectors. This mini-batch approach to synthesizing vector parallelism multiplies the number of activations by a factor of 32, growing the local storage requirement to over 2 GB.
GPUs and other machines designed for matrix algebra also suffer another memory multiplier on either the weights and activations of a neural network. GPUs cannot efficiently execute directly the small convolutions used in deep neural networks. So a transformation called 'lowering' is used to convert those convolutions into matrix-matrix multiplications (GEMMs) which GPUs can execute efficiently. Lowering cures execution inefficiency, but at the cost of multiplying either the activation storage or the weight storage by the number of elements in the convolution mask, typically a factor of 9 (3x3 convolution masks). Finally, additional memory is also required to store the input data, temporary values and the program’s instructions. Measuring the memory use of ResNet-50 training with a mini-batch of 32 on a typical high performance GPU shows that it needs over 7.5 GB of local DRAM.
You might think that by using lower-precision compute you could reduce this large memory requirement, but that is not the case for a SIMD machine like a GPU. If you switch to half-precision data values for weights and activations, with a mini-batch of 32, you would only fill half of the SIMD vector width, wasting half of the available compute. To compensate, when you switch from full precision to half precision on a GPU, you also need to double the mini-batch size to induce enough data parallelism to use all the available compute. So switching to lower-precision weights and activations on a GPU still requires over 7.5 GB of local DRAM storage.
With such large amounts of storage state required, it is not possible to keep the data on the GPU processor. In fact, many high performance GPU processors have only 1 KB of memory associated with each of the processor cores that can be read fast enough to saturate the floating-point datapath. This means that at each layer of the DNN, you need to save the state to external DRAM, load up the next layer of the network and then reload the data to the system. As a result, the already bandwidth and latency constrained off-chip memory interface suffers the additional burden of constantly reloading weights as well as saving and retrieving activations. This significantly slows down the training time and considerably increases power consumption.
Although large mini-batches improve computational efficiency by providing parallelism, research shows that large mini-batches lead to networks with a poorer ability to generalise and that take longer to train. Besides, machine learning model graphs already expose enormous parallelism, so it shouldn’t be necessary to synthesize more. True graph machines such as Graphcore’s IPU don’t need large mini-batches for efficient execution, and they can execute convolutions without the memory bloat of lowering to GEMMs. So IPUs have a very much smaller memory footprint than GPUs, small enough to fit on the processing chip even for large networks. The efficiency and performance gains from doing this are huge.
Decades of work on compilers for sequential programming languages means there are several techniques to reduce memory further. First, operations such as activation functions can be performed ‘in-place’ allowing the input data to be overwritten directly by the output. In this way the memory state can be reused. Secondly, memory can be reused by analysing the data dependencies between operations in a network and allocating the same memory to operations that do not use it concurrently.
This second approach is particularly effective when the entire neural network can be analysed at compile-time to create a fixed allocation of memory, since the runtime overheads of memory management reduce to almost zero. The combination of these techniques has been shown to reduce memory in neural networks by a factor of two to three. These optimisation techniques on a parallel program are analogous to the dataflow analysis in a sequential program graph to allow the reuse of registers and stack memory, with their relatively higher efficiency compared to dynamic memory allocation routines.
Another approach is to trade reduced memory for an increase in computation. Often the computational resources are underused, and so an increase in computation won’t necessarily increase runtime. If you’re willing to increase compute further you can achieve relatively higher memory savings compared with the additional computation. This trade-off enables further optimisation of memory usage. A simple technique in this vein is to discard values that are relatively cheap to compute, such as activation functions, and re-compute them when necessary. Substantial reductions can be achieved by discarding retained activations in sets of consecutive layers of a network and re-computing them when they are required during the backwards pass, from the closest set of remaining activations.
Re-computing activations over sets of layers has been demonstrated by the MXNet team to deliver a factor-of-four memory reduction for a ResNet-50 network, but more importantly, results in memory use that scales sub-linearly with respect to the number of layers.
A similar memory-reuse approach has been developed by researchers at Google DeepMind with recurrent neural networks (RNNs). RNNs are a special type of DNN that allow cycles in their structure to encode behaviour over sequences of inputs. For RNNs, re-computation has been shown to reduce memory by a factor of 20 for sequences of length 1000 with only a 30% performance overhead.
A third significant approach has been recently discovered by the Baidu Deep Speech team. They have applied various memory-saving techniques to obtain a factor of 16 reduction in memory for activations, enabling them to train networks with 100 layers when previously for the same memory size they could only train networks with 9 layers.
Combining memory and processing resources in a single device has huge potential to increase the performance and efficiency of DNNs as well as others forms of machine learning systems. It is possible to make a trade off between memory and compute resources to achieve a different balance of capability and performance in a system that can be generally useful across all problem sets.
Neural networks and the knowledge models in other machine learning techniques can be thought of as mathematical graphs. These graphs expose huge amounts of parallelism. A parallel processor designed to exploit graph parallelism does not need to rely on mini-batches to achieve high compute utilization and can therefore significantly reduce the amount of local storage required.
The state-of-the-art results surveyed here show efficient use of memory through reuse and trading increased computation for reduced memory use. These techniques can deliver dramatic improvements in the performance of neural networks. Today’s GPUs and CPUs have very limited on-chip memory, just a few MBs in aggregate. New processor architectures, like Graphcore's IPU, specifically designed for machine learning adopt a much better balance between memory and compute on chip, delivering very dramatic improvements in performance and efficiency over today's CPUs and GPUs.