Make Your Model Faster with PyTorch Lightning
PyTorch Lightning Introduction
PyTorch Lighting is a more modern PyTorch version. It's an open-source machine learning library with extra capabilities that let users install advanced models.
As deep learning's complexity and scale increased, some software and hardware became insufficient. PyTorch Lightning was created by the PyTorch team to keep up with emerging technology and give users a better experience while building deep learning models. PyTorch was created in a period when AI research was primarily focused on network topologies, and it was used to create a large number of complex models for study and production. However, as models such as Generative Adversarial Networks (GAN) and Bidirectional Encoder Representations from Transformers (BERT) began to interact with one another, the adoption of new technologies became unavoidable.
Users of PyTorch Lightning are encouraged to concentrate on science and research rather than worrying about how to deploy the complicated models they are creating. Models are sometimes simplified in order for them to work on the company's systems. PyTorch Lightning, on the other hand, uses cloud technologies to allow users to debug their model, which traditionally takes 512 or 1024 GPUs, on their laptop utilizing CPUs without having to update any code.
4 Ways To Speed Up Your Training With PyTorch Lightning
This section provides 5 different ways to improve the performance of your models during training and inference.
- Mixed Precision
- Multi-GPU Training
- Parallel Data Loading
- Early Stopping
We describe each technique, including how it works, how to implement it.
Numerical formats with lower precision than 32-bit floating-point or higher precision, such as 64-bit floating-point, have various advantages.
Lower precision, such as 16-bit floating-point, uses less memory, making it easier to train and deploy massive neural networks. It also improves data transfer operations by using less memory bandwidth and allowing batch operations to execute considerably quicker on GPUs that support Tensor Core.
For particularly sensitive use-cases, higher precision, such as 64-bit floating-point, can be used.
By default, PyTorch, like most deep learning frameworks, uses 32-bit floating-point (FP32) arithmetic. Many deep learning models, on the other hand, do not require this to achieve total accuracy. Mixed precision training achieves significant computational speedup by completing operations in the half-precision format while preserving the bare minimum of information in single-precision to keep as much information as possible in critical sections of the network. Since the introduction of Tensor Cores in the Volta and Turing architectures, switching to mixed-precision has resulted in significant training speedups. To reduce memory footprint and improve speed during model training and evaluation, it blends FP32 and lower-bit floating points (such as FP16).
When the mixed-precision flag is set in PyTorch Lightning, the framework employs half-precision whenever available while maintaining single-precision everywhere else. We were able to obtain a 1.5x – 2x performance gain in our model training times with minor code changes.
from pytorch_lightning import Trainer trainer = Trainer(precision=16)
GPUs have significantly sped up training and inference times as compared to CPUs. What's better than a graphics processing unit (GPU)? There are multiple GPUs!
In PyTorch, there are a few paradigms for training models with multiple GPUs. 'DataParallel' and 'DistributedDataParallel' are two of the most frequent paradigms. We went with 'DistributedDataParallel' because it is more scalable. Look into the tradeoffs between the two strategies in this debate.
It's not easy to change your training pipeline in PyTorch (or other platforms). It's necessary to think about things like distributed data loading and syncing of weights, gradients, and measurements.
We were able to train our PyTorch models on several GPUs with absolutely no code modifications using PyTorch Lightning!
from pytorch_lightning import Trainer trainer = Trainer(gpus=-1) # or the number of GPUs available
How Does Distributed Training Work?
It's vital to remember that in a distributed situation, the optimization process is identical to that in a single-device setting; that is, we still minimize the same cost function using the same model and optimizer.
The main difference is that the gradient computation is distributed across numerous devices and performed in parallel. This works because the gradient operator is linear, which means that computing the gradient for individual data samples and then averaging them is the same as computing the gradient for the entire batch of data at once.
Parallel Data Loading
The data loading and augmentation processes are frequently found to be bottlenecks in the training pipeline.
The data loading and augmentation process are ridiculously parallel, and it may be sped up by loading data in parallel utilizing multiple CPU processes. During training and inference, expensive GPU resources are not bottlenecked by CPUs.
We did the following to import data as quickly as feasible for training deep learning models:
- Set the number of CPUs in DataLoader's number worker's argument.
- When working with GPUs, set the 'pin memory' parameter in DataLoader to True.
This places the data in page-locked memory, which allows data to be transferred to the GPU more quickly.
Lightning's EarlyStopping Callback allows the Trainer to immediately halt when a certain statistic stops improving. It's ideal for Hyper Parameter searches and Grid Runs since it reduces the amount of time spent on parameter sets that produce poor convergence or overfitting.
from pytorch_lightning.callbacks import EarlyStopping from pytorch_lightning import Trainer es = EarlyStopping( monitor="val_loss", stopping_threshold=1e-4, divergence_threshold=6.0 ) trainer = Trainer(callbacks=[es])
As you can see in the code above, there are a couple of parameters, which need to be specified in order to control the early stopping.
- stopping_threshold - When the observed quantity hits this threshold, the training is instantly terminated. It's helpful when we know that going beyond a particular optimal figure isn't going to help us anymore.
- divergence_threshold - When the observed quantity exceeds this threshold, the training is terminated. When the model reaches a number this low, we believe it is advisable to stop early and try again with new initial conditions.
We were able to construct a training pipeline that was 5x-10x faster and more memory efficient by utilizing all of these advancements in PyTorch Lightning. This enables us to dramatically shorten our experimentation cycles and explore previously unexplored avenues! It also allowed us to try out more model topologies and hyperparameters, allowing us to pursue more risky research ideas!