Improving Application Performance with NVMe Storage - Part 2
Local versus Shared Storage for Artificial Intelligence (AI) and Machine Learning (ML)
April 30, 2019

Zivan Ori
E8 Storage

Share this

Using local SSDs inside of the GPU node delivers fast access to data during training, but introduces challenges that impact the overall solution in terms of scalability, data access and data protection.

Start with Part 1: The Rise of AI and ML Driving Parallel Computing Requirements

Normally, GPU nodes don't have much room for SSDs, which limits the opportunity to train very deep neural networks that need more data. For example, one well respected vendor's standard solution is limited to 7.5TB of internal storage, and it can only scale to 30TB. In contrast, there are generally available NVMe solutions that can scale from 100TB to 1PB of shared NVMe storage at the performance of local NVMe SSDs, providing the opportunity to significantly increase the depth of the training for neural networks.

A number of today's GPU-based servers have the power to perform entire processing operations on their own, however some workloads require more than a single GPU node, either to speed up operations by processing across multiple GPUs, or to process a machine learning model to large to fit into a single GPU. If the clustered GPU nodes all need access to the same dataset for their machine learning training, the data has to be copied to each CPU node, leading to capacity limitations and inefficient storage utilization. Alternatively, if the dataset is split among the nodes in the GPU cluster, then data is only stored locally and cannot be shared between the nodes, and there is no redundancy scheme (RAID / replication) to protect the data.

Because using local SSDs may not have the capacity to store the full dataset for machine learning or deep learning, some installations instead use local SSD as cache for a slower storage array to accelerate access to the working dataset. This leads to performance bottlenecks as the amount of data movement leads to delays in cached data being available on the SSDs. As datasets grow, local SSD caching becomes ineffective for feeding the GPU training models at the required speeds.

Shared NVMe storage can solve the performance challenge for GPU clusters by giving shared read / write data access to all nodes in the cluster at the performance of local SSDs. The need to cache or replicate datasets to all nodes in the GPU cluster is eliminated, improving the overall storage efficiency of the cluster. With some solutions offering support for up to 1PB of RAID protected, shared NVMe data, the GPU cluster can tackle massive deep learning training for improved results. For clustered applications, this type of solution is ideal for global filesystems such as IBM Spectrum Scale, Lustre, CEPH and others.

Use Case Scenario Example: Deep Learning Datasets

One vendor provides the hardware infrastructure that their customers use to test a variety of applications. With simple connectivity via Ethernet (or InfiniBand), shared NVMe storage provides more capacity for deep learning datasets, which would allow them to expand the use cases that it offers to its customers.

Moving to Shared NVMe-oF Storage

Having now discussed the performance of NVMe inside of GPU nodes, let's explore the performance impacts of moving to shared NVMe-oF storage. For this discussion, we will use an example where performance testing would be focused on assessing single node performance of using shared NVMe storage relative to the local SSD inside of the GPU node.

Reasonable benchmark parameters and test objectives could be:

1. RDMA Performance: Test whether RDMA-based (remote direct memory access) connectivity at the core of the storage architecture could enable low-latency and high data throughput.

2. Network Performance: How would large quantities of data affect the network, and whether the network became a bottleneck during data transfers.

3. CPU Consumption: How much CPU power is used during large data transfers over the RDMA enabled NICs.

4. In general, whether RDMA technology could be a key component of an AI / ML computing cluster.

I have in fact been privy to similar benchmarks. For side-by-side testing, a TensorFlow benchmark with two different data models was utilized: ResNet-50, a 50-layer residual neural network, as well VGG-19, a 19-layer convolutional neural network that was trained on more than a million images from the ImageNet database. Both models were read-intensive as the neural network ingests massive amounts of data during both the training and processing phases of the benchmark. A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The storage appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the tests used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.

A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The NVMe appliance was connected to the node via the NVMe-oF protocol over 50GbE / 100GbE ports for the shared NVMe storage testing. For the final results, all of the test runs used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.

Benchmark Results

In both image throughput and overall training time, the appliance exceeded the performance of the local NVMe SSD inside the GPU node by a couple of percentage points. This highlights one of the performance advantages of shared NVMe storage: the ability to spread volumes across all drives in the array gains the throughput advantages of multiple SSDs, which compensates for the any latency impacts of moving to external storage. In other words, the improved image throughput performance means that more images can be processed in an hour / day / week when using shared NVMe storage than with local SSDs. Although the difference is just a few percentage points, this advantage will scale up as more GPU nodes are added to the compute cluster.

In addition, the training time with NVMe storage was much faster than with local SSDs, again highlighting the advantage of being able to bring the performance of multiple NVMe SSDs to bear in a shared volume. Combined with the scalability of the NVMe storage, this enables customers to not only speed up the performance of training, but to also leverage 100TB or more datasets to enable deep learning for improved results.

Read Part 3: Benefits of NVMe Storage for AI/ML

Zivan Ori is CEO and Co-Founder of E8 Storage
Share this

The Latest

November 25, 2024

In a fast-paced industry where customer service is a priority, the opportunity to use AI to personalize products and services, revolutionize delivery channels, and effectively manage peaks in demand such as Black Friday and Cyber Monday are vast. By leveraging AI to streamline demand forecasting, optimize inventory, personalize customer interactions, and adjust pricing, retailers can have a better handle on these stress points, and deliver a seamless digital experience ...

November 21, 2024

Broad proliferation of cloud infrastructure combined with continued support for remote workers is driving increased complexity and visibility challenges for network operations teams, according to new research conducted by Dimensional Research and sponsored by Broadcom ...

November 20, 2024

New research from ServiceNow and ThoughtLab reveals that less than 30% of banks feel their transformation efforts are meeting evolving customer digital needs. Additionally, 52% say they must revamp their strategy to counter competition from outside the sector. Adapting to these challenges isn't just about staying competitive — it's about staying in business ...

November 19, 2024

Leaders in the financial services sector are bullish on AI, with 95% of business and IT decision makers saying that AI is a top C-Suite priority, and 96% of respondents believing it provides their business a competitive advantage, according to Riverbed's Global AI and Digital Experience Survey ...

November 18, 2024

SLOs have long been a staple for DevOps teams to monitor the health of their applications and infrastructure ... Now, as digital trends have shifted, more and more teams are looking to adapt this model for the mobile environment. This, however, is not without its challenges ...