Should You Send a CPU to Do a GPU’s Job?

January 29, 2020

Should You Send a CPU to Do a GPU’s Job?

Published in Electronic Design

At its Data-Centric Innovation Day in April 2019, Intel unveiled the new 2nd Generation Intel Xeon Scalable Processors (formerly Cascade Lake). The parts divide across the Platinum, Gold, Silver, and Bronze lines. At the top of the line is the Platinum 9200, also known as Advanced Performance (AP). The 9282 has 56 cores per processor in a multichip module (two dies in one package, resulting in double the core count and double the memory). Measuring 76.0 × 72.5 mm, it’s Intel’s largest package to date. Focusing on density, high-performance computing, and advanced analytics, this packaged server can only be purchased from OEMs who buy from Intel and make modifications.

One of the new features of the second-generation Xeon processors is Intel Deep Learning Boost (Intel DL Boost), also known as the Vector Neural Network Instruction (VNNI). VNNI combines three instructions into a single instruction, resulting in better use of computational resources and cache while reducing the likelihood of bandwidth bottlenecks. Secondly, VNNI enables INT8 deep-learning inference, which boosts performance with “little loss of accuracy.” The 8-bit inference yields a theoretical peak compute gain of 4X over the 32-bit floating-point (FP32) operations.

Fast-forward to May 13, 2019, when Intel announced that its new high-end CPU outperforms Nvidia’s GPU on ResNet-50, a popular convolutional neural network (CNN) for computer vision. Quoting Intel, “Today, we have achieved leadership performance of 7878 images per second on ResNet-50 with our latest generation of Intel Xeon Scalable processors, outperforming 7844 images per second on Nvidia Tesla V100, the best GPU performance as published by Nvidia on its website including T4.”  

Employing the Xeon Platinum 9292, Intel achieved 7878 images/s by creating 28 virtual instances of four CPU cores each (using a batch size of 11) (Table 1). An open-source deep learning framework, Intel Optimized Caffe, was used to optimize the ResNet-50 code. Intel recently added four general optimizations for the new INT8 inference: activation memory optimization, weight sharing, convolution algorithm tuning, and first convolution transformation.

Nvidia wasted no time in replying to Intel’s performance claims, releasing the statement, “It’s not every day that one of the world’s leading tech companies highlights the benefits of your products. Intel did just that last week, comparing the inference performance of two of their most expensive CPUs to Nvidia GPUs.” Nvidia’s detailed reply was a dual-prong response centering on power efficiency and performance per processor.

Read the full article.

Tammy Carter

Tammy Carter

Senior Product Manager

Tammy Carter is the Senior Product Manager for GPGPUs and software products, featuring OpenHPEC for Curtiss-Wright Defense Solutions. In addition to an M.S. in Computer Science, she has over 20 years of experience designing, developing, and integrating real-time embedded systems in the defense, communications, and medical arenas.

Comparing CPUs and GPUs for Deep Learning and Artificial Intelligence

Which is better for machine learning applications, 2nd Generation Intel Xeon Scalable Processors or NVIDIA’s T4? A head-to-head comparison helps explain whether you should choose a CPU or a GPU for your high-performance machine learning solution.

Choosing the Best Processor for the Job

This white paper reviews popular processor architectures to help you make an informed decision when defining your electronics payload.

NVIDIA GPU Processing

In data-intensive applications, such as imaging enhancement and mosaicking, GPUs can stitch together input from multiple sensors or process radar data faster than general-purpose CPUs.