Should You Send a CPU to Do a GPU’s Job?

Electronic Design

Published in Electronic Design

At its Data-Centric Innovation Day in April 2019, Intel unveiled the new 2nd Generation Intel Xeon Scalable Processors (formerly Cascade Lake). The parts divide across the Platinum, Gold, Silver, and Bronze lines. At the top of the line is the Platinum 9200, also known as Advanced Performance (AP). The 9282 has 56 cores per processor in a multichip module (two dies in one package, resulting in double the core count and double the memory). Measuring 76.0 × 72.5 mm, it’s Intel’s largest package to date. Focusing on density, high-performance computing, and advanced analytics, this packaged server can only be purchased from OEMs who buy from Intel and make modifications.

One of the new features of the second-generation Xeon processors is Intel Deep Learning Boost (Intel DL Boost), also known as the Vector Neural Network Instruction (VNNI). VNNI combines three instructions into a single instruction, resulting in better use of computational resources and cache while reducing the likelihood of bandwidth bottlenecks. Secondly, VNNI enables INT8 deep-learning inference, which boosts performance with “little loss of accuracy.” The 8-bit inference yields a theoretical peak compute gain of 4X over the 32-bit floating-point (FP32) operations.

Fast-forward to May 13, 2019, when Intel announced that its new high-end CPU outperforms Nvidia’s GPU on ResNet-50, a popular convolutional neural network (CNN) for computer vision. Quoting Intel, “Today, we have achieved leadership performance of 7878 images per second on ResNet-50 with our latest generation of Intel Xeon Scalable processors, outperforming 7844 images per second on Nvidia Tesla V100, the best GPU performance as published by Nvidia on its website including T4.”  

Employing the Xeon Platinum 9292, Intel achieved 7878 images/s by creating 28 virtual instances of four CPU cores each (using a batch size of 11) (Table 1). An open-source deep learning framework, Intel Optimized Caffe, was used to optimize the ResNet-50 code. Intel recently added four general optimizations for the new INT8 inference: activation memory optimization, weight sharing, convolution algorithm tuning, and first convolution transformation.

Nvidia wasted no time in replying to Intel’s performance claims, releasing the statement, “It’s not every day that one of the world’s leading tech companies highlights the benefits of your products. Intel did just that last week, comparing the inference performance of two of their most expensive CPUs to Nvidia GPUs.” Nvidia’s detailed reply was a dual-prong response centering on power efficiency and performance per processor.

Read the full article.