<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=145304570664993&amp;ev=PageView&amp;noscript=1">

Performance Results

Here we provide performance results for the Bow Pod platforms. We provide results from our submission to the OGB-Large Scale Challenge,  the MLPerf Training v2.0 submission, and results from our own benchmarking activities across a wider range of models for both Training & Inference 

ogb-lsc logo

 

Open Graph Benchmark was established in 2020 with the aim of objectively measuring the performance of different graph models and compute systems.

OGB’s Large Scale Challenge began in 2021 in order to accelerate the development & adoption of larger graphs.

 

Today, the OGB-LSC consists of three challenges, based on different datasets:

  • PCQM4Mv2:  Predicting a quantum property of molecular graphs
  • WikiKG90Mv2:  Predicting missing facts in a knowledge graph based on Wikipedia
  • MAG240M:  Automatically labelling subject areas on papers submitted to ArXiv

 

For the latest challenge OGB-LSC 2022, Graphcore submitted to PCQM4v2 in partnership with researchers from Valence and Mila. Graphcore also submitted to WikiKG90Mv2 with our own team.

 

Among a strong field of participants including Microsoft, Tencent, NVIDIA and many other world leading companies & research institutions, OGB-LSC 2022 submissions running on Graphcore IPUs secured:

First place in the PCQM4Mv2 challenge for predicting quantum properties of molecular graphs

First place in the WikiKG90Mv2 challenge for knowledge graphs.

 

These results are summarised below, with further details about OGB-LSC competition results available at https://ogb.stanford.edu/neurips2022/results/ 

Dataset Challenge
Model
Platform
SDK Version
Framework
Throughput (items/sec)
Time to Train
Evaluation Metric
OGB-LSC 2022 Ranking
PCQM4Mv2
GPS++
Bow Pod16
SDK 3.0
TensorFlow2
17,800
23h 43m
0.0719 MAE
First
WikiKG90Mv2
TransE (256)
Bow Pod16
SDK 3.0
Poplar
1,260,000
11h 50m
0.2562 MRR
First

Evaluation Metric using ensembled results using test-challenge dataset | PQM4Mv2 throughput measured in graphs/s | WikiKG90M2 throughput measured in triples/s

Bow Platform - Training

Here we provide Training performance results for the Bow Pod platforms. Throughput in this context is defined as the number of input data points (sequences, images, or rows) processed by the model per second. 

 

The below results detail the obtained throughput values for each of the referenced models in the specified configuration.

Model
Variant
Platform
SDK Version
Framework
Dataset
Batch Size
Precision
Throughput (items/sec)
BERT Large
Ph1 Pre-Training (SL128) - Packed
Bow Pod16
SDK 3.1
PopART
Wikipedia
54,784
16.16
6,044
BERT Large
Ph1 Pre-Training (SL128)
Bow Pod16
SDK 3.3
TensorFlow2
Wikipedia
65,280
16.16
4,527
BERT Large
Ph1 Pre-Training (SL128) - Packed
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
56,064
16.16
5,600
BERT Large
Ph1 Pre-Training (SL128) - Packed
Bow Pod64
SDK 3.1
PopART
Wikipedia
54,784
16.16
22,759
BERT Large
Ph1 Pre-Training (SL128)
Bow Pod64
SDK 3.0
TensorFlow2
Wikipedia
66,560
16.16
18,199
BERT Large
Ph1 Pre-Training (SL128) - Packed
Bow Pod64
SDK 3.2
PyTorch
Wikipedia
56,064
16.16
18,442
BERT Large
Ph2 Pre-Training (SL512) - Packed
Bow Pod16
SDK 3.1
PopART
Wikipedia
9,600
16.16
2,126
BERT Large
Ph2 Pre-Training (SL512) - Packed
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
1,973
BERT Large
Ph2 Pre-Training (SL512) - Packed
Bow Pod64
SDK 3.1
PopART
Wikipedia
9,600
16.16
7,789
BERT Large
Ph2 Pre-Training (SL512) - Packed
Bow Pod64
SDK 3.2
PyTorch
Wikipedia
8,192
16.16
6,543
BERT Large
Fine-Tuning (SL384 - SQuAD)
Bow Pod16
SDK 3.1
PopART
SQuAD
256
16.16
1,183
BERT Large
Fine-Tuning (SL384 - SQuAD)
Bow Pod16
SDK 3.3
PyTorch
SQuAD
256
16.16
1,009
BERT Base
Ph1 Pre-Training (SL128)
Bow Pod16
SDK 3.1
PopART
Wikipedia
65,536
16.16
16,391
BERT Base
Ph1 Pre-Training (SL128)
Bow Pod16
SDK 3.3
TensorFlow2
Wikipedia
65,280
16.16
15,173
BERT Base
Ph1 Pre-Training (SL128)
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
65,536
16.16
15,911
BERT Base
Ph2 Pre-Training (SL512)
Bow Pod16
SDK 3.1
PopART
Wikipedia
16,384
16.16
3,656
BERT Base
Ph2 Pre-Training (SL384)
Bow Pod16
SDK 3.3
TensorFlow2
Wikipedia
16,320
16.16
4,387
BERT Base
Ph2 Pre-Training (SL512)
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
16,384
16.16
3,527
Group BERT Base
Ph1 Pre-Training (SL128)
Bow Pod16
SDK 3.0
TensorFlow1
Wikipedia
65,520
16.16
7,187
Group BERT Base
Ph2 Pre-Training (SL384)
Bow Pod16
SDK 3.0
TensorFlow1
Wikipedia
32,800
16.16
2,288
Group BERT Base
Ph1 Pre-Training (SL128)
Bow Pod64
SDK 3.0
TensorFlow1
Wikipedia
64,800
16.16
26,425
Group BERT Base
Ph2 Pre-Training (SL384)
Bow Pod64
SDK 3.0
TensorFlow1
Wikipedia
32,640
16.16
7,572
BERT Base - HuggingFace
Fine-Tuning (SL384 - SQuAD)
Bow Pod16
SDK 3.0
TensorFlow2
SQuAD
320
16.16
1,014
GPT2
GPT2-Large (SL512)
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
414
GPT2
GPT2-Large (SL512)
Bow Pod64
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
1,590
GPT2
GPT2-Large (SL1024)
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
178
GPT2
GPT2-Medium (SL1024)
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
337
GPT2
GPT2-Medium (SL1024)
Bow Pod64
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
1,320
GPT2
GPT2-Small (SL1024)
Bow Pod16
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
1,065
GPT2
GPT2-Small (SL1024)
Bow Pod64
SDK 3.3
PyTorch
Wikipedia
8,192
16.16
4,094
Conformer-Medium
WeNet-Conformer-Medium
Bow Pod16
SDK 3.3
PyTorch
AiShell1
288
16.16
1,167
RNN-T
Transformer Transducer
Bow Pod16
SDK 3.1
PopART
Generated
32
32.32
1,442
DeepVoice3
 
Bow-2000
SDK 3.1
PopART
VCTK Corpus
128
32.32
9,653
FastSpeech2
 
Bow Pod16
SDK 3.1
TensorFlow2
LJ Speech
64
16.16
1,653
FastPitch
frames/s
Bow Pod16
SDK 3.3
PyTorch
Generated
128
32.32
1,489,341
TGN
Temporal Graph Network
1x Bow IPU
SDK 3.3
Pytorch Geometric
 
 
 
31,617
Cluster-GCN
 
Bow-2000
SDK 3.2
TensorFlow2
PPI
 
16.16
684,439
Cluster-GCN
 
Bow-2000
SDK 3.2
TensorFlow2
ArXiv
 
16.16
3,521,863
Cluster-GCN
 
Bow-2000
SDK 3.2
TensorFlow2
Reddit
 
16.16
1,959,184
Cluster-GCN
 
Bow-2000
SDK 3.2
TensorFlow2
Products
 
16.16
3,412,474
Cluster-GCN
 
Bow-2000
SDK 3.2
TensorFlow2
ogbn-mag
 
16.16
2,586,673
MPNN-GIN
MP Graph Isomorphism n/w
Bow-2000
SDK 3.3
TensorFlow2
Generated
1,024
16.16
473,608
ResNet-50 v1.5
 
Bow Pod16
SDK 3.0
TensorFlow1
ImageNet2012
3,520
16.16
44,059
ResNet-50 v1.5
 
Bow Pod16
SDK 3.3
PyTorch
ImageNet2012
16,384
16.16
38,036
ResNet-50 v1.5
 
Bow Pod64
SDK 3.0
TensorFlow1
ImageNet2012
5,120
16.16
153,205
ResNet-50 v1.5
 
Bow Pod64
SDK 3.2
PyTorch
ImageNet2012
16,384
16.16
109,232
EfficientNet-B4
G16-EfficientNet
Bow Pod16
SDK 3.0
TensorFlow1
ImageNet2012
6,144
16.16
9,000
EfficientNet-B4
G16-EfficientNet
Bow Pod16
SDK 3.3
PyTorch
ImageNet2012
1,024
16.32
8,223
EfficientNet-B4
G16-EfficientNet
Bow Pod64
SDK 3.0
TensorFlow1
ImageNet2012
6,144
16.16
34,140
ResNeXt101
 
Bow Pod16
SDK 3.0
TensorFlow1
ImageNet2012
768
16.16
12,277
ViT
Pre-Training
Bow Pod16
SDK 3.3
PyTorch
ImageNet1k
65,536
16.16
7,608
ViT
Pre-Training
Bow Pod64
SDK 3.3
PyTorch
ImageNet1k
65,536
16.16
26,185
ViT
Fine-Tuning
Bow Pod16
SDK 3.3
PyTorch
ImageNet1k
2,040
16.16
8,148
DINO
Vision Transformer
Bow Pod16
SDK 3.3
PyTorch
ImageNet1k
3,200
16.16
696
DINO
Vision Transformer
Bow Pod64
SDK 3.2
PyTorch
ImageNet1k
3,200
16.16
3,437
Swin-Base (224)
Vision Transformer - Pre-Training
Bow Pod16
SDK 3.3
PyTorch
ImageNet1k
512
32.32
1,442
Swin-Tiny (224)
Vision Transformer - Pre-Training
Bow Pod16
SDK 3.3
PyTorch
ImageNet1k
1,024
32.32
3,687
Swin-Large (224)
Vision Transformer - Fine-Tuning
Bow Pod16
SDK 3.3
PyTorch
ImageNet1k
8,196
16.16
3,283
UNet (Medical)
 
Bow-2000
SDK 3.3
TensorFlow2
EM segmentation
24
16.16
152
Mini DALL-E
 
Bow Pod16
SDK 3.3
PyTorch
COCO 2017
6,144
16.16
1,843
Mini DALL-E
 
Bow Pod64
SDK 3.3
PyTorch
COCO 2017
24,576
16.16
6,787
MAE
Masked Autoencoder for visual representation learning
Bow Pod16
SDK 3.1
PyTorch
ImageNet
4,128
16.16
7,111
Frozen In Time
Multimodal - Pre-Training (1 frame)
Bow Pod8
SDK 3.3
PyTorch
webvid
240
16.16
447
CLIP
Multimodel (language/vision)
Bow Pod8
SDK 3.3
PyTorch
c3m
795
16.16
2,498

Bow Platform - Inference

Model inference in this context refers to running a trained model on input data to infer output. Inference performance in production setups is typically measured on two metrics: throughput (as defined previously) and latency, which is defined in this context as the amount of time taken for the model to provide an output given an input.

 

Here below we provide results for the new Bow-2000 platform as throughput and latency for a given batch size.

Model
Variant
Platform
SDK Version
Framework
Dataset
Batch Size
Precision
Throughput (items/sec)
Latency (ms)
GPT2
GPT2-Small
Bow Pod16
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
1,361
5.51
GPT2
GPT2-Medium
Bow Pod16
SDK 3.3
PyTorch
Synthetic (host-generated)
2
16.16
337
11.75
GPT2
GPT2-Large
Bow Pod16
SDK 3.3
PyTorch
Synthetic (host-generated)
2
16.16
97
20.57
BERT-Large
SL128
Bow-2000
SDK 3.1
PopART
SQuAD
4
16.16
2,908
1.36
BERT-Large
SL128
Bow-2000
SDK 3.1
PopART
SQuAD
8
16.16
4,096
1.94
BERT-Large
SL128
Bow-2000
SDK 3.1
PopART
SQuAD
12
16.16
4,655
2.56
BERT-Large
SL128
Bow-2000
SDK 3.1
PopART
SQuAD
16
16.16
5,292
3.01
BERT-Base
SL128
Bow-2000
SDK 3.1
PopART
SQuAD
4
16.16
6,508
0.6
BERT-Base
SL128
Bow-2000
SDK 3.1
PopART
SQuAD
320
16.16
28,069
11.41
ResNet-50v1.5
lowest latency config
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
9,297
0.58
ResNet-50v1.5
higher throughput config
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
12,282
1.66
ResNet-50v1.5
 
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
256
16.16
46,174
26.16
EfficientNet-B0
lowest latency config
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
10,866
0.39
EfficientNet-B0
higher throughput config
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
14,789
1.29
EfficientNet-B0
 
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
192
16.16
46,107
19.97
EfficientNet-B4
lowest latency config
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
4,419
1.26
EfficientNet-B4
higher throughput config
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
5,646
3.53
EfficientNet-B4
 
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
48
16.16
15,458
14.35
Yolo v4
image 896, bps 5, max det 200
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
924
6.49
Yolo v4
image 896, bps 10, max det 300
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
4
16.16
988
6.76
Yolo v4
image 640, bps 5, max det 200
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
8
16.16
1,854
6.61
Yolo v4
image 640, bps 10, max det 300
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
8
16.16
1,948
6.92
Yolo v4
image 512, bps 5, max det 200
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
8
16.16
2,477
4.81
Yolo v4
image 512, bps 10, max det 300
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
8
16.16
2,663
4.93
Yolo v4
image 416, bps 5, max det 200
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
8
16.16
3,202
3.7
Yolo v4
image 416, bps 10, max det 100
Bow-2000
SDK 3.3
PyTorch
Synthetic (host-generated)
16
16.16
4,268
6.28
EfficientDet-D0
 
Bow-2000
SDK 3.3
TF2 w/Keras
Synthetic (host-generated)
16
16.16
5,171
0.77
EfficientDet-D1
 
Bow-2000
SDK 3.3
TF2 w/Keras
Synthetic (host-generated)
12
16.16
2,875
1.4
EfficientDet-D2
 
Bow-2000
SDK 3.3
TF2 w/Keras
Synthetic (host-generated)
8
16.16
1,869
2.14
EfficientDet-D3
 
Bow-2000
SDK 3.3
TF2 w/Keras
Synthetic (host-generated)
4
16.16
925
4.33
EfficientDet-D4
 
Bow-2000
SDK 3.3
TF2 w/Keras
Synthetic (host-generated)
4
16.16
664
6.03
Unet (Medical)
 
Bow-2000
SDK 3.3
TensorFlow2
Synthetic (host-generated)
4
16.16
1,920
 
Unet (Medical)
 
Bow-2000
SDK 3.3
TensorFlow2
Synthetic (host-generated)
8
16.16
2,081
 
FastSpeech2
 
Bow-2000
SDK 3.1
TensorFlow2
Synthetic (host-generated)
4
16.16
2,610
1.53
FastSpeech2
 
Bow-2000
SDK 3.1
TensorFlow2
Synthetic (host-generated)
16
16.16
4,354
0.92
FastSpeech2
 
Bow-2000
SDK 3.1
TensorFlow2
Synthetic (host-generated)
32
16.16
4,949
0.81
FastSpeech2
 
Bow-2000
SDK 3.1
TensorFlow2
Synthetic (host-generated)
60
16.16
5,201
0.77

MLPerf v2.0 Training Performance

For our submissions in to MLPerf Training version 2.0 we have chosen to submit for the popular application benchmark categories of Image Classification (ResNet-50) and Natural Language Processing (BERT), and also a new entry as an Open submission in the Speech Transcription category for RNN-T 

 

There are two divisions for submissions. The Closed division requires submitters to use exactly the same model and optimizer implementation that includes defining hyperparameter state and training epochs. There is also an Open division that fosters and supports innovation by supporting different model implementations more tuned to different processor capabilities or as in this case, more aligned to customer requirements

MLPerf v2.0 Training Results | MLPerf ID: 2.0-2045, 2.0-2049, 2.0-2051, 2.0-2053

MLPerf v2.0 Training Results | MLPerf ID: 2.0-2047, 2.0-2050, 2.0-2052, 2.0-2054

Division
Model
MLPerf Quality Target
Platform
SDK Version
Framework
MLPerf ID
Dataset
Precision
Time to Train (mins)
Closed
ResNet50 v1.5
75.90% classification
Bow Pod16
SDK 2.5.1
TensorFlow
2.0-2047
ImageNet2012
16.16
19.64
Closed
ResNet50 v1.5
75.90% classification
Bow Pod64
SDK 2.5.1
TensorFlow
2.0-2050
ImageNet2012
16.16
6.30
Closed
ResNet50 v1.5
75.90% classification
Bow Pod128
SDK 2.5.1
TensorFlow
2.0-2052
ImageNet2012
16.16
4.19
Closed
ResNet50 v1.5
75.90% classification
Bow Pod256
SDK 2.5.1
TensorFlow
2.0-2054
ImageNet2012
16.16
2.67
Closed
BERT
0.72 Mask-LM accuracy
Bow Pod16
SDK 2.5.1
PopART
2.0-2045
Wikipedia
16.16
20.66
Closed
BERT
0.72 Mask-LM accuracy
Bow Pod16
SDK 2.5.1
PaddlePaddle
2.0-2046
Wikipedia
16.16
20.75
Closed
BERT
0.72 Mask-LM accuracy
Bow Pod64
SDK 2.5.1
PopART
2.0-2049
Wikipedia
16.16
6.70
Closed
BERT
0.72 Mask-LM accuracy
Bow Pod64
SDK 2.5.1
PaddlePaddle
2.0-2048
Wikipedia
16.16
6.77
Closed
BERT
0.72 Mask-LM accuracy
Bow Pod128
SDK 2.5.1
PopART
2.0-2051
Wikipedia
16.16
4.42
Closed
BERT
0.72 Mask-LM accuracy
Bow Pod256
SDK 2.5.1
PopART
2.0-2053
Wikipedia
16.16
3.19
Open
RNN-T
-
Bow Pod64
SDK 2.5.1
PopART
2.0-2125
Customer dataset
16.16
109.36

The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved.
Unauthorized use strictly prohibited. See www.mlperf.org for more information.

IPU-POD Classic - Training

Training a machine learning model involves running the algorithm over an input dataset (training data) until the model converges - meaning that it has learned to produce the desired output to a specified accuracy. Throughput in this context is defined as the number of input data points (sequences, images, or rows) processed by the model per second. Throughput is often used as a measure of hardware performance as it is directly related to the time for the model to train to a specified accuracy.


The results provided below detail the obtained throughput values for each of the referenced models in the specified configuration. 

Model
Variant
Platform
SDK Version
Framework
Dataset
Batch Size
Precision
Throughput (items/sec)
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD16
SDK 2.4.0
PopART
Wikipedia
65,536
16.16
3,738
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD16
SDK 2.4.0
TensorFlow1
Wikipedia
65,600
16.16
3,704
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD16
SDK 2.4.0
PyTorch
Wikipedia
65,536
16.16
3,582
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD64
SDK 2.4.0
PopART
Wikipedia
65,536
16.16
14,189
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD64
SDK 2.4.0
TensorFlow1
Wikipedia
66,560
16.16
13,917
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD64
SDK 2.4.0
PyTorch
Wikipedia
65,536
16.16
12,251
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD128
SDK 2.4.0
PopART
Wikipedia
65,536
16.16
24,424
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD128
SDK 2.4.0
TensorFlow1
Wikipedia
66,560
16.16
24,900
BERT Large
Ph1 Pre-Training (SL128)
IPU-POD128
SDK 2.4.0
PyTorch
Wikipedia
65,536
16.16
22,402
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD16
SDK 2.4.0
PopART
Wikipedia
16,384
16.16
1,063
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD16
SDK 2.4.0
TensorFlow1
Wikipedia
16,400
16.16
1,025
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD16
SDK 2.4.0
PyTorch
Wikipedia
16,384
16.16
1,012
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD64
SDK 2.4.0
PopART
Wikipedia
16,384
16.16
4,003
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD64
SDK 2.4.0
TensorFlow1
Wikipedia
16,640
16.16
3,938
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD64
SDK 2.4.0
PyTorch
Wikipedia
16,384
16.16
3,611
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD128
SDK 2.4.0
PopART
Wikipedia
16,384
16.16
7,127
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD128
SDK 2.4.0
TensorFlow1
Wikipedia
16,640
16.16
7,292
BERT Large
Ph2 Pre-Training (SL384)
IPU-POD128
SDK 2.4.0
PyTorch
Wikipedia
16,384
16.16
6,500
BERT Large
Fine-Tuning (SL384 - SQuAD)
IPU-POD16
SDK 2.4.0
PopART
SQuAD
256
16.16
884
BERT Large
Fine-Tuning (SL384 - SQuAD)
IPU-POD16
SDK 2.4.0
PyTorch
SQuAD
256
16.16
744
BERT Base
Ph1 Pre-Training (SL128)
IPU-POD16
SDK 2.4.0
PopART
Wikipedia
65,536
16.16
11,991
BERT Base
Ph1 Pre-Training (SL128)
IPU-POD16
SDK 2.4.0
TensorFlow1
Wikipedia
65,280
16.16
11,647
BERT Base
Ph1 Pre-Training (SL128)
IPU-POD16
SDK 2.4.0
TensorFlow2
Wikipedia
65,280
16.16
11,035
BERT Base
Ph1 Pre-Training (SL128)
IPU-POD16
SDK 2.4.0
PyTorch
Wikipedia
65,536
16.16
11,184
BERT Base
Ph2 Pre-Training (SL384)
IPU-POD16
SDK 2.4.0
PopART
Wikipedia
16,384
16.16
3,545
BERT Base
Ph2 Pre-Training (SL384)
IPU-POD16
SDK 2.4.0
TensorFlow1
Wikipedia
16,320
16.16
3,288
BERT Base
Ph2 Pre-Training (SL384)
IPU-POD16
SDK 2.4.0
TensorFlow2
Wikipedia
16,320
16.16
3,155
BERT Base
Ph2 Pre-Training (SL384)
IPU-POD16
SDK 2.4.0
PyTorch
Wikipedia
16,384
16.16
3,334
BERT Base - HuggingFace
Fine-Tuning (SL384 - SQuAD)
IPU-POD16
SDK 2.4.0
TensorFlow2
SQuAD
320
16.16
375
GPT2
GPT2-medium (SL128)
IPU-POD16
SDK 2.3.0
PyTorch
Wikipedia
65,536
16.16
2,540
GPT2
GPT2-medium (SL128)
IPU-POD64
SDK 2.3.0
PyTorch
Wikipedia
65,536
16.16
9,870
GPT2
GPT2-medium (SL128)
IPU-POD128
SDK 2.3.0
PyTorch
Wikipedia
65,536
16.16
18,842
GPT2
GPT2-medium (SL128)
IPU-POD256
SDK 2.3.0
PyTorch
Wikipedia
65,536
16.16
31,025
ResNet-50 v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
ImageNet2012
1,920
16.16
7,864
ResNet-50 v1.5
 
IPU-M2000
SDK 2.4.0
PyTorch
ImageNet2012
16,384
16.16
7,303
ResNet-50 v1.5
 
IPU-POD16
SDK 2.4.0
TensorFlow1
ImageNet2012
1,920
16.16
30,690
ResNet-50 v1.5
 
IPU-POD16
SDK 2.4.0
PyTorch
ImageNet2012
16,384
16.16
25,534
ResNet-50 v1.5
 
IPU-POD64
SDK 2.4.0
TensorFlow1
ImageNet2012
2,560
16.16
108,566
ResNet-50 v1.5
 
IPU-POD128
SDK 2.4.0
TensorFlow1
ImageNet2012
5,120
16.16
205,006
ResNet-50 v1.5
 
IPU-POD256
SDK 2.4.0
TensorFlow1
ImageNet2012
10,240
16.16
365,040
ResNeXt101
 
IPU-M2000
SDK 2.4.0
TensorFlow1
ImageNet2012
768
16.16
2,514
ResNeXt101
 
IPU-POD16
SDK 2.4.0
TensorFlow1
ImageNet2012
768
16.16
9,023
EfficientNet-B4
G16-EfficientNet
IPU-M2000
SDK 2.4.0
TensorFlow1
ImageNet2012
800
16.16
1,618
EfficientNet-B4
G16-EfficientNet
IPU-M2000
SDK 2.4.0
PyTorch
ImageNet2012
1,024
16.32
1,400
EfficientNet-B4
G16-EfficientNet
IPU-POD16
SDK 2.4.0
TensorFlow1
ImageNet2012
6,144
16.16
6,379
EfficientNet-B4
G16-EfficientNet
IPU-POD16
SDK 2.4.0
PyTorch
ImageNet2012
1,024
16.32
4,311
EfficientNet-B4
G16-EfficientNet
IPU-POD64
SDK 2.4.0
TensorFlow1
ImageNet2012
6,144
16.16
24,946
EfficientNet-B4
G16-EfficientNet
IPU-POD128
SDK 2.4.0
TensorFlow1
ImageNet2012
6,144
16.16
48,015
EfficientNet-B4
G16-EfficientNet
IPU-POD256
SDK 2.4.0
TensorFlow1
ImageNet2012
6,144
16.16
87,968
ViT
Vision Transformer
IPU-POD16
SDK 2.3.0
PyTorch
ImageNet1k
65,536
16.16
6,535
ViT
Vision Transformer
IPU-POD64
SDK 2.3.0
PyTorch
ImageNet1k
65,536
16.16
25,080
ViT
Vision Transformer
IPU-POD128
SDK 2.3.0
PyTorch
ImageNet1k
65,536
16.16
46,320
ViT
Vision Transformer
IPU-POD256
SDK 2.3.0
PyTorch
ImageNet1k
65,536
16.16
68,800
UNet (Medical)
 
IPU-M2000
SDK 2.4.0
TensorFlow2
EM segmentation
24
16.16
139
Mini DALL-E
 
IPU-M2000
SDK 2.4.0
PyTorch
COCO 2017
1,536
16.16
319
Mini DALL-E
 
IPU-POD16
SDK 2.4.0
PyTorch
COCO 2017
6,144
16.16
815
DeepVoice3
 
IPU-M2000
SDK 2.4.0
PopART
VCTK Corpus
128
32.32
8,496
FastSpeech2
 
IPU-M2000
SDK 2.4.0
TensorFlow2
LJ Speech
32
16.16
406
FastSpeech2
 
IPU-POD16
SDK 2.4.0
TensorFlow2
LJ Speech
64
16.16
1,141
Conformer
Conformer-Small
IPU-M2000
SDK 2.4.0
PyTorch
AiShell1
96
16.16
1,030
Conformer
Conformer-Small
IPU-POD16
SDK 2.4.0
PyTorch
AiShell1
96
16.16
3,395
TGN
Temporal Graph Network
GC200 IPU
SDK 2.4.0
TensorFlow1
JODIE Wikipedia
200
16.32
190,472

IPU-POD Classic - Time to Result

Model
Variant
Platform
SDK Version
Framework
Dataset
Batch Size
Precision
Time To Result (secs)
MCMC TFP
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Proprietary
 
32.32
49

IPU-POD Classic - Inference

Model inference in this context refers to running a model on input data to infer output. Inference performance in production setups is typically measured on two metrics: throughput (as defined previously) and latency, which is defined as the time taken to execute an inference. 

Model
Variant
Platform
SDK Version
Framework
Dataset
Batch Size
Precision
Throughput (items/sec)
Latency (ms)
BERT-Large
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
4
16.16
2,071
1.92
BERT-Large
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
8
16.16
2,911
2.73
BERT-Large
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
12
16.16
3,303
3.62
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
4
16.16
4,580
0.86
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
8
16.16
7,069
1.11
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
16
16.16
9,687
1.65
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
32
16.16
12,584
2.53
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
64
16.16
15,346
4.16
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
128
16.16
17,972
7.11
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
256
16.16
19,484
13.11
BERT-Base
SL128
IPU-M2000
SDK 2.4.0
PopART
SQuAD
320
16.16
20,803
15.36
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
4
16.16
7,152
1.66
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
8
16.16
10,515
2.27
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
16
16.16
16,207
2.95
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
32
16.16
22,544
4.24
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
64
16.16
28,762
6.66
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
128
16.16
35,155
10.91
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
256
16.16
40,085
19.14
ResNet-50v1.5
lowest latency config
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
7,397
0.52
ResNet-50v1.5
higher throughput config
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
9,404
2.04
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
16
16.16
14,321
2.69
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
32
16.16
20,927
3.7
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
64
16.16
36,193
8.62
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
128
16.16
43,472
14.38
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
256
16.16
49,816
25.13
ResNet-50v1.5
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
360
16.16
50,883
30.68
ResNeXt101
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
4
16.16
4,483
2.66
ResNeXt101
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
8
16.16
6,435
3.71
ResNeXt101
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
16
16.16
9,705
4.93
ResNeXt101
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
32
16.16
13,693
6.99
ResNeXt101
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
64
16.16
17,176
11.16
ResNeXt101
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
3,395
1.14
ResNeXt101
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
8
16.16
4,840
1.62
ResNeXt101
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
16
16.16
6,483
2.43
ResNeXt101
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
64
16.16
11,320
27.83
EfficientNet-B0
lowest latency config
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
8,686
0.44
EfficientNet-B0
higher throughput config
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
10,907
1.69
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
32
16.16
50,510
3.05
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
64
16.16
71,839
4.26
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
128
16.16
86,986
6.77
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
144
16.16
69,852
9.15
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
196
16.16
61,714
13.38
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
4
16.16
8,289
1.43
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
8
16.16
13,056
1.82
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
16
16.16
22,217
2.15
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
32
16.16
34,448
2.77
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
64
16.16
43,351
4.41
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
128
16.16
53,256
7.19
EfficientNet-B0
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
160
16.16
55,169
8.68
EfficientNet-B4
lowest latency config
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
3,539
1.09
EfficientNet-B4
higher throughput config
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
4,081
1.85
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
16
16.16
8,299
3.5
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
24
16.16
9,874
4.37
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
32
16.16
10,753
5.3
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
40
16.16
11,578
6.22
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
4
16.16
3,718
3.21
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
8
16.16
5,514
4.34
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
16
16.16
7,959
6.01
EfficientNet-B4
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
20
16.16
8,958
6.68
EfficientNet-B7
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
4
16.16
1,407
8.52
EfficientNet-B7
 
IPU-M2000
SDK 2.4.0
TensorFlow1
Synthetic (host-generated)
8
16.16
1,869
12.82
Yolo v4
image 896, bps 5, max det 200
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
690
9.4
Yolo v4
image 896, bps 10, max det 300
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
722
9.74
Yolo v4
image 640, bps 5, max det 200
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
8
16.16
1,306
10.03
Yolo v4
image 640, bps 10, max det 300
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
8
16.16
1,364
10.39
Yolo v4
image 512, bps 5, max det 200
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
8
16.16
1,772
7.25
Yolo v4
image 512, bps 10, max det 300
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
8
16.16
1,915
7.31
Yolo v4
image 416, bps 5, max det 200
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
2,195
5.88
Yolo v4
image 416, bps 10, max det 100
IPU-M2000
SDK 2.4.0
PyTorch
Synthetic (host-generated)
4
16.16
2,994
9.42
Unet (Medical)
 
IPU-M2000
SDK 2.4.0
TensorFlow2
Synthetic (host-generated)
4
16.16
1,144
 
Unet (Medical)
 
IPU-M2000
SDK 2.4.0
TensorFlow2
Synthetic (host-generated)
8
16.16
1,190
 

Precision Terminology: X.Y is defined as follows: X is the precision for storing the activations & gradients, and Y is the precision for storing the weights. When training in 16.16 weights we may still use FP32 for other variables (such as norms or momentum), and include stochastic rounding.

Benchmarks were generated using our examples on the Graphcore GitHub.

This page was last updated on Tuesday, July 4, 2023