Performance Results

Here we provide performance results for the Bow Pod platforms. We provide results from our submission to the OGB-Large Scale Challenge, the MLPerf Training v2.0 submission, and results from our own benchmarking activities across a wider range of models for both Training & Inference

ogb-lsc logo

Open Graph Benchmark was established in 2020 with the aim of objectively measuring the performance of different graph models and compute systems.

OGB’s Large Scale Challenge began in 2021 in order to accelerate the development & adoption of larger graphs.

Today, the OGB-LSC consists of three challenges, based on different datasets:

PCQM4Mv2: Predicting a quantum property of molecular graphs

WikiKG90Mv2: Predicting missing facts in a knowledge graph based on Wikipedia
MAG240M: Automatically labelling subject areas on papers submitted to ArXiv

For the latest challenge OGB-LSC 2022, Graphcore submitted to PCQM4v2 in partnership with researchers from Valence and Mila. Graphcore also submitted to WikiKG90Mv2 with our own team.

Among a strong field of participants including Microsoft, Tencent, NVIDIA and many other world leading companies & research institutions, OGB-LSC 2022 submissions running on Graphcore IPUs secured:

First place in the PCQM4Mv2 challenge for predicting quantum properties of molecular graphs

First place in the WikiKG90Mv2 challenge for knowledge graphs.

These results are summarised below, with further details about OGB-LSC competition results available at https://ogb.stanford.edu/neurips2022/results/

Dataset Challenge

Model

Platform

SDK Version

Framework

Throughput (items/sec)

Time to Train

Evaluation Metric

OGB-LSC 2022 Ranking

PCQM4Mv2

GPS++

Bow Pod16

SDK 3.0

TensorFlow2

17,800

23h 43m

0.0719 MAE

First

WikiKG90Mv2

TransE (256)

Bow Pod16

SDK 3.0

Poplar

1,260,000

11h 50m

0.2562 MRR

First

Evaluation Metric using ensembled results using test-challenge dataset | PQM4Mv2 throughput measured in graphs/s | WikiKG90M2 throughput measured in triples/s

Bow Platform - Training

Here we provide Training performance results for the Bow Pod platforms. Throughput in this context is defined as the number of input data points (sequences, images, or rows) processed by the model per second.

The below results detail the obtained throughput values for each of the referenced models in the specified configuration.

Model

Variant

Platform

SDK Version

Framework

Dataset

Batch Size

Precision

Throughput (items/sec)

BERT Large

Ph1 Pre-Training (SL128) - Packed

Bow Pod16

SDK 3.1

PopART

Wikipedia

54,784

16.16

6,044

BERT Large

Ph1 Pre-Training (SL128)

Bow Pod16

SDK 3.3

TensorFlow2

Wikipedia

65,280

16.16

4,527

BERT Large

Ph1 Pre-Training (SL128) - Packed

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

56,064

16.16

5,600

BERT Large

Ph1 Pre-Training (SL128) - Packed

Bow Pod64

SDK 3.1

PopART

Wikipedia

54,784

16.16

22,759

BERT Large

Ph1 Pre-Training (SL128)

Bow Pod64

SDK 3.0

TensorFlow2

Wikipedia

66,560

16.16

18,199

BERT Large

Ph1 Pre-Training (SL128) - Packed

Bow Pod64

SDK 3.2

PyTorch

Wikipedia

56,064

16.16

18,442

BERT Large

Ph2 Pre-Training (SL512) - Packed

Bow Pod16

SDK 3.1

PopART

Wikipedia

9,600

16.16

2,126

BERT Large

Ph2 Pre-Training (SL512) - Packed

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

1,973

BERT Large

Ph2 Pre-Training (SL512) - Packed

Bow Pod64

SDK 3.1

PopART

Wikipedia

9,600

16.16

7,789

BERT Large

Ph2 Pre-Training (SL512) - Packed

Bow Pod64

SDK 3.2

PyTorch

Wikipedia

8,192

16.16

6,543

BERT Large

Fine-Tuning (SL384 - SQuAD)

Bow Pod16

SDK 3.1

PopART

SQuAD

256

16.16

1,183

BERT Large

Fine-Tuning (SL384 - SQuAD)

Bow Pod16

SDK 3.3

PyTorch

SQuAD

256

16.16

1,009

BERT Base

Ph1 Pre-Training (SL128)

Bow Pod16

SDK 3.1

PopART

Wikipedia

65,536

16.16

16,391

BERT Base

Ph1 Pre-Training (SL128)

Bow Pod16

SDK 3.3

TensorFlow2

Wikipedia

65,280

16.16

15,173

BERT Base

Ph1 Pre-Training (SL128)

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

65,536

16.16

15,911

BERT Base

Ph2 Pre-Training (SL512)

Bow Pod16

SDK 3.1

PopART

Wikipedia

16,384

16.16

3,656

BERT Base

Ph2 Pre-Training (SL384)

Bow Pod16

SDK 3.3

TensorFlow2

Wikipedia

16,320

16.16

4,387

BERT Base

Ph2 Pre-Training (SL512)

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

16,384

16.16

3,527

Group BERT Base

Ph1 Pre-Training (SL128)

Bow Pod16

SDK 3.0

TensorFlow1

Wikipedia

65,520

16.16

7,187

Group BERT Base

Ph2 Pre-Training (SL384)

Bow Pod16

SDK 3.0

TensorFlow1

Wikipedia

32,800

16.16

2,288

Group BERT Base

Ph1 Pre-Training (SL128)

Bow Pod64

SDK 3.0

TensorFlow1

Wikipedia

64,800

16.16

26,425

Group BERT Base

Ph2 Pre-Training (SL384)

Bow Pod64

SDK 3.0

TensorFlow1

Wikipedia

32,640

16.16

7,572

BERT Base - HuggingFace

Fine-Tuning (SL384 - SQuAD)

Bow Pod16

SDK 3.0

TensorFlow2

SQuAD

320

16.16

1,014

GPT2

GPT2-Large (SL512)

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

414

GPT2

GPT2-Large (SL512)

Bow Pod64

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

1,590

GPT2

GPT2-Large (SL1024)

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

178

GPT2

GPT2-Medium (SL1024)

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

337

GPT2

GPT2-Medium (SL1024)

Bow Pod64

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

1,320

GPT2

GPT2-Small (SL1024)

Bow Pod16

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

1,065

GPT2

GPT2-Small (SL1024)

Bow Pod64

SDK 3.3

PyTorch

Wikipedia

8,192

16.16

4,094

Conformer-Medium

WeNet-Conformer-Medium

Bow Pod16

SDK 3.3

PyTorch

AiShell1

288

16.16

1,167

RNN-T

Transformer Transducer

Bow Pod16

SDK 3.1

PopART

Generated

32.32

1,442

DeepVoice3

Bow-2000

SDK 3.1

PopART

VCTK Corpus

128

32.32

9,653

FastSpeech2

Bow Pod16

SDK 3.1

TensorFlow2

LJ Speech

16.16

1,653

FastPitch

frames/s

Bow Pod16

SDK 3.3

PyTorch

Generated

128

32.32

1,489,341

TGN

Temporal Graph Network

1x Bow IPU

SDK 3.3

Pytorch Geometric

31,617

Cluster-GCN

Bow-2000

SDK 3.2

TensorFlow2

PPI

16.16

684,439

Cluster-GCN

Bow-2000

SDK 3.2

TensorFlow2

ArXiv

16.16

3,521,863

Cluster-GCN

Bow-2000

SDK 3.2

TensorFlow2

16.16

1,959,184

Cluster-GCN

Bow-2000

SDK 3.2

TensorFlow2

Products

16.16

3,412,474

Cluster-GCN

Bow-2000

SDK 3.2

TensorFlow2

ogbn-mag

16.16

2,586,673

MPNN-GIN

MP Graph Isomorphism n/w

Bow-2000

SDK 3.3

TensorFlow2

Generated

1,024

16.16

473,608

ResNet-50 v1.5

Bow Pod16

SDK 3.0

TensorFlow1

ImageNet2012

3,520

16.16

44,059

ResNet-50 v1.5

Bow Pod16

SDK 3.3

PyTorch

ImageNet2012

16,384

16.16

38,036

ResNet-50 v1.5

Bow Pod64

SDK 3.0

TensorFlow1

ImageNet2012

5,120

16.16

153,205

ResNet-50 v1.5

Bow Pod64

SDK 3.2

PyTorch

ImageNet2012

16,384

16.16

109,232

EfficientNet-B4

G16-EfficientNet

Bow Pod16

SDK 3.0

TensorFlow1

ImageNet2012

6,144

16.16

9,000

EfficientNet-B4

G16-EfficientNet

Bow Pod16

SDK 3.3

PyTorch

ImageNet2012

1,024

16.32

8,223

EfficientNet-B4

G16-EfficientNet

Bow Pod64

SDK 3.0

TensorFlow1

ImageNet2012

6,144

16.16

34,140

ResNeXt101

Bow Pod16

SDK 3.0

TensorFlow1

ImageNet2012

768

16.16

12,277

ViT

Pre-Training

Bow Pod16

SDK 3.3

PyTorch

ImageNet1k

65,536

16.16

7,608

ViT

Pre-Training

Bow Pod64

SDK 3.3

PyTorch

ImageNet1k

65,536

16.16

26,185

ViT

Fine-Tuning

Bow Pod16

SDK 3.3

PyTorch

ImageNet1k

2,040

16.16

8,148

DINO

Vision Transformer

Bow Pod16

SDK 3.3

PyTorch

ImageNet1k

3,200

16.16

696

DINO

Vision Transformer

Bow Pod64

SDK 3.2

PyTorch

ImageNet1k

3,200

16.16

3,437

Swin-Base (224)

Vision Transformer - Pre-Training

Bow Pod16

SDK 3.3

PyTorch

ImageNet1k

512

32.32

1,442

Swin-Tiny (224)

Vision Transformer - Pre-Training

Bow Pod16

SDK 3.3

PyTorch

ImageNet1k

1,024

32.32

3,687

Swin-Large (224)

Vision Transformer - Fine-Tuning

Bow Pod16

SDK 3.3

PyTorch

ImageNet1k

8,196

16.16

3,283

UNet (Medical)

Bow-2000

SDK 3.3

TensorFlow2

EM segmentation

16.16

152

Mini DALL-E

Bow Pod16

SDK 3.3

PyTorch

COCO 2017

6,144

16.16

1,843

Mini DALL-E

Bow Pod64

SDK 3.3

PyTorch

COCO 2017

24,576

16.16

6,787

MAE

Masked Autoencoder for visual representation learning

Bow Pod16

SDK 3.1

PyTorch

ImageNet

4,128

16.16

7,111

Frozen In Time

Multimodal - Pre-Training (1 frame)

Bow Pod8

SDK 3.3

PyTorch

webvid

240

16.16

447

CLIP

Multimodel (language/vision)

Bow Pod8

SDK 3.3

PyTorch

c3m

795

16.16

2,498

Bow Platform - Inference

Model inference in this context refers to running a trained model on input data to infer output. Inference performance in production setups is typically measured on two metrics: throughput (as defined previously) and latency, which is defined in this context as the amount of time taken for the model to provide an output given an input.

Here below we provide results for the new Bow-2000 platform as throughput and latency for a given batch size.

Model

Variant

Platform

SDK Version

Framework

Dataset

Batch Size

Precision

Throughput (items/sec)

Latency (ms)

GPT2

GPT2-Small

Bow Pod16

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

1,361

5.51

GPT2

GPT2-Medium

Bow Pod16

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

337

11.75

GPT2

GPT2-Large

Bow Pod16

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

20.57

BERT-Large

SL128

Bow-2000

SDK 3.1

PopART

SQuAD

16.16

2,908

1.36

BERT-Large

SL128

Bow-2000

SDK 3.1

PopART

SQuAD

16.16

4,096

1.94

BERT-Large

SL128

Bow-2000

SDK 3.1

PopART

SQuAD

16.16

4,655

2.56

BERT-Large

SL128

Bow-2000

SDK 3.1

PopART

SQuAD

16.16

5,292

3.01

BERT-Base

SL128

Bow-2000

SDK 3.1

PopART

SQuAD

16.16

6,508

0.6

BERT-Base

SL128

Bow-2000

SDK 3.1

PopART

SQuAD

320

16.16

28,069

11.41

ResNet-50v1.5

lowest latency config

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

9,297

0.58

ResNet-50v1.5

higher throughput config

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

12,282

1.66

ResNet-50v1.5

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

256

16.16

46,174

26.16

EfficientNet-B0

lowest latency config

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

10,866

0.39

EfficientNet-B0

higher throughput config

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

14,789

1.29

EfficientNet-B0

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

192

16.16

46,107

19.97

EfficientNet-B4

lowest latency config

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

4,419

1.26

EfficientNet-B4

higher throughput config

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

5,646

3.53

EfficientNet-B4

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

15,458

14.35

Yolo v4

image 896, bps 5, max det 200

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

924

6.49

Yolo v4

image 896, bps 10, max det 300

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

988

6.76

Yolo v4

image 640, bps 5, max det 200

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

1,854

6.61

Yolo v4

image 640, bps 10, max det 300

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

1,948

6.92

Yolo v4

image 512, bps 5, max det 200

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

2,477

4.81

Yolo v4

image 512, bps 10, max det 300

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

2,663

4.93

Yolo v4

image 416, bps 5, max det 200

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

3,202

3.7

Yolo v4

image 416, bps 10, max det 100

Bow-2000

SDK 3.3

PyTorch

Synthetic (host-generated)

16.16

4,268

6.28

EfficientDet-D0

Bow-2000

SDK 3.3

TF2 w/Keras

Synthetic (host-generated)

16.16

5,171

0.77

EfficientDet-D1

Bow-2000

SDK 3.3

TF2 w/Keras

Synthetic (host-generated)

16.16

2,875

1.4

EfficientDet-D2

Bow-2000

SDK 3.3

TF2 w/Keras

Synthetic (host-generated)

16.16

1,869

2.14

EfficientDet-D3

Bow-2000

SDK 3.3

TF2 w/Keras

Synthetic (host-generated)

16.16

925

4.33

EfficientDet-D4

Bow-2000

SDK 3.3

TF2 w/Keras

Synthetic (host-generated)

16.16

664

6.03

Unet (Medical)

Bow-2000

SDK 3.3

TensorFlow2

Synthetic (host-generated)

16.16

1,920

Unet (Medical)

Bow-2000

SDK 3.3

TensorFlow2

Synthetic (host-generated)

16.16

2,081

FastSpeech2

Bow-2000

SDK 3.1

TensorFlow2

Synthetic (host-generated)

16.16

2,610

1.53

FastSpeech2

Bow-2000

SDK 3.1

TensorFlow2

Synthetic (host-generated)

16.16

4,354

0.92

FastSpeech2

Bow-2000

SDK 3.1

TensorFlow2

Synthetic (host-generated)

16.16

4,949

0.81

FastSpeech2

Bow-2000

SDK 3.1

TensorFlow2

Synthetic (host-generated)

16.16

5,201

0.77

MLPerf v2.0 Training Performance

For our submissions in to MLPerf Training version 2.0 we have chosen to submit for the popular application benchmark categories of Image Classification (ResNet-50) and Natural Language Processing (BERT), and also a new entry as an Open submission in the Speech Transcription category for RNN-T

There are two divisions for submissions. The Closed division requires submitters to use exactly the same model and optimizer implementation that includes defining hyperparameter state and training epochs. There is also an Open division that fosters and supports innovation by supporting different model implementations more tuned to different processor capabilities or as in this case, more aligned to customer requirements

MLPerf v2.0 Training Results | MLPerf ID: 2.0-2045, 2.0-2049, 2.0-2051, 2.0-2053

MLPerf v2.0 Training Results | MLPerf ID: 2.0-2047, 2.0-2050, 2.0-2052, 2.0-2054

Division

Model

MLPerf Quality Target

Platform

SDK Version

Framework

MLPerf ID

Dataset

Precision

Time to Train (mins)

Closed

ResNet50 v1.5

75.90% classification

Bow Pod16

SDK 2.5.1

TensorFlow

2.0-2047

ImageNet2012

16.16

19.64

Closed

ResNet50 v1.5

75.90% classification

Bow Pod64

SDK 2.5.1

TensorFlow

2.0-2050

ImageNet2012

16.16

6.30

Closed

ResNet50 v1.5

75.90% classification

Bow Pod128

SDK 2.5.1

TensorFlow

2.0-2052

ImageNet2012

16.16

4.19

Closed

ResNet50 v1.5

75.90% classification

Bow Pod256

SDK 2.5.1

TensorFlow

2.0-2054

ImageNet2012

16.16

2.67

Closed

BERT

0.72 Mask-LM accuracy

Bow Pod16

SDK 2.5.1

PopART

2.0-2045

Wikipedia

16.16

20.66

Closed

BERT

0.72 Mask-LM accuracy

Bow Pod16

SDK 2.5.1

PaddlePaddle

2.0-2046

Wikipedia

16.16

20.75

Closed

BERT

0.72 Mask-LM accuracy

Bow Pod64

SDK 2.5.1

PopART

2.0-2049

Wikipedia

16.16

6.70

Closed

BERT

0.72 Mask-LM accuracy

Bow Pod64

SDK 2.5.1

PaddlePaddle

2.0-2048

Wikipedia

16.16

6.77

Closed

BERT

0.72 Mask-LM accuracy

Bow Pod128

SDK 2.5.1

PopART

2.0-2051

Wikipedia

16.16

4.42

Closed

BERT

0.72 Mask-LM accuracy

Bow Pod256

SDK 2.5.1

PopART

2.0-2053

Wikipedia

16.16

3.19

Open

RNN-T

Bow Pod64

SDK 2.5.1

PopART

2.0-2125

Customer dataset

16.16

109.36

The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved.
Unauthorized use strictly prohibited. See www.mlperf.org for more information.

IPU-POD Classic - Training

Training a machine learning model involves running the algorithm over an input dataset (training data) until the model converges - meaning that it has learned to produce the desired output to a specified accuracy. Throughput in this context is defined as the number of input data points (sequences, images, or rows) processed by the model per second. Throughput is often used as a measure of hardware performance as it is directly related to the time for the model to train to a specified accuracy.

The results provided below detail the obtained throughput values for each of the referenced models in the specified configuration.

Model

Variant

Platform

SDK Version

Framework

Dataset

Batch Size

Precision

Throughput (items/sec)

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD16

SDK 2.4.0

PopART

Wikipedia

65,536

16.16

3,738

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD16

SDK 2.4.0

TensorFlow1

Wikipedia

65,600

16.16

3,704

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD16

SDK 2.4.0

PyTorch

Wikipedia

65,536

16.16

3,582

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD64

SDK 2.4.0

PopART

Wikipedia

65,536

16.16

14,189

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD64

SDK 2.4.0

TensorFlow1

Wikipedia

66,560

16.16

13,917

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD64

SDK 2.4.0

PyTorch

Wikipedia

65,536

16.16

12,251

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD128

SDK 2.4.0

PopART

Wikipedia

65,536

16.16

24,424

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD128

SDK 2.4.0

TensorFlow1

Wikipedia

66,560

16.16

24,900

BERT Large

Ph1 Pre-Training (SL128)

IPU-POD128

SDK 2.4.0

PyTorch

Wikipedia

65,536

16.16

22,402

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD16

SDK 2.4.0

PopART

Wikipedia

16,384

16.16

1,063

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD16

SDK 2.4.0

TensorFlow1

Wikipedia

16,400

16.16

1,025

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD16

SDK 2.4.0

PyTorch

Wikipedia

16,384

16.16

1,012

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD64

SDK 2.4.0

PopART

Wikipedia

16,384

16.16

4,003

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD64

SDK 2.4.0

TensorFlow1

Wikipedia

16,640

16.16

3,938

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD64

SDK 2.4.0

PyTorch

Wikipedia

16,384

16.16

3,611

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD128

SDK 2.4.0

PopART

Wikipedia

16,384

16.16

7,127

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD128

SDK 2.4.0

TensorFlow1

Wikipedia

16,640

16.16

7,292

BERT Large

Ph2 Pre-Training (SL384)

IPU-POD128

SDK 2.4.0

PyTorch

Wikipedia

16,384

16.16

6,500

BERT Large

Fine-Tuning (SL384 - SQuAD)

IPU-POD16

SDK 2.4.0

PopART

SQuAD

256

16.16

884

BERT Large

Fine-Tuning (SL384 - SQuAD)

IPU-POD16

SDK 2.4.0

PyTorch

SQuAD

256

16.16

744

BERT Base

Ph1 Pre-Training (SL128)

IPU-POD16

SDK 2.4.0

PopART

Wikipedia

65,536

16.16

11,991

BERT Base

Ph1 Pre-Training (SL128)

IPU-POD16

SDK 2.4.0

TensorFlow1

Wikipedia

65,280

16.16

11,647

BERT Base

Ph1 Pre-Training (SL128)

IPU-POD16

SDK 2.4.0

TensorFlow2

Wikipedia

65,280

16.16

11,035

BERT Base

Ph1 Pre-Training (SL128)

IPU-POD16

SDK 2.4.0

PyTorch

Wikipedia

65,536

16.16

11,184

BERT Base

Ph2 Pre-Training (SL384)

IPU-POD16

SDK 2.4.0

PopART

Wikipedia

16,384

16.16

3,545

BERT Base

Ph2 Pre-Training (SL384)

IPU-POD16

SDK 2.4.0

TensorFlow1

Wikipedia

16,320

16.16

3,288

BERT Base

Ph2 Pre-Training (SL384)

IPU-POD16

SDK 2.4.0

TensorFlow2

Wikipedia

16,320

16.16

3,155

BERT Base

Ph2 Pre-Training (SL384)

IPU-POD16

SDK 2.4.0

PyTorch

Wikipedia

16,384

16.16

3,334

BERT Base - HuggingFace

Fine-Tuning (SL384 - SQuAD)

IPU-POD16

SDK 2.4.0

TensorFlow2

SQuAD

320

16.16

375

GPT2

GPT2-medium (SL128)

IPU-POD16

SDK 2.3.0

PyTorch

Wikipedia

65,536

16.16

2,540

GPT2

GPT2-medium (SL128)

IPU-POD64

SDK 2.3.0

PyTorch

Wikipedia

65,536

16.16

9,870

GPT2

GPT2-medium (SL128)

IPU-POD128

SDK 2.3.0

PyTorch

Wikipedia

65,536

16.16

18,842

GPT2

GPT2-medium (SL128)

IPU-POD256

SDK 2.3.0

PyTorch

Wikipedia

65,536

16.16

31,025

ResNet-50 v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

ImageNet2012

1,920

16.16

7,864

ResNet-50 v1.5

IPU-M2000

SDK 2.4.0

PyTorch

ImageNet2012

16,384

16.16

7,303

ResNet-50 v1.5

IPU-POD16

SDK 2.4.0

TensorFlow1

ImageNet2012

1,920

16.16

30,690

ResNet-50 v1.5

IPU-POD16

SDK 2.4.0

PyTorch

ImageNet2012

16,384

16.16

25,534

ResNet-50 v1.5

IPU-POD64

SDK 2.4.0

TensorFlow1

ImageNet2012

2,560

16.16

108,566

ResNet-50 v1.5

IPU-POD128

SDK 2.4.0

TensorFlow1

ImageNet2012

5,120

16.16

205,006

ResNet-50 v1.5

IPU-POD256

SDK 2.4.0

TensorFlow1

ImageNet2012

10,240

16.16

365,040

ResNeXt101

IPU-M2000

SDK 2.4.0

TensorFlow1

ImageNet2012

768

16.16

2,514

ResNeXt101

IPU-POD16

SDK 2.4.0

TensorFlow1

ImageNet2012

768

16.16

9,023

EfficientNet-B4

G16-EfficientNet

IPU-M2000

SDK 2.4.0

TensorFlow1

ImageNet2012

800

16.16

1,618

EfficientNet-B4

G16-EfficientNet

IPU-M2000

SDK 2.4.0

PyTorch

ImageNet2012

1,024

16.32

1,400

EfficientNet-B4

G16-EfficientNet

IPU-POD16

SDK 2.4.0

TensorFlow1

ImageNet2012

6,144

16.16

6,379

EfficientNet-B4

G16-EfficientNet

IPU-POD16

SDK 2.4.0

PyTorch

ImageNet2012

1,024

16.32

4,311

EfficientNet-B4

G16-EfficientNet

IPU-POD64

SDK 2.4.0

TensorFlow1

ImageNet2012

6,144

16.16

24,946

EfficientNet-B4

G16-EfficientNet

IPU-POD128

SDK 2.4.0

TensorFlow1

ImageNet2012

6,144

16.16

48,015

EfficientNet-B4

G16-EfficientNet

IPU-POD256

SDK 2.4.0

TensorFlow1

ImageNet2012

6,144

16.16

87,968

ViT

Vision Transformer

IPU-POD16

SDK 2.3.0

PyTorch

ImageNet1k

65,536

16.16

6,535

ViT

Vision Transformer

IPU-POD64

SDK 2.3.0

PyTorch

ImageNet1k

65,536

16.16

25,080

ViT

Vision Transformer

IPU-POD128

SDK 2.3.0

PyTorch

ImageNet1k

65,536

16.16

46,320

ViT

Vision Transformer

IPU-POD256

SDK 2.3.0

PyTorch

ImageNet1k

65,536

16.16

68,800

UNet (Medical)

IPU-M2000

SDK 2.4.0

TensorFlow2

EM segmentation

16.16

139

Mini DALL-E

IPU-M2000

SDK 2.4.0

PyTorch

COCO 2017

1,536

16.16

319

Mini DALL-E

IPU-POD16

SDK 2.4.0

PyTorch

COCO 2017

6,144

16.16

815

DeepVoice3

IPU-M2000

SDK 2.4.0

PopART

VCTK Corpus

128

32.32

8,496

FastSpeech2

IPU-M2000

SDK 2.4.0

TensorFlow2

LJ Speech

16.16

406

FastSpeech2

IPU-POD16

SDK 2.4.0

TensorFlow2

LJ Speech

16.16

1,141

Conformer

Conformer-Small

IPU-M2000

SDK 2.4.0

PyTorch

AiShell1

16.16

1,030

Conformer

Conformer-Small

IPU-POD16

SDK 2.4.0

PyTorch

AiShell1

16.16

3,395

TGN

Temporal Graph Network

GC200 IPU

SDK 2.4.0

TensorFlow1

JODIE Wikipedia

200

16.32

190,472

IPU-POD Classic - Time to Result

Model

Variant

Platform

SDK Version

Framework

Dataset

Batch Size

Precision

Time To Result (secs)

MCMC TFP

IPU-M2000

SDK 2.4.0

TensorFlow1

Proprietary

32.32

IPU-POD Classic - Inference

Model inference in this context refers to running a model on input data to infer output. Inference performance in production setups is typically measured on two metrics: throughput (as defined previously) and latency, which is defined as the time taken to execute an inference.

Model

Variant

Platform

SDK Version

Framework

Dataset

Batch Size

Precision

Throughput (items/sec)

Latency (ms)

BERT-Large

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

2,071

1.92

BERT-Large

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

2,911

2.73

BERT-Large

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

3,303

3.62

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

4,580

0.86

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

7,069

1.11

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

9,687

1.65

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

12,584

2.53

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

16.16

15,346

4.16

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

128

16.16

17,972

7.11

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

256

16.16

19,484

13.11

BERT-Base

SL128

IPU-M2000

SDK 2.4.0

PopART

SQuAD

320

16.16

20,803

15.36

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

7,152

1.66

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

10,515

2.27

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

16,207

2.95

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

22,544

4.24

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

28,762

6.66

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

128

16.16

35,155

10.91

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

256

16.16

40,085

19.14

ResNet-50v1.5

lowest latency config

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

7,397

0.52

ResNet-50v1.5

higher throughput config

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

9,404

2.04

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

14,321

2.69

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

20,927

3.7

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

36,193

8.62

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

128

16.16

43,472

14.38

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

256

16.16

49,816

25.13

ResNet-50v1.5

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

360

16.16

50,883

30.68

ResNeXt101

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

4,483

2.66

ResNeXt101

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

6,435

3.71

ResNeXt101

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

9,705

4.93

ResNeXt101

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

13,693

6.99

ResNeXt101

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

17,176

11.16

ResNeXt101

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

3,395

1.14

ResNeXt101

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

4,840

1.62

ResNeXt101

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

6,483

2.43

ResNeXt101

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

11,320

27.83

EfficientNet-B0

lowest latency config

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

8,686

0.44

EfficientNet-B0

higher throughput config

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

10,907

1.69

EfficientNet-B0

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

50,510

3.05

EfficientNet-B0

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

71,839

4.26

EfficientNet-B0

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

128

16.16

86,986

6.77

EfficientNet-B0

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

144

16.16

69,852

9.15

EfficientNet-B0

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

196

16.16

61,714

13.38

EfficientNet-B0

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

8,289

1.43

EfficientNet-B0

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

13,056

1.82

EfficientNet-B0

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

22,217

2.15

EfficientNet-B0

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

34,448

2.77

EfficientNet-B0

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

43,351

4.41

EfficientNet-B0

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

128

16.16

53,256

7.19

EfficientNet-B0

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

160

16.16

55,169

8.68

EfficientNet-B4

lowest latency config

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

3,539

1.09

EfficientNet-B4

higher throughput config

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

4,081

1.85

EfficientNet-B4

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

8,299

3.5

EfficientNet-B4

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

9,874

4.37

EfficientNet-B4

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

10,753

5.3

EfficientNet-B4

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

11,578

6.22

EfficientNet-B4

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

3,718

3.21

EfficientNet-B4

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

5,514

4.34

EfficientNet-B4

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

7,959

6.01

EfficientNet-B4

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

8,958

6.68

EfficientNet-B7

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

1,407

8.52

EfficientNet-B7

IPU-M2000

SDK 2.4.0

TensorFlow1

Synthetic (host-generated)

16.16

1,869

12.82

Yolo v4

image 896, bps 5, max det 200

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

690

9.4

Yolo v4

image 896, bps 10, max det 300

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

722

9.74

Yolo v4

image 640, bps 5, max det 200

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

1,306

10.03

Yolo v4

image 640, bps 10, max det 300

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

1,364

10.39

Yolo v4

image 512, bps 5, max det 200

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

1,772

7.25

Yolo v4

image 512, bps 10, max det 300

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

1,915

7.31

Yolo v4

image 416, bps 5, max det 200

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

2,195

5.88

Yolo v4

image 416, bps 10, max det 100

IPU-M2000

SDK 2.4.0

PyTorch

Synthetic (host-generated)

16.16

2,994

9.42

Unet (Medical)

IPU-M2000

SDK 2.4.0

TensorFlow2

Synthetic (host-generated)

16.16

1,144

Unet (Medical)

IPU-M2000

SDK 2.4.0

TensorFlow2

Synthetic (host-generated)

16.16

1,190

Precision Terminology: X.Y is defined as follows: X is the precision for storing the activations & gradients, and Y is the precision for storing the weights. When training in 16.16 weights we may still use FP32 for other variables (such as norms or momentum), and include stochastic rounding.

Benchmarks were generated using our examples on the Graphcore GitHub.

This page was last updated on Tuesday, July 4, 2023

Performance Results

Bow Platform - Training

Bow Platform - Inference

MLPerf v2.0 Training Performance

IPU-POD Classic - Training

IPU-POD Classic - Time to Result

IPU-POD Classic - Inference

Get the latest Graphcore news

Register your interest