

shaping tomorrow with you

# DLU<sup>™</sup>: Deep Learning Unit

Copyright 2018 FUJITSU LIMITED





# DLU: Processor Designed for Deep Learning





from the K computer

#### Features

- Architecture designed for deep learning
- Low-power consumption design
- Sol: 10x Performance / Watt compared to competitors
- Scalable design with Tofu interconnect technology
  - Ability to handle large-scale neural networks

computer

# What's the New Architecture for the DLU? Domain specific, Optimal precision, and Massively parallel.

#### **Conventional Architecture**

#### The New Architecture

**General Use** 

Complicated O-O-O cores w/ cache memory

**High Precision** 

Double/Single precision FP

Sequential + Parallel

Multiple strong cores

1. Domain Specific

Domain specific cores w/ large register file

2. Optimal Precision

Deep Learning Integer

#### 3. Massively Parallel

Many cores w/ on-chip network

Copyright 2018 FUJITSU LIMITED

# DLU Architecture



- ISA: Newly developed for deep learning
- Micro-Architecture
  - Simple pipeline to remove HW complexity
  - On-chip network to share data between DPUs
- Utilizes Fujitsu's HPC experience, such as high density FMAs and high speed interconnect
- Maximizes performance / watt



#### Fujitsu's interconnect technology Large scale DLU interconnect through off-chip network

# Heterogeneous Cores and Large Register File

The combination of few large core (Master) and many small execution cores (DPU) results in more performance with less power consumption, compared to a conventional homogeneous structure



# DPE & Large RF (Register File)

- DPU consists of 16 DPEs connected with on-chip network
- DPE incudes large RF and wide SIMD execution units to realize an efficient Deep Learning engine.
  - RF is fully SW controllable unlike cache to extract full HW potential



# Domain specific architecture - Why cache memory removed -<u>General Processors</u> DLU (Domain Specific)



Complex hardware to achieve high performance for any applications with various data access patterns

#### E.g.

- Large cache memory with cache tags and LRU replacement Unit
- Hardware Prefetch Engine

Simple hardware focusing on simple memory access patterns

CH-in

CH-out

(weight)

- E.g. Convolution Layer
  - CH-in data can be shared among CH-out calculation at all DPUs
  - Memory access patterns are continuously and predictable (software controllable)

# DLINT : Deep Learning Integer



- Fujitsu's "DLINT" realizes necessary accuracy for Deep Learning with only a 16 or 8 bits data size (i.e. less power consumption compared with FP32)
- Training results with DLINT8/16 can be converted to the conventional 8/16-bit INT for inference.



# Accuracy of Deep Learning Integer



#### DLINT has shown similar accuracy with FP32 precision



(\*) ImageNet(subset): image size=96x96, #categories=25

# DLU Roadmap



#### Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors



The 1<sup>st</sup>
Generation

The 2<sup>nd</sup> Generation Embedded Host CPU

Future

 Neuro Computing
 Combinational Optimization Architecture

\* Subject to change without notice Copyright 2018 FUJITSU LIMITED

#### Al will be accelerated by three technologies



Digital Annealer will identify combinational optimal solutions



Three world-class advanced technologies together will contribute to expansion of customer business

"K" Supercomputer & FX100 will provide simulation and pre-processing technology

3.

HPC Deep Learning FUĴÎTSU DLU TM Deep Learning Unit

Zinrai Deep Learning & DLU will offer a high-speed learning environment

# FUJITSU

shaping tomorrow with you