

### **Technology Scaling Trends**

Exascale in 2021... and then what?





### **Specialization:**

### Natures way of Extracting More Performance in Resource Limited Environment

#### **Powerful General Purpose**



Xeon, Power

Many Lighter Weight (post-Dennard scarcity)



KNL AMD, Cavium/Marvell, GPU

Many Different Specialized (Post-Moore Scarcity)



Apple, Google, Amazon Samba Nova



### **Extreme Hardware Specialization is Happening Now!**

This trend is already well underway in broader electronics industry

Ciphers

**eFuses** 

MIPI CSI2

Cell phones and even megadatacenters (Google TPU, Microsoft FPGAs...) **40+ different heterogeneous** (and it will happen to HPC too... will we be ready?) accelerators in Apple A11 (2019) System Control Connectivity **CPU Platform** Secure JTAG MMC 4.4/ USB2 HSIC Quad ARM® Cortex™-A9 Core SD 3.0 x3 Host x2 PLL, Osc. Blocks 32 KB I-Cache 32 KB D-Cache per Core per Core MMC 4.4/ MIPI HSI Clock and Reset SDXC PTM per Core **NEON per Core** Smart DMA S/PDIF UART x5. Tx/Rx 30 1 MB L2-Cache + VFPv3 5 Mbps Specialized IP IOMUX PCIe 2.0 Multimedia Timer x3 (1-Lane) 12C x3, Hardware Graphics Accelerators SPI x5 20 PWM x4 3D Vector Graphics FlexCAN x2 2D MLB150 + 15 ESAI, I2S/SSI Watch Dog x2 DTCP Video Codecs Audio **Power Management** of 1080p30 Enc/Dec ASRC 3.3V GPIO 1 Gb Ethernet Power Temperature 5 # + IEEE® 1588 Supplies Monitor Keypad **Imaging Processing Unit** Internal Memory **A4 A5** A6 **A8 A7** A9 A10 A11 Resizing and Blending Image Enhancement NAND Cntrl. 2017 ROM RAM S-ATA and 2010 (BCH40) Inversion/Rotation PHY 3 Gbps estimates [Y. Shao 2015] Security LP-DDR2, Display and Camera Interface RNG ecurity Cntrl USB2 OTG DDR3/ 24-bit RGB, LVDS (x2) HDMI and PHY and PHY LV-DDR3 TrustZone Secure RTC **USB2** Host x32/64, MIPI DSI 20-bit CSI and PHY 533 MHz

[www.anandtech.com/show/8562/chipworks-a8]

# Large Scale Datacenters also Moving to Specialized Acceleration The Google TPU



accelerators was designed for something else



Specialization will be necessar and performance requirements



B input is read at once, and they instantly

256 accumulator RAMs.

|       | Model      | MHz    | Measured<br>Watts |      | TOPS/s |     | GOPS/s /Watt |    | GB/s | On-Chip |
|-------|------------|--------|-------------------|------|--------|-----|--------------|----|------|---------|
| IVIOU | WIOGEI     | 141112 | Idle              | Busy | 8b     | FP  | 8b           | FP | GD/3 | Memory  |
|       | Haswell    | 2300   | 41                | 145  | 2.6    | 1.3 | 18           | 9  | 51   | 51 MiB  |
|       | NVIDIA K80 | 560    | 24                | 98   |        | 2.8 |              | 29 | 160  | 8 MiB   |
|       | TPU        | 700    | 28                | 40   | 92     |     | 2,300        |    | 34   | 28 MiB  |

Notional exascale system:

2,300 GOPS/W →? 288 GF/W (dp) → a 3.5 MW Exaflop system!



### **Amazon AWS Graviton Custom ARM SoC** (and others)

#### **AWS Graviton2 processor**

- 4x the vCPUs
- 7x CPU performance
- ~2x performance/vCPU
- ~30 Billion transistors
- 7nm







### **AWS CEO Andy Jassy:**

"AWS isn't going to wait for the tech supply chain to innovate for it and is making a statement with performance comparisons against an Intel Xeon-based instance. The EC2 team was clear that Graviton2 sends a message to vendors that they need to move faster and AWS is not going to hold back its cadence based on suppliers."

### Why does it Matter?

Why should we specialize?



### **HEP: Computing challenges for Particle Tracking**



Exponential growth of the current ATLAS Inner
Detector reconstruction time with increased



... but computing power no longer increasing at exponential rates!

New approaches must be developed to satisfy

growing computing demands of the experiment



### Impacts of Moore's Law Tapering on DOE Science

### **HEP Computing Strategy**

- ▶ Successful implementation of the broad science program envisioned by P5 will require an equally broad and foresighted approach to the computing challenges
  - Meeting these challenges will require us to work together to more effectively share resources (hardware, software, and expertise) and appropriately integrate commercial computing and HPC advances
- Last year OHEP stood up an internal working group charged with:
  - Developing and maintaining an HEP Computing Resource Management Strategy, and
  - ▶ Recommending actions to implement the strategy
- Working group began by conducting an initial survey of the computing needs from each of the three physics Frontiers, and assembled this into a preliminary model
  - Energy Frontier portion alone was a large factor beyond the current computing budget
  - Large data volumes with the HL-LHC require correspondingly large amounts of computing to analyze it
  - → Grid-only solution: \$850M ± 200M
  - ► Using the experiments' estimates of future HPC use reduces this to \$650M ± 150M



# Jim Siegrist May 2018 presentation to HEP Advisory Panel

- Computing capacity for LHC-II off by \$850M compared to original estimates
  - A major factor in mis-projection was due to earlier assumption that Moore's Law would continue unabated

### Mission Need doesn't end with Exascale



**HENP:** compute requirements grow exponentially relative to luminousity

**BES Light Sources & CryoEM:** Double-exponential growth of camera data rates (100k FPS)

Cloud-Resolving Climate Models: Kilometer scale climate models still out of reach (~1 SYD in 2010, ~5 SYD in 2020)

What if we are successful in creating Al driven (no-human in the loop) experiments? What kind of data processing would be needed to keep up with that?

### **Architecture Specialization for Science**

(hardware is design around the algorithms) can't design effective hardware without applied math



But what are the right specializations to include?

What is the cost model (we know we cannot afford to spin our own chips from scratch)

What is the right partnership/economic model for the future of HPC?

The role government research is to understand these trade-offs.

### Post Exascale: Heterogeneous Computing Research Directions



### **Specialization**

Purpose built machines for big science targets.

**Example**: Google TPU. For DOE, DFT is 25% of workload



# Heterogeneous Integration

**Co-integration of many heterogeneous accelerators** 

**Example**: Apple Bionic chip, AWS Graviton2, Project38.



#### Resource Disaggregation

Photonic MCMs to enable reconfigurable nodes/systems

**Example**: Facebook/Google.

Just DRAM utilization diversity in DOE could benefit from this.



### **Specialization**

Purpose built machines for big science targets.

**Example**: Google TPU. For DOE, DFT is 25% of workload



# Heterogeneous Integration

Co-integration of many heterogeneous accelerators

**Example**: Apple Bionic chip, AWS Graviton2, Project38.



#### **Resource Disaggregation**

Photonic MCMs to enable reconfigurable nodes/systems

**Example**: Facebook/Google.

Just DRAM utilization diversity in DOE could benefit from this.



### Algorithm-Driven Design of Programmable Hardware Accelerators

**Example: LS3DF/Density Functional Theory (DFT)** 

# What: Design the hardware acceleration around the target algorithm/application

- Purpose-built acceleration
- Science-led reference algorithm design

# Why: Huge opportunities to improve performance density and efficiency

 FFT hardware accelerator 50x-100x faster than GPU (using SPIRAL generator)

#### **How:** Target Density Functional Theory

- 1. Large fraction of the DOE workload
- 2. Mature code base and algorithm
- 3. LS3DF formulation minimizes off-chip communication and scales O(N)





### The DFT kernel for each fragment

Communication Avoiding LS3DF Formulation – Scales O(N)



$${}_{22} + F_{211} + F_{121} + F_{112} - F_{221} - F_{212} - F_{122} - F_{111}$$

#### **Von-Neumann Instruction Processors vs. Hardware Circuits**

(must redesign for static dataflow and deep flow-through pipelines)

#### Von Neumann CPU



#### Dataflow (FPGA, GraphCore etc.)



**FPGA** (Field Programmable Gate Array): Granularity of these operations and wires are single bits

CGRA (Coarse Grain Reconfigurable Array):
Programmability & ALUs at word granularity
improves speed and density!!
(Cerebras, GraphCore, SambaNova, LPU)

ASIC or Chiplet (custom circuit): Another factor of 10x on density and energy efficiency.

```
= 2^*R_{[t=n]}(0,0,0)
= R_{[t=n-1]}(0,0,0)
+ C^*R_{[t=n+1]}(+1,0,0)
0,0) - C^*2^*R_{[t=n]}(0,0,0)
0,0) + C^*R_{[t=n]}(-1,0,0)
0,0) + C^*R_{[t=n+1]}(0,+1,0)
0,0) - C^*2^*R_{[t=n+1]}(0,0,0)
0,0) + C^*R_{[t=n]}(0,0,0)
0,0) + C^*R_{[t=n]}(0,0,1)
0,0) + C^*R_{[t=n+1]}(0,0,0)
0,0) + C^*R_{[t=n+1]}(0,0,0)
0,0) + C^*R_{[t=n+1]}(0,0,0)
0,0) + C^*R_{[t=n]}(0,0,0)
```

### **Algorithm Reformulated as Custom Circuit**

#### Von Neumann CPU





#### Dataflow (FPGA, GraphCore etc.)





### Preliminary Performance on CGRA H $\Psi$

#### Eigenvalue Problem



#### Von Neumann CPU or GPU



```
int main()
  while (n < 100)
    print("n = %d\n", n);
    pause(200);
    if(n == 50) break;
  print("All done!");
```



#### Dataflow (FPGA, GraphCore etc.)





#### Mapping onto Custom Hardware



| Results           |             |          |          |  |
|-------------------|-------------|----------|----------|--|
|                   | Time for    | Speedup  | Speedup  |  |
| Platform          | Contraction | over CPU | over GPU |  |
| CPU (Haswell/Cori |             |          |          |  |
| Phase 1) node     | 1.375       | 1        |          |  |
|                   |             |          |          |  |
| GPU (NVIDIA 1080) | 0.5         | 2.75     | 1        |  |
| CGRA (            |             |          |          |  |
| unoptimized       | 0.23        | 6        | 2.2      |  |
| CGRA ( )          |             |          |          |  |
| optimized         | 0.023       | 60       | 21.7     |  |
|                   |             |          |          |  |

Delivered Speedups (compared to optimized code) of "custom" DFT accelerator running on CGRA



Thom Popovici, Andrew Canning (FFTx), Zhengji Zhang (NERSC) Franz Francetti (CMU/FFTx)

## **Heterogeneous Integration**





### **Specialization**

Purpose built machines for big science targets.

**Example**: Google TPU. For DOE, DFT is 25% of workload



# Heterogeneous Integration

Co-integration of many heterogeneous accelerators

**Example**: Apple Bionic chip, AWS Graviton2, Project38.



#### Resource Disaggregation

Photonic MCMs to enable reconfigurable nodes/systems

**Example**: Facebook/Google.

Just DRAM utilization diversity in DOE could benefit from this.



### Project38: HPC Improvements Through Innovative Architecture

Cross-agency architectural exploration

### Project 38 (P38) is a set of vendor-agnostic architectural explorations involving DOD, the DOE Office of Science, and NNSA

- Near-term goal: Quantify the performance value and identify the potential costs of specific architectural concepts against a limited set of applications of interest to both the DOE and DOD.
- Long-term goal: Develop an enduring capability for DOE and DOD to jointly explore architectural innovations and quantify their value.

• Stretch goal: Specification of a shared, purpose built architecture to drive future DOE-DOD collaborations and investments. (purpose-built HPC by 2025)

#### **Accomplishments**

- Released initial project report through NITRD in 2020 that identifies 8 promising architecture enhancements that can significantly improve application performance.
- Working with Arm, AMD (LBL/ANL/PNNL), and Micron (Sandia/LLNL) to assess feasibility and develop cost models
- ANL evaluating impact of diverse specializations on the programming environment & compiler technologies.

Related Effort at LANL
Jason Pruett
"Tailored Computing"
(whitepaper forthcoming)

Internal

Design &

Production

COTS

Traditional DOE<sub>ECP</sub>

Aggressive Vendor

Innovative USG

<u>.pdf</u>

### Recapping Key P38 Technology Explorations



- Fixed Function Accelerators & COTS IP (Extreme Heterogeneity)
  - RISC-V and ARM cores
  - Fixed function FFT (Generated by SPIRAL)



- Word Granularity Scratchpad Memory (Gather Scatter):
  - Gather-scatter within processor tile
  - more effective SIMD



- Recoding engine (Efficient programmable FSM & data reorg.)
  - Sub-word granularity and high control irregularity
  - Handles branch-heavy code (avg. 20x improvement over processor core)
  - One lane is 1/100<sup>th</sup> the size of a x86 processor core



- Hardware Message Queues (Lightweight Interprocessor Communication)
  - Gather-scatter between processor tiles
  - Async between tiles to eliminate overhead of barriers

### **Fixed Function Accelerators Design Study**

**Dark Silicon** 

- What if HPC adopted SmartPhone SoC Strategy -- mix fixed-function accelerators with programmable cores
- Target commonly used scientific primitives/libraries
  - BLAS (level 1,2,3)
  - FFT (FFTW or SPIRAL interface)





### m p

### FFT Example With FFTx (Francetti, Popovic, Canning)



#### For FFT of size N

- Storage = N \* operand size
- $\overline{\phantom{a}}$  Compute = 5/2 \* N \* log2(N) FLOPs
- Use Pseudo-2D algorithm for large FFTs

#### **Single FFT Accelerator Resource**

### **Assumptions: Spiral HW Generator**

- 1GHz @ 14nm technology node
- 2M point transform (data off-chip)
- HPC Challenge Benchmark: Single precision (Float32) complex, out-of-place

#### Limit: 100 GB/s off-chip memory

- 16k points on-chip engine
- Analytic model for FP limit ~1.5TFLOPs SP
- 4.5mm<sup>2</sup> area for compute @ 14nm

#### Limit: 1TB/s off-chip memory

- ~10k MADD + ~5k add -> 15k FP@1GHz

Analytical model for FP limit ~15TFLOPs SP

47mm<sup>2</sup> area for compute @14nm



### **IP Reuse is Key**

This is the \*real\* power of the ARM ecosystem (its not just about Arm cores or Cavium)



- Leverage commodity ecosystems
- Get commercially supported IP where there is a market to support it
- Use open-source IP where the government needs to develop technology to serve its needs
- Partner with system integrators & chip vendors for realization of systems

( new sustainable economic model for HPC)









### **Resource Disaggregation**





### **Specialization**

Purpose built machines for big science targets.

**Example**: Google TPU. For DOE, DFT is 25% of workload



# Heterogeneous Integration

Co-integration of many heterogeneous accelerators

**Example**: Apple Bionic chip, AWS Graviton2, Project38.



#### **Resource Disaggregation**

Photonic MCMs to enable reconfigurable nodes/systems

**Example**: Facebook/Google.

Just DRAM utilization diversity in DOE could benefit from this.



### **Diverse Node Configurations for Datacenter Workloads**

CPU

TOR

#### **Training**

- 8 connections: GPU
- 8 links to HBM (weights)
- 8 links: to NVRAM
- 1 links: to CPU (control)

#### **Data Mining**

- 6-links: HBM
- 15 links: NVRAM (capacity)
- 4 links: CPU (branchy code)



#### <u>Inference</u>

 16 links to TOR (streaming data)



- 8 links HBM (weights)
- 1 link: CPU

#### **Graph Analytics**

- 16 links HBM
- 8 links TOR
- 1 Link CPU





TOR

NVRAM

GPU

HBM

CPU

### **Memory Disaggregation**

#### Memory pressure at NERSC, 2018



Fraction of Node Memory Used (%)

About 15% of NERSC workload uses more than 75% of the available memory per node.

And ~25% uses more than 50% of available memory.

But 75% of Haswell job hours (60% of KNL) use < 25% memory

Over

Overestimate: maxrss x ranks\_per\_node

Assumes memory balance across MPI ranks.



Brian Austin: NERSC Workload Analysis

### Disaggregated Node/Rack Architecture



Most solutions current disaggregation solutions use Interconnect bandwidth (1 – 10 GB/s) But this is significantly inferior to RAM bandwidth (100 GB/s – 1 TB/s)

# Interposers are the right point of intersection where copper pin bandwidth density could match photonics bandwidth density!



**Good News:** Extend Bandwidth Density and lower power/bit

- Bad News: Limited (~2cm) reach
  - Cannot get outside of the package (but we need to!!!!)







- 5X the bandwidth v. GDDR5
- Up to 16GB
- One-third the footprint
- Half the energy per bit
- Managed memory stack for optimal levels of reliability, availability and serviceability





### Impedance Matching to our Packaging Technology





package integration

Microbump

10Gbps

Copper Pillars @

Package substrate

EAG 5.0kV 3.2mm x15.0k







#### **DWDM Using Silicon Photonics**

Ring Resonators @ 10 Gigabits/sec per chan Many channels to get bandwidth density

and Slow!

#### **Comb Laser Sources**

Single laser to efficiently generate 100s of frequencies

Wide and Slow!

### Photonic MCM (Multi-Chip Module)





### Photonic MCM (Multi-Chip Module)







### PINE: Photonic Integrated Networked Energy Efficient Datacenters

Resource Disaggregation to custom-assemble diverse accelerators for diverse workload requirements

- 1) Energy-bandwidth optimized optical links
- 2) Embedded silicon photonics into OC-MCMs
- 3) Bandwidth steering for **Custom Node Connectivity**







### **ENLITENED**





Bergman













Patel

. 1 | 1 . 1 | 1 . .



Dennison

**DVIDIA** 























### **Economic Models**



### **Neil Thompson: Economics of Post-Moore Electronics**



http://neil-t.com, MIT CSAIL, MIT Sloan School

| T | he | To | p |
|---|----|----|---|
|---|----|----|---|

| Technology  | 01010011 01100011<br>01101001 01100101<br>01101110 01100011<br>01100101 00000000 |                                        |                                                 |  |
|-------------|----------------------------------------------------------------------------------|----------------------------------------|-------------------------------------------------|--|
|             | Software                                                                         | Algorithms                             | Hardware architecture                           |  |
| Opportunity | Software performance engineering                                                 | New algorithms                         | Hardware streamlining                           |  |
| Examples    | Removing software bloat<br>Tailoring software to<br>hardware features            | New problem domains New machine models | Processor simplification  Domain specialization |  |

#### The Bottom

#### for example, semiconductor technology

- The Economic Impact of Moore's Law
- There's Plenty of Room at the Top: What will drive computer performance after Moore's Law?
- The Decline of Computers as a General Purpose Technology





**Papers** 



### IP Reuse is Key: (IP is the commodity & cost driver)



#### **Neil Thompson**



### Chiplets and Wafer-Scale Integration as path for Heterogeneous Integration







CHIPS modularity targets the enabling of a wide range of custom solutions



### **Industry: Heterogeneous Integration Roadmap**

to

IoE



2019 Edition

http://eps.ieee.org/hir

HPC and Megadatacenters is 2<sup>nd</sup> chapter



All future applications will be further transformed through the power of AI, VR, and AR.

**Data Centers** 



Everywhere





Die + Heterogeneous

System in Package (SiP)











### conclusion

- In the era of the "universal computer" scale was the correct answer to deliver value to our scientific customers.
- In this post-moore/post-exascale era, that is not a viable approach to continuing to deliver value to our customers. It isn't scale, it must be differentiation and targeted specialization
- Scale demanded we focus on capital costs. The new era must increase focus on development costs to meet the demands of science.
- The "cloud" does not mitigate this outcome.



### **Project 38 -- Background**

DOD and DOE recognize the imperative to develop new mechanisms for engagement with the vendor community, particularly on architectural innovations with strategic value to USG HPC.

Project 38 (P38) is a set of vendor-agnostic architectural explorations involving DOD, the DOE Office of Science, and NNSA (these latter two organizations are referred to in this document as "DOE"). These explorations should accomplish the following:

- Near-term goal: Quantify the performance value and identify the potential costs of specific architectural concepts against a limited set of applications of interest to both the DOE and DOD.
- Long-term goal: Develop an enduring capability for DOE and DOD to jointly explore architectural innovations and quantify their value.
- **Stretch goal:** Specification of a shared, purpose built architecture to drive future DOE-DOD collaborations and investments. (purpose-built HPC by 2025) Internal

Traditional DOE<sub>ECP</sub>

Aggressive Vendor Innovative USG Design & Production

