# **Supercomputing in Plain English** Part X: GPGPU: Number Crunching Inside Your GPU

### Henry Neeman, Director

OU Supercomputing Center for Education & Research University of Oklahoma Information Technology Tuesday April 28 2009





# This is an experiment!

#### It's the nature of these kinds of videoconferences that FAILURES ARE GUARANTEED TO HAPPEN! NO PROMISES!

- So, please bear with us. Hopefully everything will work out well enough.
- If you lose your connection, you can retry the same kind of connection, or try connecting another way.
- Remember, if all else fails, you always have the toll free phone bridge to fall back on.







This week's Access Grid (AG) venue: Cactus.

If you aren't sure whether you have AG, you probably don't.

| Tue Apr 28 | Cactus |
|------------|--------|
| Tue May 5  | Titan  |

Many thanks to John Chapman of U Arkansas for setting these up for us.





If you want to use H.323 videoconferencing – for example, Polycom – then dial

#### 69.77.7.203##12345

any time after 2:00pm. Please connect early, at least today.

For assistance, contact Andy Fleming of <u>KanREN</u>/Kan-ed (<u>afleming@kanren.net</u> or 785-230-2513).

KanREN/Kan-ed's H.323 system can handle up to 40 simultaneous H.323 connections. If you cannot connect, it may be that all 40 are already in use.

Many thanks to Andy and KanREN/Kan-ed for providing H.323 access.





We have unlimited simultaneous iLinc connections available.

- If you're already on the SiPE e-mail list, then you should already have an e-mail about iLinc. Your personal URL will always be the same.
- If you want to use iLinc, please follow the directions in the iLinc e-mail.
- For iLinc, you <u>MUST</u> use either Windows (XP strongly preferred) or MacOS X with Internet Explorer.
- To use iLinc, you'll need to download a client program to your PC. It's free, and setup should take only a few minutes.
- Many thanks to Katherine Kantardjieff of California State U Fullerton for providing the iLinc licenses.





### **QuickTime Broadcaster**

If you cannot connect via the Access Grid, H.323 or iLinc, then you can connect via QuickTime:

#### rtsp://129.15.254.141/test\_hpc09.sdp

We recommend using QuickTime Player for this, because we've tested it successfully.

We recommend upgrading to the latest version at:

http://www.apple.com/quicktime/

When you run QuickTime Player, traverse the menus

File -> Open URL

Then paste in the rstp URL into the textbox, and click OK. Many thanks to Kevin Blake of OU for setting up QuickTime Broadcaster for us.





## **Phone Bridge**

If all else fails, you can call into our toll free phone bridge:

1-866-285-7778, access code 6483137#

Please mute yourself and use the phone to listen.

Don't worry, we'll call out slide numbers as we go.

- Please use the phone bridge <u>ONLY</u> if you cannot connect any other way: the phone bridge is charged per connection per minute, so our preference is to minimize the number of connections.
- Many thanks to Amy Apon and U Arkansas for providing the toll free phone bridge.





No matter how you connect, please mute yourself, so that we cannot hear you.

- At OU, we will turn off the sound on all conferencing technologies.
- That way, we won't have problems with echo cancellation.
- Of course, that means we cannot hear questions.
- So for questions, you'll need to send some kind of text.

#### Also, if you're on iLinc: **SIT ON YOUR HANDS! Please DON'T touch ANYTHING!**





# **Questions via Text: iLinc or E-mail**

Ask questions via text, using one of the following:

- iLinc's text messaging facility;
- e-mail to <u>sipe2009@gmail.com</u>.

All questions will be read out loud and then answered out loud.





# **Thanks for helping!**

- OSCER operations staff (Brandon George, Dave Akin, Brett Zimmerman, Josh Alexander)
- OU Research Campus staff (Patrick Calhoun, Josh Maxey, Gabe Wingfield)
- Kevin Blake, OU IT (videographer)
- Katherine Kantardjieff, CSU Fullerton
- John Chapman and Amy Apon, U Arkansas
- Andy Fleming, KanREN/Kan-ed
- This material is based upon work supported by the National Science Foundation under Grant No. OCI-0636427, "CI-TEAM Demonstration: Cyberinfrastructure Education for Bioinformatics and Beyond."





# This is an experiment!

#### It's the nature of these kinds of videoconferences that FAILURES ARE GUARANTEED TO HAPPEN! NO PROMISES!

- So, please bear with us. Hopefully everything will work out well enough.
- If you lose your connection, you can retry the same kind of connection, or try connecting another way.
- Remember, if all else fails, you always have the toll free phone bridge to fall back on.





# **Supercomputing Exercises**

Want to do the "Supercomputing in Plain English" exercises?

• The first several exercises are already posted at:

http://www.oscer.ou.edu/education.php

 If you don't yet have a supercomputer account, you can get a temporary account, just for the "Supercomputing in Plain English" exercises, by sending e-mail to:

hneeman@ou.edu

Please note that this account is for doing the <u>exercises only</u>, and will be shut down at the end of the series.



# **OK Supercomputing Symposium 2009**



2003 Keynote: Peter Freeman NSF Computer & Information Science & Engineering Assistant Director

2009 Keynote: **Ed Seidel** Director **NSF** Office of Cyberinfrastructure





2004 Keynote: Sangtae Kim **NSF** Shared Cyberinfrastructure **Division** Director





2005 Keynote: Walt Brooks NASA Advanced Supercomputing **Division Director** 



2006 Keynote: Dan Atkins Head of NSF's Office of Cvberinfrastructure



2007 Keynote: Jay Boisseau Director Texas Advanced **Computing Center** U. Texas Austin



2008 Keynote: José Munoz **Deputy Office** Director/ Senior Scientific Advisor Office of Cyberinfrastructure National Science Foundation

is **OPEN**!

FREE! Wed Oct 7 2009 @ OU http://symposium2009.oscer.ou.edu/

**Registration Parallel Programming Workshop** FREE! Tue Oct 6 2009 @ OU **Sponsored by SC09 Education Program** FREE! Symposium Wed Oct 7 2009 @ OU Supercomputing in Plain English: GPGPU

Tuesday April 28 2009



# **SC09 Summer Workshops**

This coming summer, the SC09 Education Program, part of the SC09 (Supercomputing 2009) conference, is planning to hold two weeklong supercomputing-related workshops in Oklahoma, for **FREE** (except you pay your own transport):

- <u>At OSU Sun May 17 the May 23</u>: <u>FREE</u> Computational Chemistry for Chemistry Educators (2010 TENTATIVE: Computational Biology)
- <u>At OU Sun Aug 9 Sat Aug 15</u>:
   <u>FREE</u> Parallel Programming & Cluster Computing
- We'll alert everyone when the details have been ironed out and the registration webpage opens.
- Please note that you must <u>apply</u> for a seat, and acceptance <u>CANNOT</u> be guaranteed.





## **SC09 Summer Workshops**

- 1. May 17-23: Oklahoma State U: Computational Chemistry
- 2. May 25-30: Calvin Coll (MI): Intro to Computational Thinking
- 3. June 7-13: U Cal Merced: Computational Biology
- 4. June 7-13: Kean U (NJ): Parallel Progrmg & Cluster Comp
- 5. June 14-20: Widener U (PA): Computational Physics
- 6. July 5-11: Atlanta U Ctr: Intro to Computational Thinking
- 7. July 5-11: Louisiana State U: Parallel Progrmg & Cluster Comp
- 8. July 12-18: U Florida: Computational Thinking Grades 6-12
- 9. July 12-18: Ohio Supercomp Ctr: Computational Engineering
- 10. Aug 2-8: U Arkansas: Intro to Computational Thinking
- 11. Aug 9-15: U Oklahoma: Parallel Progrmg & Cluster Comp





#### Outline

- What is GPGPU?
- GPU Programming
- Digging Deeper: CUDA on NVIDIA
- CUDA Thread Hierarchy and Memory Hierarchy
- CUDA Example: Matrix-Matrix Multiply





# What is GPGPU?



#### Accelerators

No, not this ....



http://gizmodo.com/5032891/nissans-eco-gas-pedal-fights-back-to-help-you-save-gas





- In HPC, an accelerator is hardware component whose role is to speed up some aspect of the computing workload.
- In the olden days (1980s), supercomputers sometimes had <u>array processors</u>, which did vector operations on arrays, and PCs sometimes had <u>floating point accelerators</u>: little chips that did the floating point calculations in hardware rather than software.
- More recently, *Field Programmable Gate Arrays* (FPGAs) allow reprogramming deep into the hardware.





### Why Accelerators are Good

Accelerators are good because:

• they make your code run faster.





## Why Accelerators are Bad

Accelerators are bad because:

- they're expensive;
- they're hard to program;
- your code on them isn't portable to other accelerators, so the labor you invest in programming them has a very short half-life.





### The King of the Accelerators

#### The undisputed champion of accelerators is:

#### the graphics processing unit.

http://www.amd.com/us-en/assets/content\_type/DigitalMedia/46928a\_01\_ATI-FirePro\_V8700\_angled\_low\_res.gif

http://images.nvidia.com/products/quadro\_fx\_5800/Quadro\_FX5800\_low\_3qtr.png



http://www.gamecyte.com/wp-content/uploads/2009/01/ibm-sony-toshiba-cell.jpg





# Why GPU?

- Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering.
- They became very very popular with videogamers, because they've produced better and better images, and lightning fast.
- And, prices have been extremely good, ranging from three figures at the low end to four figures at the high end.





### **GPUs are Popular**

- Chips are expensive to design (hundreds of millions of \$\$\$), expensive to build the factory for (billions of \$\$\$), but cheap to produce.
- In 2006 2007, GPUs sold at a rate of about 80 million cards per year, generating about \$20 billion per year in revenue.

http://www.xbitlabs.com/news/video/display/20080404234228\_Shipments\_of\_Discrete\_Graphi cs\_Cards\_on\_the\_Rise\_but\_Prices\_Down\_Jon\_Peddie\_Research.html

• This means that the GPU companies have been able to recoup the huge fix costs.





## **GPU Do Arithmetic**

- GPUs mostly do stuff like rendering images.
- This is done through mostly floating point arithmetic the same stuff people use supercomputing for!





# **GPU Programming**



# Hard to Program?

- In the olden days that is, until just the last few years programming GPUs meant either:
  - using a graphics standard like OpenGL (which is mostly meant for rendering), or
  - getting fairly deep into the graphics rendering pipeline.
- To use a GPU to do general purpose number crunching, you had to make your number crunching pretend to be graphics.
- This was hard. So most people didn't bother.





More recently, GPU manufacturers have worked hard to make GPUs easier to use for general purpose computing.

This is known as *General Purpose Graphics Processing Units*.





- Proprietary programming language or extensions
  - NVIDIA: CUDA (C/C++)
  - AMD/ATI: StreamSDK/Brook+ (C/C++)
- OpenCL (Open Computing Language): an industry standard for doing number crunching on GPUs.
- Portland Group Fortran and C compilers with accelerator directives.





# NVIDIA CUDA

- NVIDIA proprietary
- Formerly known as "Compute Unified Device Architecture"
- Extensions to C to allow better control of GPU capabilities
- Modest extensions but major rewriting of the code
- No Fortran version available





#### **CUDA Example Part 1**

```
| |
```

```
#include "stdafx.h"
```

```
#include <stdio.h>
#include <cuda.h>
```

```
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx<N) a[idx] = a[idx] * a[idx];
}</pre>
```

http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/





### **CUDA Example Part 2**

// main routine that executes on the host
int main(void)

```
float *a h, *a d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size t size = N * sizeof(float);
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a h[i] = (float)i;
cudaMemcpy(a d, a h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block size = 4;
int n blocks = N/block size + (N%block size == 0 ? 0:1);
square array <<< n blocks, block size >>> (a d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a h, a d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a h[i]);</pre>
// Cleanup
free(a h); cudaFree(a d);
```





### **AMD/ATI Brook+**

- AMD/ATI proprietary
- Formerly known as "Close to Metal" (CTM)
- Extensions to C to allow better control of GPU capabilities
- No Fortran version available





{

#### **Brook+ Example Part 1**

float4 matmult\_kernel (int y, int x, int k,
 float4 M0[], float4 M1[])

```
float4 total = 0;
for (int c = 0; c < k / 4; c++)
{
    total += M0[y][c] * M1[x][c];
}
return total;</pre>
```

http://developer.amd.com/gpu\_assets/Stream\_Computing\_Overview.pdf





#### **Brook+ Example Part 2**

```
void matmult (float4 A[], float4 B'[], float4 C[])
ł
    for (int i = 0; i < n; i++)
    ł
        for (j = 0; j < m / 4; j+)
            launch_thread{
                C[i][j] =
                    matmult_kernel(j, i, k, A, B');}
    sync_threads{}
```





# **OpenCL**

- Open Computing Language
- Open standard developed by the Khronos Group, which is a consortium of many companies (including NVIDIA, AMD and Intel, but also lots of others)
- Initial version of OpenCL standard released in Dec 2008.
- Many companies will create their own implementations.
- Apple expects to be first to market, with an OpenCL implementation included in Mac OS X v10.6 ("Snow Leopard"), expected in 2009.





```
// create a compute context with GPU device
context = clCreateContextFromType(0, CL DEVICE TYPE GPU, NULL, NULL,
   NULL);
// create a work-queue
queue = clCreateWorkOueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
memobjs[0] =
    clCreateBuffer(context,
                   CL MEM READ ONLY | CL MEM COPY HOST PTR,
                   sizeof(float)*2*num entries, srcA);
memobjs[1] =
    clCreateBuffer(context, CL_MEM_READ_WRITE,
                   sizeof(float)*2*num entries, NULL);
// create the compute program
program =
    clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL);
// build the compute program executable
clBuildProgramExecutable(program, false, NULL, NULL);
// create the compute kernel
kernel = clCreateKernel(program, "fft1D 1024");
```









```
// This kernel computes FFT of length 1024. The 1024 length FFT
// is decomposed into calls to a radix 16 function, another
// radix 16 function and then a radix 4 function
kernel void fft1D 1024 (
    global float2 *in, global float2 *out,
    local float *sMemx, local float *sMemy)
{
    int tid = get_local_id(0);
    int blockIdx = get group id(0) * 1024 + tid;
    float2 data[16];
    // starting index of data to/from global memory
    in = in + blockIdx;
    out = out + blockIdx;
    globalLoads(data, in, 64); // coalesced global reads
```





fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) \* 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) \* 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); // coalesced global writes globalStores(data, out, 64);



#### **Portland Group Accelerator Directives**

- Proprietary directives in Fortran and C
- Similar to OpenMP in structure
- Currently in beta release
- If the compiler doesn't understand these directives, it ignores them, so the same code can work with an accelerator or without, and with the PGI compilers or other compilers.
- In principle, this will be able to work on a variety of accelerators, but the first instance will be NVIDIA; PGI recently announced a deal with AMD/ATI.
- The directives tell the compiler what parts of the code happen in the accelerator; the rest happens in the regular hardware.





#### **PGI Accelerator Example**

!\$acc region do k = 1, n1do i = 1, n3c(i,k) = 0.0do j = 1, n2c(i,k) = c(i,k) +a(i,j) \* b(j,k) δ enddo enddo enddo !\$acc end region http://www.pgroup.com/resources/accel.htm





#### **Digging Deeper: CUDA on NVIDIA**



#### **NVIDIA Tesla**

- NVIDIA now offers a GPU platform named Tesla.
- It consists of their highest end graphics card, minus the video out connector.
- This cuts the cost of the GPU card roughly in half: Quadro FX 5800 is ~\$3000, Tesla C1060 is ~\$1500.





#### **NVIDIA Tesla C1060 Card Specs**

- 240 GPU cores
- 1.296 GHz



- Single precision floating point performance: 933 GFLOPs (3 single precision flops per clock per core)
- Double precision floating point performance: 78 GFLOPs (0.25 double precision flops per clock per core)
- Internal RAM: 4 GB
- Internal RAM speed: 102 GB/sec (compared 21-25 GB/sec for regular RAM)
- Has to be plugged into a PCIe slot (at most 8 GB/sec)





#### **NVIDIA Tesla S1070 Server Specs**

- 4 C1060 cards inside a 1U server (looks like a Sooner node)
- Available in both 1.296 GHz and 1.44 GHz
- Single Precision (SP) floating point performance:
   3732 GFLOPs (1.296 GHz) or 4147 GFLOPs (1.44 GHz)
- Double Precision (DP) floating point performance:
   311 GFLOPs (1.296 GHz) or 345 GFLOPs (1.44 GHz)
- Internal RAM: 16 GB total (4 GB per GPU card)
- Internal RAM speed: 408 GB/sec aggregate
- Has to be plugged into two PCIe slots (at most 16 GB/sec)





#### Compare x86 vs S1070

Let's compare the best dual socket x86 server today vs S1070.

|                                | Dual socket, Intel<br>2.66 hex core | NVIDIA Tesla S1070                             |  |
|--------------------------------|-------------------------------------|------------------------------------------------|--|
| Peak DP FLOPs                  | 128 GFLOPs DP                       | 345 GFLOPs DP (2.7x)                           |  |
| Peak SP FLOPS                  | 256 GFLOPs SP                       | 4147 GFLOPs SP (16.2x)                         |  |
| Peak RAM BW                    | 17 GB/sec                           | 408 GB/sec (24x)                               |  |
| Peak PCIe BW                   | N/A                                 | 16 GB/sec                                      |  |
| Needs x86 server to attach to? | No                                  | Yes                                            |  |
| Power/Heat                     | ~400 W                              | $\sim 800 \text{ W} + \sim 400 \text{ W} (3x)$ |  |
| Code portable?                 | Yes                                 | No (CUDA)                                      |  |
|                                |                                     | Yes (PGI, OpenCL)                              |  |





#### Compare x86 vs S1070

Here are some interesting measures:

|                       | Dual socket, Intel<br>2.66 hex core | NVIDIA Tesla S1070          |
|-----------------------|-------------------------------------|-----------------------------|
| DP GFLOPs/Watt        | ~0.3 GFLOPs/Watt                    | ~0.3 GFLOPs/Watt (same)     |
| SP GFLOPS/Watt        | 0.64 GFLOPs/Watt                    | ~3.5 GFLOPs (~5x)           |
| DP GFLOPs/sq ft       | ~340 GFLOPs/sq ft                   | ~460 GFLOPs/sq ft (1.3x)    |
| SP GFLOPs/sq ft       | ~680 GFLOPs/sq ft                   | ~5500 GFLOPs/sq ft (8x)     |
| Racks per PFLOP<br>DP | 244 racks/PFLOP<br>DP               | 181 racks/PFLOP (3/4)<br>DP |
| Racks per PFLOP<br>SP | 122 racks/PFLOP<br>SP               | 15 racks/PFLOP (1/8)<br>SP  |

OU's Sooner is 65 TFLOPs SP, which is <u>1 rack</u> of S1070.





#### What Are the Downsides?

- You have to rewrite your code into CUDA or OpenCL or PGI accelerator directives.
  - CUDA: Proprietary, C/C++ only
  - OpenCL: portable but cumbersome
  - PGI accelerator directives: not clear whether you can have most of the code live inside the GPUs.





#### **Programming for Performance**

The biggest single performance bottleneck on GPU cards today is the PCIe slot:

- PCIe 2.0 x16: 8 GB/sec
- 1600 MHz Front Side Bus: 25 GB/sec
- GDDR3 GPU card RAM: 102 GB/sec per card

Your goal:

- At startup, move the data from x86 server RAM into GPU RAM.
- Do almost all the work inside the GPU.
- Use the x86 server only for I/O and message passing, to minimize the amount of data moved through the PCIe slot.





#### **Does CUDA Help?**

| Example Applications            | URL                                            | Speedup     |
|---------------------------------|------------------------------------------------|-------------|
| Seismic Database                | http://www.headwave.com                        | 66x – 100x  |
| Mobile Phone Antenna Simulation | http://www.accelware.com                       | 45x         |
| Molecular Dynamics              | http://www.ks.uiuc.edu/Research/vmd            | 21x – 100x  |
| Neuron Simulation               | http://www.evolvedmachines.com                 | 100x        |
| MRIProcessing                   | http://bic-test.beckman.uiuc.edu               | 245x – 415x |
| Atmospheric Cloud Simulation    | http://www.cs.clemson.edu/~jesteel/clouds.html | 50x         |

http://www.nvidia.com/object/IO\_43499.html





### **CUDA Thread Hierarchy and Memory Hierarchy**

Some of these slides provided by Paul Gray, University of Northern Iowa



#### **CPU vs GPU Layout**



#### Source: Nvidia CUDA Programming Guide





#### **Buzzword: Kernel**

In CUDA, a *kernel* is code (typically a function) that can be run inside the GPU.

Typically, the kernel code operates in lock-step on the stream processors inside the GPU.





#### **Buzzword: Thread**

In CUDA, a *thread* is an execution of a kernel with a given index.

Each thread uses its index to access a specific subset of the elements of a target array, such that the collection of all threads cooperatively processes the entire data set.

So these are very much like threads in the OpenMP or pthreads sense – they even have shared variables and private variables.





#### **Buzzword: Block**

In CUDA, a *block* is a group of threads.

- Just like OpenMP threads, these could execute concurrently or independently, and in no particular order.





#### **Buzzword: Grid**

In CUDA, a *grid* is a group of (thread) blocks, with no synchronization at all among the blocks.





#### **NVIDIA GPU Hierarchy**

- <u>Grids</u> map to GPUs
- <u>Blocks</u> map to the MultiProcessors (MP)
  - Blocks are never split across MPs, but an MP can have multiple blocks
- <u>Threads</u> map to Stream Processors (SP)
- <u>Warps</u> are groups of (32) threads that execute simultaneously

Image Source: Nvidia CUDA Programming Guide







#### **CUDA Built-in Variables**

- **blockIdx.x, blockIdx.y, blockIdx.z** are built-in variables that returns the block ID in the x-axis, y-axis and z-axis of the block that is executing the given block of code.
- threadIdx.x, threadIdx.y, threadidx.z are built-in variables that return the thread ID in the x-axis, y-axis and z-axis of the thread that is being executed by this stream processor in this particular block.
- So, you can express your collection of blocks, and your collection of threads within a block, as a 1D array, a 2D array or a 3D array.

These can be helpful when thinking of your data as 2D or 3D.





#### \_global\_\_\_ Keyword

- In CUDA, if a function is declared with the **\_\_\_global\_\_\_** keyword, that means that it's intended to be executed inside the GPU.
- In CUDA, the term for the GPU is <u>*device*</u>, and the term for the x86 server is <u>*host*</u>.
- So, a kernel runs on a device, while the main function and so on run on the host.
- Note that a host can play host to multiple devices; for example, an S1070 server contains 4 C1060 GPU cards, and if a single host has two PCIe slots, then both of the PCIe plugs of the S1070 can be plugged into that same host.





#### **Copying Data from Host to Device**

- If data need to move from the host (where presumably the data are initially input or generated), then a copy has to exist in both places.
- Typically, what's copied are arrays, though of course you can also copy a scalar (the address of which is treated as an array of length 1).





#### **CUDA Memory Hierarchy #1**

- CUDA has a hierarchy of several kinds of memory:
- Host memory (x86 server)
- Device memory (GPU)
  - <u>Global</u>: visible to all threads in all blocks – largest, slowest
  - <u>Shared</u>: visible to all threads in a particular block – medium size, medium speed
  - *Local*: visible only to a particular thread smallest, fastest







#### **CUDA Memory Hierarchy #2**

## CUDA has a hierarchy of several kinds of memory:

- Host memory (x86 server)
- Device memory (GPU)
  - <u>Constant</u>: visible to all threads in all blocks; read only
  - <u>Texture</u>: visible to all threads in all blocks; read only







#### **CUDA Example: Matrix-Matrix Multiply**



#### **Matrix-Matrix Multiply Main Part 1**

float\* host\_B;
float\* host\_B;
float\* device\_A;
float\* device\_B;
float\* device\_C;
host\_A = (float\*) malloc(mem\_size\_A);

float\* host A;

host\_B = (float\*) malloc(mem\_size\_B); host C = (float\*) malloc(mem\_size\_C);

```
cudaMalloc((void**) &device_A, mem_size_A);
cudaMalloc((void**) &device_B, mem_size_B);
cudamalloc((void**) &device_C, mem_size_C);
```

 $\ensuremath{{\prime}}\xspace$  // Set up the initial values of A and B here.

// Henry says: I've oversimplified this a bit from
// the original example code.





#### **Matrix-Matrix Multiply Main Part 2**



## Q

#### **Matrix Matrix Multiply Kernel Part 1**

**\_global\_\_\_** void matrixMul( float\* C, float\* A, float\* B, int wA, int wB)

```
// Block index
int bx = blockIdx.x;
int by = blockIdx.v;
// Thread index
int tx = threadIdx.xi
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK SIZE * by;
// Index of the last sub-matrix of A processed by the block
           = aBegin + wA - 1;
int aEnd
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK SIZE * wB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;
```



#### **Matrix Matrix Multiply Kernel Part 2**

```
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
         a \leq a \in a \in d;
         a += aStep, b += bStep) {
    // Declaration of the shared memory array As used to
    // store the sub-matrix of A
   shared float As[BLOCK SIZE][BLOCK SIZE];
    // Declaration of the shared memory array Bs used to
    // store the sub-matrix of B
    shared float Bs[BLOCK SIZE][BLOCK SIZE];
    // Load the matrices from device memory
    // to shared memory; each thread loads
    // one element of each matrix
   AS(ty, tx) = A[a + wA * ty + tx];
   BS(ty, tx) = B[b + wB * ty + tx];
    // Synchronize to make sure the matrices are loaded
    syncthreads();
```



#### Matrix Matrix Multiply Kernel Part 3

```
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
        Csub += AS(ty, k) * BS(k, tx);
```

// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
\_\_\_\_syncthreads();

```
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
```



}



We wouldn't really do matrix-matrix multiply this way.

- NVIDIA has developed a CUDA implementation of the BLAS libraries, which include a highly tuned matrix-matrix multiply routine.
- (We'll learn about BLAS next time.)
- There's also a CUDA FFT library, if your code needs Fast Fourier Transforms.



#### **But What If I Have a Fortran Code?**

Here are your options for Fortran:

- Rewrite part or all of your code in C or C++.
- Use the PGI accelerator directives.



## **OK Supercomputing Symposium 2009**



2003 Keynote: Peter Freeman NSF Computer & Information Science & Engineering Assistant Director

2009 Keynote: **Ed Seidel** Director **NSF** Office of Cyberinfrastructure





2004 Keynote: Sangtae Kim **NSF** Shared Cyberinfrastructure **Division** Director





2005 Keynote: Walt Brooks NASA Advanced Supercomputing **Division Director** 



2006 Keynote: Dan Atkins Head of NSF's Office of Cvberinfrastructure



2007 Keynote: Jay Boisseau Director Texas Advanced **Computing Center** U. Texas Austin



2008 Keynote: José Munoz **Deputy Office** Director/ Senior Scientific Advisor Office of Cyberinfrastructure National Science Foundation

is **OPEN**!

FREE! Wed Oct 7 2009 @ OU http://symposium2009.oscer.ou.edu/

**Registration Parallel Programming Workshop** FREE! Tue Oct 6 2009 @ OU **Sponsored by SC09 Education Program** FREE! Symposium Wed Oct 7 2009 @ OU Supercomputing in Plain English: GPGPU

Tuesday April 28 2009



#### **SC09 Summer Workshops**

This coming summer, the SC09 Education Program, part of the SC09 (Supercomputing 2009) conference, is planning to hold two weeklong supercomputing-related workshops in Oklahoma, for **FREE** (except you pay your own transport):

- <u>At OSU Sun May 17 the May 23</u>: <u>FREE</u> Computational Chemistry for Chemistry Educators (2010 TENTATIVE: Computational Biology)
- <u>At OU Sun Aug 9 Sat Aug 15</u>: <u>FREE</u> Parallel Programming & Cluster Computing
- We'll alert everyone when the details have been ironed out and the registration webpage opens.
- Please note that you must <u>apply</u> for a seat, and acceptance <u>CANNOT</u> be guaranteed.





#### **SC09 Summer Workshops**

- 1. May 17-23: Oklahoma State U: Computational Chemistry
- 2. May 25-30: Calvin Coll (MI): Intro to Computational Thinking
- June 7-13: U Cal Merced: Computational Biology
- 4. June 7-13: Kean U (NJ): Parallel, Distributed & Grid
- 5. June 14-20: Widener U (PA): Computational Physics
- 6. July 5-11: Atlanta U Ctr: Intro to Computational Thinking
- 7. July 5-11: Louisiana State U: Parallel, Distributed & Grid
- 8. July 12-18: U Florida: Computational Thinking Pre-college
- 9. July 12-18: Ohio Supercomp Ctr: Computational Engineering
- 10. Aug 2-8: U Arkansas: Intro to Computational Thinking
- 11. Aug 9-15: U Oklahoma: Parallel, Distributed & Grid





#### **To Learn More Supercomputing**

http://www.oscer.ou.edu/education.php





#### **Thanks for helping!**

- OSCER operations staff (Brandon George, Dave Akin, Brett Zimmerman, Josh Alexander)
- OU Research Campus staff (Patrick Calhoun, Josh Maxey, Gabe Wingfield)
- Kevin Blake, OU IT (videographer)
- Katherine Kantardjieff, CSU Fullerton
- John Chapman and Amy Apon, U Arkansas
- Andy Fleming, KanREN/Kan-ed
- This material is based upon work supported by the National Science Foundation under Grant No. OCI-0636427, "CI-TEAM Demonstration: Cyberinfrastructure Education for Bioinformatics and Beyond."





# Thanks for your attention!

