程序代写代做代考 GPU flex cuda PowerPoint Presentation

PowerPoint Presentation

Parallel Computing

with GPUs
Dr Paul Richmond

http://paulrichmond.shef.ac.uk/teaching/COM4521/

Context and Hardware Trends

Supercomputing

Software and Parallel Computing

Course Outline

Context of course

0.0 TFlops

1.0 TFlops

2.0 TFlops

3.0 TFlops

4.0 TFlops

5.0 TFlops

6.0 TFlops

7.0 TFlops

8.0 TFlops

9.0 TFlops

10.0 TFlops

1 CPU Core GPU (4992 cores)

8.74 TeraFLOPS

~40 GigaFLOPS

6 hours CPU time
vs.

1 minute GPU time

Scale of Performance

4992 GPU cores

Serial Computing

Parallel Computing

Accelerated Computing

16 cores1 core 4x 4992 GPU cores +16 CPU cores

Accelerated Workstation

1.8m

28m

650m 2.6km

Scale of Performance: Titan Supercomputer

Transistors != performance

Moores Law: A doubling of transistors every couple of years
Not a law actually an observation

Doesn’t actually say anything about performance

Dennard Scaling

“As transistors get smaller their power density stays constant”

Power = Frequency x Voltage²

Performance improvements for CPUs traditionally realised by
increasing frequency

Decrease voltage to maintain a steady power
Only works so far

Increase Power
Disastrous implications for cooling

Instruction Level Parallelism

Transistors used to build more complex architectures

Use pipelining to overlap instruction execution

IF ID EX MEM WB

cycles

Instruction Level Parallelism

Transistors used to build more complex architectures

Use pipelining to overlap instruction execution

IF ID EX MEM WB

cycles

add 1 to R1

copy R1 to R2

IF ID EX MEM WB

IF ID EX MEM WBWasted cycles

Golden Era of Performance

90s saw great improvements to
single CPU performance
1980s to 2002: 100%

performance increase every 2
years

2002 to now: ~40% every 2 years

Adapting to Thrive in a New Economy of Memory Abundance, K Bresniker et al.

Why More Cores?

Use extra transistors for multi/many core parallelism
More operations per clock cycle

Power can be kept low

Processor designs can be simple – shorter pipelines (RISC)

GPUs and Many Core Designs

Take the idea of multiple cores to the extreme (many cores)

Dedicate more die space to compute
At the expense of branch prediction, out of order execution, etc.

Simple, Lower Power and Highly Parallel
Very effective for HPC applications

From GTC 2017 Keynote Talk, NVIDIA CEO Jensen Huang

Accelerators

Problem: Still require OS, IO and scheduling

Solution: “Hybrid System”,
CPU provides management and

“Accelerators” (or co-processors) such as GPUs provide compute power

DRAM GDRAM

CPU
GPU/

Accelerator

I/O I/O
PCIe

Types of Accelerator

GPUs
Emerged from 3D graphics but now specialised for HPC

Readily available in workstations

Xeon Phis
Many Integrated Cores (MIC) architecture

Based on Pentium 4 design (x86) with wide vector units

Closer to traditional multicore

Simpler programming and compilation

Context and Hardware Trends

Supercomputing

Software and Parallel Computing

Course Outline

Top Supercomputers

Top Supercomputer

Number 500 on list

1993 1998 2004 2009 2014 2020

100 PFlops

10 PFlops

1 PFlops

100 TFlops

10 TFlops

1 TFlops

100 GFlops

10 GFlops

1 GFlops

100 MFlops

1 EFlops

P
e

rf
o

rm
a

n
ce

(
e

xp
o

n
e

n
ti

a
l)

Year

Volta V100 (15TFLOPS SP)

Supercomputing Observations

Exascale computing
1 Exaflop = 1M Gigaflops

Estimated for 2020

Pace of change
Desktop GPU top supercomputer in 2002

A desktop with a GPU would be in Top 500 in 2008

A Teraflop of performance took 1MW in 2000

Extrapolating the trend
Current gen top500 on every desktop in < 10 years Trends of HPC Improvements at individual computer node level are greatest Better parallelism Hybrid processing 3D fabrication Communication costs are increasing Memory per core is reducing Supercomputing Observations https://www.nextplatform.com/2016/11/14/closer-look- 2016-top-500-supercomputer-rankings/ Green 500 Top energy efficient supercomputers HPC Observations Improvements at individual computer node level are greatest Better parallelism Hybrid processing 3D fabrication Communication costs are increasing Memory per core is reducing Throughput > Latency
http://sc16.supercomputing.org/2016/10/07/sc16-invited-talk-spotlight-dr-john-d-mccalpin-
presents-memory-bandwidth-system-balance-hpc-systems/

Context and Hardware Trends

Supercomputing

Software and Parallel Computing

Course Outline

Software Challenge

How to use this hardware efficiently?

Software approaches
Parallel languages: some limited impact but not as flexible as sequential

programming

Automatic parallelisation of serial code: >30 years of research hasn’t solved
this yet

Design software with parallelisation in mind

Amdahl’s Law

Speedup of a program is limited by the proportion than can be
parallelised

100

120

0% 20% 40% 60% 80% 100%

S
p

e
e

d
u

p
(

S
)

Parallel Proportion of Code (P)

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 =
1

1 − 𝑃

Amdahl’s Law cont.

Addition of processing cores gives diminishing returns

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 =
1

𝑃
𝑁

− (1 − 𝑃)

25
1 2 4 8

1
6

3
2

6
4

1
2

2
5

5
1

1
0

2
4

2
0

4
8

4
0

9
6

8
1

9
2

1
6

3
8

3
2

7
6

6
5

5
3

S
p

e
e

d
u

p
(

S
)

Number of Processors (N)

P = 25%

P = 50%

P = 90%

P= 95%

Parallel Programming Models

Distributed Memory
Geographically distributed processors (clusters)

Information exchanged via messages

Shared Memory
Independent tasks share memory space

Asynchronous memory access

Serialisation and synchronisation to ensure correctness

No clear ownership of data

Not necessarily performance oriented

Types of Parallelism

Bit-level
Parallelism over size of word, 8, 16, 32, or 64 bit.

Instruction Level (ILP)
Pipelining

Task Parallel
Program consists of many independent tasks

Tasks execute on asynchronous cores

Data Parallel
Program has many similar threads of execution

Each thread performs the same behaviour on different data

Implications of Parallel Computing

Performance improvements
Speed

Capability (i.e. scale)

Context and Hardware Trends

Supercomputing

Software and Parallel Computing

Course Outline

COM4521/6521 specifics

Designed to give insight into parallel computing
Specifically with GPU accelerators

Knowledge transfers to all many core architectures

What you will learn
How to program in C and manage memory manually

How to use OpenMP to write programs for multi-core CPUs

What a GPU is and how to program it with the CUDA language

How to think about problems in a highly parallel way

How to identify performance limitations in code and address them

Course Mailing List

A google group for the course has been set up
You have already been added if you were registered 01/02/2018

Mailing list uses;
Request help outside of lab classes

Find out if a lecture has changed

Want to participate in discussion on course content

https://groups.google.com/a/sheffield.ac.uk/forum/#!forum/com452
1-group

https://groups.google.com/a/sheffield.ac.uk/forum/#!forum/com4521-group

Learning Resources

Course website: http://paulrichmond.shef.ac.uk/teaching/COM4521/

Recommended Reading:
Edward Kandrot, Jason Sanders, “CUDA by Example: An Introduction to

General-Purpose GPU Programming”, Addison Wesley 2010.

Brian Kernighan, Dennis Ritchie, “The C Programming Language (2nd
Edition)”, Prentice Hall 1988.

http://paulrichmond.shef.ac.uk/teaching/COM4521/

Timetable

2 x 1 hour lecture per week (back to back)
Monday 15:00 until 17:00 Broad Lane Lecture Theater 11
Week 5 first half of the lecture will be in DIA-LT09 (Lecture Theatre 9)
Week 5 second half of the lecture will be MOLE quiz in DIA-206 (Compute room 4)

1 x 2 hour lab per week
Tuesday 9:00 until 11:00 Diamond DIA-206 (Compute room 4)
Week 10 first half of the lab will be an assessed MOLE quiz DIA-206 (Compute room 4)

Assignment
Released in two parts
Part 1

 Released week 3
Due for hand in on Tuesday week 7 (20/03/2018) at 17:00
 Feedback after Easter.

Part 2
 Released week 6
Due for hand in on Tuesday week 12 (15/05/2018) at 17:00

Course Assessment

2 x Multiple Choice quizzes on MOLE (10% each)
Weeks 5 and 10

An assignment (80%)
Part 1 is 30% of the assignment total

Part 2 is 70% of the assignment total

For each assignment part
Half of the marks are for the program and half for a written report

Will require understanding of why you have implemented a particular
technique

Will require benchmarking, profiling and explanation to demonstrate that you
understand the implications of what you have done

Lab Classes

2 hours every week
Essential in understanding the course content!

Do not expect to complete all exercises within the 2 hours

Coding help from lab demonstrators Robert Chisholm and John
Charlton:
http://staffwww.dcs.shef.ac.uk/people/R.Chisholm/

http://www.dcs.shef.ac.uk/cgi-bin/makeperson?J.Charlton

Assignment and lab class help questions should be directed to the
google discussion group

http://staffwww.dcs.shef.ac.uk/people/R.Chisholm/
http://www.dcs.shef.ac.uk/cgi-bin/makeperson?J.Charlton

Feedback

After each teaching week you MUST submit the lab register/feedback
form
This records your engagement in the course

Ensures that I can see what you have understood and not understood

Allows us to revisit any concepts ideas with further examples

This only works if you are honest!

Submit this once you have finished with the lab exercises

Your feedback will be used to clarify topics which are assessed in the
assignments

Lab Register Link: https://goo.gl/0r73gD

Additional feedback from assignment and MOLE quizzes

https://goo.gl/0r73gD

Machines Available

Diamond Compute Labs
Visual Studio 2017
NVIDIA CUDA 9.1

VAR Lab
CUDA enabled machines – same spec as Diamond high spec compute room

ShARC
University of Sheffield HPC system
You will need an account (see HPC docs website)
Select number of GPU nodes available (see gpucomputing.shef.ac.uk)
Special short job queue will be made availble

Your own machine
Must have a NVIDIA GPU for CUDA exercises
Virtual machines not an option
IMPORTANT: Follow the websites guidance for installing Visual Studio

http://docs.hpc.shef.ac.uk/en/latest/hpc/getting-started.html#getting-an-account
gpucomputing.shef.ac.uk
http://paulrichmond.shef.ac.uk/teaching/COM4521/visual_studio

Summary

Parallelism is already here in a big way
From mobile to workstation to supercomputers

Parallelism in hardware
It’s the only way to use increasing number of transistors

Trend is for increasing parallelism

Supercomputers
Increased dependency on accelerators

Accelerators are greener

Software approaches
Shared and distributed memory models differ

Programs must be highly parallel to avoid diminishing returns

Published by admin

Leave a Reply Cancel reply