Author Topic: CUDA Programming and why it hurts my head. (Read 817 times)

HTH · « **on:** January 20, 2015, 08:34:54 am »

Alright so as EVERYONE knows I got a new PC (because im proud of it and its sexy and fuck you if you dont like it), and i did so mostly for cuda programming. I have an idea for my fourth year project which will.. if succesful, grant me all of the marks and many top keks among my peers at my government job which I will no doubt get :p (offered) also; inb4 stop bragging about your new graphics cards, i dropped a lot of numbers related to them but I SWEAR it was at least 80% to try and explain concepts.

Tl;Dr: I need to program cuda things for irl things.

So while struggling through a few concepts I decided that I would make a "brief" little "Cliff's Notes" for CUDA, a few things I found difficult to find on google (especially for my newer architecture).

First things first, CUDA Cores, what do they mean? In short, nothing. Yes nothing. It's a marketing term and I have found it amounts to nothing more than the ALU count across the whole card, not a bad thing but definitely misleading. So for reference, t he CUDA Core count of a single GPU is equal to the ALU count of a single times the number of MP on the card. For the 900 series that is 128 * 16, though on the 970 it is hampered in a certain way(compared to the 980) which I care not to explain, we can, for ease of calculation say that each card in this computer contains 16 Multiprocessors 's

Second, # of resident threads != # simultaneously executed threads.

To make this point I need to explain a bit about how CUDA works; CUDA uses a system that involves the following (mainly):

Threads (derp.)
Warps (an instance of 32 threads all executing the same instructions)
Blocks (which contain 1 or more warps depending on architecture)
Warp Schedulers, which are literally what they sound like, they schedule the instructions warps will execute.
all of which are contained upon MultiProcessors.
There also otehr types of processors that sit and can technically do other work and increase your thread count a bit but thats not pertinent to the topic today.

Working from the highest level of abstraction downwards you can treat each MultiProcessors as a seperate entity, it has it's own Warp Schedulers, which issue commands to the warps on that MultiProcessor ONLY, its own section of memory, blocks (which contain 1 or more warps as previously stated) are a little oddity and exist, from my understanding, to allow warps to work together to a certain degree, a warp which is started in a block must stay in its block but when you sync threads they MAY switch to the other warp in said block (depending on coding ofc).

The important thing to remember here is that a warp is 32 threads Doing the exact same thing and is coincendentally a large reason that GPUs are so amazing for parallel computing.

CPUs can't do this because the amount of time a CPU would be spent doing the exact same instruction across 32 cores or what have you would be so tiny that the cost and added complexity wouldnt be worth it (currently). A warp != 32 separate threads as if it was a 32 core CPU.

Now this is all well and good but you may be wondering where my point is, well here it is, though (my system) may have 2048 threads (64 warps) per MultiProcessor in residence (all waiting for execution time) the amount of concurrently executing threads is limited to the amount of warp schedulers (and the amount of instructions they can send out per clock cycle).

Two GTX 970 GPUs can have 2048*32 (64 warps * 32 MPs * 32 thread/warp) resident threads at one time, but they can only execute 8*32*32 (4 warp schedulers * 2 Instructions/ warp scheduler * 32 MPs * 32 threads/warp) threads at once.

This means (in numbers) that they can have about 65000 threads kicking around, but only about 8200~ threads running at once. So why the amazingly large amount of residence threads vs executing?

Well, because my third point is: Delays Exist in Computing

That is a fundamental law which I care not to google the name for, but the time it takes to get data from system memory or wait on an external process or indeed even the time it takes electrons to travel are all delays.

In order to combat this, Nvidia made the GPU architecture very fast and efficient at switching between warps (and thus threads) on the fly. If it senses that a certain warp is held up for whatever reason it simply switches to another one, I believe I read that it takes an average of 2 clock cycles to do so but don't quote me on that. This is essentially the same as Intel's hyperthreading technology. In doing so they (with well written code) nearly double the apparent execution of concurrent threads. Again, synonymous with Intel's Hyperthreading.

So to sum it up in what I was hoping would be my only paragraph for this post...

A CUDA GPU is limited in resident threads by the amount of onboard memory, in actual threads by the warp scheduling speed, and in apparently concurrent by the amount of warp switching that the GPU can do to most efficiently utilize clock cycles. The reason a CUDA GPU could never replace a CPU is because its strength lies in the fact that it only needs to send out 8 instructions to have 256 threads execute them, a quality which wouldnt translate well into CPU usage. Not to mention the massive architecture differences which would massively lower MP amounts if anyone tried to make a GPU crossed with a CPU.

Also, CUDA Cores and the AMD equivalent are a lie/marketing term/way to compare relative strengths amoung similar architectures only and if you are looking for a card for math purposes you should be looking at the MP count, the amount of Warp Schedulers and the amount of concurrent instructions which can be sent to the warps by said schedulers. As well as single and double precision. (theres a reason a GTX 970 has over 1000x the processing power of certain older cards desite only having 100x as many cores, warp schedulers have octupled since then (1 instruction/cycle vs

Syntax990 · « **Reply #1 on:** January 20, 2015, 11:20:34 am »

Quote from: HTH on January 20, 2015, 08:34:54 am

Well, because my third point is: Delays Exist in Computing

That is a fundamental law which I care not to google the name for, but the time it takes to get data from system memory or wait on an external process or indeed even the time it takes electrons to travel are all delays.

http://en.wikipedia.org/wiki/Propagation_delay

HTH · « **Reply #2 on:** January 20, 2015, 11:48:23 am »

Quote from: syntax990 on January 20, 2015, 11:20:34 am

http://en.wikipedia.org/wiki/Propagation_delay

That is (part of) the delays yes, but I was referencing Amadahl's Law, and the Von Neumann Bottleneck just drawing a blank lol

EvilZone

News:

Author Topic: CUDA Programming and why it hurts my head. (Read 817 times)

HTH

CUDA Programming and why it hurts my head.

Syntax990

Re: CUDA Programming and why it hurts my head.

HTH

Re: CUDA Programming and why it hurts my head.