cuda - Improving kernel performance by increasing occupancy? -

- May 15, 2013

here output of compute visual profiler kernel on gt 440:

kernel details: grid size: [100 1 1], block size: [256 1 1]
register ratio: 0.84375 ( 27648 / 32768 ) [35 registers per thread]
shared memory ratio: 0.336914 ( 16560 / 49152 ) [5520 bytes per block]
active blocks per sm: 3 (maximum active blocks per sm: 8)
active threads per sm: 768 (maximum active threads per sm: 1536)
potential occupancy: 0.5 ( 24 / 48 )
occupancy limiting factor: registers

please, pay attention bullets marked bold. kernel execution time 121195 us.

i reduced number of registers per thread moving local variables shared memory. compute visual profiler output became:

kernel details: grid size: [100 1 1], block size: [256 1 1]
register ratio: 1 ( 32768 / 32768 ) [30 registers per thread]
shared memory ratio: 0.451823 ( 22208 / 49152 ) [5552 bytes per block]
active blocks per sm: 4 (maximum active blocks per sm: 8)
active threads per sm: 1024 (maximum active threads per sm: 1536)
potential occupancy: 0.666667 ( 32 / 48 )
occupancy limiting factor: registers

hence, 4 blocks simultaneously executed on single sm versus 3 blocks in previous version. however, execution time 115756 us, same! why? aren't blocks totally independent being executed on different cuda cores?

you implicitly assuming higher occupancy automatically translates higher performance. not case.

the nvidia architecture needs number of active warps per mp in order hide instruction pipeline latency of gpu. on fermi based card, requirement translates minimum occupancy of 30%. aiming higher occupancies minimum not result in higher throughput, latency bottleneck can have moved part of gpu. entry level gpu doesn't have lot of memory bandwidth, , quite possible 3 blocks per mp sufficient make code memory bandwidth limited, in case increasing number of blocks won't have effect on performance (it might go down because of increased memory controller contention , cache misses). further, said spilled variables shared memory in order reduce register foot print of kernel. on fermi, shared memory has 1000 gb/s of bandwidth, compared 8000 gb/s registers (see link below microbenchmarking results demonstrate this). have moved variables slower memory, may have negative effect on performance, offsetting benefit high occupancy affords.

if have not seen it, highly recommend vasily volkov's presentation gtc 2010 "better performance @ lower occupancy" (pdf). here shown how exploiting instruction level parallelism can increase gpu throughput high levels @ very, low levels of occupancy.

Search This Blog

Support

cuda - Improving kernel performance by increasing occupancy? -

Comments

Post a Comment

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -