cuda - Improving kernel performance by increasing occupancy? -
here output of compute visual profiler kernel on gt 440:
- kernel details: grid size: [100 1 1], block size: [256 1 1]
- register ratio: 0.84375 ( 27648 / 32768 ) [35 registers per thread]
- shared memory ratio: 0.336914 ( 16560 / 49152 ) [5520 bytes per block]
- active blocks per sm: 3 (maximum active blocks per sm: 8)
- active threads per sm: 768 (maximum active threads per sm: 1536)
- potential occupancy: 0.5 ( 24 / 48 )
- occupancy limiting factor: registers
please, pay attention bullets marked bold. kernel execution time 121195 us
.
i reduced number of registers per thread moving local variables shared memory. compute visual profiler output became:
- kernel details: grid size: [100 1 1], block size: [256 1 1]
- register ratio: 1 ( 32768 / 32768 ) [30 registers per thread]
- shared memory ratio: 0.451823 ( 22208 / 49152 ) [5552 bytes per block]
- active blocks per sm: 4 (maximum active blocks per sm: 8)
- active threads per sm: 1024 (maximum active threads per sm: 1536)
- potential occupancy: 0.666667 ( 32 / 48 )
- occupancy limiting factor: registers
hence, 4
blocks simultaneously executed on single sm versus 3
blocks in previous version. however, execution time 115756 us
, same! why? aren't blocks totally independent being executed on different cuda cores?
you implicitly assuming higher occupancy automatically translates higher performance. not case.
the nvidia architecture needs number of active warps per mp in order hide instruction pipeline latency of gpu. on fermi based card, requirement translates minimum occupancy of 30%. aiming higher occupancies minimum not result in higher throughput, latency bottleneck can have moved part of gpu. entry level gpu doesn't have lot of memory bandwidth, , quite possible 3 blocks per mp sufficient make code memory bandwidth limited, in case increasing number of blocks won't have effect on performance (it might go down because of increased memory controller contention , cache misses). further, said spilled variables shared memory in order reduce register foot print of kernel. on fermi, shared memory has 1000 gb/s of bandwidth, compared 8000 gb/s registers (see link below microbenchmarking results demonstrate this). have moved variables slower memory, may have negative effect on performance, offsetting benefit high occupancy affords.
if have not seen it, highly recommend vasily volkov's presentation gtc 2010 "better performance @ lower occupancy" (pdf). here shown how exploiting instruction level parallelism can increase gpu throughput high levels @ very, low levels of occupancy.
Comments
Post a Comment