{"id":4862,"date":"2025-08-15T07:18:15","date_gmt":"2025-08-15T15:18:15","guid":{"rendered":"https:\/\/www.pnfsoftware.com\/blog\/?p=4862"},"modified":"2025-08-18T16:03:52","modified_gmt":"2025-08-19T00:03:52","slug":"reversing-nvidia-cuda-sass-code","status":"publish","type":"post","link":"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/","title":{"rendered":"Reversing Nvidia GPU&#8217;s SASS code"},"content":{"rendered":"\n<p>JEB 5.31 ships with a generic SASS disassembler and experimental decompiler for GPU code compiled for Nvidia architectures Volta to Blackwell, that is, compute capabilities <code>sm_70<\/code> to <code>sm_121<\/code>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/jeb-sass-decompilation-example2.gif\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"612\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/jeb-sass-decompilation-example2-1024x612.gif\" alt=\"\" class=\"wp-image-4965\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/jeb-sass-decompilation-example2-1024x612.gif 1024w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/jeb-sass-decompilation-example2-300x179.gif 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/jeb-sass-decompilation-example2-768x459.gif 768w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/jeb-sass-decompilation-example2-1536x918.gif 1536w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/jeb-sass-decompilation-example2-2048x1224.gif 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Click the above image to see the full-size animated gif of a SASS decompilation.<\/figcaption><\/figure>\n\n\n\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a07995f180f4\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a07995f180f4\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#What_is_SASS_Code\" >What is SASS Code<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#SASS_Primer\" >SASS Primer<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Execution_Environment\" >Execution Environment<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Data_Spaces\" >Data Spaces<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Registers\" >Registers<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Classes_of_Instructions\" >Classes of Instructions<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Disassembling_Volta_Code\" >Disassembling Volta+ Code<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Rendering\" >Rendering<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#DisplayImplicitDescriptors\" >.DisplayImplicitDescriptors<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#DisplayRegisterNumbers\" >.DisplayRegisterNumbers<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#DisplayHiddenAttributes\" >.DisplayHiddenAttributes<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#DisplaySchedulingInfo\" >.DisplaySchedulingInfo<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Code_analysis\" >Code analysis<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Extracting_cubins\" >Extracting cubins<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Decompiling_SASS_to_pseudo-C\" >Decompiling SASS to pseudo-C<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Sample_decompilation\" >Sample decompilation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Limitations\" >Limitations<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#IR_Conversion\" >IR Conversion<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Memory_Mapping\" >Memory Mapping<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-20\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Calling_Conventions\" >Calling Conventions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-21\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Optimizations\" >Optimizations<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-22\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#API_and_headless_processing\" >API and headless processing<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-23\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Conclusion\" >Conclusion<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-24\" href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/#Annex_1_Instructions_Distribution\" >Annex 1: Instructions Distribution<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"What_is_SASS_Code\"><\/span>What is SASS Code<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>SASS <sup class='footnote'><a href='#fn-4862-1' id='fnref-4862-1' onclick='return fdfootnote_show(4862)'>1<\/a><\/sup> is the low-level, semi-documented machine code generated when compiling high-level CUDA <sup class='footnote'><a href='#fn-4862-2' id='fnref-4862-2' onclick='return fdfootnote_show(4862)'>2<\/a><\/sup> source code (C++ or higher-level languages) with <em>nvcc<\/em> or when translating PTX <sup class='footnote'><a href='#fn-4862-3' id='fnref-4862-3' onclick='return fdfootnote_show(4862)'>3<\/a><\/sup> intermediate code with <em>ptxas<\/em>.<\/p>\n\n\n\n<p>A simplified view of the compilation steps can be seen as follows:<br><code>CUDA code (C\/C++, etc.)  =&gt;  PTX IR (~LLVM bitcode)  =&gt;  SASS (assembly)<\/code><\/p>\n\n\n\n<p>Practically, GPU code is embedded in an ELF container referred to as a <em>cubin<\/em>, for &#8220;CUDA binary&#8221;. One or more cubins are embedded in a host program to be executed on a CPU. When GPU code needs to be run, the host will retrieve the appropriate cubin and ask the GPU to load and execute it. The complete build process of some high-level example.cu file mixing general-purpose code and GPU code is as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-3.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"954\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-3-1024x954.png\" alt=\"\" class=\"wp-image-5001\" style=\"width:533px;height:auto\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-3-1024x954.png 1024w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-3-300x279.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-3-768x716.png 768w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-3.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Build processes, from CUDA to fatbin &#8211; image (c) NVidia<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"SASS_Primer\"><\/span>SASS Primer<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Readers familiar with CUDA may not know about SASS or the details of the environment in which GPU code is executed. This section is a primer that will help make the remainder of this page more readable. If you are familiar with PTX and\/or SASS, you may want to skip to the next section about disassembling code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Execution_Environment\"><\/span>Execution Environment<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-4.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1020\" height=\"706\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-4.png\" alt=\"\" class=\"wp-image-5005\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-4.png 1020w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-4-300x208.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-4-768x532.png 768w\" sizes=\"auto, (max-width: 1020px) 100vw, 1020px\" \/><\/a><figcaption class=\"wp-element-caption\">This execution environment diagram does not represent warps, a sub-division of the threads in a CTA (32 threads per warp, executed in lockstep: same issued instruction, divergent threads wait.) Image (c) Nvidia<\/figcaption><\/figure>\n\n\n\n<p>Concisely, the basic environment for execution of a GPU <em>kernel<\/em> K can be described as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>K is executed on a <em>streaming multiprocessor<\/em> (SM).<\/li>\n\n\n\n<li>Threads for K are organized into <em>warps<\/em>. Each warp contains 32 threads executed in lockstep at the instruction-issue level (if the PC of a thread differs because of a branching instruction, divergence is handled by masking inactive threads until a reconvergence point.)<\/li>\n\n\n\n<li>Warps are grouped into <em>Cooperative Thread Arrays<\/em> (CTAs), also called <em>thread blocks<\/em>, each containing up to 1024 threads.<\/li>\n\n\n\n<li>Starting with Hopper, CTAs can be grouped into <em>clusters<\/em>.<\/li>\n\n\n\n<li>The full set of CTAs or clusters forms the <em>compute grid<\/em> for K.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Data_Spaces\"><\/span>Data Spaces<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>A kernel&#8217;s code can access several data spaces:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Registers: per-thread, detailed in the following sub-section<\/li>\n\n\n\n<li>Local Memory: per-thread, in DRAM, accessed with LDL\/STL<\/li>\n\n\n\n<li>Shared Memory: per-CTA, on-chip, accessed via LDS\/STS<\/li>\n\n\n\n<li>Global Memory: global, in DRAM, accessed via LDG\/STG<\/li>\n\n\n\n<li>Constant Memory: in DRAM, cached, accessed via LDC\/ULDC<\/li>\n\n\n\n<li>Texture Memory: global, in DRAM, accessed via TLD\/TSD<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1018\" height=\"620\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-2.png\" alt=\"\" class=\"wp-image-4981\" style=\"width:659px;height:auto\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-2.png 1018w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-2-300x183.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-2-768x468.png 768w\" sizes=\"auto, (max-width: 1018px) 100vw, 1018px\" \/><\/a><figcaption class=\"wp-element-caption\">&#8220;SMEM&#8221; represents the on-chip shared memory for threads of a CTA. The constant memory (in blue, noted &#8220;read-only&#8221;) is loaded from DRAM into a cache). Image (c) Nvidia<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Registers\"><\/span>Registers<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>Let&#8217;s see what registers are available to a thread:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>General Registers (<strong>Rx<\/strong>): up to 256 32-bit registers; 64-bit values are represented by two contiguous registers; R255 is a zero-register (aliased RZ)<\/li>\n\n\n\n<li>Predicate Registers (<strong>Px<\/strong>): 8 boolean flags per thread; P7 is always true (aliased PT)<\/li>\n\n\n\n<li>Special Registers (<strong>SRx<\/strong>): 256 read-only registers, containing thread\/block IDs, lane ID, clock values, performance counters, etc; most are 32-bit, some are 64-bit. <sup class='footnote'><a href='#fn-4862-4' id='fnref-4862-4' onclick='return fdfootnote_show(4862)'>4<\/a><\/sup><\/li>\n<\/ul>\n\n\n\n<p>Uniform registers were added on Turing and above (sm_75+). Their values are the same for all threads of a warp:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uniform Registers (<strong>URx<\/strong>): 64 32-bit registers (increased to 256 registers on sm_100+); the last one in the bank is a zero-register (aliased URZ)<\/li>\n\n\n\n<li>Uniform Predicate Registers (<strong>UPx<\/strong>): 8 boolean flags; UP7 is always true (aliased UPT)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Classes_of_Instructions\"><\/span>Classes of Instructions<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>SASS instructions can be grouped into high-level classes, also corresponding to different execution pipelines. A few examples:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Integer: IMAD, IADD3, SHF, LOP3 (arbitrary 3-input bitwise operation backed by a look-up table)<\/li>\n\n\n\n<li>Floating-point: FADD, FFMA, FSET, F2F\/F2I\/I2F (conversion instructions), MUFU (multi-usage function, for sin, cos, reverse square-root, etc.)<\/li>\n\n\n\n<li>Load\/Store: LDx\/ STx for each memory space<\/li>\n\n\n\n<li>Control flow: BRA, BRX, CALL, RET, SSY, BSYNC, EXIT<\/li>\n\n\n\n<li>Uniform ops: many equivalent instructions prefixed by U will work on uniform registers, e.g. UIADD3, UIMAD, ULEA<\/li>\n<\/ul>\n\n\n\n<p>Refer to this <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-binary-utilities\/#turing-instruction-set\">Nvidia documentation page<\/a> for a brief description of the instruction classes as well as the instructions themselves. For convenience in JEB, the description of an instruction&#8217;s opcode will also be displayed when hovering over its mnemonic.<\/p>\n\n\n\n<p>Finally, let&#8217;s note that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>About the encoding: all Volta+ instructions are fixed size, 16-byte long.<\/li>\n\n\n\n<li>Most instructions have 1 to 4 operands; the destination operands go first, followed by the source operands. <sup class='footnote'><a href='#fn-4862-5' id='fnref-4862-5' onclick='return fdfootnote_show(4862)'>5<\/a><\/sup><\/li>\n\n\n\n<li>The opcode and operands can contain optional attributes and qualifiers that modify how the instruction behave (e.g. &#8220;.64&#8221; will specify a 64-bit operation on a pair of registers).<\/li>\n\n\n\n<li>All instructions can be predicated.<br>Example: <code>@!P0 IMAD R0, R1, R2, R3<\/code><br>means: &#8220;perform R0=R1*R2+R3 <strong>if P0 is false<\/strong>&#8220;<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Disassembling_Volta_Code\"><\/span>Disassembling Volta+ Code<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>The JEB disassembler plugin can handle SASS code embedded in <em>cubin<\/em> files. They are  ELF containers using the <code>EM_CUDA<\/code> (190) machine type. As for any JEB disassembler plugin, it can be used on standalone binary blobs as well.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-2-scaled.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"670\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-2-1024x670.png\" alt=\"\" class=\"wp-image-4870\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-2-1024x670.png 1024w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-2-300x196.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-2-768x503.png 768w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-2-1536x1005.png 1536w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-2-2048x1340.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Disassembly of a simple cubin with one kernel and no additional sub-routines<\/figcaption><\/figure>\n\n\n\n<p>The disassembler uses the type name <code>sass_visa<\/code>, to mean &#8220;SASS Volta+ ISA&#8221;.<\/p>\n\n\n\n<p>Pre-Volta (before sm_70) code is <strong>not<\/strong> supported by this plugin.<\/p>\n\n\n\n<p>Two current limitations in terms of processing ELF CUDA files:<br>&#8211; The relocations are not supported and not applied.<br>&#8211; Relocatable files (ET_REL) are not supported at the moment (only executables and .so libs are processed)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Rendering\"><\/span>Rendering<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>The disassembler offers an array of rendering options not provided by the official CUDA toolkit&#8217;s <em>cuobjdump<\/em> and <em>nvdisasm<\/em> tools. On top of the usual options common to all JEB disassembler plugins, the following additional options can be enabled (right-click, <em>Rendering Options<\/em>).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/sass_rendering_animated.gif\"><img loading=\"lazy\" decoding=\"async\" width=\"2976\" height=\"1328\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/sass_rendering_animated.gif\" alt=\"\" class=\"wp-image-4890\"\/><\/a><figcaption class=\"wp-element-caption\">Custom SASS rendering options and their effect on disassembly code<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"DisplayImplicitDescriptors\"><\/span>.<strong>DisplayImplicitDescriptors<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>This option is enabled by default on GUI clients (its default is false for headless clients, e.g. when scripting). If enabled, the implicit descriptor used to access memory will be displayed to avoid any ambiguity. Example:<\/p>\n\n\n\n<p>Instruction bytes: <code>81 79 06 02 04 00 00 00 00 11 1E 0C 00 68 01 00<\/code><br>Standard rendering: <code><strong>LDG.E.U8 R6, [R2.64]<\/strong><\/code><br>Rendering with desc: <code><strong>LDG.E.U8 R6, <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">desc[UR4]<\/mark>[R2.64]<\/strong><\/code><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"DisplayRegisterNumbers\"><\/span>.DisplayRegisterNumbers<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>If enabled, the disassembly will use number-based register names instead of their aliases (e.g. P7 instead of PT). Examples:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>R255<\/strong>: Register #255 is always zero and aliased <strong>RZ<\/strong><\/li>\n\n\n\n<li><strong>P7<\/strong>: Predicate Register #7 is always true and aliased <strong>PT<\/strong><\/li>\n\n\n\n<li><strong>SR0<\/strong>: Special Register #0 is the lane id and aliased <strong>SR_LANEID<\/strong><\/li>\n\n\n\n<li>On architectures sm_75 to sm_90, the last (63rd) uniform register <strong>UR63<\/strong> is always zero and aliased as <strong>URZ<\/strong><\/li>\n\n\n\n<li>etc.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"DisplayHiddenAttributes\"><\/span>.DisplayHiddenAttributes<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>If enabled, opcode attributes, operands, and operand attributes that use default value will be explicitly rendered. Examples:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instruction bytes: <code>10 72 00 00 05 00 00 00 FF E0 FF 07 00 E4 0F 00<\/code><br>Standard rendering: <strong><code>IADD3 R0, R0, R5, RZ<\/code><\/strong><br>Full rendering: <strong><code>IADD3 R0, <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">PT, PT, R0.noreuse,<\/mark> R5<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">.noreuse<\/mark>, RZ<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">.noreuse<\/mark><\/code><\/strong><\/li>\n\n\n\n<li>Instruction bytes: <code>81 73 03 02 00 04 00 00 00 E9 1E 00 00 A2 0E 00<\/code><br>Standard rendering: <strong><code>LDG.E.SYS R3, [R2+0x4]<\/code><\/strong><br>Full rendering: <strong><code>LDG.E<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">.EN.32.WEAK<\/mark>.SYS<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">.noprivate PT,<\/mark> R3, [R2+0x4]<\/code><\/strong><\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"DisplaySchedulingInfo\"><\/span>.DisplaySchedulingInfo<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>If enabled, extra scheduling information is explicitly generated and appended to the instruction. Examples:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>IMAD  R5, R5, c[0x0][0x4], R2   <strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">&amp;req*={1} ?WAIT4_END_GROUP<\/mark><\/strong><\/code><\/li>\n\n\n\n<li><code>MOV   R3, 0x8                   <strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\">?trans1<\/mark><\/strong><\/code><\/li>\n<\/ul>\n\n\n\n<p>Note that we do not provide instruction timing information (such as hardware-enforced latency to avoid data hazards) at this point, although it is likely we will add that as a rendering option in a future update.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Code_analysis\"><\/span>Code analysis<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>The code analyzer breaks down the SASS code and rebuilds control flow. When doing so, internal <code>__device__<\/code> sub-routines that were not inlined are recovered and displayed in the code hierarchy.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-5.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"493\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-5-1024x493.png\" alt=\"\" class=\"wp-image-4894\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-5-1024x493.png 1024w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-5-300x145.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-5-768x370.png 768w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-5-1536x740.png 1536w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-5-2048x986.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Two internal methods were recovered<\/figcaption><\/figure>\n\n\n\n<p>Per usual with code units in JEB, the disassembly listing can be annotated (e.g. comments: hotkey <strong>\/<\/strong>), methods can be renamed (hotkey <strong>N<\/strong>), code can be navigated (e.g. cross-references: hotkey <strong>X<\/strong>), etc. All those actions are located in the <em>Action<\/em> and <em>Native<\/em> menus. <sup class='footnote'><a href='#fn-4862-6' id='fnref-4862-6' onclick='return fdfootnote_show(4862)'>6<\/a><\/sup><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Extracting_cubins\"><\/span>Extracting cubins<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>A secondary plugin retrieves and extract cubins from host executable files (ELF, PE, etc.). Extraction works on a best-effort basis, as the file format is not officially documented by Nvidia.<\/p>\n\n\n\n<p>In the screenshot below, we opened the <em>oceanFFT<\/em> demo program shipping with the CUDA toolkit. A fatbin was retrieved, containing 17 cubins, as can be seen in the Project Explorer panel. The fatbin&#8217;s Description fragment provides more details:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Type of fatbin code (SASS or PTX)<\/li>\n\n\n\n<li>Flags (e.g. whether the fatbin payload was compresed)<\/li>\n\n\n\n<li>The intended architecture<\/li>\n<\/ul>\n\n\n\n<p>Individual cubin units are created and can be opened to analyze and annotate the code, as was shown in the previous section.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-4-scaled.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"546\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-4-1024x546.png\" alt=\"\" class=\"wp-image-4884\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-4-1024x546.png 1024w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-4-300x160.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-4-768x409.png 768w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-4-1536x818.png 1536w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/06\/image-4-2048x1091.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">The CUDA demo file oceanFFT.exe opened in JEB. The discovered fatbin unit contains cubin ELF as sub-units.<\/figcaption><\/figure>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nfatbin (cuda_fatbin)\n\n#0: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_52, data:0x368 bytes (Maxwell - pre-Volta code will not be parsed))\n#1: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_50, data:0x3BD8 bytes (Maxwell - pre-Volta code will not be parsed))\n#2: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_52, data:0x3BD8 bytes (Maxwell - pre-Volta code will not be parsed))\n#3: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_60, data:0x3B98 bytes (Pascal - pre-Volta code will not be parsed))\n#4: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_61, data:0x3B98 bytes (Pascal - pre-Volta code will not be parsed))\n#5: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_70, data:0x4838 bytes (Volta)\n#6: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_75, data:0x47B8 bytes (Turing)\n#7: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_80, data:0x4A38 bytes (Ampere)\n#8: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_86, data:0x4A38 bytes (Ampere)\n#9: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_89, data:0x49B8 bytes (Ada)\n#10: kind:SASS code, flags:AddressSize64|HostWindows, version:1.7, architecture:sm_90, data:0x4ED0 bytes (Blackwell)\n#11: kind:SASS code, flags:AddressSize64|HostWindows|0x1000000, version:1.8, architecture:sm_100, data:0x6CA8 bytes (Blackwell)\n#12: kind:SASS code, flags:AddressSize64|HostWindows|0x1000000, version:1.8, architecture:sm_101, data:0x7040 bytes (Blackwell)\n#13: kind:SASS code, flags:AddressSize64|HostWindows|0x1000000, version:1.8, architecture:sm_103, data:0x6EE8 bytes (Blackwell)\n#14: kind:SASS code, flags:AddressSize64|HostWindows|0x1000000, version:1.8, architecture:sm_120, data:0x7038 bytes (Blackwell)\n#15: kind:PTX source, flags:AddressSize64|HostWindows|CompressionLZ4, version:8.8, architecture:sm_121, data:0x4337 bytes (Blackwell)\n#16: kind:SASS code, flags:AddressSize64|HostWindows|0x1000000, version:1.8, architecture:sm_121, data:0x7038 bytes (Blackwell)\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Decompiling_SASS_to_pseudo-C\"><\/span>Decompiling SASS to pseudo-C<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>This JEB release also includes an experimental\/proof-of-concept decompiler plugin for SASS code. It will generate pseudo C code with many caveats, as described in the current section.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Sample_decompilation\"><\/span>Sample decompilation<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>Have a look at the simple CUDA kernel below:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\n\/\/ matrix_mult.cu\n__global__ void matrix_mult(int* m1, int m1_nrows, int m1_ncols, int* m2, int m2_nrows, int m2_ncols, int* mr) {\n  int v = 0;\n  for(int i = 0; i &lt; m1_ncols; i++) {\n    v += m1&#x5B;threadIdx.x * m1_ncols + i] * m2&#x5B;i * m2_ncols + threadIdx.y];\n  }\n  mr&#x5B;threadIdx.x * m2_ncols + threadIdx.y] = v;\n}\n<\/pre><\/div>\n\n\n<p>For the sake of example, let&#8217;s compile this kernel with full optimizations except for loop unrolling (<code>#pragma unroll 1<\/code> on the for-loop). The decompiled code looks as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"527\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1024x527.png\" alt=\"\" class=\"wp-image-4942\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1024x527.png 1024w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-300x154.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-768x395.png 768w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1536x791.png 1536w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-2048x1055.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">The left part of this JEB workspace shows the SASS code for matrix_mult. The right side shows the decompiler output for this kernel. Note that the compilation process was sub-optimal, as loop unrolling was disabled.<\/figcaption><\/figure>\n\n\n\n<p>Below, the same matrix multiplication kernel compiled with full optimizations, including loop unrolling <sup class='footnote'><a href='#fn-4862-7' id='fnref-4862-7' onclick='return fdfootnote_show(4862)'>7<\/a><\/sup>, which will increase the size of code substantially, and make readability much worse:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"693\" src=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1-1024x693.png\" alt=\"\" class=\"wp-image-4943\" srcset=\"https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1-1024x693.png 1024w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1-300x203.png 300w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1-768x520.png 768w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1-1536x1039.png 1536w, https:\/\/www.pnfsoftware.com\/blog\/wp-content\/uploads\/2025\/08\/image-1-2048x1386.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">The left part of this JEB workspace shows the SASS code for matrix_mult. The right side shows the decompiler output for this kernel. Full optimizations are on, including loop unrolling. Notice that both the SASS code as well as the decompiler output are much larger.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Limitations\"><\/span>Limitations<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>The sub sections below describe some of the decompiler plugin&#8217;s design and implementation choices, as well as list some limitations and avenues of improvement that may be considered in <em>future updates<\/em>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"IR_Conversion\"><\/span>IR Conversion<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>Many classes of instructions are fully converted to precise IR. However, not all SASS instructions&#8217; semantics are clearly understood or have been figured out. In some cases, the decompiler won&#8217;t be able to generate precise low-level IR to represent an instruction.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Partial conversion: some instructions are only partially converted. For instance, at the time of writing, only the 4-operand version of <code>FMNMX<\/code> is converted to precise IR. The 5-operand variant is not, and will yield an &#8220;untranslated IR&#8221; statement, eventually represented as pseudo-C code such as <code>FMNMX(inputs, outputs)<\/code>.<\/li>\n\n\n\n<li>Missing conversion: some instructions are not converted at all. For instance, synchronization primitives such as <code>ELECT <\/code>or <code>VOTE <\/code>will be mapped to untranslated IR statements.<\/li>\n<\/ul>\n\n\n\n<p><span style=\"text-decoration: underline;\">Future updates:<\/span> Another limitation regards the handling of <code>BRX<\/code> branching instruction, which relies on a fixed-size jump table to perform a jump. Currently, the table (located in the cubin) is not processed, potentially yielding sub-par results regarding the actual targets of such instructions.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Memory_Mapping\"><\/span>Memory Mapping<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>GPU kernels execute in environments that do not match how JEB&#8217;s <em>gendec<\/em> abstracts code and memory. When processing an artifact, JEB places all code and data in a unified, potentially arbitrarily-long virtual memory. By contrast, GPU codes are run in isolation, and can access several memory areas that are also isolated from each other. In order to reconcile those views, the plugin organizes bytes as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Kernel codes will be found in the first 256 Mb of the VM. Each kernel (included its private sub-routines, when they have been identified) starts at the nearest 8-byte rounded address, following the previous kernel.<\/li>\n\n\n\n<li>The constant memory, represented in SASS as <code>c[bankId][offset]<\/code> is mapped at address 0x1000&#8217;0000. Each bank has an arbitrary max size of 0x0100&#8217;0000 bytes.<\/li>\n\n\n\n<li>The shared and local memory (e.g. accessed by <code>STS\/LDS<\/code> or <code>STL\/LDL<\/code>) is mapped at 0x3000&#8217;0000.<\/li>\n\n\n\n<li>There is no explicit stack pointer.<\/li>\n<\/ul>\n\n\n\n<p><span style=\"text-decoration: underline;\">Future updates:<\/span> The values of global constants are located in specific segments of the cubin. Currently, those segments are ignored by the decompiler plugin.  Similarly, resolved global symbols to e.g. external API routines such as libc&#8217;s, are located in constant pools.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Calling_Conventions\"><\/span>Calling Conventions<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>A kernel entry-point (<code>__global__ <\/code>methods) has a well-defined calling convention &#8212; though not well-documented. Kernels return void, and all output data is written through pointers passed as arguments. The arguments are mapped to the constant memory bank 0, at the following offsets:<br>&#8211; <code>sm_70 <\/code>to <code>sm_89<\/code>: 0x160<br>&#8211; <code>sm_90<\/code>: 0x210<br>&#8211; <code>sm_100 <\/code>to <code>sm_12x<\/code>: 0x380<\/p>\n\n\n\n<p>Currently, the plugin generates no-output\/no-input prototypes (i.e., <code>void kernel()<\/code>) for all kernels, and replaces memory accesses to mapped parameters by synthetic variables. Some simplified example: a 32-bit memory access to <code>c[0][0x160]<\/code> (resulting in <code>*(int32*)(0x10000160)<\/code> at the IR level) will be replaced by an EVar named &#8220;arg0&#8221;.<\/p>\n\n\n\n<p><span style=\"text-decoration: underline;\">Future updates:<\/span> Special calling convention may be created to allow the definition and customization of such vars at the prototype level.<\/p>\n\n\n\n<p>Kernel sub-routines (<code>__device__<\/code> methods that have not been inlined) do not have well-defined calling conventions. The compiler is free to save and use whatever registers it sees fit before invocation. The SASS decompiler plugin does not handle such methods very well at the time of writing, since it does not know which registers are written to provide the return value.<\/p>\n\n\n\n<p><span style=\"text-decoration: underline;\">Future updates:<\/span> A global pass may examine the callers of kernel sub-routines to infer which registers are saved, which are used to provide arguments, and which hold return values.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Optimizations\"><\/span>Optimizations<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>The matrix multiplication example above highlighted the importance of loop unrolling in the compilation process to obtain better performance for GPU code. However, this compiler-level optimization produces lengthy and difficult pseudo C decompilations.<\/p>\n\n\n\n<p><span style=\"text-decoration: underline;\">Future updates:<\/span> Having a special IR optimizer that attempts to re-roll some loops will be an important step toward producing readable decompiled code for large kernels. <sup class='footnote'><a href='#fn-4862-8' id='fnref-4862-8' onclick='return fdfootnote_show(4862)'>8<\/a><\/sup><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"API_and_headless_processing\"><\/span>API and headless processing<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>The usual <a href=\"https:\/\/www.pnfsoftware.com\/jeb\/apidoc\/\">JEB APIs<\/a> can be used to access container units (<code>IELFUnit<\/code>, <code>IPECOFFUnit<\/code>), code units (<code>INativeCodeUnit<\/code>), and decompiler units (<code>INativeDecompilerUnit<\/code>).<\/p>\n\n\n\n<p>The CUDA fatbin units, of type <code>cuda_fatbin<\/code>, is represented by the newly-added interface <code>ICudaFatbinUnit<\/code>. The <code>getCubinEntries()<\/code> method provides <code>ICudaCubinEntry<\/code> objects. That interface offers an easy way to retrieve flags or architecture information about a cubin. The code itself (in the case of a SASS entry) is located in ELF units that are children (sub-units) of a fatbin unit.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>This plugin provides flexible and robust disassembly and analysis for Nvidia code generated for Volta and above classes of GPUs. It was tested on 100+ millions unique instructions making up approximately 620,000 kernels (from <code>sm_70<\/code> to <code>sm_121<\/code>) to ensure full compatibility with the outputs of cuobjdump and nvdisasm.<\/p>\n\n\n\n<p>The experimental decompiler plugin can be used to examine a higher-level representation of the underlying SASS code. At the time of writing, the output is quite rough and will require more source binary information extraction and code optimization in order to match the one of mature decompiler plugins.<\/p>\n\n\n\n<p><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\"><strong>&#x26a0;<\/strong> <strong>Improvement<\/strong><\/mark><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-medium-pink-color\"> to this plugin will mostly depend on user feedback!<\/mark> If you find it useful, please reach out to us (<a href=\"mailto:support@pnfsoftware.com\">email<\/a>, <a href=\"https:\/\/www.pnfsoftware.com\/chat\">Slack<\/a>) and let us know your requirements, use-cases, and areas where you&#8217;d like to see improvements.<\/strong><\/p>\n\n\n\n<p>I want to extend a special thank you note to the author of <a href=\"https:\/\/github.com\/0xD0GF00D\/DocumentSASS\">DocumentSass<\/a>, which was a great help during the early weeks of this research project.<\/p>\n\n\n\n<p>Thank you &amp; until next time \ud83d\ude42 Nicolas.<\/p>\n\n\n\n<iframe loading=\"lazy\" width=\"560\" height=\"315\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/KMHaqUcTbHY?si=XtJ-a4IHwQ4ea2nb&#038;vq=hd1440\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Annex_1_Instructions_Distribution\"><\/span>Annex 1: Instructions Distribution<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>To create the SASS disassembler, we examined 62,000 kernels shipping with the CUDA toolkit as well as other commercial applications. It made up for about 1.1 million unique instructions. It is unlikely this set is representative of all commercial GPU code, but we provide those numbers for the most curious readers:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Top distributions, per-kernel<br># Example: BRA is seen in all kernels,<br># whereas BAR is seen in about every other kernel<br>BRA      100.00%<br>EXIT     98.35%<br>NOP      98.08%<br>IMAD     96.85%<br>ISETP    96.81%<br>S2R      95.75%<br>LDG      93.75%<br>IADD3    93.19%<br>STG      88.71%<br>SHF      82.59%<br>LOP3     79.81%<br>LEA      79.35%<br>BSYNC    74.03%<br>BSSY     74.03%<br>MOV      67.91%<br>SEL      59.34%<br>ULDC     58.22%<br>LDS      53.43%<br>STS      53.41%<br>BAR      51.81%<br># ...<br># full list: <a href=\"https:\/\/www.pnfsoftware.com\/other\/sass\/sass_insn_per_kernel_distribs.txt\">sass_insn_per_kernel_distribs.txt<\/a><\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\"># Most common, in descending order<br># Example: IMAD (integer multiply-and-add) is by far<br># the most common instruction encountered<br>IMAD<br>ISETP<br>IADD3<br>FFMA<br>LDG<br>BRA<br>LEA<br>NOP<br>LDS<br>LOP3<br>MOV<br>SHF<br>FMUL<br>STS<br>FADD<br>DFMA<br>PRMT<br>SEL<br># ...<br># full list: <a href=\"https:\/\/www.pnfsoftware.com\/other\/sass\/sass_insn_most_common.txt\">sass_insn_most_common.txt<\/a><\/pre>\n\n\n\n<p>&#8212;<\/p>\n\n\n<div class='footnotes' id='footnotes-4862'><div class='footnotedivider'><\/div><ol><li id='fn-4862-1'> SASS= Streaming Assembly, the hybrid RISC\/NISC\/VLIW-like proprietary instruction set used by Nvidia GPUs <span class='footnotereverse'><a href='#fnref-4862-1'>&#8617;<\/a><\/span><\/li><li id='fn-4862-2'> CUDA= Compute Unified Device Architecture, a computing platform and set of APIs to write and run code on GPUs <span class='footnotereverse'><a href='#fnref-4862-2'>&#8617;<\/a><\/span><\/li><li id='fn-4862-3'> PTX= Parallel Thread Execution, a medium-level intermediate representation of GPU code <span class='footnotereverse'><a href='#fnref-4862-3'>&#8617;<\/a><\/span><\/li><li id='fn-4862-4'> A compiled list of all special registers gathered from sm_70 to sm_121 can be consulted here: <a href=\"https:\/\/www.pnfsoftware.com\/other\/sass\/sass_special_registers.txt\">sass_special_registers.txt<\/a>. Not all registers are available on all architectures. Many can be accessed in PTX through similarly-named variables, e.g. <code>%tid.x<\/code> for <code>SR_TID.X<\/code>. <span class='footnotereverse'><a href='#fnref-4862-4'>&#8617;<\/a><\/span><\/li><li id='fn-4862-5'> SASS instructions do not use a mixed operand (src+dst) such as on the x86 architecture. <span class='footnotereverse'><a href='#fnref-4862-5'>&#8617;<\/a><\/span><\/li><li id='fn-4862-6'> JEB &#8220;actions&#8221; are not specific to the SASS plugin. New users are encouraged to read <a href=\"https:\/\/www.pnfsoftware.com\/jeb\/manual\/actions\/\">the Manual pages<\/a> if they are not familiar with JEB&#8217;s commands and workflow. <span class='footnotereverse'><a href='#fnref-4862-6'>&#8617;<\/a><\/span><\/li><li id='fn-4862-7'> ptxas does a lot of loop unrolling to maximize instruction-level parallelism and achieve better throughput. <span class='footnotereverse'><a href='#fnref-4862-7'>&#8617;<\/a><\/span><\/li><li id='fn-4862-8'> On its own, a generic re-roller is quite an endeavor and will make for a great research project that would benefit both <em>gendec<\/em> and its Dex counterpart, <em>dexdec<\/em>. <span class='footnotereverse'><a href='#fnref-4862-8'>&#8617;<\/a><\/span><\/li><\/ol><\/div>","protected":false},"excerpt":{"rendered":"<p>JEB 5.31 ships with a generic SASS disassembler and experimental decompiler for GPU code compiled for Nvidia architectures Volta to Blackwell, that is, compute capabilities sm_70 to sm_121. What is SASS Code SASS 1 is the low-level, semi-documented machine code generated when compiling high-level CUDA 2 source code (C++ or higher-level languages) with nvcc or &hellip; <a href=\"https:\/\/www.pnfsoftware.com\/blog\/reversing-nvidia-cuda-sass-code\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Reversing Nvidia GPU&#8217;s SASS code<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[32,30],"tags":[],"class_list":["post-4862","post","type-post","status-publish","format-standard","hentry","category-cuda","category-jeb5"],"_links":{"self":[{"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/posts\/4862","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=4862"}],"version-history":[{"count":0,"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/posts\/4862\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=4862"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=4862"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pnfsoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=4862"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}