在CATLASS样例工程使用msDebug【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlassmsDebug是用于调试在NPU侧运行的算子程序的一个工具该工具向算子开发人员提供了在昇腾设备上调试算子的手段。调试手段包括了读取昇腾设备内存与寄存器、暂停与恢复程序运行状态等。⚠️注意若在容器环境进行开发调试请保证/dev/drv_debug映射至容器内参考驱动检查使用示例下面以00_basic_matmul为例进行msDebug调试的使用说明。使能驱动的调试功能参考msDebug工具概述以debug模式安装驱动或在full模式安装的驱动下执行echo 1 /proc/debug_switch打开调试通道。为了避免出现安全问题请勿在生产环境启用调试通道。若出现以下问题说明驱动版本较低需更新驱动。msdebug failed to initialize. please install HDK. [ERROR] error code: 0x20102 terminate called after throwing an instance of MSDEBUG_ERROR_CODE编译运行基于快速上手打开工具的编译开关--debug --msdebug使能debug与msdebug编译算子样例。bash scripts/build.sh --debug --msdebug 00_basic_matmul--debug同时控制host与device侧代码的debug开关--msdebug控制device侧代码的debug开关。若只增加--debug只会启用host的调试功能仅能用gdb/lldb调试host侧代码。切换到可执行文件的编译目录output/bin下使用msdebug执行算子样例程序。cd output/bin # 可执行文件名 |矩阵m轴|n轴|k轴|Device ID可选 msdebug ./00_basic_matmul 256 512 1024 0msdebug ./00_basic_matmul 256 512 1024 0 msdebug(MindStudio Debugger) is part of MindStudio Operator-dev Tools. The tool provides developers with a mechanism for debugging Ascend kernels running on actual hardware. This enables developers to debug Ascend kernels without being affected by potential changes brought by simulation and emulation environments. (msdebug) target create ./00_basic_matmul Current executable set to /home/catlass/output/bin/00_basic_matmul (aarch64). (msdebug) settings set -- target.run-args 256 512 1024 0 (msdebug)命令行调试设置断点和程序执行通过命令b basic_matmul.cpp:45和b basic_matmul.cpp:90在00_basic_matmul.cpp中90~101行为类型别名定义非运行时机器代码设置两个断点再用breakpoint list查看已有断点。(msdebug) b basic_matmul.cpp:45 Breakpoint 1: where 00_basic_matmulRun(GemmOptions const) 460 at basic_matmul.cpp:45:18, address 0x000000000016df8c (msdebug) b basic_matmul.cpp:90 Breakpoint 2: where 00_basic_matmulRun(GemmOptions const) 2816 at basic_matmul.cpp:101:39, address 0x000000000016e8c0 (msdebug) breakpoint list Current breakpoints: 1: file basic_matmul.cpp, line 45, exact_match 0, locations 1 1.1: where 00_basic_matmulRun(GemmOptions const) 460 at basic_matmul.cpp:45:18, address 00_basic_matmul[0x000000000016df8c], unresolved, hit count 0 2: file basic_matmul.cpp, line 90, exact_match 0, locations 1 2.1: where 00_basic_matmulRun(GemmOptions const) 2816 at basic_matmul.cpp:101:39, address 00_basic_matmul[0x000000000016e8c0], unresolved, hit count 0 (msdebug)执行命令r程序将开始运行直到第一个断点处再执行命令c程序将运行到下一个断点。需要注意的是对于多核程序而言算子程序通常会被下发至多个加速核并发运行一旦某一个加速核命中了断点会通过中断通知其他的加速核立即停下因此不保证其他的加速核也一定同时在该断点停下而且相同的断点也可能被其他的加速核再次命中开发者可配合禁用/删除断点命令来防止加速核不停命中同一个断点的情况。(msdebug) r Process 813993 launched: /home/catlass/output/bin/00_basic_matmul (aarch64) Process 813993 stopped * thread #1, name 00_basic_matmul, stop reason breakpoint 1.1 frame #0: 0x0000aaaaaac0df8c 00_basic_matmulRun(options0x0000ffffffffe340) at basic_matmul.cpp:45:18 42 43 uint32_t m options.problemShape.m(); 44 uint32_t n options.problemShape.n(); - 45 uint32_t k options.problemShape.k(); 46 47 size_t lenA static_castsize_t(m) * k; 48 size_t lenB static_castsize_t(k) * n; (msdebug) c Process 813993 resuming Process 813993 stopped * thread #1, name 00_basic_matmul, stop reason breakpoint 2.1 frame #0: 0x0000aaaaaac0e8c0 00_basic_matmulRun(options0x0000ffffffffe340) at basic_matmul.cpp:101:39 98 using MatmulKernel Gemm::Kernel::BasicMatmulBlockMmad, BlockEpilogue, BlockScheduler; 99 100 using MatmulAdapter Gemm::Device::DeviceGemmMatmulKernel; - 101 MatmulKernel::Arguments arguments{options.problemShape, deviceA, deviceB, deviceC}; 102 MatmulAdapter matmulOp; 103 matmulOp.CanImplement(arguments); 104 size_t sizeWorkspace matmulOp.GetWorkspaceSize(arguments); (msdebug) c Process 813993 resuming [Launch of Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel11BasicMatmulINS1_5Blo on Device 0] Compare success. Process 813993 exited with status 0 (0x00000000) (msdebug)查看变量和内存如果想查看标量通过p指令可以直接查看当前n变量的值。Process 813993 launched: /home/catlass/output/bin/00_basic_matmul (aarch64) Process 813993 stopped * thread #1, name 00_basic_matmul, stop reason breakpoint 1.1 frame #0: 0x0000aaaaaac0df8c 00_basic_matmulRun(options0x0000ffffffffe340) at basic_matmul.cpp:45:18 42 43 uint32_t m options.problemShape.m(); 44 uint32_t n options.problemShape.n(); - 45 uint32_t k options.problemShape.k(); 46 47 size_t lenA static_castsize_t(m) * k; 48 size_t lenB static_castsize_t(k) * n; (msdebug) p n (uint32_t) $0 512如果想查看内存先通过p指令查看当前内存的信息。通过x -m UB -f float16[] 65536 -c 4 -s 4命令可以打印accumulatorBuffer内存中的值一次最多打印1024字节。(msdebug) c Process 814339 resuming Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason breakpoint 2.1 frame #0: 0x000000000000bf98 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:136:19 133 134 AscendC::SetFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); 135 AscendC::WaitFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); - 136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen); 137 AscendC::SetFlagAscendC::HardEvent::MTE3_V(outputEventIds[bufferIndex]); 138 139 bufferIndex (bufferIndex 1) % BUFFER_NUM; (msdebug) p outputBuffer (AscendC::LocalTensor__fp16[2]) $2 { [0] { AscendC::BaseLocalTensor__fp16 { # 内存、数据类型 address_ (dataLen 131072, bufferAddr 65536, bufferHandle , logicPos \v) # 起始地址、数据长度 } shapeInfo_ { shapeDim \x88 originalShapeDim \xf8 shape {} originalShape {} dataFormat ND } } [1] { AscendC::BaseLocalTensor__fp16 { address_ (dataLen 49152, bufferAddr 147456, bufferHandle , logicPos \v) } shapeInfo_ { shapeDim \x88 originalShapeDim \xf8 shape {} originalShape {} dataFormat ND } } } (msdebug) x -m UB -f float16[] 65536 -c 4 -s 4 # 在UB内存中从65536的地址分打印4行4字节的fp16数据 0x00010000: {355.5 188.75} 0x00010004: {244.125 -364.75} 0x00010008: {-104.875 -156} 0x0001000c: {232 -100.75} (msdebug) x -m UB -f float16[] 65536 -c 4 -s 8 # 在UB内存中从65536的地址分打印4行8字节的fp16数据 0x00010000: {355.5 188.75 244.125 -364.75} 0x00010008: {-104.875 -156 232 -100.75} 0x00010010: {-47.4062 105.875 -322.5 -265.75} 0x00010018: {260 200.125 -139.25 -190.625} (msdebug)如果想逐行调试运行命令n使程序运行至下一行(msdebug) n Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason step over frame #0: 0x000000000000bfe4 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:137:73 134 AscendC::SetFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); 135 AscendC::WaitFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); 136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen); - 137 AscendC::SetFlagAscendC::HardEvent::MTE3_V(outputEventIds[bufferIndex]); 138 139 bufferIndex (bufferIndex 1) % BUFFER_NUM; 140 } (msdebug) n Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason step over frame #0: 0x000000000000c000 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:139:28 136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen); 137 AscendC::SetFlagAscendC::HardEvent::MTE3_V(outputEventIds[bufferIndex]); 138 - 139 bufferIndex (bufferIndex 1) % BUFFER_NUM; 140 } 141 142 AscendC::WaitFlagAscendC::HardEvent::V_MTE2(inputEventIds[0]); (msdebug) n Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason step over frame #0: 0x000000000000c014 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:96:68 93 AscendC::SetFlagAscendC::HardEvent::V_MTE2(accumulatorEventIds[1]); 94 95 uint32_t loops (elementCount tileLen - 1) / tileLen; - 96 for (uint32_t loopIdx aivId; loopIdx loops; loopIdx aivNum) { 97 uint32_t actualTileLen tileLen; 98 if (loopIdx loops - 1) { 99 actualTileLen elementCount - loopIdx * tileLen; (msdebug)通过var命令可以查看当前栈帧的全部变量。(msdebug) var (Catlass::Gemm::Kernel::ReduceAddCatlass::Arch::AtlasA2, float, __fp16, 8192 *__stack__) this 0x00000000001cf838 (const AscendC::GlobalTensor__fp16 __stack__) dst 0x00000000001cf930: { AscendC::BaseGlobalTensor__fp16 { address_ 0x000012c0c0094000 oriAddress_ 0x000012c0c0094000 } bufferSize_ 1898896 shapeInfo_ { shapeDim h originalShapeDim \xf9 shape {} originalShape {} dataFormat ND } cacheMode_ CACHE_MODE_NORMAL } (const AscendC::GlobalTensorfloat __stack__) src 0x00000000001cf908: { AscendC::BaseGlobalTensorfloat { address_ 0x000012c041400000 oriAddress_ 0x000012c041400000 } bufferSize_ 1898904 shapeInfo_ { shapeDim H originalShapeDim \xf9 shape {} originalShape {} dataFormat ND } cacheMode_ CACHE_MODE_NORMAL } (uint64_t) elementCount 131072 (uint32_t) splitkFactor 2 (const uint32_t) ELE_PER_VECTOR_BLOCK 64 (uint32_t) aivNum 48 (uint32_t) aivId 26 (uint64_t) taskPerAiv 2752 (uint32_t) tileLen 2752 (uint32_t) loops 48 (uint32_t) loopIdx 26 (msdebug)退出调试调试完成后通过命令q退出msdebug若通过CtrlC等手段强行退出则msdebug进程不会结束仍在后台运行此时可通过ps -ef | grep msdebug查找对应的进程pid再用kill -9 进程pid终止对应进程即可。不能同时启动多个msdebug进程进行调试。(msdebug) q Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y常用命令表命令命令缩写作用示例breakpoint filename:lineNob增加断点b add_custom.cpp:85b my_functionrunr重新运行rcontinuec继续运行cprintp打印变量p zLocalframe variablevar打印当前帧所有变量varmemory readx读内存-m 指定内存位置支持GM/UB/L0A/L0B/L0C-f 指定字节转换格式-s 指定每行打印字节数-c 指定打印的行数x -m GM -f float16[] 1000 -c 2 -s 128register readre r读取寄存器值-a 读取所有寄存器值$REG_NAME 读取指定名称的寄存器值register read -are r $PCthread step-overnextn在同一个调用栈中移动到下一个可执行的代码行nascend info devices/查询device信息ascend info devicesascend info cores/查询算子所运行的aicore相关信息ascend info coresascend info tasks/查询算子所运行的task相关信息ascend info tasksascend info stream/查询算子所运行的stream相关信息ascend info streamascend info blocks/查询算子所运行的block相关信息可选参数 -d/–details显示所有blocks当前中断处代码ascend info blocksascend aic core/切换调试器所聚焦的cube核ascend aic 1ascend aiv core/切换调试器所聚焦的vector核ascend aiv 5target modules addkernel.oimage addkernel.oPyTorch框架拉起算子时导入算子调试信息注当程序执行run命令后再执行本命令导入调试信息则还需额外执行image load命令以使调试信息生效image addAddCustom_xxx.otarget modules load –f kernel.o –s addressimage load -f kernel.o -s address在程序运行后使导入的调试信息生效image load -f AddCustom_xxx.o -s 0附录msdebug支持的数据格式Valid values are: default B or boolean b or binary y or bytes Y or bytes with ASCII c or character C or printable character F or complex float s or c-string d or decimal E or enumeration x or hex X or uppercase hex f or float brain float16 o or octal O or OSType U or unicode16 unicode32 u or unsigned decimal p or pointer char[] int8_t[] uint8_t[] int16_t[] uint16_t[] int32_t[] uint32_t[] int64_t[] uint64_t[] bfloat16[] float16[] float32[] float64[] uint128_t[] I or complex integer a or character array A or address hex float i or instruction v or void u or unicode8指定调试使用的NPU卡配置环境变量ASCEND_RT_VISIBLE_DEVICES为需要使用的NPU卡号例如# 指定当前进程仅使用Device ID为2的Device export ASCEND_RT_VISIBLE_DEVICES2【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考