在CATLASS样例工程使用msDebug

张

张建站

2026/5/9 16:16:33

10分钟阅读

在CATLASS样例工程使用msDebug【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlassmsDebug是用于调试在NPU侧运行的算子程序的一个工具该工具向算子开发人员提供了在昇腾设备上调试算子的手段。调试手段包括了读取昇腾设备内存与寄存器、暂停与恢复程序运行状态等。⚠️注意若在容器环境进行开发调试请保证/dev/drv_debug映射至容器内参考驱动检查使用示例下面以00_basic_matmul为例进行msDebug调试的使用说明。使能驱动的调试功能参考msDebug工具概述以debug模式安装驱动或在full模式安装的驱动下执行echo 1 /proc/debug_switch打开调试通道。为了避免出现安全问题请勿在生产环境启用调试通道。若出现以下问题说明驱动版本较低需更新驱动。msdebug failed to initialize. please install HDK. [ERROR] error code: 0x20102 terminate called after throwing an instance of MSDEBUG_ERROR_CODE编译运行基于快速上手打开工具的编译开关--debug --msdebug使能debug与msdebug编译算子样例。bash scripts/build.sh --debug --msdebug 00_basic_matmul--debug同时控制host与device侧代码的debug开关--msdebug控制device侧代码的debug开关。若只增加--debug只会启用host的调试功能仅能用gdb/lldb调试host侧代码。切换到可执行文件的编译目录output/bin下使用msdebug执行算子样例程序。cd output/bin # 可执行文件名 |矩阵m轴|n轴|k轴|Device ID可选 msdebug ./00_basic_matmul 256 512 1024 0msdebug ./00_basic_matmul 256 512 1024 0 msdebug(MindStudio Debugger) is part of MindStudio Operator-dev Tools. The tool provides developers with a mechanism for debugging Ascend kernels running on actual hardware. This enables developers to debug Ascend kernels without being affected by potential changes brought by simulation and emulation environments. (msdebug) target create ./00_basic_matmul Current executable set to /home/catlass/output/bin/00_basic_matmul (aarch64). (msdebug) settings set -- target.run-args 256 512 1024 0 (msdebug)命令行调试设置断点和程序执行通过命令b basic_matmul.cpp:45和b basic_matmul.cpp:90在00_basic_matmul.cpp中90~101行为类型别名定义非运行时机器代码设置两个断点再用breakpoint list查看已有断点。(msdebug) b basic_matmul.cpp:45 Breakpoint 1: where 00_basic_matmulRun(GemmOptions const) 460 at basic_matmul.cpp:45:18, address 0x000000000016df8c (msdebug) b basic_matmul.cpp:90 Breakpoint 2: where 00_basic_matmulRun(GemmOptions const) 2816 at basic_matmul.cpp:101:39, address 0x000000000016e8c0 (msdebug) breakpoint list Current breakpoints: 1: file basic_matmul.cpp, line 45, exact_match 0, locations 1 1.1: where 00_basic_matmulRun(GemmOptions const) 460 at basic_matmul.cpp:45:18, address 00_basic_matmul[0x000000000016df8c], unresolved, hit count 0 2: file basic_matmul.cpp, line 90, exact_match 0, locations 1 2.1: where 00_basic_matmulRun(GemmOptions const) 2816 at basic_matmul.cpp:101:39, address 00_basic_matmul[0x000000000016e8c0], unresolved, hit count 0 (msdebug)执行命令r程序将开始运行直到第一个断点处再执行命令c程序将运行到下一个断点。需要注意的是对于多核程序而言算子程序通常会被下发至多个加速核并发运行一旦某一个加速核命中了断点会通过中断通知其他的加速核立即停下因此不保证其他的加速核也一定同时在该断点停下而且相同的断点也可能被其他的加速核再次命中开发者可配合禁用/删除断点命令来防止加速核不停命中同一个断点的情况。(msdebug) r Process 813993 launched: /home/catlass/output/bin/00_basic_matmul (aarch64) Process 813993 stopped * thread #1, name 00_basic_matmul, stop reason breakpoint 1.1 frame #0: 0x0000aaaaaac0df8c 00_basic_matmulRun(options0x0000ffffffffe340) at basic_matmul.cpp:45:18 42 43 uint32_t m options.problemShape.m(); 44 uint32_t n options.problemShape.n(); - 45 uint32_t k options.problemShape.k(); 46 47 size_t lenA static_castsize_t(m) * k; 48 size_t lenB static_castsize_t(k) * n; (msdebug) c Process 813993 resuming Process 813993 stopped * thread #1, name 00_basic_matmul, stop reason breakpoint 2.1 frame #0: 0x0000aaaaaac0e8c0 00_basic_matmulRun(options0x0000ffffffffe340) at basic_matmul.cpp:101:39 98 using MatmulKernel Gemm::Kernel::BasicMatmulBlockMmad, BlockEpilogue, BlockScheduler; 99 100 using MatmulAdapter Gemm::Device::DeviceGemmMatmulKernel; - 101 MatmulKernel::Arguments arguments{options.problemShape, deviceA, deviceB, deviceC}; 102 MatmulAdapter matmulOp; 103 matmulOp.CanImplement(arguments); 104 size_t sizeWorkspace matmulOp.GetWorkspaceSize(arguments); (msdebug) c Process 813993 resuming [Launch of Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel11BasicMatmulINS1_5Blo on Device 0] Compare success. Process 813993 exited with status 0 (0x00000000) (msdebug)查看变量和内存如果想查看标量通过p指令可以直接查看当前n变量的值。Process 813993 launched: /home/catlass/output/bin/00_basic_matmul (aarch64) Process 813993 stopped * thread #1, name 00_basic_matmul, stop reason breakpoint 1.1 frame #0: 0x0000aaaaaac0df8c 00_basic_matmulRun(options0x0000ffffffffe340) at basic_matmul.cpp:45:18 42 43 uint32_t m options.problemShape.m(); 44 uint32_t n options.problemShape.n(); - 45 uint32_t k options.problemShape.k(); 46 47 size_t lenA static_castsize_t(m) * k; 48 size_t lenB static_castsize_t(k) * n; (msdebug) p n (uint32_t) $0 512如果想查看内存先通过p指令查看当前内存的信息。通过x -m UB -f float16[] 65536 -c 4 -s 4命令可以打印accumulatorBuffer内存中的值一次最多打印1024字节。(msdebug) c Process 814339 resuming Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason breakpoint 2.1 frame #0: 0x000000000000bf98 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:136:19 133 134 AscendC::SetFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); 135 AscendC::WaitFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); - 136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen); 137 AscendC::SetFlagAscendC::HardEvent::MTE3_V(outputEventIds[bufferIndex]); 138 139 bufferIndex (bufferIndex 1) % BUFFER_NUM; (msdebug) p outputBuffer (AscendC::LocalTensor__fp16[2]) $2 { [0] { AscendC::BaseLocalTensor__fp16 { # 内存、数据类型 address_ (dataLen 131072, bufferAddr 65536, bufferHandle , logicPos \v) # 起始地址、数据长度 } shapeInfo_ { shapeDim \x88 originalShapeDim \xf8 shape {} originalShape {} dataFormat ND } } [1] { AscendC::BaseLocalTensor__fp16 { address_ (dataLen 49152, bufferAddr 147456, bufferHandle , logicPos \v) } shapeInfo_ { shapeDim \x88 originalShapeDim \xf8 shape {} originalShape {} dataFormat ND } } } (msdebug) x -m UB -f float16[] 65536 -c 4 -s 4 # 在UB内存中从65536的地址分打印4行4字节的fp16数据 0x00010000: {355.5 188.75} 0x00010004: {244.125 -364.75} 0x00010008: {-104.875 -156} 0x0001000c: {232 -100.75} (msdebug) x -m UB -f float16[] 65536 -c 4 -s 8 # 在UB内存中从65536的地址分打印4行8字节的fp16数据 0x00010000: {355.5 188.75 244.125 -364.75} 0x00010008: {-104.875 -156 232 -100.75} 0x00010010: {-47.4062 105.875 -322.5 -265.75} 0x00010018: {260 200.125 -139.25 -190.625} (msdebug)如果想逐行调试运行命令n使程序运行至下一行(msdebug) n Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason step over frame #0: 0x000000000000bfe4 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:137:73 134 AscendC::SetFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); 135 AscendC::WaitFlagAscendC::HardEvent::V_MTE3(outputEventIds[bufferIndex]); 136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen); - 137 AscendC::SetFlagAscendC::HardEvent::MTE3_V(outputEventIds[bufferIndex]); 138 139 bufferIndex (bufferIndex 1) % BUFFER_NUM; 140 } (msdebug) n Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason step over frame #0: 0x000000000000c000 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:139:28 136 Ub2Gm(dst[loopIdx * tileLen], outputBuffer[bufferIndex], actualTileLen); 137 AscendC::SetFlagAscendC::HardEvent::MTE3_V(outputEventIds[bufferIndex]); 138 - 139 bufferIndex (bufferIndex 1) % BUFFER_NUM; 140 } 141 142 AscendC::WaitFlagAscendC::HardEvent::V_MTE2(inputEventIds[0]); (msdebug) n Process 814339 stopped [Switching to focus on Kernel _ZN7Catlass13KernelAdapterINS_4Gemm6Kernel12SplitkMatmulINS1_5Bl, CoreId 0, Type aiv] * thread #1, name 09_splitk_matmu, stop reason step over frame #0: 0x000000000000c014 device_debugdata_ZN7Catlass4Gemm6Kernel9ReduceAddINS_4Arch7AtlasA2EfDhLj8192EEclERKN7AscendC12GlobalTensorIDhEERKNS7_IfEEmj_mix_aiv(this0x00000000001cf838, dst0x00000000001cf930, src0x00000000001cf908, elementCount131072, splitkFactor2) at splitk_matmul.hpp:96:68 93 AscendC::SetFlagAscendC::HardEvent::V_MTE2(accumulatorEventIds[1]); 94 95 uint32_t loops (elementCount tileLen - 1) / tileLen; - 96 for (uint32_t loopIdx aivId; loopIdx loops; loopIdx aivNum) { 97 uint32_t actualTileLen tileLen; 98 if (loopIdx loops - 1) { 99 actualTileLen elementCount - loopIdx * tileLen; (msdebug)通过var命令可以查看当前栈帧的全部变量。(msdebug) var (Catlass::Gemm::Kernel::ReduceAddCatlass::Arch::AtlasA2, float, __fp16, 8192 *__stack__) this 0x00000000001cf838 (const AscendC::GlobalTensor__fp16 __stack__) dst 0x00000000001cf930: { AscendC::BaseGlobalTensor__fp16 { address_ 0x000012c0c0094000 oriAddress_ 0x000012c0c0094000 } bufferSize_ 1898896 shapeInfo_ { shapeDim h originalShapeDim \xf9 shape {} originalShape {} dataFormat ND } cacheMode_ CACHE_MODE_NORMAL } (const AscendC::GlobalTensorfloat __stack__) src 0x00000000001cf908: { AscendC::BaseGlobalTensorfloat { address_ 0x000012c041400000 oriAddress_ 0x000012c041400000 } bufferSize_ 1898904 shapeInfo_ { shapeDim H originalShapeDim \xf9 shape {} originalShape {} dataFormat ND } cacheMode_ CACHE_MODE_NORMAL } (uint64_t) elementCount 131072 (uint32_t) splitkFactor 2 (const uint32_t) ELE_PER_VECTOR_BLOCK 64 (uint32_t) aivNum 48 (uint32_t) aivId 26 (uint64_t) taskPerAiv 2752 (uint32_t) tileLen 2752 (uint32_t) loops 48 (uint32_t) loopIdx 26 (msdebug)退出调试调试完成后通过命令q退出msdebug若通过CtrlC等手段强行退出则msdebug进程不会结束仍在后台运行此时可通过ps -ef | grep msdebug查找对应的进程pid再用kill -9 进程pid终止对应进程即可。不能同时启动多个msdebug进程进行调试。(msdebug) q Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y常用命令表命令命令缩写作用示例breakpoint filename:lineNob增加断点b add_custom.cpp:85b my_functionrunr重新运行rcontinuec继续运行cprintp打印变量p zLocalframe variablevar打印当前帧所有变量varmemory readx读内存-m 指定内存位置支持GM/UB/L0A/L0B/L0C-f 指定字节转换格式-s 指定每行打印字节数-c 指定打印的行数x -m GM -f float16[] 1000 -c 2 -s 128register readre r读取寄存器值-a 读取所有寄存器值$REG_NAME 读取指定名称的寄存器值register read -are r $PCthread step-overnextn在同一个调用栈中移动到下一个可执行的代码行nascend info devices/查询device信息ascend info devicesascend info cores/查询算子所运行的aicore相关信息ascend info coresascend info tasks/查询算子所运行的task相关信息ascend info tasksascend info stream/查询算子所运行的stream相关信息ascend info streamascend info blocks/查询算子所运行的block相关信息可选参数 -d/–details显示所有blocks当前中断处代码ascend info blocksascend aic core/切换调试器所聚焦的cube核ascend aic 1ascend aiv core/切换调试器所聚焦的vector核ascend aiv 5target modules addkernel.oimage addkernel.oPyTorch框架拉起算子时导入算子调试信息注当程序执行run命令后再执行本命令导入调试信息则还需额外执行image load命令以使调试信息生效image addAddCustom_xxx.otarget modules load –f kernel.o –s addressimage load -f kernel.o -s address在程序运行后使导入的调试信息生效image load -f AddCustom_xxx.o -s 0附录msdebug支持的数据格式Valid values are: default B or boolean b or binary y or bytes Y or bytes with ASCII c or character C or printable character F or complex float s or c-string d or decimal E or enumeration x or hex X or uppercase hex f or float brain float16 o or octal O or OSType U or unicode16 unicode32 u or unsigned decimal p or pointer char[] int8_t[] uint8_t[] int16_t[] uint16_t[] int32_t[] uint32_t[] int64_t[] uint64_t[] bfloat16[] float16[] float32[] float64[] uint128_t[] I or complex integer a or character array A or address hex float i or instruction v or void u or unicode8指定调试使用的NPU卡配置环境变量ASCEND_RT_VISIBLE_DEVICES为需要使用的NPU卡号例如# 指定当前进程仅使用Device ID为2的Device export ASCEND_RT_VISIBLE_DEVICES2【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

MoltFi：为AI交易代理构建链上安全护栏的架构与实践

1. 项目概述：为AI交易代理戴上“智能镣铐”如果你正在尝试让AI代理帮你进行加密货币交易，那么最让你夜不能寐的问题，很可能不是市场波动，而是“它会不会失控？” 一个被精心调教的AI，理论上应该遵循你的指令…...

2026/5/9 16:12:31 阅读更多 →

用Vue+SpringBoot+MySQL做个超市商品管理系统，我踩过的坑你别再踩了（附完整源码）

VueSpringBootMySQL超市商品管理系统实战避坑指南第一次用Vue和SpringBoot做超市管理系统时，我对着满屏的报错信息差点崩溃。前后端联调时浏览器控制台的红字、数据库连接池莫名其妙耗尽、MyBatis Plus的乐观锁配置总是不生效...这些坑让我熬了三个通宵。现在我把这…...

2026/5/9 16:06:41 阅读更多 →

Anthropic 官方生产级 Agent 最佳实践：12 个可复用的 MCP 设计模式

在 Claude Code 源代码泄露事件之后，我们从源码里整理出了 12 种 Agentic Harness 模式。后来又结合 Anthropic 官方的 Agent Skills 构建指南，继续拆解出 14 种 Skill 编写模式。这次再往前走一步，问题就变得更现实了：当 Agent 真…...

2026/5/9 16:06:40 阅读更多 →

如何用Python脚本绕过百度网盘限速？5个实用技巧大揭秘

如何用Python脚本绕过百度网盘限速？5个实用技巧大揭秘【免费下载链接】baidu-wangpan-parse 获取百度网盘分享文件的下载地址项目地址: https://gitcode.com/gh_mirrors/ba/baidu-wangpan-parse 上周，当我需要从百度网盘下载一个3GB的设计素材时…...

2026/5/8 0:39:19 阅读更多 →

构建Web3多智能体世界：从账户抽象到AI驱动的链上经济

1. 项目概述：一个由AI驱动的Web3多智能体世界EmpowerTours 是一个我深度参与构建的、运行在 Monad 区块链上的综合性 Web3 平台。它不仅仅是一个应用，更是一个持续运行的多智能体世界，并深度集成在 Farcaster 社交协议中，作为一个…...

2026/5/8 0:40:09 阅读更多 →

2026届最火的降AI率网站推荐榜单

Ai论文网站排名（开题报告、文献综述、降aigc率、降重综合对比） TOP1. 千笔AI TOP2. aipasspaper TOP3. 清北论文 TOP4. 豆包 TOP5. kimi TOP6. deepseek 需要从源头优化以及后期校正两方同时着手，来降低文本里AIGC也就是人工智能生成内…...

2026/5/8 0:02:56 阅读更多 →