CANN/catlass TileMmad矩阵乘加实现

张

张建站

2026/5/30 21:50:39

10分钟阅读

TileMmad【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass代码位置[TOC]功能说明TileMmad使用 AscendC::Mmad 基础 API 完成矩阵乘加C A * B。操作数 A 在 L0AB 在 L0BC 在 L0C排布格式分别为 zZ、nZ、zN。非 TLA 实现。支持两种调用模式无 Bias标准矩阵乘加带 Bias将 BT 中的 Bias 加载到 L0C 后执行矩阵乘加与 TLA 版本 TileMmadTla 的区别在于本模板直接操作AscendC::LocalTensor不使用tla::Tensor封装。架构差异架构kDirectionAligndisableGemv说明AtlasA2 (2201)floatColumnMajor/nZL1A 时开启—K 方向对齐优化Ascend950 (3510)—L1A 为VectorLayout时 false其他 trueGEMV 模式控制模板原型template class ArchTag_, // 架构标签 class AType_, // A 矩阵 GmType class BType_, // B 矩阵 GmType class BiasType_ // Bias GmType struct TileMmad;调用接口无 Biasvoid operator()( AscendC::LocalTensorElementAccumulator const l0CTensor, // L0C 累加结果 AscendC::LocalTensorElementA const l0ATensor, // L0A 左矩阵 AscendC::LocalTensorElementB const l0BTensor, // L0B 右矩阵 uint32_t m, // M 维度对齐后 uint32_t n, // N 维度对齐后 uint32_t k, // K 维度对齐后 bool initC true, // true覆盖, false原子累加 uint8_t unitFlag 0 // L0C→GM 并行搬运标志 );带 Biasvoid operator()( AscendC::LocalTensorElementAccumulator const l0CTensor, AscendC::LocalTensorElementA const l0ATensor, AscendC::LocalTensorElementB const l0BTensor, AscendC::LocalTensorElementAccumulator const l0BiasTensor, // BT Bias 数据 uint32_t m, uint32_t n, uint32_t k, bool initC true, // 带 Bias 时强制 false内部覆盖 uint8_t unitFlag 0 );尾流水屏障当(m / 16) * (n / 16) 10时自动插入PipeBarrierPIPE_M()防止流水冲突。调用示例无 Bias#include catlass/gemm/tile/tile_mmad.hpp using namespace Catlass::Gemm; using AType Gemm::GemmTypehalf, layout::zZ; using BType Gemm::GemmTypehalf, layout::nZ; using BiasType void; AscendC::LocalTensorhalf l0ATensor; AscendC::LocalTensorhalf l0BTensor; AscendC::LocalTensorfloat l0CTensor; Tile::TileMmadArch::AtlasA2, AType, BType, BiasType mmadOp; mmadOp(l0CTensor, l0ATensor, l0BTensor, 64, 64, 32);带 Biasusing AType Gemm::GemmTypehalf, layout::zZ; using BType Gemm::GemmTypehalf, layout::nZ; using BiasType Gemm::GemmTypefloat, layout::VectorLayout; AscendC::LocalTensorfloat l0BiasTensor; Tile::TileMmadArch::AtlasA2, AType, BType, BiasType mmadOp; mmadOp(l0CTensor, l0ATensor, l0BTensor, l0BiasTensor, 64, 64, 32);unitFlag 并行搬运bool initC true; uint8_t unitFlag 1; // 启用 L0C→GM 并行 // 第 1 次 mmad初始化 C mmadOp(l0CTensor, l0ATensor, l0BTensor, 64, 64, 32, initC, unitFlag); // 后续 mmad原子累加继续并行 initC false; mmadOp(l0CTensor, l0ATensor, l0BTensor, 64, 64, 32, initC, unitFlag);【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

深入理解nanoT5-base-65kBPE-v2的SiLU/gated-SiLU激活函数机制：提升语言模型性能的终极指南

深入理解nanoT5-base-65kBPE-v2的SiLU/gated-SiLU激活函数机制：提升语言模型性能的终极指南【免费下载链接】nanoT5-base-65kBPE-v2 项目地址: https://ai.gitcode.com/hf_mirrors/Rose/nanoT5-base-65kBPE-v2 nanoT5-base-65kBPE-v2是一个基于T5架构的先进…...

2026/5/30 21:50:36 阅读更多 →

从‘搞死主机’到一次成功：我的Linux硬盘挂载避坑全记录（含fstab文件急救指南）

从‘搞死主机’到一次成功：我的Linux硬盘挂载避坑全记录那是一个周五的深夜，我正兴奋地准备为我的家庭NAS添加第二块硬盘。作为Linux新手，我自信满满地复制了网上找到的挂载命令，将新硬盘直接挂载到了/home目录下。按下回车的那一…...

2026/5/30 21:48:38 阅读更多 →

Axure中文界面终极指南：3分钟免费汉化你的原型设计工具

Axure中文界面终极指南：3分钟免费汉化你的原型设计工具【免费下载链接】axure-cn Chinese language file for Axure RP. Axure RP 简体中文语言包。支持 Axure 11、10、9。不定期更新。项目地址: https://gitcode.com/gh_mirrors/ax/axure-cn 你是否曾经因…...

2026/5/30 21:47:30 阅读更多 →

告别手慢无！自动化抢票系统让你轻松搞定热门演出门票

告别手慢无！自动化抢票系统让你轻松搞定热门演出门票【免费下载链接】ticket-purchase 大麦自动抢票，支持人员、城市、日期场次、价格选择项目地址: https://gitcode.com/GitHub_Trending/ti/ticket-purchase 还在为抢不到心仪的演唱会门票而烦…...

2026/5/30 1:54:26 阅读更多 →

Pearcleaner：macOS应用彻底清理的终极解决方案，释放宝贵磁盘空间

Pearcleaner：macOS应用彻底清理的终极解决方案，释放宝贵磁盘空间【免费下载链接】Pearcleaner A free, source-available and fair-code licensed mac app cleaner 项目地址: https://gitcode.com/gh_mirrors/pe/Pearcleaner 你是否曾经遇到过这…...

2026/5/30 1:54:27 阅读更多 →