CANN向量步幅切片约束
Vec Stride and Slicing Constraints【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when a vec operation needs to access part of a wider buffer, or when a narrow source (e.g. row-max buffer) must align with a wide destination row by row.GoalDecide correctly when a vec operation can run continuously over a full buffer versus when it requires sliced views or explicit stride configuration.1. The alignment problemVec operations inferrepeatfrom the destination tensor and strides from each tensorsspan/shape. When a wide buffer (e.g.[M, 128]) is paired with a narrow buffer (e.g.[M, 8]), the repeat counts may not align row-by-row.For float (C08):[M, 128]→span1128does not match8*C064orC08→ default strides (blk1, rep8)Each row takes2 repeats(128 / 64 2)[M, 8]→span18 C0→blk0, rep1Each row takes1 repeatfrom the narrow bufferIfsub(wide[M,128], wide[M,128], narrow[M,8])is called directly:repeat M * 128 / 64 2M(from dst)narrow advances 1 per repeat → after repeat 0 (row 0 first half), narrow moves to row 1row 0s second half gets row 1s value→ misaligned!2. Fix: slice the wide buffer to 64-column viewsSlicing to[M, 64]creates a view wherespan164 8*C0:blk1, repshape[1]//C0(e.g.128//816for a 128-wide parent)Each row takes1 repeat→ aligns with the narrow buffersrep1# Correct: sliced views ensure 1 repeat per row sub(ub[0:M, 0:64], ub[0:M, 0:64], max_buf) # first half sub(ub[0:M, 64:128], ub[0:M, 64:128], max_buf) # second halfThe slice syntax creates a Tensor view with updatedspanandoffsetwhile keeping the originalshape. The stride auto-inference usesspanfor stride selection andshapeforrep_stridecalculation, which correctly skips the full row width between repeats.3. When slicing is NOT neededPurely element-wise operations (no narrow source) can run continuously over the full buffer:OperationNeeds slicing?Reasonmuls(wide, wide, scalar)NoScalar broadcasts uniformlyexp(wide, wide)NoSame-shape in-place, no alignment issuecast(half_out, float_in)NoSame-shape element-wise conversionsub(wide, wide, narrow)YesNarrow source advances 1 row/repeatvmax(dst64, wide_half1, wide_half2)YesNeed column views of a wider bufferbrcb(wide, narrow)Explicit stridesSee brcb sectionRule: if all source and destination tensors have the samespanand are operated element-wise, no slicing is needed. If any operand has a different width (narrower), slice the wider operands to match the narrow operands per-row repeat cadence.4. Stride auto-inference rulesFromvecutils.infer_strides(tensor)for float (C08):span[1]Matchesblk_striderep_stride64( 8×C0)Yes1shape[1] // C08( C0)Yes0shape[1] // C0otherNo1 (default)8 (default)For half (C016):span[1]Matchesblk_striderep_stride128( 8×C0)Yes1shape[1] // C016( C0)Yes0shape[1] // C0otherNo1 (default)8 (default)Whenspan[0] 1and a match occurred,rep_strideis overridden to0.infer_repeat(tensor)always uses:span[0] * span[1] / (256 // dtype.size)5. Column slicing via Tensor viewsDSL tensor slicing (tensor[row_start:row_end, col_start:col_end]) creates a view with:offsetadjusted to the slice startspanset to the slice extentshapeinherited from the parent (full allocation width)This meansrep_stride shape[1] // C0correctly accounts for the full row width, whilerepeat span[0] * span[1] // (256 // dtype_size)only covers the sliced region.Example forub_data[0:64, 64:128]whereub_dataisTensor(float, [64, 128]):span [64, 64],shape [64, 128],offset [0, 64]blk1, rep128//816(skips full 128-wide row)repeat 64*64/64 64(one repeat per row)Files to studyeasyasc/stub_functions/vec/vecutils.py— stride inference logiceasyasc/utils/Tensor.py— slice/view creationagent/example/kernels/a2/flash_attn_score.py— practical use of sliced sub continuous exp/cast【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考