huawei-cloud-ascend-op-mfu-calculator

Other

Calculate MFU (Machine FLOP Utilization) for operators like matmul/GEMM/FlashAttention on Ascend NPU, providing clear formulas and derivation process Use this skill when the user wants to: (1) calculate MFU for matrix operations, (2) analyze operator performance efficiency, (3) understand hardware utilization, (4) optimize operator implementation Trigger: user mentions "MFU", "machine flop utilization", "operator FLOPs", "matmul performance", "GEMM efficiency", "Ascend MFU", "算子MFU", "算力利用率", "矩阵乘效率", "GEMM性能", "FlashAttention性能"

Install

openclaw skills install huawei-cloud-ascend-op-mfu-calculator

Huawei Cloud Ascend Operator MFU Calculator

Overview

This skill calculates MFU (Machine FLOP Utilization) for operators like matmul/GEMM/FlashAttention on Ascend NPU, providing clear formulas and derivation process.

Architecture: Input Validation → FLOPs Calculation → Achieved TFLOPs/s → MFU Calculation → Result Analysis

Related Skills:

  • huawei-cloud-ascend-profiler-db-explorer - Profiling data analysis for operator performance data

Prerequisites

  1. Python 3.8+ installed
  2. Basic understanding of FLOPs calculation concepts

Usage Scenarios

Typical Problem Scenarios:

  • Evaluating how well an operator utilizes Ascend NPU compute power
  • Comparing performance of different operator implementations
  • Identifying optimization opportunities for matrix operations

Typical User Utterances:

  • "Calculate MFU for my GEMM operator"
  • "What's the machine FLOP utilization for FlashAttention?"
  • "Analyze my matmul operator performance efficiency"

Workflow

  1. Input Collection: Gather operator parameters (matrix dimensions, data types, execution time)
  2. FLOPs Calculation: Compute theoretical FLOPs for the operation
  3. Achieved Performance: Calculate achieved TFLOPs/s from execution time
  4. MFU Calculation: Apply formula MFU = Achieved FLOPs / Peak FLOPs
  5. Result Analysis: Provide interpretation and optimization suggestions

MFU Calculation Formula

MFU = (Achieved FLOPs / Peak FLOPs) × 100%

Where:

  • Achieved FLOPs = Operation FLOPs / Execution Time
  • Peak FLOPs = Hardware-specific peak performance (e.g., Ascend 910B: 256 TFLOPs for FP16)

Reference Documents

DocumentDescription
Ascend 910B Series Technical SpecificationsOfficial Ascend 910B series product specifications
MFU Calculation MethodologyDetailed MFU calculation formulas and examples
FlashAttention Technical PaperOriginal FlashAttention research paper

Enhanced Features

Intelligent Bottleneck Diagnoser

  • AI-powered bottleneck diagnosis that analyzes profiling data to identify root causes automatically
  • Classifies bottlenecks into categories: memory-bound, compute-bound, communication-bound, or operator-fallback
  • Provides actionable optimization recommendations with priority ranking
  • Includes pattern matching for known performance anti-patterns

Parameter Confirmation

ParameterDescriptionRequired
operatorOperator type (matmul/flash_attention/gemm, etc.)Yes
flopsTheoretical FLOPs of the operatorYes
time_msOperator execution time (milliseconds)Yes
peak_tflopsHardware peak computing power (TFLOPS)Yes
deviceNPU device type (910B/910, etc.)No