Accelerator Energy Measurement

Joule measures energy consumption not just on CPUs but across GPUs, TPUs, and other accelerators. This guide covers the three-tier measurement approach, supported hardware, and how to use accelerator energy data in your programs.

Three-Tier Approach

Joule uses a tiered strategy to maximize energy measurement coverage:

Tier 1: Static Estimation

Available everywhere, no hardware access required. The compiler estimates energy from code structure using calibrated instruction costs. This is what powers #[energy_budget] at compile time.

Tier 2: CPU Performance Counters

On supported platforms, Joule reads hardware performance counters for actual CPU energy:

PlatformAPIGranularity
Intel/AMD LinuxRAPL via perf_eventPer-package, per-core
Intel/AMD LinuxRAPL via MSRPer-package
Apple Silicon macOSIOReport frameworkPer-cluster

Tier 3: Accelerator Energy

For GPU and accelerator workloads, Joule queries vendor-specific APIs. Each backend in TensorForge implements the EnergyTelemetry trait.

Vendor Coverage

VendorHardwareAPIEnergyPowerTemperature
NVIDIAGPUs (A100, H100, etc.)NVMLBoard-levelPer-GPUPer-GPU
AMDGPUs (MI250, MI300, etc.)ROCm SMIAverage powerPer-GPUPer-GPU
IntelGPUs, GaudiLevel ZeroPer-devicePer-domainPer-device
GoogleTPU v4, v5TPU RuntimePer-chipPer-chipPer-chip
AWSInferentia, TrainiumNeuron SDKPer-corePer-corePer-core
GroqLPUHLMLBoard-levelPer-devicePer-device
CerebrasCS-2, CS-3CS SDKWafer-scalePer-waferPer-wafer
SambaNovaSN30, SN40DataScale APIPer-RDUPer-RDUPer-RDU

API Details

NVIDIA (NVML)

The NVIDIA Management Library provides direct energy readings:

nvmlDeviceGetTotalEnergyConsumption(device, &energy_mj)
  • Returns total energy in millijoules since driver load
  • Subtract start from end measurement for per-operation energy
  • Available on all datacenter GPUs (V100, A100, H100, B100)
  • Supported on consumer GPUs (RTX 3000/4000/5000 series)

AMD (ROCm SMI)

ROCm System Management Interface provides power readings:

rsmi_dev_power_ave_get(device_index, sensor_id, &power_uw)
  • Returns average power in microwatts
  • Energy is derived from power * time
  • Available on MI series (MI250, MI300) and Radeon Pro

Intel (Level Zero)

Intel's Level Zero API provides power domain readings:

zesDeviceEnumPowerDomains(device, &count, domains)
zesPowerGetEnergyCounter(domain, &energy)
  • Energy counter in microjoules
  • Multiple power domains (package, card, memory)
  • Supports Intel Arc GPUs and Gaudi accelerators

Google (TPU Runtime)

tpu_device_get_energy_consumption(device, &energy_j)
  • Per-chip energy in joules
  • Available on TPU v4 and v5 pods
  • Accessed through the TPU runtime API

AWS (Neuron SDK)

neuron_device_get_power(device, &power_mw)
  • Per-NeuronCore power in milliwatts
  • Available on Inferentia and Trainium instances
  • Accessed through the Neuron runtime

Groq (HLML)

Groq's Hardware Library for Machine Learning mirrors the NVML API:

hlmlDeviceGetTotalEnergyConsumption(device, &energy_mj)
  • Board-level energy in millijoules
  • Available on Groq LPU cards

Cloud Detection

Joule automatically detects available accelerators using:

Device Files

PathAccelerator
/dev/nvidia*NVIDIA GPU
/dev/kfdAMD GPU (ROCm)
/dev/dri/renderD*Intel GPU
/dev/accel*Google TPU
/dev/neuron*AWS Inferentia/Trainium

Environment Variables

VariableAccelerator
CUDA_VISIBLE_DEVICESNVIDIA GPU
ROCR_VISIBLE_DEVICESAMD GPU
ZE_AFFINITY_MASKIntel GPU
TPU_NAMEGoogle TPU
NEURON_RT_NUM_CORESAWS Inferentia/Trainium
GROQ_DEVICE_IDGroq LPU

JSON Output

Set JOULE_ENERGY_JSON=1 to get structured JSON output with per-device breakdowns:

JOULE_ENERGY_JSON=1 joulec program.joule --emit c --energy-check -o program.c

Report Format

{
  "program": "program.joule",
  "timestamp": "2026-03-03T10:30:00Z",
  "devices": [
    {
      "type": "cpu",
      "vendor": "intel",
      "model": "Xeon w9-3595X",
      "energy_joules": 0.00042,
      "measurement": "rapl",
      "tier": 2
    },
    {
      "type": "gpu",
      "vendor": "nvidia",
      "model": "H100",
      "energy_joules": 0.0031,
      "measurement": "nvml",
      "tier": 3
    }
  ],
  "total_energy_joules": 0.00352,
  "functions": [
    {
      "name": "matrix_multiply",
      "energy_joules": 0.0028,
      "device": "gpu:0",
      "confidence": 0.95,
      "budget_joules": 0.005,
      "status": "within_budget"
    },
    {
      "name": "preprocess",
      "energy_joules": 0.00042,
      "device": "cpu",
      "confidence": 0.90,
      "budget_joules": 0.001,
      "status": "within_budget"
    }
  ]
}

Per-Device Breakdown

When multiple accelerators are present, the report includes energy per device:

{
  "devices": [
    { "type": "cpu", "energy_joules": 0.0012, "tier": 2 },
    { "type": "gpu", "vendor": "nvidia", "index": 0, "energy_joules": 0.045, "tier": 3 },
    { "type": "gpu", "vendor": "nvidia", "index": 1, "energy_joules": 0.043, "tier": 3 },
    { "type": "gpu", "vendor": "nvidia", "index": 2, "energy_joules": 0.044, "tier": 3 },
    { "type": "gpu", "vendor": "nvidia", "index": 3, "energy_joules": 0.046, "tier": 3 }
  ],
  "total_energy_joules": 0.1792
}

Using Accelerator Energy in Code

Energy Budgets on GPU Functions

#[energy_budget(max_joules = 0.05)]
#[gpu_kernel]
fn batch_matmul(a: Tensor, b: Tensor) -> Tensor {
    a.matmul(b)
}

The budget is checked against actual GPU energy consumption (Tier 3) when available, or estimated (Tier 1) otherwise.

Runtime Energy Query

use std::energy::{measure, EnergyReport};

let report: EnergyReport = measure(|| {
    model.forward(input)
});

println!("CPU energy: {} J", report.cpu_joules());
println!("GPU energy: {} J", report.gpu_joules());
println!("Total: {} J", report.total_joules());

Adaptive Energy Behavior

use std::energy::current_power_draw;

let power = current_power_draw();  // watts
if power > 200.0 {
    // Use energy-efficient path
    compute_sparse(data)
} else {
    // Full compute path
    compute_dense(data)
}

Fallback Behavior

When hardware energy APIs are unavailable, Joule falls back gracefully:

  1. If Tier 3 (accelerator) is unavailable, use Tier 2 (CPU counters) for CPU portions
  2. If Tier 2 is unavailable, use Tier 1 (static estimation)
  3. The confidence score reflects which tier was used

No program crashes due to missing energy hardware. The measurement degrades gracefully with reduced precision.