Accelerator Energy Measurement

Joule measures energy consumption not just on CPUs but across GPUs, TPUs, and other accelerators. This guide covers the three-tier measurement approach, supported hardware, and how to use accelerator energy data in your programs.

Three-Tier Approach

Joule uses a tiered strategy to maximize energy measurement coverage:

Tier 1: Static Estimation

Available everywhere, no hardware access required. The compiler estimates energy from code structure using calibrated instruction costs. This is what powers #[energy_budget] at compile time.

Tier 2: CPU Performance Counters

On supported platforms, Joule reads hardware performance counters for actual CPU energy:

Platform	API	Granularity
Intel/AMD Linux	RAPL via `perf_event`	Per-package, per-core
Intel/AMD Linux	RAPL via MSR	Per-package
Apple Silicon macOS	IOReport framework	Per-cluster

Tier 3: Accelerator Energy

For GPU and accelerator workloads, Joule queries vendor-specific APIs. Each backend in TensorForge implements the EnergyTelemetry trait.

Vendor Coverage

Vendor	Hardware	API	Energy	Power	Temperature
NVIDIA	GPUs (A100, H100, etc.)	NVML	Board-level	Per-GPU	Per-GPU
AMD	GPUs (MI250, MI300, etc.)	ROCm SMI	Average power	Per-GPU	Per-GPU
Intel	GPUs, Gaudi	Level Zero	Per-device	Per-domain	Per-device
Google	TPU v4, v5	TPU Runtime	Per-chip	Per-chip	Per-chip
AWS	Inferentia, Trainium	Neuron SDK	Per-core	Per-core	Per-core
Groq	LPU	HLML	Board-level	Per-device	Per-device
Cerebras	CS-2, CS-3	CS SDK	Wafer-scale	Per-wafer	Per-wafer
SambaNova	SN30, SN40	DataScale API	Per-RDU	Per-RDU	Per-RDU

API Details

NVIDIA (NVML)

The NVIDIA Management Library provides direct energy readings:

nvmlDeviceGetTotalEnergyConsumption(device, &energy_mj)

Returns total energy in millijoules since driver load
Subtract start from end measurement for per-operation energy
Available on all datacenter GPUs (V100, A100, H100, B100)
Supported on consumer GPUs (RTX 3000/4000/5000 series)

AMD (ROCm SMI)

ROCm System Management Interface provides power readings:

rsmi_dev_power_ave_get(device_index, sensor_id, &power_uw)

Returns average power in microwatts
Energy is derived from power * time
Available on MI series (MI250, MI300) and Radeon Pro

Intel (Level Zero)

Intel's Level Zero API provides power domain readings:

zesDeviceEnumPowerDomains(device, &count, domains)
zesPowerGetEnergyCounter(domain, &energy)

Energy counter in microjoules
Multiple power domains (package, card, memory)
Supports Intel Arc GPUs and Gaudi accelerators

Google (TPU Runtime)

tpu_device_get_energy_consumption(device, &energy_j)

Per-chip energy in joules
Available on TPU v4 and v5 pods
Accessed through the TPU runtime API

AWS (Neuron SDK)

neuron_device_get_power(device, &power_mw)

Per-NeuronCore power in milliwatts
Available on Inferentia and Trainium instances
Accessed through the Neuron runtime

Groq (HLML)

Groq's Hardware Library for Machine Learning mirrors the NVML API:

hlmlDeviceGetTotalEnergyConsumption(device, &energy_mj)

Board-level energy in millijoules
Available on Groq LPU cards

Cloud Detection

Joule automatically detects available accelerators using:

Device Files

Path	Accelerator
`/dev/nvidia*`	NVIDIA GPU
`/dev/kfd`	AMD GPU (ROCm)
`/dev/dri/renderD*`	Intel GPU
`/dev/accel*`	Google TPU
`/dev/neuron*`	AWS Inferentia/Trainium

Environment Variables

Variable	Accelerator
`CUDA_VISIBLE_DEVICES`	NVIDIA GPU
`ROCR_VISIBLE_DEVICES`	AMD GPU
`ZE_AFFINITY_MASK`	Intel GPU
`TPU_NAME`	Google TPU
`NEURON_RT_NUM_CORES`	AWS Inferentia/Trainium
`GROQ_DEVICE_ID`	Groq LPU

JSON Output

Set JOULE_ENERGY_JSON=1 to get structured JSON output with per-device breakdowns:

JOULE_ENERGY_JSON=1 joulec program.joule --emit c --energy-check -o program.c

Report Format

{
  "program": "program.joule",
  "timestamp": "2026-03-03T10:30:00Z",
  "devices": [
    {
      "type": "cpu",
      "vendor": "intel",
      "model": "Xeon w9-3595X",
      "energy_joules": 0.00042,
      "measurement": "rapl",
      "tier": 2
    },
    {
      "type": "gpu",
      "vendor": "nvidia",
      "model": "H100",
      "energy_joules": 0.0031,
      "measurement": "nvml",
      "tier": 3
    }
  ],
  "total_energy_joules": 0.00352,
  "functions": [
    {
      "name": "matrix_multiply",
      "energy_joules": 0.0028,
      "device": "gpu:0",
      "confidence": 0.95,
      "budget_joules": 0.005,
      "status": "within_budget"
    },
    {
      "name": "preprocess",
      "energy_joules": 0.00042,
      "device": "cpu",
      "confidence": 0.90,
      "budget_joules": 0.001,
      "status": "within_budget"
    }
  ]
}

Per-Device Breakdown

When multiple accelerators are present, the report includes energy per device:

{
  "devices": [
    { "type": "cpu", "energy_joules": 0.0012, "tier": 2 },
    { "type": "gpu", "vendor": "nvidia", "index": 0, "energy_joules": 0.045, "tier": 3 },
    { "type": "gpu", "vendor": "nvidia", "index": 1, "energy_joules": 0.043, "tier": 3 },
    { "type": "gpu", "vendor": "nvidia", "index": 2, "energy_joules": 0.044, "tier": 3 },
    { "type": "gpu", "vendor": "nvidia", "index": 3, "energy_joules": 0.046, "tier": 3 }
  ],
  "total_energy_joules": 0.1792
}

Using Accelerator Energy in Code

Energy Budgets on GPU Functions

#[energy_budget(max_joules = 0.05)]
#[gpu_kernel]
fn batch_matmul(a: Tensor, b: Tensor) -> Tensor {
    a.matmul(b)
}

The budget is checked against actual GPU energy consumption (Tier 3) when available, or estimated (Tier 1) otherwise.

Runtime Energy Query

use std::energy::{measure, EnergyReport};

let report: EnergyReport = measure(|| {
    model.forward(input)
});

println!("CPU energy: {} J", report.cpu_joules());
println!("GPU energy: {} J", report.gpu_joules());
println!("Total: {} J", report.total_joules());

Adaptive Energy Behavior

use std::energy::current_power_draw;

let power = current_power_draw();  // watts
if power > 200.0 {
    // Use energy-efficient path
    compute_sparse(data)
} else {
    // Full compute path
    compute_dense(data)
}

Fallback Behavior

When hardware energy APIs are unavailable, Joule falls back gracefully:

If Tier 3 (accelerator) is unavailable, use Tier 2 (CPU counters) for CPU portions
If Tier 2 is unavailable, use Tier 1 (static estimation)
The confidence score reflects which tier was used

No program crashes due to missing energy hardware. The measurement degrades gracefully with reduced precision.

The Joule Programming Language