Accelerator Energy Measurement
Joule measures energy consumption not just on CPUs but across GPUs, TPUs, and other accelerators. This guide covers the three-tier measurement approach, supported hardware, and how to use accelerator energy data in your programs.
Three-Tier Approach
Joule uses a tiered strategy to maximize energy measurement coverage:
Tier 1: Static Estimation
Available everywhere, no hardware access required. The compiler estimates energy from code structure using calibrated instruction costs. This is what powers #[energy_budget] at compile time.
Tier 2: CPU Performance Counters
On supported platforms, Joule reads hardware performance counters for actual CPU energy:
| Platform | API | Granularity |
|---|---|---|
| Intel/AMD Linux | RAPL via perf_event | Per-package, per-core |
| Intel/AMD Linux | RAPL via MSR | Per-package |
| Apple Silicon macOS | IOReport framework | Per-cluster |
Tier 3: Accelerator Energy
For GPU and accelerator workloads, Joule queries vendor-specific APIs. Each backend in TensorForge implements the EnergyTelemetry trait.
Vendor Coverage
| Vendor | Hardware | API | Energy | Power | Temperature |
|---|---|---|---|---|---|
| NVIDIA | GPUs (A100, H100, etc.) | NVML | Board-level | Per-GPU | Per-GPU |
| AMD | GPUs (MI250, MI300, etc.) | ROCm SMI | Average power | Per-GPU | Per-GPU |
| Intel | GPUs, Gaudi | Level Zero | Per-device | Per-domain | Per-device |
| TPU v4, v5 | TPU Runtime | Per-chip | Per-chip | Per-chip | |
| AWS | Inferentia, Trainium | Neuron SDK | Per-core | Per-core | Per-core |
| Groq | LPU | HLML | Board-level | Per-device | Per-device |
| Cerebras | CS-2, CS-3 | CS SDK | Wafer-scale | Per-wafer | Per-wafer |
| SambaNova | SN30, SN40 | DataScale API | Per-RDU | Per-RDU | Per-RDU |
API Details
NVIDIA (NVML)
The NVIDIA Management Library provides direct energy readings:
nvmlDeviceGetTotalEnergyConsumption(device, &energy_mj)
- Returns total energy in millijoules since driver load
- Subtract start from end measurement for per-operation energy
- Available on all datacenter GPUs (V100, A100, H100, B100)
- Supported on consumer GPUs (RTX 3000/4000/5000 series)
AMD (ROCm SMI)
ROCm System Management Interface provides power readings:
rsmi_dev_power_ave_get(device_index, sensor_id, &power_uw)
- Returns average power in microwatts
- Energy is derived from power * time
- Available on MI series (MI250, MI300) and Radeon Pro
Intel (Level Zero)
Intel's Level Zero API provides power domain readings:
zesDeviceEnumPowerDomains(device, &count, domains)
zesPowerGetEnergyCounter(domain, &energy)
- Energy counter in microjoules
- Multiple power domains (package, card, memory)
- Supports Intel Arc GPUs and Gaudi accelerators
Google (TPU Runtime)
tpu_device_get_energy_consumption(device, &energy_j)
- Per-chip energy in joules
- Available on TPU v4 and v5 pods
- Accessed through the TPU runtime API
AWS (Neuron SDK)
neuron_device_get_power(device, &power_mw)
- Per-NeuronCore power in milliwatts
- Available on Inferentia and Trainium instances
- Accessed through the Neuron runtime
Groq (HLML)
Groq's Hardware Library for Machine Learning mirrors the NVML API:
hlmlDeviceGetTotalEnergyConsumption(device, &energy_mj)
- Board-level energy in millijoules
- Available on Groq LPU cards
Cloud Detection
Joule automatically detects available accelerators using:
Device Files
| Path | Accelerator |
|---|---|
/dev/nvidia* | NVIDIA GPU |
/dev/kfd | AMD GPU (ROCm) |
/dev/dri/renderD* | Intel GPU |
/dev/accel* | Google TPU |
/dev/neuron* | AWS Inferentia/Trainium |
Environment Variables
| Variable | Accelerator |
|---|---|
CUDA_VISIBLE_DEVICES | NVIDIA GPU |
ROCR_VISIBLE_DEVICES | AMD GPU |
ZE_AFFINITY_MASK | Intel GPU |
TPU_NAME | Google TPU |
NEURON_RT_NUM_CORES | AWS Inferentia/Trainium |
GROQ_DEVICE_ID | Groq LPU |
JSON Output
Set JOULE_ENERGY_JSON=1 to get structured JSON output with per-device breakdowns:
JOULE_ENERGY_JSON=1 joulec program.joule --emit c --energy-check -o program.c
Report Format
{
"program": "program.joule",
"timestamp": "2026-03-03T10:30:00Z",
"devices": [
{
"type": "cpu",
"vendor": "intel",
"model": "Xeon w9-3595X",
"energy_joules": 0.00042,
"measurement": "rapl",
"tier": 2
},
{
"type": "gpu",
"vendor": "nvidia",
"model": "H100",
"energy_joules": 0.0031,
"measurement": "nvml",
"tier": 3
}
],
"total_energy_joules": 0.00352,
"functions": [
{
"name": "matrix_multiply",
"energy_joules": 0.0028,
"device": "gpu:0",
"confidence": 0.95,
"budget_joules": 0.005,
"status": "within_budget"
},
{
"name": "preprocess",
"energy_joules": 0.00042,
"device": "cpu",
"confidence": 0.90,
"budget_joules": 0.001,
"status": "within_budget"
}
]
}
Per-Device Breakdown
When multiple accelerators are present, the report includes energy per device:
{
"devices": [
{ "type": "cpu", "energy_joules": 0.0012, "tier": 2 },
{ "type": "gpu", "vendor": "nvidia", "index": 0, "energy_joules": 0.045, "tier": 3 },
{ "type": "gpu", "vendor": "nvidia", "index": 1, "energy_joules": 0.043, "tier": 3 },
{ "type": "gpu", "vendor": "nvidia", "index": 2, "energy_joules": 0.044, "tier": 3 },
{ "type": "gpu", "vendor": "nvidia", "index": 3, "energy_joules": 0.046, "tier": 3 }
],
"total_energy_joules": 0.1792
}
Using Accelerator Energy in Code
Energy Budgets on GPU Functions
#[energy_budget(max_joules = 0.05)]
#[gpu_kernel]
fn batch_matmul(a: Tensor, b: Tensor) -> Tensor {
a.matmul(b)
}
The budget is checked against actual GPU energy consumption (Tier 3) when available, or estimated (Tier 1) otherwise.
Runtime Energy Query
use std::energy::{measure, EnergyReport};
let report: EnergyReport = measure(|| {
model.forward(input)
});
println!("CPU energy: {} J", report.cpu_joules());
println!("GPU energy: {} J", report.gpu_joules());
println!("Total: {} J", report.total_joules());
Adaptive Energy Behavior
use std::energy::current_power_draw;
let power = current_power_draw(); // watts
if power > 200.0 {
// Use energy-efficient path
compute_sparse(data)
} else {
// Full compute path
compute_dense(data)
}
Fallback Behavior
When hardware energy APIs are unavailable, Joule falls back gracefully:
- If Tier 3 (accelerator) is unavailable, use Tier 2 (CPU counters) for CPU portions
- If Tier 2 is unavailable, use Tier 1 (static estimation)
- The confidence score reflects which tier was used
No program crashes due to missing energy hardware. The measurement degrades gracefully with reduced precision.