TensorForge
TensorForge is Joule's energy-aware machine learning framework. It provides a complete ML stack -- from tensor operations to distributed training to inference -- with energy measurement built into every layer.
Architecture
TensorForge is organized as 22 crates in the Joule workspace:
Foundation Crates
| Crate | Purpose |
|---|---|
tf-core | Core types, EnergyTelemetry trait, tensor metadata |
tf-ir | TensorIR: HighOp (14 tensor operations), graph representation |
tf-compiler | OptimizationPass trait, graph rewriting infrastructure, 7 optimization passes |
tf-autodiff | Automatic differentiation with real VJP (vector-Jacobian product) implementations |
tf-hal | Hardware abstraction: Device trait, memory management |
tf-runtime | Tensor execution runtime, memory pools, scheduling |
Backend Crates
| Crate | Hardware Target |
|---|---|
tf-backend-cpu | x86/ARM CPUs with SIMD |
tf-backend-cuda | NVIDIA GPUs via CUDA |
tf-backend-rocm | AMD GPUs via ROCm/HIP |
tf-backend-metal | Apple GPUs via Metal |
tf-backend-tpu | Google TPUs |
tf-backend-level0 | Intel GPUs/accelerators via Level Zero |
tf-backend-neuron | AWS Inferentia/Trainium via Neuron SDK |
tf-backend-groq | Groq LPUs |
tf-backend-gaudi | Intel Gaudi (Habana Labs) |
tf-backend-estimated | Energy-estimated backend (no hardware required) |
High-Level Crates
| Crate | Purpose |
|---|---|
tf-nn | Neural network modules (Module trait, layers, activations) |
tf-optim | Optimizers (AdamW, SGD with momentum) |
tf-data | Data loading and batching |
tf-serialize | Model serialization/deserialization |
tf-distributed | Distributed training (ring, tree, halving-doubling collectives) |
tf-infer | Inference engine (KV cache, speculative decoding, scheduling) |
EnergyTelemetry Trait
The EnergyTelemetry trait is the foundation of TensorForge's energy awareness. Every backend implements it:
pub trait EnergyTelemetry {
fn energy_consumed_joules(&self) -> f64;
fn power_draw_watts(&self) -> f64;
fn temperature_celsius(&self) -> f64;
fn reset_counters(&mut self);
}
This means every tensor operation -- every matmul, every convolution, every activation -- has a measurable energy cost. The energy data flows up through the framework:
- Individual ops report energy via the backend's telemetry
- The optimizer aggregates energy per training step
- The training loop reports energy per epoch
- The distributed runtime aggregates energy across all nodes
TensorIR
TensorForge uses a graph-based intermediate representation with 14 high-level operations:
| Operation | Description |
|---|---|
MatMul | Matrix multiplication |
Conv2D | 2D convolution |
BatchNorm | Batch normalization |
Relu | ReLU activation |
Softmax | Softmax |
Add | Element-wise addition |
Mul | Element-wise multiplication |
Reduce | Reduction (sum, mean, max) |
Reshape | Tensor reshape |
Transpose | Tensor transpose |
Concat | Tensor concatenation |
Slice | Tensor slicing |
Gather | Index-based gathering |
Scatter | Index-based scattering |
Graph Optimization
The tf-compiler provides 7 optimization passes:
- Operator Fusion -- Fuse sequences like Conv2D+BatchNorm+ReLU into a single kernel
- Layout Optimization -- Choose optimal memory layout (NCHW vs NHWC) per backend
- Constant Folding -- Evaluate constant subgraphs at compile time
- Dead Node Elimination -- Remove unused computation
- Common Subexpression Elimination -- Share identical computations
- Memory Planning -- Minimize peak memory usage through buffer reuse
- Energy-Aware Scheduling -- Reorder operations to minimize energy consumption
Autodiff
TensorForge implements reverse-mode automatic differentiation with real VJP implementations for all operations. No stubs, no placeholders -- every backward pass computes correct gradients:
use tf_autodiff::backward;
let loss = model.forward(input);
let gradients = backward(loss); // real gradient computation
optimizer.step(gradients);
Neural Network API
The tf-nn crate provides a Module trait for building neural networks:
use tf_nn::{Module, Linear, Conv2d, BatchNorm2d, relu};
struct ResBlock {
conv1: Conv2d,
bn1: BatchNorm2d,
conv2: Conv2d,
bn2: BatchNorm2d,
}
impl Module for ResBlock {
fn forward(&self, x: Tensor) -> Tensor {
let residual = x;
let out = self.conv1.forward(x)
|> self.bn1.forward
|> relu
|> self.conv2.forward
|> self.bn2.forward;
relu(out + residual)
}
}
Optimizers
The tf-optim crate provides energy-tracked optimizers:
use tf_optim::{AdamW, SGD};
// AdamW with weight decay
let optimizer = AdamW::new(model.parameters(), lr: 0.001, weight_decay: 0.01);
// SGD with momentum
let optimizer = SGD::new(model.parameters(), lr: 0.01, momentum: 0.9);
Every optimizer step reports energy consumed:
let energy = optimizer.step(gradients);
println!("Step energy: {} J", energy.joules());
Distributed Training
The tf-distributed crate supports multi-node training with three collective algorithms:
| Algorithm | Pattern | Best For |
|---|---|---|
| Ring AllReduce | Each node sends to next neighbor | Large models, high bandwidth |
| Tree AllReduce | Binary tree reduction | Low latency |
| Halving-Doubling | Recursive halving then doubling | Balanced |
Energy is tracked across all nodes, giving total training energy:
use tf_distributed::DistributedTrainer;
let trainer = DistributedTrainer::new(
model,
world_size: 8,
algorithm: CollectiveAlgorithm::Ring,
);
let metrics = trainer.train(dataset, epochs: 10);
println!("Total energy across {} nodes: {} J", 8, metrics.total_energy_joules());
Inference Engine
The tf-infer crate provides a high-performance inference engine with:
Paged KV Cache
Efficient key-value caching for transformer models. Memory is allocated in pages, avoiding fragmentation:
use tf_infer::KvCache;
let cache = KvCache::paged(
num_layers: 32,
num_heads: 32,
head_dim: 128,
page_size: 256,
);
Continuous Batching
Dynamic batching that adds new requests to a running batch without waiting for all current requests to complete:
use tf_infer::ContinuousBatcher;
let batcher = ContinuousBatcher::new(max_batch_size: 64);
batcher.add_request(prompt);
let outputs = batcher.step(); // processes all pending requests
Speculative Decoding
Use a smaller draft model to generate candidates, then verify with the full model:
use tf_infer::SpeculativeDecoder;
let decoder = SpeculativeDecoder::new(
target_model: large_model,
draft_model: small_model,
num_speculative_tokens: 5,
);
Sampling Pipeline
Configurable token sampling with temperature, top-k, top-p, and repetition penalty:
use tf_infer::SamplingConfig;
let config = SamplingConfig {
temperature: 0.7,
top_k: 50,
top_p: 0.9,
repetition_penalty: 1.1,
};
Energy-Aware Scheduling
The inference scheduler considers energy costs when choosing batch sizes and scheduling decisions. It can enforce energy budgets on inference requests:
use tf_infer::EnergyAwareScheduler;
let scheduler = EnergyAwareScheduler::new(
max_energy_per_request: 0.5, // joules
max_power_draw: 200.0, // watts
);
Compiler Integration
TensorForge integrates with the Joule compiler through the joule-codegen-tensorforge crate. When Joule code uses tensor operations, the compiler:
- Lowers tensor expressions to TensorIR
- Applies graph optimization passes
- Selects the backend based on
--target - Generates backend-specific code
- Instruments energy telemetry calls
This means energy budgets work with ML code:
#[energy_budget(max_joules = 10.0)]
fn train_epoch(model: &mut Model, data: DataLoader) -> f64 {
let mut total_loss = 0.0;
for batch in data {
let loss = model.forward(batch.input);
let grads = backward(loss);
optimizer.step(grads);
total_loss = total_loss + loss.item();
}
total_loss
}