TensorForge

TensorForge is Joule's energy-aware machine learning framework. It provides a complete ML stack -- from tensor operations to distributed training to inference -- with energy measurement built into every layer.

Architecture

TensorForge is organized as 22 crates in the Joule workspace:

Foundation Crates

Crate	Purpose
`tf-core`	Core types, `EnergyTelemetry` trait, tensor metadata
`tf-ir`	TensorIR: `HighOp` (14 tensor operations), graph representation
`tf-compiler`	`OptimizationPass` trait, graph rewriting infrastructure, 7 optimization passes
`tf-autodiff`	Automatic differentiation with real VJP (vector-Jacobian product) implementations
`tf-hal`	Hardware abstraction: `Device` trait, memory management
`tf-runtime`	Tensor execution runtime, memory pools, scheduling

Backend Crates

Crate	Hardware Target
`tf-backend-cpu`	x86/ARM CPUs with SIMD
`tf-backend-cuda`	NVIDIA GPUs via CUDA
`tf-backend-rocm`	AMD GPUs via ROCm/HIP
`tf-backend-metal`	Apple GPUs via Metal
`tf-backend-tpu`	Google TPUs
`tf-backend-level0`	Intel GPUs/accelerators via Level Zero
`tf-backend-neuron`	AWS Inferentia/Trainium via Neuron SDK
`tf-backend-groq`	Groq LPUs
`tf-backend-gaudi`	Intel Gaudi (Habana Labs)
`tf-backend-estimated`	Energy-estimated backend (no hardware required)

High-Level Crates

Crate	Purpose
`tf-nn`	Neural network modules (`Module` trait, layers, activations)
`tf-optim`	Optimizers (AdamW, SGD with momentum)
`tf-data`	Data loading and batching
`tf-serialize`	Model serialization/deserialization
`tf-distributed`	Distributed training (ring, tree, halving-doubling collectives)
`tf-infer`	Inference engine (KV cache, speculative decoding, scheduling)

EnergyTelemetry Trait

The EnergyTelemetry trait is the foundation of TensorForge's energy awareness. Every backend implements it:

pub trait EnergyTelemetry {
    fn energy_consumed_joules(&self) -> f64;
    fn power_draw_watts(&self) -> f64;
    fn temperature_celsius(&self) -> f64;
    fn reset_counters(&mut self);
}

This means every tensor operation -- every matmul, every convolution, every activation -- has a measurable energy cost. The energy data flows up through the framework:

Individual ops report energy via the backend's telemetry
The optimizer aggregates energy per training step
The training loop reports energy per epoch
The distributed runtime aggregates energy across all nodes

TensorIR

TensorForge uses a graph-based intermediate representation with 14 high-level operations:

Operation	Description
`MatMul`	Matrix multiplication
`Conv2D`	2D convolution
`BatchNorm`	Batch normalization
`Relu`	ReLU activation
`Softmax`	Softmax
`Add`	Element-wise addition
`Mul`	Element-wise multiplication
`Reduce`	Reduction (sum, mean, max)
`Reshape`	Tensor reshape
`Transpose`	Tensor transpose
`Concat`	Tensor concatenation
`Slice`	Tensor slicing
`Gather`	Index-based gathering
`Scatter`	Index-based scattering

Graph Optimization

The tf-compiler provides 7 optimization passes:

Operator Fusion -- Fuse sequences like Conv2D+BatchNorm+ReLU into a single kernel
Layout Optimization -- Choose optimal memory layout (NCHW vs NHWC) per backend
Constant Folding -- Evaluate constant subgraphs at compile time
Dead Node Elimination -- Remove unused computation
Common Subexpression Elimination -- Share identical computations
Memory Planning -- Minimize peak memory usage through buffer reuse
Energy-Aware Scheduling -- Reorder operations to minimize energy consumption

Autodiff

TensorForge implements reverse-mode automatic differentiation with real VJP implementations for all operations. No stubs, no placeholders -- every backward pass computes correct gradients:

use tf_autodiff::backward;

let loss = model.forward(input);
let gradients = backward(loss);  // real gradient computation
optimizer.step(gradients);

Neural Network API

The tf-nn crate provides a Module trait for building neural networks:

use tf_nn::{Module, Linear, Conv2d, BatchNorm2d, relu};

struct ResBlock {
    conv1: Conv2d,
    bn1: BatchNorm2d,
    conv2: Conv2d,
    bn2: BatchNorm2d,
}

impl Module for ResBlock {
    fn forward(&self, x: Tensor) -> Tensor {
        let residual = x;
        let out = self.conv1.forward(x)
            |> self.bn1.forward
            |> relu
            |> self.conv2.forward
            |> self.bn2.forward;
        relu(out + residual)
    }
}

Optimizers

The tf-optim crate provides energy-tracked optimizers:

use tf_optim::{AdamW, SGD};

// AdamW with weight decay
let optimizer = AdamW::new(model.parameters(), lr: 0.001, weight_decay: 0.01);

// SGD with momentum
let optimizer = SGD::new(model.parameters(), lr: 0.01, momentum: 0.9);

Every optimizer step reports energy consumed:

let energy = optimizer.step(gradients);
println!("Step energy: {} J", energy.joules());

Distributed Training

The tf-distributed crate supports multi-node training with three collective algorithms:

Algorithm	Pattern	Best For
Ring AllReduce	Each node sends to next neighbor	Large models, high bandwidth
Tree AllReduce	Binary tree reduction	Low latency
Halving-Doubling	Recursive halving then doubling	Balanced

Energy is tracked across all nodes, giving total training energy:

use tf_distributed::DistributedTrainer;

let trainer = DistributedTrainer::new(
    model,
    world_size: 8,
    algorithm: CollectiveAlgorithm::Ring,
);

let metrics = trainer.train(dataset, epochs: 10);
println!("Total energy across {} nodes: {} J", 8, metrics.total_energy_joules());

Inference Engine

The tf-infer crate provides a high-performance inference engine with:

Paged KV Cache

Efficient key-value caching for transformer models. Memory is allocated in pages, avoiding fragmentation:

use tf_infer::KvCache;

let cache = KvCache::paged(
    num_layers: 32,
    num_heads: 32,
    head_dim: 128,
    page_size: 256,
);

Continuous Batching

Dynamic batching that adds new requests to a running batch without waiting for all current requests to complete:

use tf_infer::ContinuousBatcher;

let batcher = ContinuousBatcher::new(max_batch_size: 64);
batcher.add_request(prompt);
let outputs = batcher.step();  // processes all pending requests

Speculative Decoding

Use a smaller draft model to generate candidates, then verify with the full model:

use tf_infer::SpeculativeDecoder;

let decoder = SpeculativeDecoder::new(
    target_model: large_model,
    draft_model: small_model,
    num_speculative_tokens: 5,
);

Sampling Pipeline

Configurable token sampling with temperature, top-k, top-p, and repetition penalty:

use tf_infer::SamplingConfig;

let config = SamplingConfig {
    temperature: 0.7,
    top_k: 50,
    top_p: 0.9,
    repetition_penalty: 1.1,
};

Energy-Aware Scheduling

The inference scheduler considers energy costs when choosing batch sizes and scheduling decisions. It can enforce energy budgets on inference requests:

use tf_infer::EnergyAwareScheduler;

let scheduler = EnergyAwareScheduler::new(
    max_energy_per_request: 0.5,  // joules
    max_power_draw: 200.0,        // watts
);

Compiler Integration

TensorForge integrates with the Joule compiler through the joule-codegen-tensorforge crate. When Joule code uses tensor operations, the compiler:

Lowers tensor expressions to TensorIR
Applies graph optimization passes
Selects the backend based on --target
Generates backend-specific code
Instruments energy telemetry calls

This means energy budgets work with ML code:

#[energy_budget(max_joules = 10.0)]
fn train_epoch(model: &mut Model, data: DataLoader) -> f64 {
    let mut total_loss = 0.0;
    for batch in data {
        let loss = model.forward(batch.input);
        let grads = backward(loss);
        optimizer.step(grads);
        total_loss = total_loss + loss.item();
    }
    total_loss
}

The Joule Programming Language