TensorForge

TensorForge is Joule's energy-aware machine learning framework. It provides a complete ML stack -- from tensor operations to distributed training to inference -- with energy measurement built into every layer.

Architecture

TensorForge is organized as 22 crates in the Joule workspace:

Foundation Crates

CratePurpose
tf-coreCore types, EnergyTelemetry trait, tensor metadata
tf-irTensorIR: HighOp (14 tensor operations), graph representation
tf-compilerOptimizationPass trait, graph rewriting infrastructure, 7 optimization passes
tf-autodiffAutomatic differentiation with real VJP (vector-Jacobian product) implementations
tf-halHardware abstraction: Device trait, memory management
tf-runtimeTensor execution runtime, memory pools, scheduling

Backend Crates

CrateHardware Target
tf-backend-cpux86/ARM CPUs with SIMD
tf-backend-cudaNVIDIA GPUs via CUDA
tf-backend-rocmAMD GPUs via ROCm/HIP
tf-backend-metalApple GPUs via Metal
tf-backend-tpuGoogle TPUs
tf-backend-level0Intel GPUs/accelerators via Level Zero
tf-backend-neuronAWS Inferentia/Trainium via Neuron SDK
tf-backend-groqGroq LPUs
tf-backend-gaudiIntel Gaudi (Habana Labs)
tf-backend-estimatedEnergy-estimated backend (no hardware required)

High-Level Crates

CratePurpose
tf-nnNeural network modules (Module trait, layers, activations)
tf-optimOptimizers (AdamW, SGD with momentum)
tf-dataData loading and batching
tf-serializeModel serialization/deserialization
tf-distributedDistributed training (ring, tree, halving-doubling collectives)
tf-inferInference engine (KV cache, speculative decoding, scheduling)

EnergyTelemetry Trait

The EnergyTelemetry trait is the foundation of TensorForge's energy awareness. Every backend implements it:

pub trait EnergyTelemetry {
    fn energy_consumed_joules(&self) -> f64;
    fn power_draw_watts(&self) -> f64;
    fn temperature_celsius(&self) -> f64;
    fn reset_counters(&mut self);
}

This means every tensor operation -- every matmul, every convolution, every activation -- has a measurable energy cost. The energy data flows up through the framework:

  • Individual ops report energy via the backend's telemetry
  • The optimizer aggregates energy per training step
  • The training loop reports energy per epoch
  • The distributed runtime aggregates energy across all nodes

TensorIR

TensorForge uses a graph-based intermediate representation with 14 high-level operations:

OperationDescription
MatMulMatrix multiplication
Conv2D2D convolution
BatchNormBatch normalization
ReluReLU activation
SoftmaxSoftmax
AddElement-wise addition
MulElement-wise multiplication
ReduceReduction (sum, mean, max)
ReshapeTensor reshape
TransposeTensor transpose
ConcatTensor concatenation
SliceTensor slicing
GatherIndex-based gathering
ScatterIndex-based scattering

Graph Optimization

The tf-compiler provides 7 optimization passes:

  1. Operator Fusion -- Fuse sequences like Conv2D+BatchNorm+ReLU into a single kernel
  2. Layout Optimization -- Choose optimal memory layout (NCHW vs NHWC) per backend
  3. Constant Folding -- Evaluate constant subgraphs at compile time
  4. Dead Node Elimination -- Remove unused computation
  5. Common Subexpression Elimination -- Share identical computations
  6. Memory Planning -- Minimize peak memory usage through buffer reuse
  7. Energy-Aware Scheduling -- Reorder operations to minimize energy consumption

Autodiff

TensorForge implements reverse-mode automatic differentiation with real VJP implementations for all operations. No stubs, no placeholders -- every backward pass computes correct gradients:

use tf_autodiff::backward;

let loss = model.forward(input);
let gradients = backward(loss);  // real gradient computation
optimizer.step(gradients);

Neural Network API

The tf-nn crate provides a Module trait for building neural networks:

use tf_nn::{Module, Linear, Conv2d, BatchNorm2d, relu};

struct ResBlock {
    conv1: Conv2d,
    bn1: BatchNorm2d,
    conv2: Conv2d,
    bn2: BatchNorm2d,
}

impl Module for ResBlock {
    fn forward(&self, x: Tensor) -> Tensor {
        let residual = x;
        let out = self.conv1.forward(x)
            |> self.bn1.forward
            |> relu
            |> self.conv2.forward
            |> self.bn2.forward;
        relu(out + residual)
    }
}

Optimizers

The tf-optim crate provides energy-tracked optimizers:

use tf_optim::{AdamW, SGD};

// AdamW with weight decay
let optimizer = AdamW::new(model.parameters(), lr: 0.001, weight_decay: 0.01);

// SGD with momentum
let optimizer = SGD::new(model.parameters(), lr: 0.01, momentum: 0.9);

Every optimizer step reports energy consumed:

let energy = optimizer.step(gradients);
println!("Step energy: {} J", energy.joules());

Distributed Training

The tf-distributed crate supports multi-node training with three collective algorithms:

AlgorithmPatternBest For
Ring AllReduceEach node sends to next neighborLarge models, high bandwidth
Tree AllReduceBinary tree reductionLow latency
Halving-DoublingRecursive halving then doublingBalanced

Energy is tracked across all nodes, giving total training energy:

use tf_distributed::DistributedTrainer;

let trainer = DistributedTrainer::new(
    model,
    world_size: 8,
    algorithm: CollectiveAlgorithm::Ring,
);

let metrics = trainer.train(dataset, epochs: 10);
println!("Total energy across {} nodes: {} J", 8, metrics.total_energy_joules());

Inference Engine

The tf-infer crate provides a high-performance inference engine with:

Paged KV Cache

Efficient key-value caching for transformer models. Memory is allocated in pages, avoiding fragmentation:

use tf_infer::KvCache;

let cache = KvCache::paged(
    num_layers: 32,
    num_heads: 32,
    head_dim: 128,
    page_size: 256,
);

Continuous Batching

Dynamic batching that adds new requests to a running batch without waiting for all current requests to complete:

use tf_infer::ContinuousBatcher;

let batcher = ContinuousBatcher::new(max_batch_size: 64);
batcher.add_request(prompt);
let outputs = batcher.step();  // processes all pending requests

Speculative Decoding

Use a smaller draft model to generate candidates, then verify with the full model:

use tf_infer::SpeculativeDecoder;

let decoder = SpeculativeDecoder::new(
    target_model: large_model,
    draft_model: small_model,
    num_speculative_tokens: 5,
);

Sampling Pipeline

Configurable token sampling with temperature, top-k, top-p, and repetition penalty:

use tf_infer::SamplingConfig;

let config = SamplingConfig {
    temperature: 0.7,
    top_k: 50,
    top_p: 0.9,
    repetition_penalty: 1.1,
};

Energy-Aware Scheduling

The inference scheduler considers energy costs when choosing batch sizes and scheduling decisions. It can enforce energy budgets on inference requests:

use tf_infer::EnergyAwareScheduler;

let scheduler = EnergyAwareScheduler::new(
    max_energy_per_request: 0.5,  // joules
    max_power_draw: 200.0,        // watts
);

Compiler Integration

TensorForge integrates with the Joule compiler through the joule-codegen-tensorforge crate. When Joule code uses tensor operations, the compiler:

  1. Lowers tensor expressions to TensorIR
  2. Applies graph optimization passes
  3. Selects the backend based on --target
  4. Generates backend-specific code
  5. Instruments energy telemetry calls

This means energy budgets work with ML code:

#[energy_budget(max_joules = 10.0)]
fn train_epoch(model: &mut Model, data: DataLoader) -> f64 {
    let mut total_loss = 0.0;
    for batch in data {
        let loss = model.forward(batch.input);
        let grads = backward(loss);
        optimizer.step(grads);
        total_loss = total_loss + loss.item();
    }
    total_loss
}