SIMD Vector Types

Simd[T; N] provides portable SIMD (Single Instruction, Multiple Data) operations. The compiler maps to platform-native intrinsics where available (x86 SSE/AVX, ARM NEON) with a scalar fallback for portability.

Creating SIMD Vectors

// Splat — fill all lanes with the same value
let v: Simd[f32; 4] = Simd::splat(1.0);      // [1.0, 1.0, 1.0, 1.0]

// From an array
let v: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);

// Load from a pointer + offset
let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let v: Simd[f32; 4] = Simd::load(&data, 0);  // first 4 elements
let w: Simd[f32; 4] = Simd::load(&data, 4);  // next 4 elements

Common Lane Widths

TypeLanesx86ARM
Simd[f32; 4]4SSE __m128NEON float32x4_t
Simd[f32; 8]8AVX __m2562x NEON
Simd[f64; 2]2SSE2 __m128dNEON float64x2_t
Simd[f64; 4]4AVX __m256d2x NEON
Simd[i32; 4]4SSE2 __m128iNEON int32x4_t
Simd[i32; 8]8AVX2 __m256i2x NEON

Arithmetic Operations

All arithmetic operates lane-by-lane:

let a: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);
let b: Simd[f32; 4] = Simd::from_array([5.0, 6.0, 7.0, 8.0]);

let sum = a.add(&b);    // [6.0, 8.0, 10.0, 12.0]
let diff = a.sub(&b);   // [-4.0, -4.0, -4.0, -4.0]
let prod = a.mul(&b);   // [5.0, 12.0, 21.0, 32.0]
let quot = a.div(&b);   // [0.2, 0.333, 0.429, 0.5]

Reduction Operations

Reduce all lanes to a single scalar:

let v: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);

let total = v.sum();     // 10.0 — horizontal sum of all lanes

Comparison and Selection

let a: Simd[f32; 4] = Simd::from_array([1.0, 5.0, 3.0, 8.0]);
let b: Simd[f32; 4] = Simd::from_array([2.0, 4.0, 6.0, 7.0]);

let lo = a.min(&b);     // [1.0, 4.0, 3.0, 7.0]
let hi = a.max(&b);     // [2.0, 5.0, 6.0, 8.0]
let same = a.eq(&b);    // false (element-wise equality check)

Unary Operations

let v: Simd[f32; 4] = Simd::from_array([-1.0, 2.0, -3.0, 4.0]);

let pos = v.abs();       // [1.0, 2.0, 3.0, 4.0]
let neg = v.neg();       // [1.0, -2.0, 3.0, -4.0]

Memory Operations

let data = vec![0.0; 1024];

// Load 4 elements starting at offset 8
let chunk: Simd[f32; 4] = Simd::load(&data, 8);

// Store back to memory
chunk.store(&mut data, 8);

// Convert to/from array
let arr: [f32; 4] = v.to_array();

Example: Vectorized Dot Product

#[energy_budget(max_joules = 0.00005)]
fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    let n = a.len();
    let mut sum: Simd[f32; 8] = Simd::splat(0.0);
    let mut i = 0;

    // Process 8 elements at a time
    while i + 8 <= n {
        let va: Simd[f32; 8] = Simd::load(a, i);
        let vb: Simd[f32; 8] = Simd::load(b, i);
        sum = sum.add(&va.mul(&vb));
        i = i + 8;
    }

    // Horizontal sum + scalar remainder
    let mut result = sum.sum();
    while i < n {
        result = result + a[i] * b[i];
        i = i + 1;
    }
    result
}

Energy Costs

OperationCostNotes
Lane arithmetic (add/sub/mul/div)2.0 pJSingle SIMD instruction
Horizontal reduction (sum)2.0 pJLog2(N) shuffle + add
Load/store0.5 pJL1 cache, aligned
Comparison (min/max/eq)2.0 pJSingle SIMD instruction

SIMD operations process N elements for roughly the same energy as one scalar operation. For a Simd[f32; 8], that's ~8x energy efficiency compared to a scalar loop — the primary reason to use SIMD in energy-aware code.

Platform Detection

The compiler automatically selects the best implementation:

  1. x86/x86_64: Uses SSE/AVX intrinsics via <immintrin.h>
  2. ARM64 (Apple Silicon, etc.): Uses NEON intrinsics via <arm_neon.h>
  3. Other platforms: Falls back to scalar loops (same behavior, no hardware acceleration)

No #[cfg] attributes needed in user code — the abstraction is portable.