SIMD Vector Types

Simd[T; N] provides portable SIMD (Single Instruction, Multiple Data) operations. The compiler maps to platform-native intrinsics where available (x86 SSE/AVX, ARM NEON) with a scalar fallback for portability.

Creating SIMD Vectors

// Splat — fill all lanes with the same value
let v: Simd[f32; 4] = Simd::splat(1.0);      // [1.0, 1.0, 1.0, 1.0]

// From an array
let v: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);

// Load from a pointer + offset
let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let v: Simd[f32; 4] = Simd::load(&data, 0);  // first 4 elements
let w: Simd[f32; 4] = Simd::load(&data, 4);  // next 4 elements

Common Lane Widths

Type	Lanes	x86	ARM
`Simd[f32; 4]`	4	SSE `__m128`	NEON `float32x4_t`
`Simd[f32; 8]`	8	AVX `__m256`	2x NEON
`Simd[f64; 2]`	2	SSE2 `__m128d`	NEON `float64x2_t`
`Simd[f64; 4]`	4	AVX `__m256d`	2x NEON
`Simd[i32; 4]`	4	SSE2 `__m128i`	NEON `int32x4_t`
`Simd[i32; 8]`	8	AVX2 `__m256i`	2x NEON

Arithmetic Operations

All arithmetic operates lane-by-lane:

let a: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);
let b: Simd[f32; 4] = Simd::from_array([5.0, 6.0, 7.0, 8.0]);

let sum = a.add(&b);    // [6.0, 8.0, 10.0, 12.0]
let diff = a.sub(&b);   // [-4.0, -4.0, -4.0, -4.0]
let prod = a.mul(&b);   // [5.0, 12.0, 21.0, 32.0]
let quot = a.div(&b);   // [0.2, 0.333, 0.429, 0.5]

Reduction Operations

Reduce all lanes to a single scalar:

let v: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);

let total = v.sum();     // 10.0 — horizontal sum of all lanes

Comparison and Selection

let a: Simd[f32; 4] = Simd::from_array([1.0, 5.0, 3.0, 8.0]);
let b: Simd[f32; 4] = Simd::from_array([2.0, 4.0, 6.0, 7.0]);

let lo = a.min(&b);     // [1.0, 4.0, 3.0, 7.0]
let hi = a.max(&b);     // [2.0, 5.0, 6.0, 8.0]
let same = a.eq(&b);    // false (element-wise equality check)

Unary Operations

let v: Simd[f32; 4] = Simd::from_array([-1.0, 2.0, -3.0, 4.0]);

let pos = v.abs();       // [1.0, 2.0, 3.0, 4.0]
let neg = v.neg();       // [1.0, -2.0, 3.0, -4.0]

Memory Operations

let data = vec![0.0; 1024];

// Load 4 elements starting at offset 8
let chunk: Simd[f32; 4] = Simd::load(&data, 8);

// Store back to memory
chunk.store(&mut data, 8);

// Convert to/from array
let arr: [f32; 4] = v.to_array();

Example: Vectorized Dot Product

#[energy_budget(max_joules = 0.00005)]
fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    let n = a.len();
    let mut sum: Simd[f32; 8] = Simd::splat(0.0);
    let mut i = 0;

    // Process 8 elements at a time
    while i + 8 <= n {
        let va: Simd[f32; 8] = Simd::load(a, i);
        let vb: Simd[f32; 8] = Simd::load(b, i);
        sum = sum.add(&va.mul(&vb));
        i = i + 8;
    }

    // Horizontal sum + scalar remainder
    let mut result = sum.sum();
    while i < n {
        result = result + a[i] * b[i];
        i = i + 1;
    }
    result
}

Energy Costs

Operation	Cost	Notes
Lane arithmetic (add/sub/mul/div)	2.0 pJ	Single SIMD instruction
Horizontal reduction (sum)	2.0 pJ	Log2(N) shuffle + add
Load/store	0.5 pJ	L1 cache, aligned
Comparison (min/max/eq)	2.0 pJ	Single SIMD instruction

SIMD operations process N elements for roughly the same energy as one scalar operation. For a Simd[f32; 8], that's ~8x energy efficiency compared to a scalar loop — the primary reason to use SIMD in energy-aware code.

Platform Detection

The compiler automatically selects the best implementation:

x86/x86_64: Uses SSE/AVX intrinsics via <immintrin.h>
ARM64 (Apple Silicon, etc.): Uses NEON intrinsics via <arm_neon.h>
Other platforms: Falls back to scalar loops (same behavior, no hardware acceleration)

No #[cfg] attributes needed in user code — the abstraction is portable.

The Joule Programming Language