SIMD Vector Types
Simd[T; N] provides portable SIMD (Single Instruction, Multiple Data) operations. The compiler maps to platform-native intrinsics where available (x86 SSE/AVX, ARM NEON) with a scalar fallback for portability.
Creating SIMD Vectors
// Splat — fill all lanes with the same value
let v: Simd[f32; 4] = Simd::splat(1.0); // [1.0, 1.0, 1.0, 1.0]
// From an array
let v: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);
// Load from a pointer + offset
let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let v: Simd[f32; 4] = Simd::load(&data, 0); // first 4 elements
let w: Simd[f32; 4] = Simd::load(&data, 4); // next 4 elements
Common Lane Widths
| Type | Lanes | x86 | ARM |
|---|---|---|---|
Simd[f32; 4] | 4 | SSE __m128 | NEON float32x4_t |
Simd[f32; 8] | 8 | AVX __m256 | 2x NEON |
Simd[f64; 2] | 2 | SSE2 __m128d | NEON float64x2_t |
Simd[f64; 4] | 4 | AVX __m256d | 2x NEON |
Simd[i32; 4] | 4 | SSE2 __m128i | NEON int32x4_t |
Simd[i32; 8] | 8 | AVX2 __m256i | 2x NEON |
Arithmetic Operations
All arithmetic operates lane-by-lane:
let a: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);
let b: Simd[f32; 4] = Simd::from_array([5.0, 6.0, 7.0, 8.0]);
let sum = a.add(&b); // [6.0, 8.0, 10.0, 12.0]
let diff = a.sub(&b); // [-4.0, -4.0, -4.0, -4.0]
let prod = a.mul(&b); // [5.0, 12.0, 21.0, 32.0]
let quot = a.div(&b); // [0.2, 0.333, 0.429, 0.5]
Reduction Operations
Reduce all lanes to a single scalar:
let v: Simd[f32; 4] = Simd::from_array([1.0, 2.0, 3.0, 4.0]);
let total = v.sum(); // 10.0 — horizontal sum of all lanes
Comparison and Selection
let a: Simd[f32; 4] = Simd::from_array([1.0, 5.0, 3.0, 8.0]);
let b: Simd[f32; 4] = Simd::from_array([2.0, 4.0, 6.0, 7.0]);
let lo = a.min(&b); // [1.0, 4.0, 3.0, 7.0]
let hi = a.max(&b); // [2.0, 5.0, 6.0, 8.0]
let same = a.eq(&b); // false (element-wise equality check)
Unary Operations
let v: Simd[f32; 4] = Simd::from_array([-1.0, 2.0, -3.0, 4.0]);
let pos = v.abs(); // [1.0, 2.0, 3.0, 4.0]
let neg = v.neg(); // [1.0, -2.0, 3.0, -4.0]
Memory Operations
let data = vec![0.0; 1024];
// Load 4 elements starting at offset 8
let chunk: Simd[f32; 4] = Simd::load(&data, 8);
// Store back to memory
chunk.store(&mut data, 8);
// Convert to/from array
let arr: [f32; 4] = v.to_array();
Example: Vectorized Dot Product
#[energy_budget(max_joules = 0.00005)]
fn dot_product(a: &[f32], b: &[f32]) -> f32 {
let n = a.len();
let mut sum: Simd[f32; 8] = Simd::splat(0.0);
let mut i = 0;
// Process 8 elements at a time
while i + 8 <= n {
let va: Simd[f32; 8] = Simd::load(a, i);
let vb: Simd[f32; 8] = Simd::load(b, i);
sum = sum.add(&va.mul(&vb));
i = i + 8;
}
// Horizontal sum + scalar remainder
let mut result = sum.sum();
while i < n {
result = result + a[i] * b[i];
i = i + 1;
}
result
}
Energy Costs
| Operation | Cost | Notes |
|---|---|---|
| Lane arithmetic (add/sub/mul/div) | 2.0 pJ | Single SIMD instruction |
| Horizontal reduction (sum) | 2.0 pJ | Log2(N) shuffle + add |
| Load/store | 0.5 pJ | L1 cache, aligned |
| Comparison (min/max/eq) | 2.0 pJ | Single SIMD instruction |
SIMD operations process N elements for roughly the same energy as one scalar operation. For a Simd[f32; 8], that's ~8x energy efficiency compared to a scalar loop — the primary reason to use SIMD in energy-aware code.
Platform Detection
The compiler automatically selects the best implementation:
- x86/x86_64: Uses SSE/AVX intrinsics via
<immintrin.h> - ARM64 (Apple Silicon, etc.): Uses NEON intrinsics via
<arm_neon.h> - Other platforms: Falls back to scalar loops (same behavior, no hardware acceleration)
No #[cfg] attributes needed in user code — the abstraction is portable.