alloy

A high-performance, Rust-native CLI for merging, diffing, and converting LLM weights.

Srijit Iyer · Adam Abdalla · Reyab Saluja

March 2026 · GitHub

Overview

alloy is a Rust CLI for merging, diffing, and converting large language model weights. It supports 10 merge algorithms (linear, SLERP, TIES, DARE, DELLA, and others), reads mergekit-compatible YAML configs, and outputs standard safetensors files.

Model merging is just arithmetic on float arrays. It doesn't need Python, PyTorch, or a GPU. alloy memory-maps model files and uses fused SIMD kernels (AVX2/NEON) to read BF16 directly from disk and compute in f32 registers, avoiding intermediate allocations. On real 7B and 14B models, IO-bound methods like linear and SLERP run 1.4-2.1x faster than mergekit. Compute-heavy methods like TIES and DARE are still slower due to PyTorch's optimized C tensor kernels.

It compiles to a ~4 MB static binary, runs on CPU-only hardware, and produces bit-identical results to mergekit for the same config.

Why merge models?

Training LLMs is expensive. Fine-tuning one for a specific task - code generation, medical QA, creative writing - takes hundreds of GPU hours and a curated dataset. But if you want a model that's good at two things, you're stuck. You can't retrain from scratch on a combined dataset (too expensive), and running both models at inference time doubles your cost.

Model merging skips all of that. You take two fine-tuned models that share a common base and combine their weights into a single model. Same inference cost as running one model. No extra training, no access to the original datasets, no additional compute at serving time.

Why does this work at all? Fine-tuning moves weights in a structured way. The difference between a base model and its fine-tune - a task vector - captures what the model "learned" during training. Combine task vectors from different fine-tunes and you compose capabilities. A code model plus a chat model gives you a model that can do both, usually without much degradation in either skill.

This isn't a niche technique. Some of the most downloaded open models on HuggingFace - NeuralBeagle, Goliath-120B, and dozens of community favorites - are merges, not models trained from scratch.

Two fine-tuned neural networks merged into a single model by combining their weight tensors

Motivation

mergekit (Goddard et al., Arcee AI) is the standard tool for model merging, and the community has built great things with it. But it's slow, and the reason is structural: it treats merging as a PyTorch workload.

Here's what actually happens when you run a mergekit merge. Python starts up, imports dozens of modules, initializes the PyTorch runtime, builds a task DAG, and starts loading model shards through LazyTensorLoader. Each tensor access opens a file, seeks to an offset, reads bytes, deserializes into a PyTorch tensor, and allocates memory. A 7B model in FP16 is ~14 GB of raw weight data, but mergekit's peak memory can blow past 64 GB because it keeps multiple tensors alive at once during the merge.

The math behind merging is relatively simple - weighted sums, some trig, element-wise operations on float arrays. None of it needs autograd, a CUDA runtime, or PyTorch's tensor abstraction. Just fast reads and arithmetic.

We took that idea and ran with it. alloy is a Rust CLI that memory-maps safetensors files, processes one tensor at a time, and writes results straight to disk. The whole thing compiles to a ~4 MB static binary with zero runtime dependencies.

Element-wise weighted combination of model weight tensors

mergekit

Python + PyTorch runtime
DAG-based task scheduler
LazyTensorLoader (repeated file open/close)
Multiple tensors in memory at once
High peak memory from multiple tensors in flight

alloy

Native Rust, no runtime
Sequential streaming pipeline
Zero-copy mmap (no deserialization)
One tensor in memory at a time
Peak RAM proportional to largest tensor

Both produce standard safetensors output. Same format, same files, loadable by any HuggingFace tool or inference server.

Architecture

Everything in alloy follows one constraint: process one tensor at a time. The design streams tensors sequentially rather than loading multiple into memory simultaneously.

Model A.safetensors

Model B.safetensors

Merge Methodfrom YAML config

TensorStoremmap all shards, HashMap index

Orchestratorfor each tensor name in config

Plan Shardscompute byte offsets

Select Methodslerp / ties / dare

MergeOpread, compute, produce tensor

StreamingWritercast dtype, write, drop

Merged Model.safetensors

Architecture diagram of the alloy merge pipeline

TensorStore

When alloy opens a model, it memory-maps every shard file via mmap. The OS maps file contents into the process's virtual address space without actually reading anything into RAM. Tensor data stays on disk until you touch it, and the kernel handles page-in/page-out automatically.

On open, we parse each shard's JSON header (tensor names and byte offsets) and build a HashMap index. Every lookup after that is O(1). F32 tensors are accessed as zero-copy slices directly into the mmap region. F16 and BF16 get converted to F32 on the fly, which does allocate, but only for one tensor at a time.

pub fn tensor_bytes(&self, name: &str) -> &[u8] {
    let loc = &self.index[name];
    let shard = &self.shards[loc.shard_idx];
    &shard.mmap[loc.abs_start..loc.abs_end]
}

pub fn tensor_bytes(&self, name: &str) -> &[u8] {
    let loc = &self.index[name];           // O(1) HashMap lookup
    let shard = &self.shards[loc.shard_idx]; // which file this tensor lives in
    &shard.mmap[loc.abs_start..loc.abs_end]  // zero-copy slice into mmap
}

StreamingWriter

Writing safetensors is trickier than reading them. The format puts a header with byte offsets for every tensor at the start of the file, so you need to know exact sizes before writing a single byte. We handle this in two phases:

Phase 1 (planning): walk all tensor names and shapes from the merge config, compute byte sizes, and assign them to output shards. This gives us a complete header and a shard assignment map. No tensor data gets read here - metadata only.

Phase 2 (writing): for each tensor in order, read from the TensorStore, run the merge op, cast to the output dtype, and write directly to the output file at the precomputed offset. Each tensor gets dropped from memory right after writing.

Writes go to a .tmp file first, then get atomically renamed on completion. If alloy crashes mid-write, you're left with a .tmp file and the original output stays untouched. No partial or corrupted files.

// Phase 1: plan shard layout (metadata only)
let plan = planner.assign_shards(&tensor_names, &shapes, max_shard_size);

// Phase 2: stream tensors one at a time
for name in &plan.tensor_order {
    let data = store.tensor_as_f32(name)?;   // mmap read
    let merged = merge_op.merge(&inputs)?;    // compute
    writer.write_tensor(name, &merged)?;      // write + drop
}

Numerical stability

Weights are stored as FP16 or BF16, but we do all merge math in F32 after conversion. For operations that accumulate across many elements - dot products for cosine similarity, norm calculations for SLERP - we use F64 accumulators to prevent floating-point drift. This actually matters: accumulating 10 million FP32 values without F64 can lose you several digits of precision.

Parallelism comes from rayon's par_chunks, with each thread processing 8,192 elements independently. Per-chunk results are then folded sequentially (not reduced in parallel) so that output is deterministic regardless of thread count or scheduling. Same merge twice = bit-identical results.

Algorithms

We implement 10 merge methods. They're all interchangeable - the pipeline doesn't know or care which algorithm is running, it just feeds in tensors and gets a merged result back.

linearweighted average of N models

slerpspherical interpolation between 2 models

nuslerpmulti-model SLERP via sequential pairwise interpolation

task_arithmeticbase + scaled sum of task vectors

tiestrim, elect sign, disjoint merge

dare_linearrandom dropout + rescaled linear merge

dare_tiesDARE dropout + TIES sign election

della_linearmagnitude-aware dropout + linear merge

dellamagnitude-aware dropout + TIES sign election

passthroughconcatenate layer ranges from different models

Linear and Task Arithmetic

The simplest merge is a weighted average: $\theta_{\text{merged}} = \sum_i w_i \theta_i$ . It works surprisingly well when models share the same base, because fine-tuning tends to push weights in similar directions.

Task arithmetic goes one step further by operating on task vectors rather than raw weights. A task vector is just the diff between a fine-tuned model and its base: $\tau_i = \theta_i - \theta_{\text{base}}$ . You add a weighted sum of task vectors back onto the base, scaled by a factor $\lambda$ :

\theta_{\text{merged}} = \theta_{\text{base}} + \lambda \sum_i w_i \, \tau_i

$\lambda$ controls how far the merged model strays from the base. Increase it for stronger specialization, keep it low to stay closer to base model behavior.

SLERP

Linear interpolation (LERP) blends two weight vectors along a straight line through the interior of the space. Problem: the interpolated vector ends up shorter than either input. For weight tensors, that magnitude loss degrades model quality.

SLERP fixes this by interpolating along the surface of the hypersphere instead, preserving the norm throughout. Geometrically, it traces the great circle arc between two points on a sphere rather than cutting through the interior:

\text{slerp}(\mathbf{a}, \mathbf{b}, t) = \frac{\sin((1-t)\omega)}{\sin\omega}\,\mathbf{a} + \frac{\sin(t\omega)}{\sin\omega}\,\mathbf{b}

$\omega$ is the angle between $\mathbf{a}$ and $\mathbf{b}$ , computed as $\omega = \arccos\!\left(\frac{\mathbf{a}\cdot\mathbf{b}}{\|\mathbf{a}\|\,\|\mathbf{b}\|}\right)$ , and $t \in [0, 1]$ is the interpolation parameter. When the vectors are nearly parallel ( $\cos\omega > 0.9995$ ), the formula blows up numerically, so we fall back to LERP.

TIES

Merging more than two models gets messy. Task vectors from different fine-tunes often conflict: one model pushes a weight up, another pushes it down. Average those opposing signals and they cancel each other out. TIES (Yadav et al., 2023) deals with this in three steps:

1. Trim: for each task vector, zero out the smallest elements by magnitude. Gets rid of noise from insignificant weight changes. The trim fraction (density) is a hyperparameter, typically 0.2-0.5.

2. Elect sign: at each weight position, take a weighted vote across all models to pick the "consensus" sign. If models A and C increased a weight and model B decreased it, and A+C outweigh B, the consensus is positive.

3. Disjoint merge: at each position, only keep contributions from models whose sign matches the consensus. Disagreeing models get excluded entirely:

\theta_j = \theta_{\text{base},j} + \frac{\sum_{i:\,\text{sgn}(\tau_{i,j})\,=\,s_j} w_i\,\tau_{i,j}}{\sum_{i:\,\text{sgn}(\tau_{i,j})\,=\,s_j} w_i}

where $s_j$ is the elected sign at position $j$ . You end up preserving the direction of change that most models agree on instead of diluting it with conflicting signals.

DARE

DARE (Yu et al., 2023) takes a completely different approach: random dropout. Before merging, each element of each task vector gets randomly zeroed out with probability $1-p$ , where $p$ is the keep rate (typically 0.1-0.5). Surviving elements are rescaled by $1/p$ to preserve expected magnitude.

The reasoning: fine-tuning changes are heavily redundant. The learned behavior is distributed across tons of weights, and you can drop most of them without losing the capability. By randomly picking a different subset for each model, you cut down on conflicting signals while keeping each model's overall contribution intact.

DARE pairs with either linear merging (dare_linear) or TIES sign election (dare_ties).

DELLA

DELLA (Bansal et al., 2024) builds on DARE with one change: dropout probability depends on weight magnitude. Bigger weight changes are more likely to be important, so they get dropped less often. Per-element dropout probability:

p_{\text{drop},j} = (1-d)\left(1 - \frac{|\tau_j|}{\max|\tau|}\right)^\epsilon

$d$ is the base density (like DARE's keep rate), $|\tau_j|$ is the task vector magnitude at position $j$ , and $\epsilon$ controls how aggressively dropout favors large elements. At $\epsilon = 0$ you get uniform dropout (plain DARE). As $\epsilon$ increases, small changes get dropped more while large changes survive.

Model comparison

alloy also has a diff command for comparing two models tensor-by-tensor. Useful for seeing how much a fine-tune changed the base, or sanity-checking a merge. For each shared tensor, we compute four metrics in a single parallel pass:

Cosine similarity - directional alignment between two weight vectors (1.0 = same direction, 0.0 = orthogonal). L2 distance - total magnitude of change. Max absolute difference - the single weight that moved the most. Mean absolute difference - average change across all elements.

\text{cosine}(\mathbf{a},\mathbf{b}) = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2}\;\sqrt{\sum_i b_i^2}} \qquad \|\mathbf{a}-\mathbf{b}\|_2 = \sqrt{\sum_i(a_i-b_i)^2}

Tensors that exist in only one model, or have mismatched shapes, get reported separately. Same streaming, one-tensor-at-a-time approach as the merge pipeline.

Benchmarks

We benchmarked alloy against mergekit on real models using an Azure Standard_E48s_v5 (48 cores, 384 GB RAM). Timed with hyperfine (3 runs with warmup).

7B - Mistral-7B-v0.1 (BF16, 14.48 GB)

Method	alloy	mergekit	Performance
linear	7.3 s	15.4 s	2.1x faster
slerp	9.5 s	12.9 s	1.4x faster
ties	23.3 s	13.1 s	1.8x slower
dare_ties	38.0 s	15.3 s	2.5x slower

14B - Qwen2.5-14B (BF16, 29.54 GB)

Method	alloy	mergekit	Performance
linear	14.1 s	25.8 s	1.8x faster
slerp	18.5 s	22.1 s	1.2x faster
ties	44.9 s	21.9 s	2.0x slower
dare_ties	75.7 s	27.3 s	2.8x slower

alloy uses fused SIMD kernels (AVX2 on x86_64, NEON on aarch64) that read BF16 values directly from memory-mapped safetensors files and compute in f32 registers. This eliminates the intermediate f32 allocation that a naive bf16-to-f32-to-bf16 pipeline requires. For IO-bound methods (linear, slerp), this makes alloy consistently faster than mergekit at all model scales. For compute-heavy methods (ties, dare_ties), mergekit's PyTorch C tensor kernels remain faster per element.

Memory usage

We also measured peak memory (RSS) for linear and SLERP on the 1.24B synthetic models (Apple M2 Max, 34 GB):

Method	alloy	mergekit	Reduction
linear	3.9 GB	6.5 GB	40% less
slerp	5.2 GB	7.7 GB	33% less

Because alloy streams one tensor at a time via mmap, the OS can evict pages that aren't actively in use. This keeps peak RSS well below mergekit, which holds multiple tensors in memory simultaneously during its DAG-based execution.

Where alloy wins (at this scale)

Linear and SLERP are IO-bound at this scale - the merge math is trivial compared to reading data off disk. Our mmap reader skips Python startup, PyTorch init, and mergekit's repeated file open/close cycles. That overhead is a big fraction of total runtime for small models, so eliminating it gives a real speedup.

Limitations

Optimized for small/medium models. alloy's speed advantage comes from eliminating Python/PyTorch startup and IO overhead. At small scales (~1-2B), that overhead is a big fraction of total runtime. At larger scales, compute dominates and PyTorch's optimized C++ kernels become the bottleneck.

Compute-heavy methods are slower at all scales. TIES and DARE involve per-element magnitude sorting, random mask generation, and conditional accumulation. PyTorch runs these through vectorized C++ kernels. Our inner loops use rayon for parallelism but lack explicit SIMD intrinsics.

mmap behavior varies by environment. Our streaming design keeps peak RSS low on memory-constrained machines (as shown above), but on servers with large RAM the kernel may keep previously-touched mmap pages resident longer than necessary.

No GGUF support. Both alloy and mergekit operate on safetensors files. Neither can merge GGUF-quantized models directly, which means you can't merge models already quantized for llama.cpp or ollama without converting back to safetensors first.

No evaluation loop. Neither tool tells you whether a merge actually improved the model. You merge, then manually run evals. Automated merge-evaluate-iterate pipelines (like mergekit's experimental CMA-ES support) are still an open problem for both tools.

No GPU acceleration. All merging happens on CPU. For IO-bound methods this is fine, but compute-heavy algorithms could benefit from GPU offloading. Neither alloy nor mergekit currently uses the GPU for the merge computation itself (mergekit uses PyTorch's CPU kernels, not CUDA).

Future work

Hand-written SIMD kernels. The single biggest performance gap. PyTorch's ATen library uses hand-tuned AVX2/AVX-512 kernels for element-wise operations. Our inner loops rely on LLVM auto-vectorization, which doesn't match hand-written intrinsics. Writing explicit SIMD via std::arch for the hot paths (weighted accumulation, magnitude partitioning, masked conditional accumulation) is the most impactful thing we can do.

Buffered IO for large models. For models above a certain size threshold, switching from mmap to explicit pread() with a reusable buffer would give more control over memory usage on high-RAM servers without regressing small-model performance.

Scaling to larger models. With SIMD kernels and buffered IO, alloy should scale competitively to larger models. The streaming architecture is sound - the remaining work is in low-level compute optimization.

GGUF format support. Local inference engines (llama.cpp, ollama) use GGUF, not safetensors. GGUF read/write would let us merge quantized models directly.

GPU acceleration. For compute-bound methods, offloading merge kernels to a GPU via wgpu or CUDA could pair our streaming IO with hardware-accelerated math.

References

mergekit - Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, Jacob Solawetz. mergekit: Tools for Merging Pretrained Language Models. 2024. The tool that started the model merging ecosystem.

Task Arithmetic - Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, Ali Farhadi. Editing Models with Task Arithmetic. ICLR 2023.

TIES - Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal. TIES-Merging: Resolving Interference When Merging Models. NeurIPS 2023.

DARE - Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. ICML 2024.

DELLA - Pala Tej Deep, Nikhil Prakash, Atharva Parikh, Mohit Bansal. DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling. 2024.

SLERP for neural networks - Originally applied to model merging by the open-source community; the technique itself is from Ken Shoemake's 1985 SIGGRAPH paper on quaternion interpolation.