alloy

A high-performance, Rust-native CLI for merging, diffing, and converting LLM weights.

March 2026 · GitHub

Overview

alloy is a Rust CLI for merging, diffing, and converting large language model weights. It supports 10 merge algorithms (linear, SLERP, TIES, DARE, DELLA, and others), reads mergekit-compatible YAML configs, and outputs standard safetensors files.

Model merging is just arithmetic on float arrays. It doesn't need Python, PyTorch, or a GPU. alloy memory-maps model files and processes one tensor at a time, so peak memory scales with the largest single tensor, not the whole model. On IO-bound methods like linear and SLERP, that makes it 1.5-1.8x faster than mergekit. For compute-heavy methods like TIES and DARE, mergekit with PyTorch still wins.

It compiles to a ~4 MB static binary, runs on CPU-only hardware, and produces bit-identical results to mergekit for the same config.

Why merge models?

Training LLMs is expensive. Fine-tuning one for a specific task - code generation, medical QA, creative writing - takes hundreds of GPU hours and a curated dataset. But if you want a model that's good at two things, you're stuck. You can't retrain from scratch on a combined dataset (too expensive), and running both models at inference time doubles your cost.

Model merging skips all of that. You take two fine-tuned models that share a common base and combine their weights into a single model. Same inference cost as running one model. No extra training, no access to the original datasets, no additional compute at serving time.

Why does this work at all? Fine-tuning moves weights in a structured way. The difference between a base model and its fine-tune - a task vector- captures what the model "learned" during training. Combine task vectors from different fine-tunes and you compose capabilities. A code model plus a chat model gives you a model that can do both, usually without much degradation in either skill.

This isn't a niche technique. Some of the most downloaded open models on HuggingFace - NeuralBeagle, Goliath-120B, and dozens of community favorites - are merges, not models trained from scratch.

Two fine-tuned neural networks merged into a single model by combining their weight tensors

Motivation

mergekit (Goddard et al., Arcee AI) is the standard tool for model merging, and the community has built great things with it. But it's slow, and the reason is structural: it treats merging as a PyTorch workload.

Here's what actually happens when you run a mergekit merge. Python starts up, imports dozens of modules, initializes the PyTorch runtime, builds a task DAG, and starts loading model shards through LazyTensorLoader. Each tensor access opens a file, seeks to an offset, reads bytes, deserializes into a PyTorch tensor, and allocates memory. A 7B model in FP16 is ~14 GB of raw weight data, but mergekit's peak memory can blow past 64 GB because it keeps multiple tensors alive at once during the merge.

The math behind merging is relatively simple - weighted sums, some trig, element-wise operations on float arrays. None of it needs autograd, a CUDA runtime, or PyTorch's tensor abstraction. Just fast reads and arithmetic.

We took that idea and ran with it. alloy is a Rust CLI that memory-maps safetensors files, processes one tensor at a time, and writes results straight to disk. The whole thing compiles to a ~4 MB static binary with zero runtime dependencies.

Element-wise weighted combination of model weight tensors

mergekit

  • Python + PyTorch runtime
  • DAG-based task scheduler
  • LazyTensorLoader (repeated file open/close)
  • Multiple tensors in memory at once
  • 64 GB+ peak RAM for 7B merges

alloy

  • Native Rust, no runtime
  • Sequential streaming pipeline
  • Zero-copy mmap (no deserialization)
  • One tensor in memory at a time
  • Peak RAM proportional to largest tensor

Both produce standard safetensors output. Same format, same files, loadable by any HuggingFace tool or inference server.

Architecture

Everything in alloy follows one constraint: peak memory should scale with the largest single tensor, not the entire model. A 7B model has ~200 tensors. There's no reason to hold more than one in memory at a time.

TensorStore

When alloy opens a model, it memory-maps every shard file via mmap. The OS maps file contents into the process's virtual address space without actually reading anything into RAM. Tensor data stays on disk until you touch it, and the kernel handles page-in/page-out automatically.

On open, we parse each shard's JSON header (tensor names and byte offsets) and build a HashMap index. Every lookup after that is O(1). F32 tensors are accessed as zero-copy slices directly into the mmap region. F16 and BF16 get converted to F32 on the fly, which does allocate, but only for one tensor at a time.

pub fn tensor_bytes(&self, name: &str) -> &[u8] {
    let loc = &self.index[name];           // O(1) HashMap lookup
    let shard = &self.shards[loc.shard_idx]; // which file this tensor lives in
    &shard.mmap[loc.abs_start..loc.abs_end]  // zero-copy slice into mmap
}

StreamingWriter

Writing safetensors is trickier than reading them. The format puts a header with byte offsets for every tensor at the start of the file, so you need to know exact sizes before writing a single byte. We handle this in two phases:

Phase 1 (planning): walk all tensor names and shapes from the merge config, compute byte sizes, and assign them to output shards. This gives us a complete header and a shard assignment map. No tensor data gets read here - metadata only.

Phase 2 (writing): for each tensor in order, read from the TensorStore, run the merge op, cast to the output dtype, and write directly to the output file at the precomputed offset. Each tensor gets dropped from memory right after writing.

Writes go to a .tmp file first, then get atomically renamed on completion. If alloy crashes mid-write, you're left with a .tmp file and the original output stays untouched. No partial or corrupted files.

// Phase 1: plan shard layout (metadata only)
let plan = planner.assign_shards(&tensor_names, &shapes, max_shard_size);

// Phase 2: stream tensors one at a time
for name in &plan.tensor_order {
    let data = store.tensor_as_f32(name)?;   // mmap read
    let merged = merge_op.merge(&inputs)?;    // compute
    writer.write_tensor(name, &merged)?;      // write + drop
}

Numerical stability

Weights are stored as FP16 or BF16, but we do all merge math in F32 after conversion. For operations that accumulate across many elements - dot products for cosine similarity, norm calculations for SLERP - we use F64 accumulators to prevent floating-point drift. This actually matters: accumulating 10 million FP32 values without F64 can lose you several digits of precision.

Parallelism comes from rayon's par_chunks, with each thread processing 8,192 elements independently. Per-chunk results are then folded sequentially (not reduced in parallel) so that output is deterministic regardless of thread count or scheduling. Same merge twice = bit-identical results.

Algorithms

We implement 10 merge methods. They're all interchangeable - the pipeline doesn't know or care which algorithm is running, it just feeds in tensors and gets a merged result back.

linearweighted average of N models
slerpspherical interpolation between 2 models
nuslerpmulti-model SLERP via sequential pairwise interpolation
task_arithmeticbase + scaled sum of task vectors
tiestrim, elect sign, disjoint merge
dare_linearrandom dropout + rescaled linear merge
dare_tiesDARE dropout + TIES sign election
della_linearmagnitude-aware dropout + linear merge
dellamagnitude-aware dropout + TIES sign election
passthroughconcatenate layer ranges from different models

Linear and Task Arithmetic

The simplest merge is a weighted average: θmerged=iwiθi\theta_{\text{merged}} = \sum_i w_i \theta_i. It works surprisingly well when models share the same base, because fine-tuning tends to push weights in similar directions.

Task arithmetic goes one step further by operating on task vectors rather than raw weights. A task vector is just the diff between a fine-tuned model and its base: τi=θiθbase\tau_i = \theta_i - \theta_{\text{base}}. You add a weighted sum of task vectors back onto the base, scaled by a factor λ\lambda:

θmerged=θbase+λiwiτi\theta_{\text{merged}} = \theta_{\text{base}} + \lambda \sum_i w_i \, \tau_i

λ\lambda controls how far the merged model strays from the base. Increase it for stronger specialization, keep it low to stay closer to base model behavior.

SLERP

Linear interpolation (LERP) blends two weight vectors along a straight line through the interior of the space. Problem: the interpolated vector ends up shorter than either input. For weight tensors, that magnitude loss degrades model quality.

SLERP fixes this by interpolating along the surface of the hypersphere instead, preserving the norm throughout. Geometrically, it traces the great circle arc between two points on a sphere rather than cutting through the interior:

abLERPSLERPω
slerp(a,b,t)=sin((1t)ω)sinωa+sin(tω)sinωb\text{slerp}(\mathbf{a}, \mathbf{b}, t) = \frac{\sin((1-t)\omega)}{\sin\omega}\,\mathbf{a} + \frac{\sin(t\omega)}{\sin\omega}\,\mathbf{b}

ω\omega is the angle between a\mathbf{a} and b\mathbf{b}, computed as ω=arccos ⁣(abab)\omega = \arccos\!\left(\frac{\mathbf{a}\cdot\mathbf{b}}{\|\mathbf{a}\|\,\|\mathbf{b}\|}\right), and t[0,1]t \in [0, 1] is the interpolation parameter. When the vectors are nearly parallel (cosω>0.9995\cos\omega > 0.9995), the formula blows up numerically, so we fall back to LERP.

TIES

Merging more than two models gets messy. Task vectors from different fine-tunes often conflict: one model pushes a weight up, another pushes it down. Average those opposing signals and they cancel each other out. TIES (Yadav et al., 2023) deals with this in three steps:

1. Trim: for each task vector, zero out the smallest elements by magnitude. Gets rid of noise from insignificant weight changes. The trim fraction (density) is a hyperparameter, typically 0.2-0.5.

2. Elect sign:at each weight position, take a weighted vote across all models to pick the "consensus" sign. If models A and C increased a weight and model B decreased it, and A+C outweigh B, the consensus is positive.

3. Disjoint merge: at each position, only keep contributions from models whose sign matches the consensus. Disagreeing models get excluded entirely:

θj=θbase,j+i:sgn(τi,j)=sjwiτi,ji:sgn(τi,j)=sjwi\theta_j = \theta_{\text{base},j} + \frac{\sum_{i:\,\text{sgn}(\tau_{i,j})\,=\,s_j} w_i\,\tau_{i,j}}{\sum_{i:\,\text{sgn}(\tau_{i,j})\,=\,s_j} w_i}

where sjs_j is the elected sign at position jj. You end up preserving the direction of change that most models agree on instead of diluting it with conflicting signals.

DARE

DARE (Yu et al., 2023) takes a completely different approach: random dropout. Before merging, each element of each task vector gets randomly zeroed out with probability 1p1-p, where pp is the keep rate (typically 0.1-0.5). Surviving elements are rescaled by 1/p1/p to preserve expected magnitude.

The reasoning: fine-tuning changes are heavily redundant. The learned behavior is distributed across tons of weights, and you can drop most of them without losing the capability. By randomly picking a different subset for each model, you cut down on conflicting signals while keeping each model's overall contribution intact.

DARE pairs with either linear merging (dare_linear) or TIES sign election (dare_ties).

DELLA

DELLA (Bansal et al., 2024) builds on DARE with one change: dropout probability depends on weight magnitude. Bigger weight changes are more likely to be important, so they get dropped less often. Per-element dropout probability:

pdrop,j=(1d)(1τjmaxτ)ϵp_{\text{drop},j} = (1-d)\left(1 - \frac{|\tau_j|}{\max|\tau|}\right)^\epsilon

dd is the base density (like DARE's keep rate), τj|\tau_j| is the task vector magnitude at position jj, and ϵ\epsilon controls how aggressively dropout favors large elements. At ϵ=0\epsilon = 0 you get uniform dropout (plain DARE). As ϵ\epsilon increases, small changes get dropped more while large changes survive.

Model comparison

alloy also has a diff command for comparing two models tensor-by-tensor. Useful for seeing how much a fine-tune changed the base, or sanity-checking a merge. For each shared tensor, we compute four metrics in a single parallel pass:

Cosine similarity - directional alignment between two weight vectors (1.0 = same direction, 0.0 = orthogonal). L2 distance - total magnitude of change. Max absolute difference - the single weight that moved the most. Mean absolute difference - average change across all elements.

cosine(a,b)=iaibiiai2  ibi2ab2=i(aibi)2\text{cosine}(\mathbf{a},\mathbf{b}) = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2}\;\sqrt{\sum_i b_i^2}} \qquad \|\mathbf{a}-\mathbf{b}\|_2 = \sqrt{\sum_i(a_i-b_i)^2}

Tensors that exist in only one model, or have mismatched shapes, get reported separately. Same streaming, one-tensor-at-a-time approach as the merge pipeline.

Benchmarks

We ran alloy against mergekit on synthetic 1.24B-parameter models in FP16 (~2.3 GB each), on an Apple M2 Max with 12 cores and 34 GB unified memory. Everything was timed with hyperfine (minimum 3 runs, warmup excluded).

MethodalloymergekitPerformance
linear1.75 s2.7 s1.5x faster
slerp1.54 s2.8 s1.8x faster
dare5.0 s3.1 s1.6x slower
ties15 s2.7 s5.5x slower

Where alloy wins

Linear and SLERP are IO-bound. The merge computation itself - a weighted sum, some trig - is trivial compared to reading tensor data off disk. Our mmap reader skips Python startup, PyTorch init, and mergekit's repeated file open/close cycles in LazyTensorLoader. Data goes straight from disk pages into computation via zero-copy memory mapping. No deserialization step in between.

Limitations

Compute-heavy methods are slower.TIES and DARE involve per-element magnitude sorting, random mask generation, and conditional accumulation across millions of floats. PyTorch runs these through hand-tuned SIMD-vectorized C++ kernels (ATen). Our inner loops use rayon for parallelism but don't have explicit SIMD intrinsics yet. For these methods, the IO overhead we eliminate isn't enough to offset PyTorch's raw math advantage.

Benchmarks are synthetic. We tested on 1.24B-parameter models. Performance characteristics at 7B, 13B, and 70B scale could look different, especially as tensor sizes grow and the IO-vs-compute balance shifts.

Memory not formally profiled.Our streaming design guarantees at most one tensor per input model in memory at any time, but we haven't published memory benchmarks with hard numbers yet.

No GGUF support.Both alloy and mergekit operate on safetensors files. Neither can merge GGUF-quantized models directly, which means you can't merge models already quantized for llama.cpp or ollama without converting back to safetensors first.

No evaluation loop.Neither tool tells you whether a merge actually improved the model. You merge, then manually run evals. Automated merge-evaluate-iterate pipelines (like mergekit's experimental CMA-ES support) are still an open problem for both tools.

No GPU acceleration.All merging happens on CPU. For IO-bound methods this is fine, but compute-heavy algorithms could benefit from GPU offloading. Neither alloy nor mergekit currently uses the GPU for the merge computation itself (mergekit uses PyTorch's CPU kernels, not CUDA).

Future work

SIMD vectorization. Our biggest performance gap is on compute-heavy methods like TIES and DARE. PyTorch's ATen kernels are hand-tuned; our inner loops are scalar Rust. Adding explicit SIMD via std::arch or the wide crate for magnitude sorting, mask generation, and conditional accumulation should close the gap.

GGUF format support. A lot of local inference engines (llama.cpp, ollama) use GGUF, not safetensors. GGUF read/write would let us merge quantized models directly without a conversion round-trip.

Real-scale benchmarks.We've only benchmarked on synthetic 1.24B models so far. Profiling at 7B, 13B, and 70B would show how the IO vs compute trade-off shifts with scale and put real numbers on the memory advantage.

GPU acceleration. For compute-bound methods, offloading the merge kernel to a GPU via wgpu or CUDA could pair our streaming IO with hardware-accelerated math. The MergeOp trait already isolates computation, so a GPU backend would be a drop-in replacement without touching the rest of the pipeline.

Evolutionary merging. mergekit supports CMA-ES optimization to search for optimal merge weights automatically. Implementing this in alloy would mean adding an evaluation loop (merge, benchmark, adjust weights) - architecturally different from our current single-pass design, but doable.

References

mergekit - Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, Jacob Solawetz. mergekit: Tools for Merging Pretrained Language Models. 2024. The tool that started the model merging ecosystem.

Task Arithmetic - Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, Ali Farhadi. Editing Models with Task Arithmetic. ICLR 2023.

TIES - Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal. TIES-Merging: Resolving Interference When Merging Models. NeurIPS 2023.

DARE - Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. ICML 2024.

DELLA - Pala Tej Deep, Nikhil Prakash, Atharva Parikh, Mohit Bansal. DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling. 2024.

SLERP for neural networks- Originally applied to model merging by the open-source community; the technique itself is from Ken Shoemake's 1985 SIGGRAPH paper on quaternion interpolation.