What is Model Merging? A Technique to Combine Multiple LLMs Without Training to Improve Performance

What is Model Merging? A Technique to Combine Multiple LLMs Without Training to Improve Performance

Model merging is a technique that combines the weights of multiple large language models (LLMs) without additional training to improve the performance of a single model. It has attracted attention as a new option for LLM customization, as it allows different capabilities to be integrated into one model without GPU costs or the effort of retraining. This article systematically explains everything from the mechanics of model merging to representative methods and practical applications, so that engineers and machine learning practitioners can immediately understand and apply the concepts.

Model merging is a technique that mathematically combines the weight parameters of multiple fine-tuned LLMs (Large Language Models) to enhance the capabilities of a single model without any additional training.

Its greatest advantage is the ability to integrate different skills into a single model at virtually no GPU cost, and it is attracting attention in both research and practice as a new option for LLM customization. Beginning with methods such as Task Arithmetic, announced in late 2022, algorithms including TIES and DARE have emerged in rapid succession, spreading quickly—particularly within the open-weight model community.

This article provides a systematic explanation covering the basic concepts and theoretical background of model merging, a comparison of representative methods, how to use the practical tool MergeKit, and use cases along with points of caution. The goal is to provide content that engineers and machine learning practitioners can immediately put into practice after reading.

Conclusion: Model merging is a technique that mathematically combines the weights of multiple LLMs to produce a single model, with its greatest distinguishing feature being that no additional training is required.

We will walk through the basic concepts of model merging, how it differs from fine-tuning, and the background behind the growing interest in this approach.

What Does "Weight Synthesis" of Models Mean?

When first hearing about model merging, the phrase "combining models together" might conjure up an image of mixing output text. However, what actually takes place is an operation that numerically sums or interpolates the model's parameters (weights) themselves.

An LLM (Large Language Model) is a collection of weight matrices consisting of billions to hundreds of billions of floating-point numbers. A model that has undergone fine-tuning encodes knowledge about specific tasks within these numerical values. Model merging is a technique that performs element-wise operations on the weight matrices of multiple models and integrates them into a single new weight matrix.

As a concrete illustration, consider the most fundamental method—weighted averaging—as an example.

  • Weight matrix of Model A: W_A
  • Weight matrix of Model B: W_B
  • Merged weights: α × W_A + (1−α) × W_B

Through this operation alone, there is potential to integrate Model A's Japanese language capability and Model B's coding capability into a single model.

There is one important prerequisite. Merging is only possible between models derived from the same base model (Foundation Model). Models with different architectures have weight matrices with mismatched dimensions, making direct computation impossible. Because they originate from the same base, the "semantic positions" of the weights are aligned, resulting in a structure that is less likely to break down even when combined.

Differences from Fine-Tuning and Knowledge Distillation

Model merging is often confused with other techniques that share similar goals, but the approaches are fundamentally different.

The key differences can be summarized as follows:

  • Fine-tuning: Retrains the model's weights using additional data. Requires GPU computational cost and the preparation of a new dataset.
  • Knowledge Distillation: Trains a smaller student model using the output distribution of a larger teacher model. The primary goal is reducing inference cost, and a training process is likewise involved.
  • Model merging: Simply combines existing weights mathematically, with no additional training whatsoever. No dataset or GPU retraining is required.

As a decision-making framework: fine-tuning is appropriate when the goal is adapting to a new task or improving accuracy in a specific domain, while knowledge distillation is effective when model lightweighting or reducing deployment costs is the priority. On the other hand, when you already have multiple fine-tuned models on hand and want to consolidate their capabilities into one, model merging becomes the lowest-cost option.

As a concrete example, after separately fine-tuning a "model strong at code generation" and a "model strong at multilingual support," merging the weights of both can produce a single model that possesses capabilities from each. This is an approach that is difficult to achieve through retraining alone.

Note, however, that model merging tends to be most effective when the models share a common base, and cannot be applied to models with different architectures.

Background and Origins of the Growing Interest in Model Merging

"I want to create a model with multiple capabilities while keeping GPU costs down, but retraining isn't realistic"—many engineers have likely felt this way. Model merging has emerged precisely as an answer to that challenge.

There are three main trends behind the growing attention it has received.

  • The spread of open-weight models: Open-weight models, led by the LLaMA family, have been widely released, giving rise to a large number of fine-tuned derivative models. The need to combine models specialized for different tasks has naturally grown.
  • The barrier of retraining costs: Full fine-tuning of large-scale models requires expensive GPU resources and lengthy training times. Demand has grown for methods that can integrate capabilities without additional training.
  • Accumulation of theoretical backing: From late 2022 through 2023, peer-reviewed papers on the mathematical manipulation of weights were published in rapid succession, including Task Arithmetic (arXiv:2212.04089, ICLR 2023) and TIES-Merging (arXiv:2306.01708, NeurIPS 2023).

In particular, the Task Arithmetic research presented an intuitive framework showing that "skills can be added or removed simply by adding or subtracting the delta—the difference obtained by subtracting the base model's weights from the fine-tuned model's weights (task vectors)," capturing the interest of practitioners.

Subsequently, methods such as DARE (arXiv:2311.03099, ICML 2024), which leverages the redundancy of delta parameters, were also introduced, and the approach of integrating multiple capabilities at low cost spread rapidly from the research community into practical use.

Why Does Model Merging Improve Performance? Theoretical Basis

Conclusion: The reason model merging improves performance lies in the properties of the loss landscape, where linear interpolation of weights functions effectively, and in the mechanism by which task-specific knowledge is distributed and embedded across weights.

However, choosing the wrong method can cause interference problems, leading to degraded performance.

Loss Landscape and Linear Interpolation of Weights

It is intuitive to assume that simply averaging weights would degrade performance, but in practice, models derived from the same base model have been reported to maintain or even improve performance through linear interpolation. The key to understanding this paradox lies in the structure of the "loss landscape."

The loss landscape refers to the topography of the loss function in the space of a model's weight parameters. Research in deep learning has shown that the weights of sufficiently trained models tend to converge toward "wide, flat valleys."

  • Flat minima: Regions where the loss increases very little even when weights shift slightly
  • Sharp minima: Unstable regions where even small changes in weights cause the loss to spike sharply

Multiple models fine-tuned from the same base model are said to converge toward nearby flat valleys on the same loss landscape. As a result, the midpoint obtained by linearly interpolating the weights of two such models tends to remain in a low-loss region.

Expressed mathematically, the merged weights can be written as follows:

  • θ_merged = (1 − α) × θ_A + α × θ_B (where α is an interpolation coefficient between 0 and 1)

The preconditions for this interpolation to work effectively are as follows:

How Task-Specific Knowledge Is Embedded in Weights

When fine-tuning is performed, how do a model's weights change from those of the base model?

LLMs (large language models) store general-purpose language patterns in their weights through pre-training, but when fine-tuned on task-specific data, the delta accumulates in the weights as task-specific knowledge. This delta is precisely what the Task Arithmetic paper defines as a "task vector."

  • Task vector = fine-tuned weights − base model weights
  • The general-purpose knowledge held by the base model remains intact, while only the task-specific adjustments are isolated as a delta
  • By summing the task vectors of multiple tasks, the capabilities of each can be integrated into a single model

Conditional branching becomes important here. When tasks are closely related (e.g., English translation and Japanese translation), interference between task vectors tends to be small, and both skills are more likely to be preserved after addition. On the other hand, for entirely different domains (e.g., code generation and sentiment analysis), competing updates to the same parameters are more likely to occur, and cases have been reported where simple addition alone degrades quality.

Research on DARE reports that the majority of delta parameters after SFT (supervised fine-tuning) are redundant, and that model capabilities are maintained even after removing 90–99% of them. This suggests that task-specific knowledge is concentrated and embedded in only a small subset of weights.

It is precisely this property that makes model merging—mathematically manipulating weight deltas—viable.

Cases Where Model Merging Fails and the Interference Problem

"I merged the models expecting better performance, but accuracy dropped instead"—this is one of the first obstacles engineers encounter when attempting model merging.

The primary cause of merging failures is parameter interference. Models fine-tuned on different tasks can carry conflicting update directions for the same parameters. Simply averaging these can produce an intermediate state that is optimal for neither task, and cases have been reported where performance on both tasks degrades as a result.

Representative scenarios prone to failure are as follows:

  • Different base models: Merging models with different architectures or pre-training data tends to produce meaningless results, as the semantics of the weight spaces do not correspond
  • Semantically distant tasks: When training distributions differ significantly—such as between a code generation specialist model and a sentiment analysis specialist model—the directions of the task vectors are prone to conflict
  • Misconfigured merge coefficients: Without properly tuning the weighting coefficients, the characteristics of one model can dominate, causing the capabilities of the other to be buried

TIES-Merging (arXiv:2306.01708, NeurIPS 2023) is a method that directly addresses this interference problem. It suppresses interference through three stages: pruning redundant parameters with small magnitudes (Trim), selecting signs by majority vote (Elect Sign), and merging only those parameters that agree on sign (Merge).

What Are the Major Model Merging Methods?

Conclusion: Multiple algorithms exist for model merging, and it is necessary to select the appropriate one based on the objective and conditions.

Approaches range from weighted averaging to Task Arithmetic, TIES, DARE, and SLERP. Understanding the characteristics of each method is what determines the quality of the merge.

Basics of Linear Interpolation (Weighted Average)

The simplest model merging technique is linear interpolation (Weighted Average). It is an intuitive method that simply combines the weight parameters of two models at a specified ratio.

Expressed as a formula:

  • merged = α × model_A + (1 − α) × model_B
  • α is a blending coefficient set between 0 and 1

For example, setting α = 0.5 blends the two models equally, while α = 0.7 allows you to more strongly reflect the characteristics of model_A.

At first, it is tempting to think that "a simple average would only dilute the strengths of both models." In practice, however, many cases have been reported where linear interpolation of weights successfully retains the skills of both models, provided they are fine-tuned models derived from the same base model. This is thought to be because both models exist in the vicinity of the same loss landscape.

On the other hand, linear interpolation has clear limitations.

  • Prerequisites: Both models must share the same architecture and be derived from the same base model
  • Tuning α: The optimal ratio varies by task and requires trial and error
  • Interference risk: When merging models that have been enhanced for different domains, performance may end up mediocre

A common approach for searching the optimal α is to perform a grid search while monitoring scores on a validation set.

Task Arithmetic and Vector Operation-Based Merging

Task Arithmetic is a technique proposed in the 2022 paper "Editing Models with Task Arithmetic." At its core is the concept of a task vector, defined by the following formula:

Task vector = fine-tuned weights − base model weights

In other words, it is an operation that extracts, as a vector, "how much the model has changed from the base model" through adaptation to a specific task.

Using this vector, weight manipulation translates into intuitive arithmetic operations:

  • Addition (additive merging): Adding together the task vectors of multiple tasks and applying them to the base model produces a model that simultaneously possesses the capabilities of each task
  • Subtraction (capability removal): Subtracting a task vector from the base model intentionally weakens the model's tendency to respond to a specific task
  • Scaling: Multiplying the vector by a scalar coefficient adjusts the degree of influence of each task

From a conditional branching perspective, when the tasks to be integrated are independent of one another, simple addition is often sufficient. However, when there is semantic overlap or interference between tasks, the scaling coefficients must be carefully adjusted.

That said, Task Arithmetic still faces the challenge of parameter interference. Simply adding multiple task vectors together can cause sign conflicts in the weights, leading to degraded performance.

Overview of Recent Algorithms: TIES, DARE, SLERP, and More

When you feel that "weighted averaging alone isn't improving accuracy enough—is there a smarter merging method?", three techniques come into consideration: TIES, DARE, and SLERP.

TIES (TRIM, ELECT SIGN & MERGE) is an algorithm presented at NeurIPS 2023. It resolves the problem of weight deltas from multiple models interfering with one another through a three-step process:

  • TRIM: Prunes delta parameters with small absolute values
  • ELECT SIGN: Determines by majority vote, for each parameter, whether to move in the positive or negative direction
  • MERGE: Averages and integrates only the deltas whose signs agree

By explicitly eliminating sign conflicts, this approach tends to produce less interference than simple averaging.

DARE (Drop And REscale) is a method reported in a study accepted at ICML 2024. Based on the observation that fine-tuned delta parameters contain a large amount of redundancy, it randomly drops 90–99% of the deltas and then rescales the remaining values to compensate. It has been reported that capabilities are maintained even after such drastic reduction, and DARE is often used in combination with TIES.

SLERP (Spherical Linear Interpolation) is a technique that interpolates between two models along an arc on a sphere. Whereas ordinary linear interpolation blends models along a "straight line," SLERP transitions smoothly while preserving the "angle" of the weight vectors.

How to Practice Model Merging: Tools and Procedures

Conclusion: Once you understand the theory, the next step is choosing the right implementation tool and understanding the merge procedure.

In practice, model merging comes down to three key steps: tool selection, choosing the right combination of base models, and evaluation. The following H3 sections walk through everything in order, from how to use MergeKit—a representative implementation tool—to methods for verifying output quality.

Basic Merging Workflow Using MergeKit

MergeKit is an open-source library dedicated to model merging, published by arcee-ai. It supports all major algorithms—including Task Arithmetic, TIES, DARE, and SLERP—and can execute a merge with a single YAML configuration file.

It is tempting at first to think "I can just write a Python script from scratch," but in practice, using MergeKit's configuration files yields higher reproducibility and reduces the risk of procedural errors.

The basic procedure is as follows:

  1. Installation: Install the library with pip install mergekit
  2. Create a YAML configuration file: Specify the merge method (e.g., task_arithmetic), the path to the base model, and the paths and weight coefficients for each model to be integrated
  3. Run the merge: Execute the command mergekit-yaml config.yaml output_dir/
  4. Check the output: The merged model is generated in Hugging Face-compatible format in the specified output directory

As an example of YAML syntax, when using Task Arithmetic, specify merge_method: task_arithmetic and set a scaling_coefficient for each model. Coefficients are typically adjusted in the range of 0.3–0.7, and there is no strict requirement to make them sum to 1. Setting values too high tends to degrade the capabilities the original model possessed, so it is safer to adjust incrementally while checking scores on a validation set.

How to Select Base Models and Merged Models

The quality of a merge result is largely determined at the stage of selecting the base model and the models to be integrated. Choosing an inappropriate combination can cause serious weight interference, potentially degrading the capabilities of both models.

Basic Requirements for Base Model Selection

First, it is a prerequisite that all models to be merged share the same architecture and the same number of parameters. Merging across different architectures is not supported by current methods.

Next, verify whether the group of models you wish to integrate have been fine-tuned from the same base model. For example, multiple fine-tuned models derived from a Llama-series base model tend to experience less interference because they share the same weight coordinate space.

How to Select Models for Integration: Decision Criteria by Purpose

The characteristics of the models you should choose vary depending on your objective.

  • When aiming to improve generality: Combining multiple models each specialized in different domains (e.g., coding-focused, multilingual, dialogue quality improvement) makes it easier to acquire a broad range of capabilities not found in any single model.
  • When you want to maintain accuracy on a specific task while adding secondary capabilities: Weighted averaging — where the primary model is set as the "stronger" one and the auxiliary model's weights are blended at a low coefficient — is appropriate.

Combinations to Avoid

The following patterns can significantly degrade merge quality.

Evaluation Metrics and Quality Verification After Merging

Immediately after a merge is complete, it is not uncommon to find yourself wondering, "Which metrics should I look at to assess quality?" Having a systematic evaluation process in place allows you to iterate quickly on improving your merge configuration.

The first thing to check is benchmark scores by task. Verify individually whether scores have dropped significantly on the tasks each pre-merge model excelled at. Specifically, examine the following aspects:

  • Capability retention: Are the post-merge scores for benchmarks on which the source models scored highly (e.g., mathematical reasoning, code generation, multilingual understanding) within an acceptable range?
  • Interference detection: Has strengthening one task caused a significant drop in performance on another?
  • Hallucination rate: Has the inclusion of misinformation increased in outputs that require factual accuracy?

Manual inspection of qualitative output samples is also indispensable. This allows you to detect stylistic inconsistencies and degradation in instruction-following that benchmarks alone cannot capture. A widely used approach in practice is to prepare a representative set of prompts and compare outputs side by side before and after the merge.

Finally, conduct a sensitivity analysis of the merge coefficients (interpolation ratios). By varying the coefficients in increments of 0.1 and recording how benchmark scores change, it becomes easier to identify the optimal balance point.

Managing evaluation results centrally in a spreadsheet — recording coefficients, methods, and scores together — makes it easy to reproduce and compare results later.

What Use Cases Is Model Merging Suited For?

Conclusion: Model merging is particularly effective in situations where you want to consolidate multiple capabilities into a single model at low cost.

The following H3 sections describe representative use cases for each application, including multilingualization, domain specialization, and local deployment.

Simultaneous Enhancement of Multilingual Support and Specialized Domain Knowledge

When trying to acquire both multilingual capability and specialized domain knowledge simultaneously, the initial instinct is often to "just fine-tune a single model on multilingual corpora and specialized documents at the same time." In practice, however, the two learning objectives interfere with each other, and there are reported cases where neither capability reaches a satisfactory level. Model merging is a promising way to avoid this problem.

The specific approach is as follows:

  • Prepare separately a multilingual-specialized model (e.g., fine-tuned on multiple languages such as Japanese, English, and Chinese) and a domain-specialized model (e.g., tuned on specialized corpora in fields such as medicine, law, or manufacturing)
  • Use a merge method such as Task Arithmetic to compose task vectors and integrate both capabilities into a single model
  • Achieve a multilingual × specialized domain combination with virtually no additional GPU training cost

The reason this approach is effective is that each model has already "recorded" its capabilities in its weights as independent task vectors. In areas with little interference, linear interpolation tends to yield sufficient performance.

On the other hand, there are also points to be aware of.

Hybrid Use with LoRA Adapters

LoRA (Low-Rank Adaptation) adapters are a technique that trains only a small number of parameters while keeping the base model's weights frozen. When combined with model merging, a hybrid configuration that brings out the strengths of both approaches can be achieved.

Choosing Between Merging and LoRA

Model merging is suited when broad adaptability to tasks and generality are the priority, while adding a LoRA adapter with additional training is appropriate when precise adaptation to a specific domain is required. When combining these two approaches in a two-stage configuration, the following workflow is common:

  1. Merge multiple fine-tuned models to create a multi-skilled base
  2. Apply lightweight additional training via LoRA to the resulting merged model
  3. At inference time, dynamically attach the LoRA adapter to the merged weights for use

Practical Benefits

  • Integrating general-purpose capabilities (multilingual support, code generation, etc.) at the merge stage tends to reduce the amount of training data required for subsequent LoRA training
  • LoRA adapters can be managed separately from the model itself, making it easy to swap between multiple adapters to handle different tasks
  • MergeKit also has the ability to absorb (merge) LoRA adapters into the base weights prior to merging, which helps keep the workflow simple

Points to Note

When stacking LoRA on top of a merged model, weight interference introduced during the merge can affect LoRA training. It is advisable to confirm that quality is stable through post-merge evaluation before proceeding with LoRA training.

Applications to Local LLM Construction Using Open-Weight Models

The need from practitioners to "operate LLMs without relying on cloud APIs and without sending internal data outside the organization" is growing rapidly alongside the proliferation of open-weight models. Model merging also functions as a practical option in these local LLM deployment scenarios.

Because open-weight models have their weights publicly available, multiple models can be merged and consolidated into a single inference endpoint. This makes the following types of configurations easier to achieve:

  • Integration of Japanese-specialized models × domain-specific models: Merging a general-purpose Japanese base model with a model fine-tuned on specialized knowledge in fields such as law, medicine, or manufacturing, and operating it as a single local model
  • Constraining parameter scale: Running a single merged model tends to consume less GPU memory than launching multiple models separately
  • Combination with LoRA adapters: A two-stage configuration is also possible, where a base model is enhanced through merging and then further lightly tuned with LoRA

MergeKit is designed with local execution in mind and can be used as-is with open-weight models in Hugging Face format. For inference servers, options such as Ollama and vLLM are available, and merged models can be loaded and used directly.

As a caveat, the models to be merged must share the same architecture and must be derived from the same base model.

Common Misconceptions and Caveats About Model Merging

Conclusion: Model merging is not a silver bullet — choosing the wrong method or violating license terms can become a significant risk in real-world deployment.

The expectation that "blending models will improve performance" is not always correct, and interference between models or compatibility issues can arise. Attention to redistribution licenses is also essential.

Is It Really True That Mixing Always Improves Results?

When first starting out with model merging, it is easy to assume that "combining multiple models will simply improve performance." In practice, however, merging is not a universal solution, and cases have been reported where performance degrades when the conditions are not met.

The main reasons why merges tend to fail are as follows:

  • Architecture mismatch: Models that differ in the number of layers, hidden dimensions, or number of attention heads cannot have their weights mapped to each other in the first place
  • Differences in base models: Even with the same architecture, if models were fine-tuned from different base models, their weight semantic spaces are misaligned, making interference more likely
  • Task conflicts: When the task vector of one model collides with the task vector of another in terms of sign or magnitude, both capabilities end up being only partially realized

Research on TIES-Merging has reported that simple weighted averaging causes an "interference problem" in which parameters cancel each other out due to sign conflicts. This is the primary reason why the intuition that "the more you blend, the better" breaks down.

Key criteria to keep in mind for maintaining quality are as follows:

  • Select models that are derived from the same base model
  • Always evaluate the merged model using benchmarks and real-world tasks, and compare it against the base model
  • Start with small merge coefficients (interpolation ratios) and adjust incrementally

Model merging is a "technology that presupposes trial and error." Pairing hypothesis design before merging with evaluation after merging is the realistic approach to achieving the expected performance gains.

Risks Related to Licensing and Redistribution

When the source models have different licenses, redistributing the resulting merged model enters a legal gray zone.

Key points to be aware of are as follows:

  • Commercial use eligibility: Llama-family models have their own terms of use, and conditions may apply to commercial use and redistribution. If even one of the source models prohibits commercial use, there is a risk that the entire merged model becomes ineligible for commercial use
  • License propagation: While open licenses such as Apache 2.0 and MIT are relatively permissive for redistribution, proprietary non-commercial licenses or "research use only" conditions may be interpreted as carrying over to the merged model
  • Restrictions on weight redistribution: Some models explicitly prohibit redistribution of the weights themselves. Since a merged model also contains weights, it may be subject to these restrictions

As a rule, license terms should be verified individually for each source model. The risk is lower when the intended use is purely internal, but when publishing or distributing on platforms such as Hugging Face, it is recommended to carefully review all source model licenses and, where necessary, consult with a legal professional.

Documentation in the model card is also important. Listing the name, version, and license of each source model allows users to make informed decisions — something that is also required from the perspective of Responsible AI. Always check the official page of each model for the latest official license terms.

Author & Supervisor

Yusuke Ishihara

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).