Gradient-Surgical: Efficient Fine-Tuning via Gradient-Informed Dynamic Parameter Freezing

Authors: Chengyou Xin, Wenxin Zhang, Yueling Zhang, Chengjia Wu, University of Electronic Science and Technology of China, Hong Kong University of Science and Technology, Chengdu Banana Intelligent Technology Co., Ltd. , Hong Kong ChongDe Indutrial Limited, TKLLM Research Team

Date: October 9, 2025

Keywords: Parameter-efficient fine-tuning, gradient analysis, dynamic pruning, transfer learning, efficient AI

Abstract

This paper proposes Gradient-Surgical (GSPF), a novel parameter-efficient fine-tuning method that dynamically freezes non-sensitive parameters based on gradient analysis. Unlike existing structured adaptation methods (e.g., LoRA), GSPF requires no prior assumptions about parameter update patterns. The key insight is that task-specific adaptation primarily depends on a small subset of parameters exhibiting high gradient sensitivity during early training.

Experiments on 8 NLP and CV benchmarks show that GSPF achieves:

Parameter efficiency: Updates only 0.1%–0.5% parameters
Performance: Within 0.3% accuracy of full-parameter fine-tuning (FullFT)
Efficiency: 2.1× faster training, 60% less memory
Environmental impact: 67% reduction in CO₂ emissions

1. Introduction

Large pre-trained models (PTMs) are deep learning models trained on massive datasets to capture general representations. Examples include BERT [Devlin et al., 2019] in NLP and Vision Transformers (ViT) [Dosovitskiy et al., 2020] in computer vision. These models typically contain hundreds of millions to billions of parameters and can model complex patterns across domains.

Full-parameter fine-tuning (FullFT) updates all parameters using task-specific data. While effective, it has challenges:

Computational cost: Fine-tuning billion-parameter models requires GPUs/TPUs
Memory demands: Optimizer states consume 2–3× model size
Environmental impact: Training large models produces substantial CO₂ [Strubell et al., 2019]
Catastrophic forgetting: Pre-trained knowledge can be overwritten

Parameter-efficient fine-tuning (PEFT) methods like LoRA [Hu et al., 2021] and Adapters [Houlsby et al., 2019] enforce predefined update structures, limiting adaptability. Importantly, existing methods do not exploit task-specific parameter importance variation.

We introduce GSPF, a data-driven approach that:

Identifies critical parameters via gradient sensitivity during minimal warmup
Dynamically freezes >99.5% low-sensitivity parameters
Achieves near FullFT performance with dramatically reduced resources

Contributions:

Theoretically-grounded gradient sensitivity metric for parameter importance
Efficient dynamic parameter freezing algorithm with optional iterative refinement
Evaluation on 8 benchmarks showing state-of-the-art efficiency
Analysis revealing sensitive parameters concentrated in specific network regions

2. Method

2.1 Gradient Sensitivity Analysis

Parameters with consistently large gradients during early training are most critical. For parameter $\theta_i$ over T warmup batches:

[ S_i = \frac{1}{T} \sum_{t=1}^{T} \left| \nabla_{\theta_i} \mathcal{L}(\mathcal{D}_t) \right| ]

This captures both the magnitude and consistency of parameter updates.

2.2 Dynamic Parameter Freezing

GSPF fine-tunes a small subset of parameters based on gradient sensitivity and magnitude. It operates in three phases: warmup, sensitivity-based selection, and surgical fine-tuning.

Algorithm: Gradient-Surgical Fine-Tuning (GSPF)

Input: Pretrained model θ^(0), training data D, loss function L
       Trainable ratio p, warmup epochs K, re-selection interval R
Output: Fine-tuned model θ^(T)

Phase 1: Warmup Training
------------------------
for epoch = 1 to K:
    for batch D_t in D:
        Compute loss L_t = L(θ, D_t)
        Compute gradients g_t = ∇_θ L_t
        Update θ ← θ - η * g_t
        Record absolute gradients G[:, t] ← |g_t|

Phase 2: Sensitivity Scoring & Selection
----------------------------------------
Compute per-parameter sensitivity: S_i = 1/T Σ_t G[i,t]
Normalize: S_i ← (S_i - μ_S)/σ_S
Enhance with magnitude: S~_i = S_i * tanh(|θ_i|)
Select top-p% parameters as trainable set I_train
Freeze all θ_i not in I_train

Phase 3: Surgical Fine-Tuning
-----------------------------
for epoch = K+1 to E_max:
    for batch D_t in D:
        Compute gradients g_t = ∇_θ L(θ, D_t)
        Update θ_i using AdamW only if i ∈ I_train

    if epoch % R == 0 and epoch < 0.8*E_max:
        Recompute S~_i using accumulated gradients
        Reselect I_train

Phase 1: Warmup Training

[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \mathcal{L}(D_t) ]

Record absolute gradient values |∇θ_i L|
Typically T_batches ≈ 1000 (GLUE) or 5000 (ImageNet)
Minimal memory overhead

Phase 2: Parameter Selection

Enhanced sensitivity:

[ \tilde{S}_i = \frac{S_i - \mu_S}{\sigma_S} \cdot \tanh(|\theta_i|) ]

Select top-p% parameters as trainable set; freeze the rest.

Phase 3: Surgical Fine-Tuning

Update critical subset:

[ \theta_i^{(t+1)} = \begin{cases} \theta_i^{(t)} - \eta \cdot \text{AdamWUpdate}(g_{t,i}) & i \in \theta_\text{train} \ \theta_i^{(t)} & \text{otherwise} \end{cases} ]

Iterative refinement:

Re-evaluate sensitivity every R epochs
Update: $\tilde{S}_i^\text{new} = \alpha \tilde{S}_i^\text{prev} + (1-\alpha)\tilde{S}_i^\text{current}$
Only during first 80% of training

Figure 1: Three-phase GSPF workflow

3. Experiments

3.1 Experimental Setup

Models:

NLP: BERT-base (110M), RoBERTa-large (355M)
CV: ViT-B/16 (86M), ResNet-50 (25M)

Datasets:

NLP: GLUE, SQuAD 1.1
CV: CIFAR-100, ImageNet-1K, ADE20K

Baselines: FullFT, LoRA (r=8), Adapter (factor=16), BitFit, DiffPruning

Metrics: Accuracy/F1, GPU memory (GB), training time (hrs), CO₂ emissions (kg)

Details:

Warmup epochs: 1 (GLUE/CIFAR), 2 (ImageNet)
Keep ratio: 0.1%, 0.3%, 0.5%
Hardware: 8× NVIDIA A100 (80GB)

3.2 Main Results

GLUE benchmark (BERT-base):

Method	Param%	Score	Δ vs FullFT	Mem (GB)
FullFT	100.0	85.2	-	15.2
LoRA	0.38	84.1	-1.1	8.1
Adapter	2.10	84.7	-0.5	9.3
DiffPruning	0.30	84.3	-0.9	7.8
GSPF (0.1%)	0.1	84.6	-0.6	6.0
GSPF (0.3%)	0.3	84.9	-0.3	6.8

Figure 2: Parameter-accuracy tradeoff on CIFAR-100 and ImageNet

Analysis:

GSPF dynamically selects parameter subset using gradient signals
Resembles “lottery ticket” hypothesis [Frankle et al., 2018]
Further improvements: layer-wise thresholds, higher-order sensitivity metrics

3.3 Cross-Domain Evaluation

| Task | Metric | FullFT | GSPF | |------|

--------|--------|------| | SQuAD | F1 | 88.7 | 88.5 | | ImageNet | Top-1 Acc | 81.8 | 81.6 | | ADE20K | mIoU | 48.2 | 47.9 | | Protein Fold | RMSD | 1.05 | 1.07 |

SQuAD: 0.3% parameters, F1 almost unchanged
ImageNet: Top-1 decrease 0.2%, 99.5% parameters frozen
ADE20K: Spatial features retained
Protein Folding: Slight RMSD drop, scientific tasks feasible

4. Analysis

4.1 Sensitivity Distribution

Sensitivity Heatmap Figure 3: Parameter sensitivity distribution in BERT-base and ViT-B/16

Transformer layers: last 3 layers most sensitive
Attention: Query & Value > Key
LayerNorm gain > bias
Classification head sensitivity ≈ 100× embeddings

4.2 Efficiency Analysis

Method	Time (hrs)	CO₂ (kg)	Energy (kWh)
FullFT	4.2	1.58	8.4
LoRA	2.1	0.79	4.2
Adapter	2.8	1.05	5.6
GSPF (0.3%)	1.8	0.52	3.6

4.3 Ablation Study

Ablation Study Figure 4: (a) Warmup length, (b) iterative refinement effect

1 epoch warmup sufficient
Iterative refinement improves 0.4–0.8%
Normalization enhances stability by 12%

5. Related Work

Parameter-efficient fine-tuning: LoRA, Adapters, Prefix-Tuning, IA³
Parameter importance estimation: Fisher Information, SNIP, GraSP
Dynamic networks: Surgical Fine-Tuning, DiffPruning

GSPF imposes no structural constraints, selecting parameters purely via gradient sensitivity.

6. Conclusion

GSPF efficiently adapts PTMs by:

Identifying critical parameters via gradient analysis
Updating only 0.1%–0.5% parameters
Achieving near FullFT performance with reduced cost

Future work: multi-task learning, federated learning, theoretical guarantees.

Appendix A: Implementation Details

Hyperparameters

Task	LR	Batch Size	Warmup	Keep Ratio
GLUE	2e-5	32	1	0.3%
SQuAD	3e-5	16	1	0.4%
CIFAR-100	5e-5	64	1	0.2%
ImageNet	1e-4	256	2	0.5%

Memory Saving Formula

[ \text{Mem}_{\text{saved}} = 3(1-p)|\theta| \times 4 \text{ bytes} ]

BERT-base (110M) at p=0.003 ≈ 1.31GB saved, observed ≈ 1.28GB

Full GLUE Results

Method	MNLI	QQP	QNLI	SST-2	CoLA	Avg
FullFT	86.7	91.2	92.3	93.1	62.8	85.2
GSPF	86.4	90.9	92.0	92.8	61.9	84.9

Reproducibility: Contact corresponding author for code or detailed settings