Gradient-Surgical: Efficient Fine-Tuning via Gradient-Informed Dynamic Parameter Freezing

Authors: Chengyou Xin, Wenxin Zhang, Yueling Zhang, Chengjia Wu, University of Electronic Science and Technology of China, Hong Kong University of Science and Technology, Chengdu Banana Intelligent Technology Co., Ltd. , Hong Kong ChongDe Indutrial Limited, TKLLM Research Team

Date: October 9, 2025

Keywords: Parameter-efficient fine-tuning, gradient analysis, dynamic pruning, transfer learning, efficient AI


Abstract

This paper proposes Gradient-Surgical (GSPF), a novel parameter-efficient fine-tuning method that dynamically freezes non-sensitive parameters based on gradient analysis. Unlike existing structured adaptation methods (e.g., LoRA), GSPF requires no prior assumptions about parameter update patterns. The key insight is that task-specific adaptation primarily depends on a small subset of parameters exhibiting high gradient sensitivity during early training.

Experiments on 8 NLP and CV benchmarks show that GSPF achieves:

  • Parameter efficiency: Updates only 0.1%–0.5% parameters
  • Performance: Within 0.3% accuracy of full-parameter fine-tuning (FullFT)
  • Efficiency: 2.1× faster training, 60% less memory
  • Environmental impact: 67% reduction in CO₂ emissions

1. Introduction

Large pre-trained models (PTMs) are deep learning models trained on massive datasets to capture general representations. Examples include BERT [Devlin et al., 2019] in NLP and Vision Transformers (ViT) [Dosovitskiy et al., 2020] in computer vision. These models typically contain hundreds of millions to billions of parameters and can model complex patterns across domains.

Full-parameter fine-tuning (FullFT) updates all parameters using task-specific data. While effective, it has challenges:

  • Computational cost: Fine-tuning billion-parameter models requires GPUs/TPUs
  • Memory demands: Optimizer states consume 2–3× model size
  • Environmental impact: Training large models produces substantial CO₂ [Strubell et al., 2019]
  • Catastrophic forgetting: Pre-trained knowledge can be overwritten

Parameter-efficient fine-tuning (PEFT) methods like LoRA [Hu et al., 2021] and Adapters [Houlsby et al., 2019] enforce predefined update structures, limiting adaptability. Importantly, existing methods do not exploit task-specific parameter importance variation.

We introduce GSPF, a data-driven approach that:

  1. Identifies critical parameters via gradient sensitivity during minimal warmup
  2. Dynamically freezes >99.5% low-sensitivity parameters
  3. Achieves near FullFT performance with dramatically reduced resources

Contributions:

  • Theoretically-grounded gradient sensitivity metric for parameter importance
  • Efficient dynamic parameter freezing algorithm with optional iterative refinement
  • Evaluation on 8 benchmarks showing state-of-the-art efficiency
  • Analysis revealing sensitive parameters concentrated in specific network regions

2. Method

2.1 Gradient Sensitivity Analysis

Parameters with consistently large gradients during early training are most critical. For parameter $\theta_i$ over T warmup batches:

[ S_i = \frac{1}{T} \sum_{t=1}^{T} \left| \nabla_{\theta_i} \mathcal{L}(\mathcal{D}_t) \right| ]

This captures both the magnitude and consistency of parameter updates.


2.2 Dynamic Parameter Freezing

GSPF fine-tunes a small subset of parameters based on gradient sensitivity and magnitude. It operates in three phases: warmup, sensitivity-based selection, and surgical fine-tuning.

Algorithm: Gradient-Surgical Fine-Tuning (GSPF)

Input: Pretrained model θ^(0), training data D, loss function L
       Trainable ratio p, warmup epochs K, re-selection interval R
Output: Fine-tuned model θ^(T)

Phase 1: Warmup Training
------------------------
for epoch = 1 to K:
    for batch D_t in D:
        Compute loss L_t = L(θ, D_t)
        Compute gradients g_t = ∇_θ L_t
        Update θ ← θ - η * g_t
        Record absolute gradients G[:, t] ← |g_t|

Phase 2: Sensitivity Scoring & Selection
----------------------------------------
Compute per-parameter sensitivity: S_i = 1/T Σ_t G[i,t]
Normalize: S_i ← (S_i - μ_S)/σ_S
Enhance with magnitude: S~_i = S_i * tanh(|θ_i|)
Select top-p% parameters as trainable set I_train
Freeze all θ_i not in I_train

Phase 3: Surgical Fine-Tuning
-----------------------------
for epoch = K+1 to E_max:
    for batch D_t in D:
        Compute gradients g_t = ∇_θ L(θ, D_t)
        Update θ_i using AdamW only if i ∈ I_train

    if epoch % R == 0 and epoch < 0.8*E_max:
        Recompute S~_i using accumulated gradients
        Reselect I_train

Phase 1: Warmup Training

[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \mathcal{L}(D_t) ]

  • Record absolute gradient values |∇θ_i L|
  • Typically T_batches ≈ 1000 (GLUE) or 5000 (ImageNet)
  • Minimal memory overhead

Phase 2: Parameter Selection

Enhanced sensitivity:

[ \tilde{S}_i = \frac{S_i - \mu_S}{\sigma_S} \cdot \tanh(|\theta_i|) ]

Select top-p% parameters as trainable set; freeze the rest.

Phase 3: Surgical Fine-Tuning

Update critical subset:

[ \theta_i^{(t+1)} = \begin{cases} \theta_i^{(t)} - \eta \cdot \text{AdamWUpdate}(g_{t,i}) & i \in \theta_\text{train} \ \theta_i^{(t)} & \text{otherwise} \end{cases} ]

Iterative refinement:

  • Re-evaluate sensitivity every R epochs
  • Update: $\tilde{S}_i^\text{new} = \alpha \tilde{S}_i^\text{prev} + (1-\alpha)\tilde{S}_i^\text{current}$
  • Only during first 80% of training

GSPF Workflow Figure 1: Three-phase GSPF workflow


3. Experiments

3.1 Experimental Setup

Models:

  • NLP: BERT-base (110M), RoBERTa-large (355M)
  • CV: ViT-B/16 (86M), ResNet-50 (25M)

Datasets:

  • NLP: GLUE, SQuAD 1.1
  • CV: CIFAR-100, ImageNet-1K, ADE20K

Baselines: FullFT, LoRA (r=8), Adapter (factor=16), BitFit, DiffPruning

Metrics: Accuracy/F1, GPU memory (GB), training time (hrs), CO₂ emissions (kg)

Details:

  • Warmup epochs: 1 (GLUE/CIFAR), 2 (ImageNet)
  • Keep ratio: 0.1%, 0.3%, 0.5%
  • Hardware: 8× NVIDIA A100 (80GB)

3.2 Main Results

GLUE benchmark (BERT-base):

Method Param% Score Δ vs FullFT Mem (GB)
FullFT 100.0 85.2 - 15.2
LoRA 0.38 84.1 -1.1 8.1
Adapter 2.10 84.7 -0.5 9.3
DiffPruning 0.30 84.3 -0.9 7.8
GSPF (0.1%) 0.1 84.6 -0.6 6.0
GSPF (0.3%) 0.3 84.9 -0.3 6.8

Parameter-Accuracy Tradeoff Figure 2: Parameter-accuracy tradeoff on CIFAR-100 and ImageNet

Analysis:

  • GSPF dynamically selects parameter subset using gradient signals
  • Resembles “lottery ticket” hypothesis [Frankle et al., 2018]
  • Further improvements: layer-wise thresholds, higher-order sensitivity metrics

3.3 Cross-Domain Evaluation

| Task | Metric | FullFT | GSPF | |------|

--------|--------|------| | SQuAD | F1 | 88.7 | 88.5 | | ImageNet | Top-1 Acc | 81.8 | 81.6 | | ADE20K | mIoU | 48.2 | 47.9 | | Protein Fold | RMSD | 1.05 | 1.07 |

  • SQuAD: 0.3% parameters, F1 almost unchanged
  • ImageNet: Top-1 decrease 0.2%, 99.5% parameters frozen
  • ADE20K: Spatial features retained
  • Protein Folding: Slight RMSD drop, scientific tasks feasible

4. Analysis

4.1 Sensitivity Distribution

Sensitivity Heatmap Figure 3: Parameter sensitivity distribution in BERT-base and ViT-B/16

  • Transformer layers: last 3 layers most sensitive
  • Attention: Query & Value > Key
  • LayerNorm gain > bias
  • Classification head sensitivity ≈ 100× embeddings

4.2 Efficiency Analysis

Method Time (hrs) CO₂ (kg) Energy (kWh)
FullFT 4.2 1.58 8.4
LoRA 2.1 0.79 4.2
Adapter 2.8 1.05 5.6
GSPF (0.3%) 1.8 0.52 3.6

4.3 Ablation Study

Ablation Study Figure 4: (a) Warmup length, (b) iterative refinement effect

  • 1 epoch warmup sufficient
  • Iterative refinement improves 0.4–0.8%
  • Normalization enhances stability by 12%

5. Related Work

  • Parameter-efficient fine-tuning: LoRA, Adapters, Prefix-Tuning, IA³
  • Parameter importance estimation: Fisher Information, SNIP, GraSP
  • Dynamic networks: Surgical Fine-Tuning, DiffPruning

GSPF imposes no structural constraints, selecting parameters purely via gradient sensitivity.


6. Conclusion

GSPF efficiently adapts PTMs by:

  • Identifying critical parameters via gradient analysis
  • Updating only 0.1%–0.5% parameters
  • Achieving near FullFT performance with reduced cost

Future work: multi-task learning, federated learning, theoretical guarantees.


Appendix A: Implementation Details

Hyperparameters

Task LR Batch Size Warmup Keep Ratio
GLUE 2e-5 32 1 0.3%
SQuAD 3e-5 16 1 0.4%
CIFAR-100 5e-5 64 1 0.2%
ImageNet 1e-4 256 2 0.5%

Memory Saving Formula

[ \text{Mem}_{\text{saved}} = 3(1-p)|\theta| \times 4 \text{ bytes} ]

BERT-base (110M) at p=0.003 ≈ 1.31GB saved, observed ≈ 1.28GB

Full GLUE Results

Method MNLI QQP QNLI SST-2 CoLA Avg
FullFT 86.7 91.2 92.3 93.1 62.8 85.2
GSPF 86.4 90.9 92.0 92.8 61.9 84.9

Reproducibility: Contact corresponding author for code or detailed settings