Integrating Optical Context Compression into TKLLM

A DeepSeek-OCR-Inspired Framework for Efficient Multimodal Fine-Tuning of TikTok-Specific LLMs

( Whitepaper )


Authors

Chengyou Xin, Wenxin Zhang, Yueling Zhang, Chengjia Wu

University of Electronic Science and Technology of China

The Hong Kong University of Science and Technology

Chengdu Banana Intelligent Technology Co., Ltd.

Hong Kong Chongde Industrial Limited

TKLLM Research Team


Abstract

Short-form video content generation faces three fundamental challenges: extended contextual dependencies, significant style variability, and multimodal representation misalignment. TKLLM, as a leading TikTok-specific small language model platform, targets high-quality, style-consistent video script generation from limited creator data. Traditional fine-tuning approaches, however, continue to struggle with excessive memory consumption and fragmented long-term contextual retention.

This paper introduces OCC-TKLLM, a novel fine-tuning framework inspired by DeepSeek-OCR's Optical Context Compression (OCC) methodology. By integrating an Optical Context Encoder (OCE), Mixture-of-Experts (MoE) Style Module, and Optical Forgetting Mechanism with Gradient Surgical Fine-Tuning (GSPF), our framework achieves efficient multimodal alignment and dynamic memory management. Experimental results demonstrate that OCC-TKLLM reduces GPU memory consumption by 72% while enhancing generation quality by 14.8% on the TikTok Viral Caption Benchmark. This work establishes a new paradigm for multimodal fine-tuning in content-specific language models.


1 Introduction

1.1 Background and Motivation

Within TikTok's creative ecosystem, AI models are expected to fulfill several critical functions:

  • Learn and emulate specific account-specific tones and rhythmic patterns
  • Integrate complementary information from textual, visual, and auditory modalities
  • Maintain stylistic coherence across multiple video sequences
  • Operate efficiently within constrained GPU memory environments

Conventional LLM fine-tuning methodologies—including LoRA, QLoRA, and adapter-based approaches—encounter significant limitations:

  1. Context Explosion: Extended TikTok scripts, captions, and comment threads frequently exceed 4,000 tokens
  2. Style Fragmentation: Substantial variations in tonal delivery, humor styles, and pacing between content creators
  3. Modality Isolation: Disjointed representations of video thumbnails, subtitles, and audio characteristics

These challenges complicate small-scale model fine-tuning while preserving both coherence and computational efficiency.

1.2 Inspiration from DeepSeek-OCR

DeepSeek-OCR introduced Optical Context Compression (OCC)—a paradigm where extensive textual contexts undergo conversion into visual embeddings, achieving 10×–20× compression ratios without semantic degradation. This approach raises a pivotal question for TKLLM development:

Can TikTok-style scripts and contextual data be visually represented and compressed prior to fine-tuning, thereby preserving stylistic memory while reducing computational overhead?

1.3 Contributions

We propose the OCC-TKLLM framework, which introduces four key innovations:

  1. Optical Context Encoder (OCE): Transforms extended textual contexts (scripts and comments) into compressed visual token representations
  2. MoE Style Experts: Specializes distinct model subnets for specific TikTok content domains (educational, lifestyle, narrative, etc.)
  3. Optical Forgetting Mechanism: Implements controlled degradation of historical information through visual downsampling techniques
  4. Joint OCC + GSPF Fine-Tuning: Combines dynamic gradient freezing with visual compression for efficient multimodal training

2 Related Work

2.1 DeepSeek-OCR and Optical Compression

DeepSeek-OCR demonstrated that extended text sequences can be efficiently represented as visual embeddings:

  • Encoder Architecture: SAM + CLIP with 16× convolutional downsampling
  • Decoder Framework: DeepSeek-3B-MoE architecture
  • Compression Performance: 10×–20× reduction while maintaining 97% OCR accuracy

This paradigm inspires TKLLM to employ visual tokens as compact memory representations for long-context modeling.

2.2 TKLLM and Gradient Surgical Fine-Tuning (GSPF)

TKLLM utilizes Gradient Surgical Fine-Tuning (GSPF) to dynamically freeze parameters based on gradient importance, achieving high sample efficiency for style imitation. However, in multi-video learning scenarios, GSPF alone cannot resolve context redundancy or memory expansion issues—motivating integration with OCC.

2.3 Mixture-of-Experts for Multistyle Modeling

Sparse Mixture-of-Experts (MoE) architectures, exemplified by DeepSeek-3B-MoE, enable scalable multi-domain specialization through selective expert activation per sample. TKLLM leverages this principle for style specialization, routing different TikTok content genres through distinct expert pathways.


3 Methodology

3.1 Overall Architecture

The proposed OCC-TKLLM framework comprises five integrated modules:

  1. Input Layer: Aggregates multimodal context (scripts, comments, thumbnails)
  2. Optical Context Encoder (OCE): Compresses textual context into visual embeddings
  3. MoE Style Experts: Manages diverse content style variations
  4. GSPF Adaptive Layer: Dynamically controls parameter freezing
  5. LLM Decoder: Generates final TikTok scripts and captions

The mathematical formulation is expressed as:

\[Y = f_{\text{LLM}}(f_{\text{MoE}}(f_{\text{OCE}}(I_{\text{ctx}}); \theta_{\text{moe}}); \theta_{\text{llm}})\]

where:

  • \( I_{\text{ctx}} \) represents the rendered context image
  • \( f_{\text{OCE}} \) denotes the optical encoder function
  • \( f_{\text{MoE}} \) signifies the expert routing module
  • \( f_{\text{LLM}} \) corresponds to the main decoder function

3.2 Optical Context Encoder (OCE)

3.2.1 Encoding Process

The Optical Context Encoder transforms rendered TikTok frames into semantically aligned visual embeddings. The process employs a visual backbone for dense representation extraction, followed by projection into a multimodal token space:

\[T_{\text{vis}} = \text{Conv}_{\text{16x}} (\text{SAM}(\text{CLIP}(I_{\text{ctx}})))\]

\( T_{\text{vis}} \) comprises 64–256 visual tokens, each encapsulating extensive text-level semantics. This architecture enables TKLLM to preserve scene-level context—including lighting, tonal qualities, and compositional elements—while minimizing token redundancy.

3.2.2 Contrastive Alignment

To ensure semantic consistency between textual and optical modalities, we minimize a contrastive alignment objective:

\[\mathcal{L}_{\text{align}} = 1 - \cos(T_{\text{vis}}, E_{\text{text}})\]

where \( E_{\text{text}} \) represents the text encoder output. Minimizing \(\mathcal{L}_{\text{align}}\) promotes high cosine similarity between corresponding text-video pairs, enforcing multimodal coherence within consistent TikTok style domains.

3.2.3 Resolution Modes

The encoder supports multiple configurations balancing fidelity and computational requirements:

Mode Input Resolution Visual Tokens Primary Use Case
Tiny 512×512 64 Single video script processing
Base 1024×1024 256 Multi-script contextual fusion
High-Fidelity 1280×1280 800 Comprehensive account history analysis

Higher resolution modes preserve fine-grained stylistic cues, while lower modes enable accelerated inference.

3.3 MoE Style Experts

Each expert within the Mixture-of-Experts layer corresponds to a distinct TikTok style domain (product showcases, daily vlogs, narrative storytelling, educational content). The routing function selects a limited expert subset per sample:

Let ( r_i = f_{\text{MoE}}(T_{vis}) ) represent routing scores. We select the top-\( k \) experts:

\[E_{\text{active}} = \text{Top}_k(r_i)\]

The aggregated expert output is computed as:

\[h_{\text{out}} = \sum_{i \in E_{\text{active}}} r_i \cdot f_i(T_{vis})\]

Selective activation reduces computational requirements and prevents cross-style interference while maintaining stylistic fidelity.

3.4 Optical Forgetting Mechanism

To control long-term memory expansion and mitigate context redundancy, OCC-TKLLM implements optical forgetting: recent content is encoded at high resolution, while older content undergoes progressive downsampling, with expired content summarized into minimal style tokens.

We model token reduction across historical levels as:

\[N_k = \frac{N_0}{2^k}\]

where \( N_0 \) represents the initial token count and \( k \) indexes the temporal bucket. This exponential decay preserves long-term style priors with logarithmic storage growth.

3.5 Joint OCC + GSPF Fine-Tuning

We jointly fine-tune the optical encoder and LLM using dual-channel gradient updates:

\[ \begin{cases} \nabla_{\theta_{vis}} \leftarrow \lambda_1 \cdot g_{\text{grad}}(\text{OCE}) \\ \nabla_{\theta_{\text{llm}}} \leftarrow \lambda_2 \cdot g_{\text{surgical}}(\text{LLM}) \end{cases} \]

This design accelerates optical encoder adaptation while maintaining style stability in the LLM through gradient-surgical freezing.


4 Experiments

4.1 Experimental Setup

  • Dataset: 500 TikTok creator accounts, comprising 5 million samples
  • Model Architecture: TKLLM-1.3B with OCE and MoE enhancements
  • Baseline Comparisons: LoRA, GSPF, OCC-TKLLM
  • Evaluation Metrics: GPU Memory utilization, Compression Rate, Viral Caption Score, Style Consistency

4.2 Performance Results

Method GPU Memory Compression Rate Viral Caption Score Style Consistency
LoRA 100% 1.0× 72.1 0.84
GSPF 85% 1.0× 77.3 0.89
OCC-TKLLM 28% 9.7× 88.7 0.93

OCC-TKLLM achieves superior performance while utilizing less than one-third of baseline memory resources.

4.3 Ablation Analysis

Removed Component Performance Impact Key Observation
Optical Forgetting +2.4% style drift Context overflow reemergence
MoE Experts -5.2% Viral Score Reduced style generalization capability

5 Discussion

5.1 Key Insights

  • OCC extends beyond compression—it provides a structured memory abstraction layer that preserves tonal, rhythmic, and emotional cues
  • Optical Forgetting introduces a biologically inspired mechanism that stabilizes long-term style learning
  • The OCC + GSPF integration demonstrates synergistic benefits between gradient-level precision and multimodal efficiency

5.2 Future Research Directions

  • Incorporate audio modality integration through voice tokenization techniques
  • Develop adaptive forgetting weight mechanisms for enhanced temporal control
  • Explore end-to-end multimodal creation models (video frames → captions → audio narration)

6 Conclusion

This paper presents OCC-TKLLM, an innovative fine-tuning framework that adapts DeepSeek-OCR's optical context compression to TikTok-specific model training. The system achieves:

  • 10× context compression ratio
  • 72% reduction in GPU memory consumption
  • 14.8% improvement in style accuracy

By combining vision-inspired compression, MoE specialization, and GSPF-based gradient optimization, OCC-TKLLM establishes a new benchmark for multimodal, style-aware small model training methodologies.


Appendix

A. Model Architecture Specifications

OCC-TKLLM architecture

Figure 1: OCC-TKLLM architectural overview: Input → OCE → MoE → GSPF → LLM decoder

B. Optical Forgetting Visualization

Optical memory decay curve

Figure 2: Optical memory decay profile showing exponential retention reduction across temporal buckets

C. Hyperparameter Configuration

Parameter Value
Learning Rate 3e-5
Batch Size 640
Context Length 8192
Optimizer AdamW
Learning Rate Scheduler Cosine Annealing

Keywords: TKLLM, Optical Context Encoder, DeepSeek-OCR, Gradient Surgical Fine-Tuning, TikTok Language Models, Multimodal Compression, Mixture-of-Experts, Content Generation