Integrating Optical Context Compression into TKLLM

A DeepSeek-OCR-Inspired Framework for Efficient Multimodal Fine-Tuning of TikTok-Specific LLMs

( Whitepaper )

Authors

Chengyou Xin, Wenxin Zhang, Yueling Zhang, Chengjia Wu

University of Electronic Science and Technology of China

The Hong Kong University of Science and Technology

Chengdu Banana Intelligent Technology Co., Ltd.

Hong Kong Chongde Industrial Limited

TKLLM Research Team

Abstract

Short-form video content generation faces three fundamental challenges: extended contextual dependencies, significant style variability, and multimodal representation misalignment. TKLLM, as a leading TikTok-specific small language model platform, targets high-quality, style-consistent video script generation from limited creator data. Traditional fine-tuning approaches, however, continue to struggle with excessive memory consumption and fragmented long-term contextual retention.

This paper introduces OCC-TKLLM, a novel fine-tuning framework inspired by DeepSeek-OCR's Optical Context Compression (OCC) methodology. By integrating an Optical Context Encoder (OCE), Mixture-of-Experts (MoE) Style Module, and Optical Forgetting Mechanism with Gradient Surgical Fine-Tuning (GSPF), our framework achieves efficient multimodal alignment and dynamic memory management. Experimental results demonstrate that OCC-TKLLM reduces GPU memory consumption by 72% while enhancing generation quality by 14.8% on the TikTok Viral Caption Benchmark. This work establishes a new paradigm for multimodal fine-tuning in content-specific language models.

1 Introduction

1.1 Background and Motivation

Within TikTok's creative ecosystem, AI models are expected to fulfill several critical functions:

Learn and emulate specific account-specific tones and rhythmic patterns
Integrate complementary information from textual, visual, and auditory modalities
Maintain stylistic coherence across multiple video sequences
Operate efficiently within constrained GPU memory environments

Conventional LLM fine-tuning methodologies—including LoRA, QLoRA, and adapter-based approaches—encounter significant limitations:

Context Explosion: Extended TikTok scripts, captions, and comment threads frequently exceed 4,000 tokens
Style Fragmentation: Substantial variations in tonal delivery, humor styles, and pacing between content creators
Modality Isolation: Disjointed representations of video thumbnails, subtitles, and audio characteristics

These challenges complicate small-scale model fine-tuning while preserving both coherence and computational efficiency.

1.2 Inspiration from DeepSeek-OCR

DeepSeek-OCR introduced Optical Context Compression (OCC)—a paradigm where extensive textual contexts undergo conversion into visual embeddings, achieving 10×–20× compression ratios without semantic degradation. This approach raises a pivotal question for TKLLM development:

Can TikTok-style scripts and contextual data be visually represented and compressed prior to fine-tuning, thereby preserving stylistic memory while reducing computational overhead?

1.3 Contributions

We propose the OCC-TKLLM framework, which introduces four key innovations:

Optical Context Encoder (OCE): Transforms extended textual contexts (scripts and comments) into compressed visual token representations
MoE Style Experts: Specializes distinct model subnets for specific TikTok content domains (educational, lifestyle, narrative, etc.)
Optical Forgetting Mechanism: Implements controlled degradation of historical information through visual downsampling techniques
Joint OCC + GSPF Fine-Tuning: Combines dynamic gradient freezing with visual compression for efficient multimodal training

2 Related Work

2.1 DeepSeek-OCR and Optical Compression

DeepSeek-OCR demonstrated that extended text sequences can be efficiently represented as visual embeddings:

Encoder Architecture: SAM + CLIP with 16× convolutional downsampling
Decoder Framework: DeepSeek-3B-MoE architecture
Compression Performance: 10×–20× reduction while maintaining 97% OCR accuracy

This paradigm inspires TKLLM to employ visual tokens as compact memory representations for long-context modeling.

2.2 TKLLM and Gradient Surgical Fine-Tuning (GSPF)

TKLLM utilizes Gradient Surgical Fine-Tuning (GSPF) to dynamically freeze parameters based on gradient importance, achieving high sample efficiency for style imitation. However, in multi-video learning scenarios, GSPF alone cannot resolve context redundancy or memory expansion issues—motivating integration with OCC.

2.3 Mixture-of-Experts for Multistyle Modeling

Sparse Mixture-of-Experts (MoE) architectures, exemplified by DeepSeek-3B-MoE, enable scalable multi-domain specialization through selective expert activation per sample. TKLLM leverages this principle for style specialization, routing different TikTok content genres through distinct expert pathways.

3 Methodology

3.1 Overall Architecture

The proposed OCC-TKLLM framework comprises five integrated modules:

Input Layer: Aggregates multimodal context (scripts, comments, thumbnails)
Optical Context Encoder (OCE): Compresses textual context into visual embeddings
MoE Style Experts: Manages diverse content style variations
GSPF Adaptive Layer: Dynamically controls parameter freezing
LLM Decoder: Generates final TikTok scripts and captions

The mathematical formulation is expressed as:

\[Y = f_{\text{LLM}}(f_{\text{MoE}}(f_{\text{OCE}}(I_{\text{ctx}}); \theta_{\text{moe}}); \theta_{\text{llm}})\]

where:

\( I_{\text{ctx}} \) represents the rendered context image
\( f_{\text{OCE}} \) denotes the optical encoder function
\( f_{\text{MoE}} \) signifies the expert routing module
\( f_{\text{LLM}} \) corresponds to the main decoder function

3.2 Optical Context Encoder (OCE)

3.2.1 Encoding Process

The Optical Context Encoder transforms rendered TikTok frames into semantically aligned visual embeddings. The process employs a visual backbone for dense representation extraction, followed by projection into a multimodal token space:

\[T_{\text{vis}} = \text{Conv}_{\text{16x}} (\text{SAM}(\text{CLIP}(I_{\text{ctx}})))\]

\( T_{\text{vis}} \) comprises 64–256 visual tokens, each encapsulating extensive text-level semantics. This architecture enables TKLLM to preserve scene-level context—including lighting, tonal qualities, and compositional elements—while minimizing token redundancy.

3.2.2 Contrastive Alignment

To ensure semantic consistency between textual and optical modalities, we minimize a contrastive alignment objective:

\[\mathcal{L}_{\text{align}} = 1 - \cos(T_{\text{vis}}, E_{\text{text}})\]

where \( E_{\text{text}} \) represents the text encoder output. Minimizing \(\mathcal{L}_{\text{align}}\) promotes high cosine similarity between corresponding text-video pairs, enforcing multimodal coherence within consistent TikTok style domains.

3.2.3 Resolution Modes

The encoder supports multiple configurations balancing fidelity and computational requirements:

Mode	Input Resolution	Visual Tokens	Primary Use Case
Tiny	512×512	64	Single video script processing
Base	1024×1024	256	Multi-script contextual fusion
High-Fidelity	1280×1280	800	Comprehensive account history analysis

Higher resolution modes preserve fine-grained stylistic cues, while lower modes enable accelerated inference.

3.3 MoE Style Experts

Each expert within the Mixture-of-Experts layer corresponds to a distinct TikTok style domain (product showcases, daily vlogs, narrative storytelling, educational content). The routing function selects a limited expert subset per sample:

Let ( r_i = f_{\text{MoE}}(T_{vis}) ) represent routing scores. We select the top-\( k \) experts:

\[E_{\text{active}} = \text{Top}_k(r_i)\]

The aggregated expert output is computed as:

\[h_{\text{out}} = \sum_{i \in E_{\text{active}}} r_i \cdot f_i(T_{vis})\]

Selective activation reduces computational requirements and prevents cross-style interference while maintaining stylistic fidelity.

3.4 Optical Forgetting Mechanism

To control long-term memory expansion and mitigate context redundancy, OCC-TKLLM implements optical forgetting: recent content is encoded at high resolution, while older content undergoes progressive downsampling, with expired content summarized into minimal style tokens.

We model token reduction across historical levels as:

\[N_k = \frac{N_0}{2^k}\]

where \( N_0 \) represents the initial token count and \( k \) indexes the temporal bucket. This exponential decay preserves long-term style priors with logarithmic storage growth.

3.5 Joint OCC + GSPF Fine-Tuning

We jointly fine-tune the optical encoder and LLM using dual-channel gradient updates:

\[ \begin{cases} \nabla_{\theta_{vis}} \leftarrow \lambda_1 \cdot g_{\text{grad}}(\text{OCE}) \\ \nabla_{\theta_{\text{llm}}} \leftarrow \lambda_2 \cdot g_{\text{surgical}}(\text{LLM}) \end{cases} \]

This design accelerates optical encoder adaptation while maintaining style stability in the LLM through gradient-surgical freezing.

4 Experiments

4.1 Experimental Setup

Dataset: 500 TikTok creator accounts, comprising 5 million samples
Model Architecture: TKLLM-1.3B with OCE and MoE enhancements
Baseline Comparisons: LoRA, GSPF, OCC-TKLLM
Evaluation Metrics: GPU Memory utilization, Compression Rate, Viral Caption Score, Style Consistency

4.2 Performance Results

Method	GPU Memory	Compression Rate	Viral Caption Score	Style Consistency
LoRA	100%	1.0×	72.1	0.84
GSPF	85%	1.0×	77.3	0.89
OCC-TKLLM	28%	9.7×	88.7	0.93

OCC-TKLLM achieves superior performance while utilizing less than one-third of baseline memory resources.

4.3 Ablation Analysis

Removed Component	Performance Impact	Key Observation
Optical Forgetting	+2.4% style drift	Context overflow reemergence
MoE Experts	-5.2% Viral Score	Reduced style generalization capability

5 Discussion

5.1 Key Insights

OCC extends beyond compression—it provides a structured memory abstraction layer that preserves tonal, rhythmic, and emotional cues
Optical Forgetting introduces a biologically inspired mechanism that stabilizes long-term style learning
The OCC + GSPF integration demonstrates synergistic benefits between gradient-level precision and multimodal efficiency

5.2 Future Research Directions

Incorporate audio modality integration through voice tokenization techniques
Develop adaptive forgetting weight mechanisms for enhanced temporal control
Explore end-to-end multimodal creation models (video frames → captions → audio narration)

6 Conclusion

This paper presents OCC-TKLLM, an innovative fine-tuning framework that adapts DeepSeek-OCR's optical context compression to TikTok-specific model training. The system achieves:

10× context compression ratio
72% reduction in GPU memory consumption
14.8% improvement in style accuracy

By combining vision-inspired compression, MoE specialization, and GSPF-based gradient optimization, OCC-TKLLM establishes a new benchmark for multimodal, style-aware small model training methodologies.

Appendix

A. Model Architecture Specifications

OCC-TKLLM architecture

Figure 1: OCC-TKLLM architectural overview: Input → OCE → MoE → GSPF → LLM decoder

B. Optical Forgetting Visualization

Optical memory decay curve

Figure 2: Optical memory decay profile showing exponential retention reduction across temporal buckets

C. Hyperparameter Configuration

Parameter	Value
Learning Rate	3e-5
Batch Size	640
Context Length	8192
Optimizer	AdamW
Learning Rate Scheduler	Cosine Annealing

Keywords: TKLLM, Optical Context Encoder, DeepSeek-OCR, Gradient Surgical Fine-Tuning, TikTok Language Models, Multimodal Compression, Mixture-of-Experts, Content Generation