Left

🚀 LIFT: Language-Image Alignment with Fixed Text Encoders

1UC Berkeley 2The University of Hong Kong


pipeline

The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder \(f^{\text{text}}\) to pre-compute the embedding \(z^T\) for each text sample \(T\). During training, we solely update the image encoder \(f_{\theta}^{\text{img}}\) and the projection head \(f_{\phi}^{\text{head}}\) to align image embeddings with the pre-computed text embeddings by optimizing an alignment objective.

Overview

Currently, the most dominant approach to establishing language-image alignment is to pre-train (always from scratch) text and image encoders jointly through contrastive learning, such as CLIP and its variants.

In this work, we question whether such a costly joint training is necessary.

We investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder.

Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve

  1. Compositional information — word order, object-attribute association, object-object relation, etc.
  2. Detailed, long captions, often generated by specially fine-tuned vision-language models.
Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.


pipeline

LIFT Encodes Compositional Information Much Better than CLIP

It is well known that CLIP lacks compositional understanding. Prior studies attribute this limitation to the fact that contrastive pre-training on general-purpose retrieval datasets incentivizes CLIP's encoders trained from scratch to adopt a shortcut strategy that suppresses (i.e. discards) features related to compositional information. We test the compositional understanding of LIFT and CLIP using SugarCrepe.

The Teaser Image

As shown in the table, when trained on the short captions from DataComp-1B, LIFT outperforms CLIP on all seven tasks with a 6.8% average accuracy gain; when trained on the long, synthetic captions from Recap-DataComp-1B, it leads on six tasks with a 7.9% gain. In both settings, LIFT achieves significant gains on add attribute, replace attribute, and replace relation tasks. These improvements are strong evidence that \(f^{\text{text}}\)'s auto-regressive training objective avoids the compositional oversight induced by contrastive learning and enables more accurate modeling of object–attribute associations and object–object relations.

LIFT's stronger compositional understanding also translates to improved performance on large multimodal models (LMMs) downstream tasks. We use LLaVA to train LMMs with either LIFT or CLIP as the vision tower, and observe that LIFT leads CLIP on five out of six tasks when both are trained on short captions and all six tasks when trained on long captions.

The Teaser Image

We find the gains mainly come from the subtasks requiring compositional understanding — such as MMBench's fine-grained perception (single-instance) and relational reasoning. The former subtask involves object localization and attribute recognition, while the latter includes identifying physical relations, all largely benefiting from LIFT's accurate encoding of compositional information.



LIFT Learns from Long Captions Much Better than CLIP

Recent studies show that CLIP yields suboptimal zero-shot performance when trained on full-length long captions (usually synthesized by VLMs), for its text encoder overemphasizes the syntactic similarity introduced by caption generators and fail to attend to semantically meaningful content. Here are three straightforward examples. We observe that CLIP's text encoder tends to assign higher scores to syntactically similar but semantically different caption pairs.

The Teaser Image


In contrast, LIFT employs the LLM-based text encoder pre-trained on large-scale data, resulting in an embedding space more robust to such syntactic homogeneity and better at extracting semantically meaningful features to distinguish captions. As shown in the table, when trained on short, web-scraped captions, CLIP has a slight edge over LIFT on ImageNet-1K zero-shot classification and two image-to-text retrieval tasks. However, all of these advantages are overtaken by LIFT with an average accuracy gain of 11.0% when both are trained on long, synthetic captions.

The Teaser Image

LIFT Is Much More Efficient in Terms of FLOPs and Memory

Since we don't optimize \(f^{\text{text}}\), the entire text embedding process can be performed offline. Given average per-batch max caption token length \(n\), the FLOPs and memory footprint of CLIP scale with \(\mathcal{O}(n^2)\) complexity, whereas LIFT achieves \(\mathcal{O}(1)\) amortized complexity. We also quantitatively benchmark CLIP and LIFT on both short (\(n=77\)) and long (\(n=128\)) captions. On average, LIFT reduces FLOPs by 25.5% for short captions and 35.7% for long ones, while lowering memory usage by 6.8% and 12.6%.

The Teaser Image

BibTeX

If you find our work inspiring, please consider giving a citation!

@misc{yang2025languageimagealignmentfixedtext,
      title={Language-Image Alignment with Fixed Text Encoders}, 
      author={Jingfeng Yang and Ziyang Wu and Yue Zhao and Yi Ma},
      year={2025},
      eprint={2506.04209},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04209}, 
}