CLIP Needs Registers. And Gated MLPs. And +20M params. 🧑‍💻🤖

CLIP-fine-tune-registers-gated

CLIP Needs Registers. And Gated MLPs. And +20M params. 🧑‍💻🤖

Fixing CLIP's modality gap via happy little accidents.

Want a retrieval model? Low modality gap? Read on! ✅🌟
Want zero-shot accuracy & Text Encoders for gen-AI? You're better off using my classic CLIP-fine-tune repo. 👈⚠️
Jump to HF models: huggingface.co/zer0int/CLIP-Registers-Gated_MLP-ViT-L-14
Jump to HF Long-CLIP models: huggingface.co/zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14/

Update 19/MAR/2025:

Added fusion gate inspector (subset of image dataset)
Compares gating towards REG vs. CLS token over layers
Usual syntax; use --deterministic for same choice of images
Example:

Update 14/MAR/2025:

Added feature (activation max) visualization!
--use_model by default expects models/ViT-L-14-REG-GATED-balanced-ckpt12.safetensors
You can specify layers & features as a range or discrete, example:

python REG-12-XGATED-featureviz-fusion-mlps.py --layer_range 8-11 --feature_range 42,1000,77

That would visualize feature 42 and 1000 and 77 on layer 8, 9, 10 and 11.
If you exceed the valid range for Layers or Features, you'll get an IndexError.
Read the green text when you run the script to see the valid range! :)
Interesting observations: MLP Fusion Gate features are either sharp or dead (thanks, ReLU...). But if not dead, they're super intricate and detailed, no matter which layer.
On the other hand, early layers (resblocks) in the ViT encode simple structures, lines, zigzags... Then more complex textures:

python REG-12-XGATED-featureviz-normal-mlps.py --layer_range 1-23 --feature_range 42,100,1000

Update: 11/MAR/2025

Added Long-CLIP (248 tokens) version! 🎉
Same syntax, except prepend long to everything. :)
Download the original model to fine-tune it (I have a .safetensors version so you don't need to load a danger-pickle!), or download my already fine-tuned models (including Text-Encoder-Only version for t2i / t2v / gen-AI):
👉 huggingface.co/zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14/

1st commit: 09/MAR/2025

This was initally an attempt to implement Paper: Vision Transformers Need Registers
...By just fine-tuning a pre-trained model (yes, a pretty bold (or crazy) idea! 🤣).
Tl;dr: CLIP hoards global information in local vision (image) patches -> known phenomenon of misleading heatmaps.
Such 'register tokens' of global information are easily identified: Norm >>100 (normal local patch: <80, ~50).

I just want a new Text Encoder... ✨

...for my Text-to-Image (Text-to-Video) AI!
Direct download for best / balanced model: click here
Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!)

Now, about the full model; ViT especially. 🔍

After initial failures (patch norms ✅, zero-shot accuracy 84.5% -> ~70% ❌ == ruined model):
Added MLP gates with ReLU to ViT resblocks. Exacerbated patch norm outliers. 🧐
But: CLIP learned to steer its obsessive hoarding of global information! 🤩
Result: Modality Gap (Euclidean): (OpenAI pre-trained): 0.8276 --> (THIS): 0.4740 👈🤯
While also: Zero-shot, retrieval, ... outperform original model across the board. ✅
(Exeption: Minor reduction in linear probe accuracy for some datasets)
See the 'evals_benchmarks_results' for CLIP_Benchmark (LAION) & Benchmarks (code) included here.
Summary of what changed in the ViT:

OpenAI pre-trained: 	Total Parameters: 427,616,513

REG-X-GATED: 		Total Parameters: 452,815,117
|--> + 4 Register Tokens, visual.positional_embedding.shape[0]: 257 -> 261
|--> + MLP with ReLU for every layer (gating) + Final Fusion MLP
|--> + Only during fine-tuning: Geometric Parametrization

# See 'TRAINgmpCLIPregXGATED/model.py' for all details.

I want to play with the new CLIP on the block! 🥳

Grab the full models (not Text Encoder only version) on my HuggingFace
The safetensors are just 'import clip' inside (model structure). Just so you don't need to load a danger-pickle. =)
Recommended examples:

Gradient Ascent:

Assuming you saved the model to a 'models' subfolder in the cloned repo:

REG-3-XGATED-gradient-ascent.py --no_adjust --use_model models/ViT-L-14-REG-GATED-balanced-ckpt12.safetensors --deterministic

--deterministic for reproducable results / comparison between models.
--no_adjust because it's a weird softmax expansion that doesn't work too well as of yet. :)
--use_image path/to/image.png to use your own image. Else uses default included example.

The 'OAI' versions of (ANY!) code are for original OpenAI / CLIP models ('import clip').
If --use_model is not provided, defaults to 'ViT-L/14'.
Of course you can use custom models as well. For example, direct download my ViT-L/14-GmP, then:

python REG-3-OAI-CLIP-gradient-ascent.py --no_adjust --use_model models/ViT-L-14-GmP-ft.safetensors --deterministic

⚠️ Just ensure to load 'normal' models with 'OAI' code, and register-gated models with 'REG'. Else throws error at you. :)
Example benefit of low modality gap, gradient ascent: Look at that loss!

Attention Heatmaps:

The 'EXACT' variants create large images with exact (patch / square) attention heatmaps, if you need them. Takes longer.
Recommended (use the fast and more visually aesthetic version):

REG-5-XGATED-visualize-attention-heatmaps.py --use_model models/ViT-L-14-REG-GATED-balanced-ckpt12.safetensors

You can also specify --token_folder and --image_folder. format is "image.png" -> "tokens_image.txt". Space as separator. Check EX-tokens-vis and EX-image-vis for default examples!
Or just use the above mentioned gradient ascent to get a CLIP opinion (texts) about your own images; they'll be saved to a 'more-tokens' subfolder!
Batch processing only -> put your image(s) into a folder and use that as --image_folder path/to/myimages and --token_folder more-tokens after getting CLIP opinions.

Same syntax applies for the other code. Please check the code for details - it's well-documented (I hope)! 🤗

Attention heatmap, OpenAI ViT-L/14 pre-trained:
Attention heatmap, REG-XGATED fine-tune:

Selected visual examples (as an image is worth 16x16 words):

Patch Cos Sim (Patch-wise Cosine Similarity with Text)
Modgap (Modality Gap)
Direct Ascent Synthesis / see also: github.com/zer0int/CLIP-Direct-Ascent-Synthesis

Evals

Please see respective code for details. See the (below) or 'evals_benchmarks_results' folder for results.

I want to fine-tune my own mutant REG-XGATED CLIP. 🤓

That means you probably already know what you're doing. 🙃
Run REG-0 REG-0-register-token-init-kit.py on a large dataset (of images only)
Gets 'natural' self-emergent CLIP register tokens as init for the +4 appended, trainable registers.
REG-1 (finetune), REG-2 (convert Geometric Parametrization .theta .r -> back to .weight)
Please check the (extensive!) comments inside the code for details!

Text-To-Image, Flux.1-dev, pure CLIP guidance (no T5)

See examples in the ComfyUI workflows folder!

Model Performance Overview

Task / Dataset	Metric	ViT-L/14 OpenAI (Pre-trained)	X-GATED (ckpt20 xtreme)	X-GATED (ckpt12 balanced)	X-GATED (ckpt12 balanced, ablated)
VoC-2007 (Multilabel)	mAP	0.7615	0.8140	0.8471	0.8247
MSCOCO Retrieval	Image Recall@5	0.2194	0.3565	0.3532	0.3349
	Text Recall@5	0.3034	0.5425	0.5278	0.5086
Linear Probe CIFAR-10	Acc@1	0.9535	0.9813	0.9813	0.9811
	Acc@5	0.9966	0.9997	0.9997	0.9997
	Mean Class Recall	0.9535	0.9813	0.9813	0.9811
MVT ImageNet/ObjectNet (Zero-Shot)	Accuracy	0.8453	0.8686	0.8830	0.8815
Linear Probe ILSVRC2012	Top-1	69.86%	66.43%	67.10%	68.99%
	Top-5	92.70%	91.52%	91.83%	92.64%
Modality Gap Metrics	Euclidean Gap ↓	0.8276	0.4740	0.5395	0.7486
	JSD ↓	0.5200	0.1601	0.1303	0.3310
	Wasserstein Distance ↓	0.4084	0.1742	0.2102	0.3262
	Img-Text Cos Sim (mean) ↑	0.2723	0.4926	0.4794	0.3634
	Img-Text Cos Sim (std)	0.0362	0.0814	0.0758	0.0537
	Text-Text Cos Sim (mean)	0.6807	0.6657	0.6896	0.6896
	Text-Text Cos Sim (std)	0.1344	0.1671	0.1535	0.1535

Bolded values represent the best performance for each metric.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP Needs Registers. And Gated MLPs. And +20M params. 🧑‍💻🤖

Fixing CLIP's modality gap via happy little accidents.

Update 19/MAR/2025:

Update 14/MAR/2025:

Update: 11/MAR/2025

1st commit: 09/MAR/2025

I just want a new Text Encoder... ✨

Now, about the full model; ViT especially. 🔍

I want to play with the new CLIP on the block! 🥳

Gradient Ascent:

Attention Heatmaps:

Same syntax applies for the other code. Please check the code for details - it's well-documented (I hope)! 🤗

Selected visual examples (as an image is worth 16x16 words):

Evals

I want to fine-tune my own mutant REG-XGATED CLIP. 🤓

Text-To-Image, Flux.1-dev, pure CLIP guidance (no T5)

Model Performance Overview

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
ATTNclipORG		ATTNclipORG
ATTNclipXGATED		ATTNclipXGATED
ComfyUI-workflows		ComfyUI-workflows
EX-image-vis		EX-image-vis
EX-tokens-vis		EX-tokens-vis
INFERclipregXGATED		INFERclipregXGATED
TRAINgmpCLIPregXGATED		TRAINgmpCLIPregXGATED
evals_benchmarks_results		evals_benchmarks_results
evals_benchmarks_results_longclip		evals_benchmarks_results_longclip
longATTNclipORG		longATTNclipORG
longATTNclipXGATED		longATTNclipXGATED
longINFERclipregXGATED		longINFERclipregXGATED
longTRAINgmpCLIPregXGATED		longTRAINgmpCLIPregXGATED
longmodel		longmodel
longregtokens		longregtokens
regtokens		regtokens
COCO-SPRIGHT-short-train-0_9.json		COCO-SPRIGHT-short-train-0_9.json
COCO-SPRIGHT-short-val-10_11.json		COCO-SPRIGHT-short-val-10_11.json
LICENSE		LICENSE
README.md		README.md
REG-0-register-token-init-kit.py		REG-0-register-token-init-kit.py
REG-1-XGATED-finetune-ViT-L-14-GmP.py		REG-1-XGATED-finetune-ViT-L-14-GmP.py
REG-10-OAI-CLIP-direct-ascent.py		REG-10-OAI-CLIP-direct-ascent.py
REG-10-XGATED-direct-ascent.py		REG-10-XGATED-direct-ascent.py
REG-11-ORG-OAI-model-inversion.py		REG-11-ORG-OAI-model-inversion.py
REG-11-XGATED-model-inversion.py		REG-11-XGATED-model-inversion.py
REG-12-XGATED-featureviz-fusion-mlps.py		REG-12-XGATED-featureviz-fusion-mlps.py
REG-12-XGATED-featureviz-normal-mlps.py		REG-12-XGATED-featureviz-normal-mlps.py
REG-2-XGATED-convert-back-to-weight.py		REG-2-XGATED-convert-back-to-weight.py
REG-2-XGATED-pickle-to-safetensors.py		REG-2-XGATED-pickle-to-safetensors.py
REG-3-OAI-CLIP-gradient-ascent.py		REG-3-OAI-CLIP-gradient-ascent.py
REG-3-XGATED-gradient-ascent.py		REG-3-XGATED-gradient-ascent.py
REG-4-OAI-CLIP-visualize-patch-norms.py		REG-4-OAI-CLIP-visualize-patch-norms.py
REG-4-XGATED-visualize-patch-norms.py		REG-4-XGATED-visualize-patch-norms.py
REG-5-OAI-CLIP-EXACT-visualize-attention-heatmaps.py		REG-5-OAI-CLIP-EXACT-visualize-attention-heatmaps.py
REG-5-OAI-CLIP-visualize-attention-heatmaps.py		REG-5-OAI-CLIP-visualize-attention-heatmaps.py
REG-5-XGATED-EXACT-visualize-attention-heatmaps.py		REG-5-XGATED-EXACT-visualize-attention-heatmaps.py
REG-5-XGATED-visualize-attention-heatmaps.py		REG-5-XGATED-visualize-attention-heatmaps.py
REG-6-XGATED-vs-OAI-patch-text-cos-sim.py		REG-6-XGATED-vs-OAI-patch-text-cos-sim.py
REG-6-XGATED-x-OAI-CLIP-patch-text-cos-sim-batch.py		REG-6-XGATED-x-OAI-CLIP-patch-text-cos-sim-batch.py
REG-7-OAI-CLIP-eval-imagenet-objectnet.py		REG-7-OAI-CLIP-eval-imagenet-objectnet.py
REG-7-XGATED-eval-imagenet-objectnet.py		REG-7-XGATED-eval-imagenet-objectnet.py
REG-8-OAI-CLIP-eval-modgap-flickr8k.py		REG-8-OAI-CLIP-eval-modgap-flickr8k.py
REG-8-XGATED-eval-modgap-flickr8k.py		REG-8-XGATED-eval-modgap-flickr8k.py
REG-9-OAI-CLIP-eval-imagenet-linear-probe.py		REG-9-OAI-CLIP-eval-imagenet-linear-probe.py
REG-9-XGATED-eval-imagenet-linear-probe.py		REG-9-XGATED-eval-imagenet-linear-probe.py
cliptools.py		cliptools.py
imagenet_wnid_to_class_filtered.json		imagenet_wnid_to_class_filtered.json
longREG-0-register-token-init-kit.py		longREG-0-register-token-init-kit.py
longREG-1-XGATED-finetune-ViT-L-14-GmP.py		longREG-1-XGATED-finetune-ViT-L-14-GmP.py
longREG-10-ORG-CLIP-direct-ascent.py		longREG-10-ORG-CLIP-direct-ascent.py
longREG-10-XGATED-direct-ascent.py		longREG-10-XGATED-direct-ascent.py
longREG-11-ORG-CLIP-model-inversion.py		longREG-11-ORG-CLIP-model-inversion.py
longREG-11-XGATED-model-inversion.py		longREG-11-XGATED-model-inversion.py
longREG-2-XGATED-convert-back-to-weight.py		longREG-2-XGATED-convert-back-to-weight.py
longREG-2-XGATED-pickle-to-safetensors.py		longREG-2-XGATED-pickle-to-safetensors.py
longREG-3-ORG-CLIP-gradient-ascent.py		longREG-3-ORG-CLIP-gradient-ascent.py
longREG-3-XGATED-gradient-ascent.py		longREG-3-XGATED-gradient-ascent.py
longREG-4-ORG-CLIP-visualize-patch-norms.py		longREG-4-ORG-CLIP-visualize-patch-norms.py
longREG-4-XGATED-visualize-patch-norms.py		longREG-4-XGATED-visualize-patch-norms.py
longREG-5-ORG-CLIP-visualize-attention-heatmaps.py		longREG-5-ORG-CLIP-visualize-attention-heatmaps.py
longREG-5-XGATED-EXACT-visualize-attention-heatmaps.py		longREG-5-XGATED-EXACT-visualize-attention-heatmaps.py
longREG-5-XGATED-visualize-attention-heatmaps.py		longREG-5-XGATED-visualize-attention-heatmaps.py
longREG-6-XGATED-vs-ORG-patch-text-cos-sim.py		longREG-6-XGATED-vs-ORG-patch-text-cos-sim.py
longREG-6-XGATED-x-ORG-CLIP-patch-text-cos-sim-batch.py		longREG-6-XGATED-x-ORG-CLIP-patch-text-cos-sim-batch.py
longREG-7-ORG-CLIP-eval-imagenet-objectnet.py		longREG-7-ORG-CLIP-eval-imagenet-objectnet.py
longREG-7-XGATED-eval-imagenet-objectnet.py		longREG-7-XGATED-eval-imagenet-objectnet.py
longREG-8-ORG-CLIP-eval-modgap-flickr8k.py		longREG-8-ORG-CLIP-eval-modgap-flickr8k.py
longREG-8-XGATED-eval-modgap-flickr8k.py		longREG-8-XGATED-eval-modgap-flickr8k.py
longREG-9-ORG-CLIP-eval-imagenet-linear-probe.py		longREG-9-ORG-CLIP-eval-imagenet-linear-probe.py
longREG-9-XGATED-eval-imagenet-linear-probe.py		longREG-9-XGATED-eval-imagenet-linear-probe.py
longtools-count-REG-XGATED-all-params.py		longtools-count-REG-XGATED-all-params.py
longtools-inspect-fusion-gates.py		longtools-inspect-fusion-gates.py
longtools-query_pickle.py		longtools-query_pickle.py
longtools-query_safetensors.py		longtools-query_safetensors.py
longtools-simple-pickle-to-safetensors.py		longtools-simple-pickle-to-safetensors.py
requirements.txt		requirements.txt
tools-count-OAI-CLIP-all-params.py		tools-count-OAI-CLIP-all-params.py
tools-count-REG-XGATED-all-params.py		tools-count-REG-XGATED-all-params.py
tools-inspect-fusion-gates.py		tools-inspect-fusion-gates.py
tools-query_pickle.py		tools-query_pickle.py
tools-query_safetensors.py		tools-query_safetensors.py

License

zer0int/CLIP-fine-tune-registers-gated

Folders and files

Latest commit

History

Repository files navigation

CLIP Needs Registers. And Gated MLPs. And +20M params. 🧑‍💻🤖

Fixing CLIP's modality gap via happy little accidents.

Update 19/MAR/2025:

Update 14/MAR/2025:

Update: 11/MAR/2025

1st commit: 09/MAR/2025

I just want a new Text Encoder... ✨

Now, about the full model; ViT especially. 🔍

I want to play with the new CLIP on the block! 🥳

Gradient Ascent:

Attention Heatmaps:

Same syntax applies for the other code. Please check the code for details - it's well-documented (I hope)! 🤗

Selected visual examples (as an image is worth 16x16 words):

Evals

I want to fine-tune my own mutant REG-XGATED CLIP. 🤓

Text-To-Image, Flux.1-dev, pure CLIP guidance (no T5)

Model Performance Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages