General free-text feature encoder #2211

pocman · 2025-03-02T15:44:45Z

pocman
Mar 2, 2025

Hi,
I assume I'm not the only one that used this benchmark with a goal in mind: find the best open weight alternative to GCP (OpenAI/Mistral) embedding APIs. Those APIs are really usefull as general free-text feature encoder within a machine learning model. On GCP, you just specify if you want an embedding optimized for classification, regression, ... and the resulting embedding will be usefull as input for several machine learning tasks. You pay once and can use the embedding several times for several goals.

Looking at the leaderboard make it really easy to identify the best models, and deploying an embedding API is a breath with Text Embeddings Inference. gte-Qwen2-1.5B-instruct seems like a nice tradeoff for my specfic use-case. Now, the only thing that remains is to find the best instruct for classification !

And here is the catch, looking at the instruct used in the benchmark, I was not able to find a usefull instruct.
To my surprise, for classification tasks, the eval task seems to rely on the instruct to actually prompt the embedding model into doing the classification and a logistic regression or k neighbors classifier will simply find the subpart of the space in the embedding that is capturing the meaning of the classes defined in the prompt.

This means that the benchmark is actually evaluating the capacity of each model to do classification and not the ability of the model to encode text that can then be used as input by a classifier as documented by OpenAI here.

I believe if would make sense to add a new task that would focus on general free-text feature encoder. Something that would rely on instruct that don't make any assumption on the underlying classification or regression tasks managed by a RandomForest or XGBoost.

Regards,
Thomas

KennethEnevoldsen · 2025-03-03T07:17:41Z

KennethEnevoldsen
Mar 3, 2025
Maintainer

Hmm can you help me understand the use case, where you would not be able to give s prompt?

The prompts used is not eq. To that of LLMs, it doesn't know the labels - I would say it is typically information you know when you apply the model.

Furthermore, only a few models use prompts, e.g. for the Jina model their prompt is simply 'classification', 'clustering' and similar.

We have an instruct labels which we could add to the leaderboard. To make this more clear though.

It is also possible to adapt the model implementation to run an experiment examining the effect of the prompt.

2 replies

pocman Mar 3, 2025
Author

Hmm can you help me understand the use case, where you would not be able to give a prompt?

use-case: I don't want to assume the classification task at the time of the embedding. Let's imagine a Data Ops team in charge of computing an embedding for every text written by a user and several Feature teams able to leverage the same embedding to build classifiers on top.

The prompts used is not eq. To that of LLMs, it doesn't know the labels - I would say it is typically information you know when you apply the model.

I do understand that we don't provide the labels to the embedding model, but still I would like to be able to rank the embedding models by there ability to encode all the meaning of a text an not only to segregate two classes in a 1k vector.

Furthermore, only a few models use prompts, e.g. for the Jina model their prompt is simply 'classification', 'clustering' and similar.

For the classification task, multilingual-e5-large-instruct, gte-Qwen2-7B-instruct and Linq-Embed-Mistral are all using instructs and they are on top of the leaderboard. Indeed, Jina is exactly what I'm looking for.

It is also possible to adapt the model implementation to run an experiment examining the effect of the prompt.

It would be great to add a classification task that would autorise only a single prompt to be used on all the task of type Classification.

KennethEnevoldsen Mar 3, 2025
Maintainer

I don't want to assume the classification task at the time of the embedding. Let's imagine a Data Ops team in charge of computing an embedding for every text written by a user and several Feature teams able to leverage the same embedding to build classifiers on top.

I would say this would be represented by the MultiLabelClassification tasks, e.g., MultiEURLEXMultilabelClassification. Coincidentally, this is one of the tasks that do not have a prompt (In which case the default is "Classify user passages."). Multiple tasks use the default prompt.

You can filter for these tasks e.g. using:

# not tested, but should run as is
bench = mteb.get_benchmark("MTEB(Multilingual)")

tasks = [t for t in bench.tasks if t.prompt == "Classify user passages." and task.metadata.type in ["MultiLabelClassification", Classification"]]

models_of_interest = [...]
results = mteb.load_results(models=models_of_interest, tasks=tasks)
results_simplified = results.join_revisions() # combine across revisions (in case models has multiple)

# your analysis

I do understand that we don't provide the labels to the embedding model, but still I would like to be able to rank the embedding models by there ability to encode all the meaning of a text an not only to segregate two classes in a 1k vector.

Hmm, I see what you are going for.

We have discussed approaches for adding experiments to the leaderboard (#1211). Experiments are a variant of a model (low precision, no-prompts, low embedding size).

We currently do not have a solution for this (both for implementing it and for visualizing it on the leaderboard). I encourage you to engage in the issue if you want to.

I think the best short-term solution we can do is add "Instruction" as a column to the leaderboard, potentially using three categories:

1. Free-form (fully free for, most models don't use these)
1. Preformatted ("classification", "clustering", e.g.)
1. None (generic - no embeddings)

(We might be able to figure out better labels)

This will give you an idea about which models to avoid (1) or at least which models you should test with a "generic" prompt.

pocman · 2025-03-03T14:41:24Z

pocman
Mar 3, 2025
Author

Thanks Kenneth for your fast replies.

Indeed, sorting models by performance on the MultiLabelClassification tasks is what I'm looking for.

Looking at the MultiLabelClassification implementation (AbsTaskMultilabelClassification ?), I'm not sure to understand what this is doing in the context of the MultiEURLEXMultilabelClassification dataset, how many labels are we talking about here ?

Seems like we limit the classifier to n_neighbors=5, also using a KNeighborsClassifier doesn't seems to be aligned with the recommendation from OpenAI to rely on RandomForestClassifier for such free-text feature encoder. My expectation would be that indeed a KNeighborsClassifier will have quite poor performance on such embeddings since nothing if suggesting that the labels will be isolated in a convex shape inside the embedding topology.

I encourage you to engage in the issue if you want to.

Thanks, I'll try !

1 reply

KennethEnevoldsen Mar 4, 2025
Maintainer

We haven't added this to main (yet), but on the v2 branch, you can run:

# requires v2.0.0 branch
import mteb

task = mteb.get_task("MultiEURLEXMultilabelClassification")

task.metadata.descriptive_stats

# {'test': {'num_samples': 115000,
#   'number_of_characters': 1381657027,
#   'number_texts_intersect_with_train': 0,
#   'min_text_length': 563,
#   'average_text_length': 12014.408930434782,
#   'max_text_length': 1458188,
#   'unique_texts': 115000,
#   'min_labels_per_text': 1,
#   'average_label_per_text': 3.5938,
#   'max_labels_per_text': 9,
#   'unique_labels': 21,
#   'labels': {'18': {'count': 50784},
#    ...
#   'hf_subset_descriptive_stats': {'en': {'num_samples': 5000,
# ...
# }

so 21 unique labels across 23 languages. That corresponds to the level 1 labels described in the Multi EURLEX paper.

Seems like we limit the classifier to n_neighbors=5, also using a KNeighborsClassifier

You can def. to better.

My intuition would be that a bit of dimensionality reduction + XGBoost would be better. This is to evaluate the relative ordering of models, not to get the best results. The reason for KNN is that it is quite simple, and we assume that it gives a similar ordering (though I don't believe we have tested this). I know that @x-tabdeveloping did experiment with using a Multilayer perceptron as well, though I don't recall the conclusion.

You can easily experiment with different classifiers though:

task = mteb.get_task("MultiEURLEXMultilabelClassification")
task.classifier = # your desired classifier

# evaluate as usual

If you do experiment, do let us know the results :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General free-text feature encoder #2211

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

General free-text feature encoder #2211

pocman Mar 2, 2025

Replies: 2 comments · 3 replies

KennethEnevoldsen Mar 3, 2025 Maintainer

pocman Mar 3, 2025 Author

KennethEnevoldsen Mar 3, 2025 Maintainer

pocman Mar 3, 2025 Author

KennethEnevoldsen Mar 4, 2025 Maintainer

pocman
Mar 2, 2025

Replies: 2 comments 3 replies

KennethEnevoldsen
Mar 3, 2025
Maintainer

pocman Mar 3, 2025
Author

KennethEnevoldsen Mar 3, 2025
Maintainer

pocman
Mar 3, 2025
Author

KennethEnevoldsen Mar 4, 2025
Maintainer