Replies: 2 comments 3 replies
-
Hmm can you help me understand the use case, where you would not be able to give s prompt? The prompts used is not eq. To that of LLMs, it doesn't know the labels - I would say it is typically information you know when you apply the model. Furthermore, only a few models use prompts, e.g. for the Jina model their prompt is simply 'classification', 'clustering' and similar. We have an instruct labels which we could add to the leaderboard. To make this more clear though. It is also possible to adapt the model implementation to run an experiment examining the effect of the prompt. |
Beta Was this translation helpful? Give feedback.
-
Thanks Kenneth for your fast replies. Indeed, sorting models by performance on the MultiLabelClassification tasks is what I'm looking for. Looking at the MultiLabelClassification implementation (AbsTaskMultilabelClassification ?), I'm not sure to understand what this is doing in the context of the MultiEURLEXMultilabelClassification dataset, how many labels are we talking about here ? Seems like we limit the classifier to n_neighbors=5, also using a KNeighborsClassifier doesn't seems to be aligned with the recommendation from OpenAI to rely on RandomForestClassifier for such free-text feature encoder. My expectation would be that indeed a KNeighborsClassifier will have quite poor performance on such embeddings since nothing if suggesting that the labels will be isolated in a convex shape inside the embedding topology.
Thanks, I'll try ! |
Beta Was this translation helpful? Give feedback.
-
Hi,
I assume I'm not the only one that used this benchmark with a goal in mind: find the best open weight alternative to GCP (OpenAI/Mistral) embedding APIs. Those APIs are really usefull as general free-text feature encoder within a machine learning model. On GCP, you just specify if you want an embedding optimized for classification, regression, ... and the resulting embedding will be usefull as input for several machine learning tasks. You pay once and can use the embedding several times for several goals.
Looking at the leaderboard make it really easy to identify the best models, and deploying an embedding API is a breath with Text Embeddings Inference. gte-Qwen2-1.5B-instruct seems like a nice tradeoff for my specfic use-case. Now, the only thing that remains is to find the best instruct for classification !
And here is the catch, looking at the instruct used in the benchmark, I was not able to find a usefull instruct.
To my surprise, for classification tasks, the eval task seems to rely on the instruct to actually prompt the embedding model into doing the classification and a logistic regression or k neighbors classifier will simply find the subpart of the space in the embedding that is capturing the meaning of the classes defined in the prompt.
This means that the benchmark is actually evaluating the capacity of each model to do classification and not the ability of the model to encode text that can then be used as input by a classifier as documented by OpenAI here.
I believe if would make sense to add a new task that would focus on general free-text feature encoder. Something that would rely on instruct that don't make any assumption on the underlying classification or regression tasks managed by a RandomForest or XGBoost.
Regards,
Thomas
Beta Was this translation helpful? Give feedback.
All reactions