Optimizing Benchmark Dataset Size #2170

mehran-sarmadi · 2025-02-26T10:46:30Z

mehran-sarmadi
Feb 26, 2025

Hi everyone,

Some of the datasets in the Persian benchmark, like MSMARCO-Fa and FEVER-Fa, are pretty large, and users might have trouble running them efficiently. To help with this, we’re considering two options:

Creating a smaller version of the existing datasets (e.g., FEVERHardNegatives).
Adding a separate, smaller version alongside the original dataset (e.g., MTEB (eng, v2)).
Which option do you think is better?

KennethEnevoldsen · 2025-02-26T20:32:08Z

KennethEnevoldsen
Feb 26, 2025
Maintainer

So since MTEB(fas, beta) is in beta we can make changes to the existing leaderboard.

For both of these cases; however, I suspect that they are machine translated (though the metadata does specify found I think that this is an error?). I think better alternatives are available, e.g. Miracl (which already have a HardNegatives version)

1 reply

mehran-sarmadi Mar 1, 2025
Author

Thanks for your feedback! We’ve decided to take a mixed approach, keeping some important datasets that aren’t too large while providing smaller versions for the rest. This way, users can still access key datasets in full while benefiting from more efficient versions of others.

Let us know if you have any thoughts on this approach.

michaeldinzinger · 2025-03-07T11:48:07Z

michaeldinzinger
Mar 7, 2025

@mehran-sarmadi
Hi Mehran, as you are currently in the process of revising the Persian Benchmark, you could also consider a new dataset that contains FAQs in Farsi. It is called WebFAQ and I'm currently trying to bring the dataset as Retrieval task to MTEB. The Pull Request is here. If you think it's a fit for the benchmark, you can tell me and I can add it to the Benchmark - I guess also as part of the current Pull Request.

Some more details:
The dataset is multilingual and the subset in Farsi contains 227k Question-Answer pairs. A Question serves as Query and it has one relevant document, the corresponding Answer. Language identification was done using FastText with a confidence threshold of 0.63. The test split is 10.000 queries.

1 reply

mehran-sarmadi Mar 7, 2025
Author

Hi @michaeldinzinger,
Thanks for your suggestion! I'll check it as soon as possible and let you know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Benchmark Dataset Size #2170

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Optimizing Benchmark Dataset Size #2170

mehran-sarmadi Feb 26, 2025

Replies: 2 comments · 2 replies

KennethEnevoldsen Feb 26, 2025 Maintainer

mehran-sarmadi Mar 1, 2025 Author

michaeldinzinger Mar 7, 2025

mehran-sarmadi Mar 7, 2025 Author

mehran-sarmadi
Feb 26, 2025

Replies: 2 comments 2 replies

KennethEnevoldsen
Feb 26, 2025
Maintainer

mehran-sarmadi Mar 1, 2025
Author

michaeldinzinger
Mar 7, 2025

mehran-sarmadi Mar 7, 2025
Author