← Back to Projects
Paper Reproduction NLP Embeddings Benchmarking Python

Reproducing MTEB —
Massive Text Embedding Benchmark

Paper Reproduction + Contribution
Banking77Classification
avg_word_embeddings_komninos
~67%
Sandar Aung & Maria Nelson
Honest scope — what's reproduction and what's ours

The core implementation — installing MTEB, loading the model, running evaluation on Banking77 — reproduces the paper's methodology. The results (~67% accuracy) validate what the paper reported for this model class. Our original contribution is the visualisation function: three separate subplots tracking Accuracy, F1, and Weighted F1 across all 10 experiments. The original paper outputs raw JSON only. We wrote the code to make those results readable and analysable across runs. That's a small but real addition — and I'd rather be clear about the distinction than oversell it.

What is MTEB?

Before MTEB, evaluating text embeddings was fragmented. Different papers used different tasks, different metrics, different datasets — making it nearly impossible to know whether one model was genuinely better than another, or just better on the specific benchmark its authors chose to highlight.

The Massive Text Embedding Benchmark, published in 2022, fixes this by providing a unified evaluation framework across 58 datasets, 112 languages, and 8 task categories — classification, clustering, retrieval, semantic similarity, and more. The value is standardisation: run your model on MTEB once and your score is directly comparable to every other model on the leaderboard.

58Datasets
112Languages
8Task Categories
10Experiments Run

Why this benchmarking problem matters

Earlier frameworks each solved a slice of the problem. SuperGLUE evaluated natural language understanding but had limited scope. BEIR covered retrieval but nothing else. SBERT-based evaluations were expensive and narrow. None gave a complete picture of how an embedding model actually performs across the range of tasks it would encounter in practice.

BenchmarkFocusLimitation
SuperGLUENLU tasksLimited scope
BEIRRetrieval onlyNot general purpose
SBERTSentence similarityNarrow and expensive
MTEBAll embedding tasksStill being adopted at time of writing

The model — and why it's a useful baseline

We used average_word_embeddings_komninos — a model that represents a sentence by averaging its individual word vectors. There's no attention, no context window, no transformer architecture. "Bank" as a financial institution and "bank" as a riverbank produce the same vector regardless of surrounding words.

That's exactly why it makes a useful starting point. Word-averaging embeddings are lightweight and stable, but they have a ceiling that contextual models like BERT were built to break through. Establishing that ceiling is the first step in meaningful model comparison — which is what MTEB is designed to enable.

Implementation — the reproduction

install and authenticate
!pip install mteb sentence-transformers

from huggingface_hub import login
login(token='YOUR_HF_TOKEN')  # required to access model hub
model setup and evaluation
import mteb
from sentence_transformers import SentenceTransformer

model_name = "average_word_embeddings_komninos"
model      = SentenceTransformer(model_name)

# Banking77: intent classification across 77 banking query categories
tasks      = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=tasks)
results    = evaluation.run(model, output_folder=f"results/{model_name}")
reading the raw JSON output
import json

with open('results/.../Banking77Classification.json', 'r') as file:
    results_data = json.load(file)

print(f"Overall Accuracy:     {results_data['scores']['test'][0]['accuracy']:.2f}")
print(f"Overall F1:           {results_data['scores']['test'][0]['f1']:.2f}")
print(f"Overall Weighted F1:  {results_data['scores']['test'][0]['f1_weighted']:.2f}")

Our contribution — the visualisation function

The paper's output is raw JSON — useful for pipelines, but hard to read across multiple experiments. We wrote a three-panel visualisation function that plots Accuracy, F1, and Weighted F1 separately across all 10 experiment runs, making variance and patterns immediately visible rather than requiring you to scan through numbers.

visualisation function — original contribution
import matplotlib.pyplot as plt

def visualize_test_scores(test_scores):
    experiments     = test_scores[0]['scores_per_experiment']

    accuracy_scores = [e['accuracy']   for e in experiments]
    f1_scores       = [e['f1']         for e in experiments]
    f1_weighted     = [e['f1_weighted'] for e in experiments]

    fig, axes = plt.subplots(3, 1, figsize=(10, 15))

    axes[0].plot(accuracy_scores, marker='o', color='b')
    axes[0].set_title('Accuracy per Experiment')
    axes[0].set_xlabel('Experiment')
    axes[0].set_ylabel('Accuracy')
    axes[0].grid(True)

    axes[1].plot(f1_scores, marker='o', color='g')
    axes[1].set_title('F1 Score per Experiment')
    axes[1].set_xlabel('Experiment')
    axes[1].set_ylabel('F1 Score')
    axes[1].grid(True)

    axes[2].plot(f1_weighted, marker='o', color='r')
    axes[2].set_title('F1 Weighted Score per Experiment')
    axes[2].set_xlabel('Experiment')
    axes[2].set_ylabel('F1 Weighted Score')
    axes[2].grid(True)

    plt.tight_layout()
    plt.show()

visualize_test_scores(results_data["scores"]["test"])

Results

Accuracy, F1, and Weighted F1 per experiment across 10 runs
Accuracy (blue), F1 (green), and Weighted F1 (red) across 10 experiments — all three metrics track closely, confirming stable and balanced model behaviour
ExperimentAccuracyF1Weighted F1
10.670.670.67
20.670.670.67
30.680.680.68
40.670.660.66
50.660.660.66
60.660.660.66
70.660.660.66
80.670.670.67
90.670.670.67
100.660.660.66
0.67Mean Accuracy
0.68Best — Experiment 3
0.66Lowest — Exp 5, 6, 7, 10
0.02Max Variance

What the results actually tell us

Three things stand out. First, Accuracy, F1, and Weighted F1 track almost identically across all 10 runs — a sign of balanced class handling. The model isn't favouring any particular intent category over others across 77 classes, which is a meaningful property to confirm.

Second, the variance is small — just 0.02 between best and worst experiment. That confirms the reproduction is stable and the setup is correct.

Third, the dip across experiments 5–7 before recovering is the most interesting pattern in the charts. A static word-embedding model has no mechanism to adapt between runs — it's deterministic given the same inputs. The variability comes from how the dataset is split and shuffled across experiments, not from the model itself. That's a useful thing to understand about what you're measuring.

67% on a 77-class classification task without any contextual understanding is a reasonable baseline. A transformer-based model like all-MiniLM-L6-v2 would push this well above 80% — running that comparison is the obvious next step.

Limitations

We ran this on a single task with a single model. That's enough to validate the framework and understand the baseline, but not enough to draw broader conclusions about MTEB or embedding models generally. We didn't run any model comparisons, didn't tune anything (word-averaging models have no hyperparameters to tune), and didn't add a confusion matrix across the 77 intent categories — which is where the more interesting analysis would start.

References