Text Embedding Models – Performance Comparison
One of the key challenges with text embeddings is their inability to always bring back exact match results, primarily due to the nature of how embeddings represent and understand language.
Text embeddings focus on capturing semantic meaning rather than exact word-to-word matches. They represent words or phrases in a continuous vector space based on their context and meaning. As a result, embeddings might prioritize semantically similar but not identical words or phrases over exact matches.
In this article, we challenge the model’s ability to prioritize documents containing exact matches over documents with strong semantic meaning.
Embedding Models
Today on the table we have three leading embedding models out there:
Model = | Vendor = | # of dim’s | Pro’s | Con’s |
Ada-002 | OpenAI | 1536 | – proprietary – calls external server – paid – most number of dimensions | |
Gecko-003 | 768 | – least number of dimensions | – proprietary – calls external server – paid | |
GTE-Large | Alibaba | 1024 | – runs locally (on-prem) – Free!!! | – requires system resources |
Test Scenario #1: exact match
In a database containing 1,000 rows, only three (3) rows contain a marker phrase and 100+ rows contain words of phrases semantically similar to the marker phrase. We ask the PGVector DB to bring back top 10 documents using a cosine similarity search.
The scoring is based on how close the rows containing the exact matches make it to the top: #1 scoring 10 points, #2 scoring 9 points, etc.
The top 10 results contained rows with “marker phrase” as follows:
Model | Ada-002 | Gecko-003 | GTE-Large | |||
Result position # | Pos | Score | Pos | Score | Pos | Score |
1 | x | 10 | x | 10 | ||
2 | ||||||
3 | ||||||
4 | x | 7 | ||||
5 | x | 6 | ||||
6 | x | 5 | ||||
7 | x | 4 | ||||
8 | x | 3 | ||||
9 | ||||||
10 | ||||||
Total Score: | 3 | 21 | 21 |
As you can see, both Google and Alibaba models did a great job and all three rows made the first 10 results. OpenAI, however, found other documents more similar than the ones containing the exact match, displaying only one out of three within the first 10.
Test Scenario #2: diluted exact match
Now we repeat Scenario #1, diluting the marker phrase with garbage words, so our query “marker phrase” becomes “marker phrase garbage words” (database rows remain intact, we dilute the query only). Here are the results:
Model | Ada-002 | Gecko-003 | GTE-Large | |||
Result position # | Pos | Score | Pos | Score | Pos | Score |
1 | ||||||
2 | ||||||
3 | x | 8 | x | 8 | ||
4 | ||||||
5 | ||||||
6 | ||||||
7 | x | 4 | ||||
8 | ||||||
9 | ||||||
10 | x | 1 | ||||
Total Score: | 0 | 13 | 8 |
Now, this is interesting! Google, even though losing positions and scoring 8 points less than before, still managed to maintain all three rows within the first ten results. GTE comes in second, losing two rows to other results, while OpenAI didn’t deliver at all!
Conclusion
Even though this experiment is not exactly fair, our goal was to see if different models give different weight to the occurrence of exact matches within texts. In this particular scenario, and with the data we had, our leaderboard looks as follows:
#1: Google’s Gecko-003 with 34 points,
#2: GTE-Large with 29 points, and
#3: OpenAI’s Ada-002 scoring only 3 points.
In this case, Google’s Gecko-003 is the clear winner!
(PS: your experience may differ between datasets)