Text Embedding Models - Performance Comparison

Text Embedding Models – Performance Comparison

One of the key challenges with text embeddings is their inability to always bring back exact match results, primarily due to the nature of how embeddings represent and understand language.

Text embeddings focus on capturing semantic meaning rather than exact word-to-word matches. They represent words or phrases in a continuous vector space based on their context and meaning. As a result, embeddings might prioritize semantically similar but not identical words or phrases over exact matches.

In this article, we challenge the model’s ability to prioritize documents containing exact matches over documents with strong semantic meaning.

Embedding Models

Today on the table we have three leading embedding models out there:

Model =	Vendor =	# of dim’s	Pro’s	Con’s
Ada-002	OpenAI	1536		– proprietary – calls external server – paid – most number of dimensions
Gecko-003	Google	768	– least number of dimensions	– proprietary – calls external server – paid
GTE-Large	Alibaba	1024	– runs locally (on-prem) – Free!!!	– requires system resources

Test Scenario #1: exact match

In a database containing 1,000 rows, only three (3) rows contain a marker phrase and 100+ rows contain words of phrases semantically similar to the marker phrase. We ask the PGVector DB to bring back top 10 documents using a cosine similarity search.

The scoring is based on how close the rows containing the exact matches make it to the top: #1 scoring 10 points, #2 scoring 9 points, etc.

The top 10 results contained rows with “marker phrase” as follows:

Model	Ada-002		Gecko-003		GTE-Large
Result position #	Pos	Score	Pos	Score	Pos	Score
1			x	10	x	10
2
3
4			x	7
5					x	6
6					x	5
7			x	4
8	x	3
9
10
Total Score:		3		21		21

As you can see, both Google and Alibaba models did a great job and all three rows made the first 10 results. OpenAI, however, found other documents more similar than the ones containing the exact match, displaying only one out of three within the first 10.

Test Scenario #2: diluted exact match

Now we repeat Scenario #1, diluting the marker phrase with garbage words, so our query “marker phrase” becomes “marker phrase garbage words” (database rows remain intact, we dilute the query only). Here are the results:

Model	Ada-002		Gecko-003		GTE-Large
Result position #	Pos	Score	Pos	Score	Pos	Score
1
2
3			x	8	x	8
4
5
6
7			x	4
8
9
10			x	1
Total Score:		0		13		8

Now, this is interesting! Google, even though losing positions and scoring 8 points less than before, still managed to maintain all three rows within the first ten results. GTE comes in second, losing two rows to other results, while OpenAI didn’t deliver at all!

Conclusion

Even though this experiment is not exactly fair, our goal was to see if different models give different weight to the occurrence of exact matches within texts. In this particular scenario, and with the data we had, our leaderboard looks as follows:

#1: Google’s Gecko-003 with 34 points,

#2: GTE-Large with 29 points, and

#3: OpenAI’s Ada-002 scoring only 3 points.

In this case, Google’s Gecko-003 is the clear winner!

(PS: your experience may differ between datasets)

Text Embedding Models – Performance Comparison