1
Embed all entity data as single text
Embed all the data you have about an entity as a single piece of text (using a model from a hub like Hugging Face, or your own proprietary model).
Converting all the data you have about an entity directly into a single vector usually fails to capture complex attributes; you lose potentially relevant information.
2
Run vector similarity search
Run a vector search to identify X nearest neighbor vectors in order.
Similarity searches on these vectors often produce less than satisfactory results, necessitating reranking.
3
Rerank results with additional filters
Rerank your results using a) additional contextual filters or b) information not in the vector (e.g., filtering out items that are out of stock).
To get better results, you’ll probably have to add a custom layer to incorporate additional relevant information, and/or rerank using cross-encoders, incurring a latency cost that threatens productionizing; cross-encoders obtain a reranking metric by calculating the similarity of each pair of points individually, which takes a nontrivial amount of time.
Follow along in this Colab
Interactive notebook demonstrating how to combine multiple embeddings with Superlinked.
A more efficient, reliable approach to complex search
The pieces of information you have about any entity are often complex and usually include more than one attribute. Let’s take a very simple example. Say your data source contains two paragraphs, each with a certain number of likes:- Better quality retrieval results: Your searches can retrieve results that are more complete and relevant than when your embeddings did not effectively represent different attributes of the same entity; see Superlinked’s research outcomes on this in VectorHub:
- Efficiency savings: Because you get better results, you’re less likely to require time-consuming reranking, thereby reducing the processing time for vector retrieval by a factor of 10 (instead of hundreds of milliseconds, your retrieval takes only 10s of milliseconds)
By getting better initial results, you’re less likely to require time-consuming reranking, reducing vector retrieval processing time by 10x (from hundreds of milliseconds to tens of milliseconds).
Capturing complex entity data in a single multimodal vector - Superlinked’s Spaces
At Superlinked, we use Spaces to embed different pieces of data, structured or unstructured, about an entity and concatenate them into a single, representative multimodal vector that’s easy to query in your vector index. Our Spaces approach achieves better retrieval by carefully building more representative vectors before you start retrieving - as opposed to embedding all the data you have about an entity as a single piece of text inadequately capturing all the varied attributes of the entity, and then having to rerank and filter to incorporate or omit other attributes. Spaces lets you foreground the most relevant attributes of your data when embedding. Instead of converting your data about, for example, users or products directly into a single vector, Superlinked’s approach lets you represent different attributes about the same entity in separate Spaces, each of which handles that type of data better. One Space handles data about attribute x from users and another Space captures data about attribute y from users, etc., and the same for products. But Spaces also enables you to combine data from more than one schema in the same Space. You can, for example, declare one Space that represents related attributes of data from both users and the products - e.g., product description and user product preferences. Once you’ve used a Space to connect user preference data with product descriptions in your embeddings, searching with the user preference vector will surface better product recommendations.When you use a Space to connect user preference data with product descriptions, searching with the user preference vector will surface significantly better product recommendations.
@schema
, you embed your data using Spaces that are appropriate to your data types. For example, category information uses the CategoricalSimilaritySpace
, and text uses the TextSimilaritySpace
.
Using our simple two-paragraph dataset above, we proceed as follows:
@sl.schema
, you embed these using Spaces that fit your data types:
body_space
, and surface our results in the appropriate order:
Rank | Body | Like Count | ID |
---|---|---|---|
0 | Growing computation power enables advancements in AI. | 75 | paragraph-2 |
1 | Glorious animals live in the wilderness. | 10 | paragraph-1 |
The AI-related paragraph ranks first due to semantic similarity, while the engagement metric (like_count) contributes to the overall relevance scoring.