Create vector embeddings from text data for semantic similarity search using pre-trained language models
The TextSimilaritySpace class transforms text data into high-dimensional vector embeddings, enabling semantic similarity search. It leverages pre-trained SentenceTransformers models that are specifically optimized for encoding longer text sequences efficiently.
The text input(s) to be transformed into vectors. Can be a single String
schema field, a ChunkingNode for processed text, or a sequence of either. Must
be SchemaField objects, not regular Python strings.
from superlinked import TextSimilaritySpace, chunk# Create chunked text for long documentschunked_content = chunk( text=document_schema.content, chunk_size=500, chunk_overlap=100, split_chars_keep=[".", "!", "?", ";"], split_chars_remove=["\n", "\r"])# Use chunked text in similarity spacedocument_space = TextSimilaritySpace( text=chunked_content, model="sentence-transformers/all-mpnet-base-v2")
# Knowledge base searchknowledge_space = TextSimilaritySpace( text=document_schema.content, model="sentence-transformers/all-mpnet-base-v2", cache_size=20000)# Index and queryindex = Index([knowledge_space])# Query: "How to optimize database performance?"# Finds semantically similar documents even with different wording
# Combine title and content for better contextarticle_space = TextSimilaritySpace( text=[article_schema.title, article_schema.content], model="sentence-transformers/all-mpnet-base-v2")
Cache Size: Set cache_size based on your expected unique text inputs.
For high-traffic applications with repetitive queries, larger cache sizes
(50k-100k) can significantly improve performance.
Model Trade-offs: Smaller models (MiniLM) are faster but may be less
accurate. Larger models (MPNet) provide better quality but require more
computational resources.
Chunk Size: Very small chunks may lose context, while very large chunks
may dilute important information. Start with 250-500 characters and adjust
based on your content type.
Model Consistency: Use specific model versions in production to ensure
consistent embeddings across deployments. Avoid using “latest” tags that may
change.