The TextSimilaritySpace
class transforms text data into high-dimensional vector embeddings, enabling semantic similarity search. It leverages pre-trained SentenceTransformers models that are specifically optimized for encoding longer text sequences efficiently.
Constructor
Create a new text similarity space with the specified configuration.
TextSimilaritySpace(
text,
model,
cache_size=10000,
model_cache_dir=None,
model_handler=TextModelHandler.SENTENCE_TRANSFORMERS,
embedding_engine_config=None
)
Parameters
text
String | ChunkingNode | Sequence[String | ChunkingNode]
required
The text input(s) to be transformed into vectors. Can be a single String
schema field, a ChunkingNode for processed text, or a sequence of either. Must
be SchemaField objects, not regular Python strings.
The SentenceTransformers model identifier to use for text embedding. Examples
include “all-MiniLM-L6-v2”, “all-mpnet-base-v2”, or custom model paths.
The number of embeddings to store in an in-memory LRU cache for performance
optimization. Set to 0 to disable caching.
model_cache_dir
Path | None
default:"None"
Directory to cache downloaded models. If None, uses the default cache
directory provided by the model library.
model_handler
TextModelHandler
default:"SENTENCE_TRANSFORMERS"
The handler for the embedding model. Currently supports SentenceTransformers
models.
embedding_engine_config
EmbeddingEngineConfig | None
default:"None"
Optional configuration for the embedding engine behavior and optimization
settings.
Example
from superlinked import TextSimilaritySpace, schema
@schema
class ArticleSchema:
id: str
title: str
content: str
summary: str
article_schema = ArticleSchema()
# Basic text similarity space
content_space = TextSimilaritySpace(
text=article_schema.content,
model="sentence-transformers/all-MiniLM-L6-v2"
)
# Advanced configuration
advanced_space = TextSimilaritySpace(
text=[article_schema.title, article_schema.summary],
model="sentence-transformers/all-mpnet-base-v2",
cache_size=50000,
model_cache_dir="/path/to/models"
)
Text Chunking
chunk() Function
For processing long documents, use the chunk()
function to split text into smaller, more manageable pieces:
chunk(
text,
chunk_size=None,
chunk_overlap=None,
split_chars_keep=None,
split_chars_remove=None
) -> ChunkingNode
The String schema field containing the text to be chunked.
Maximum size of each chunk in characters. Respects word boundaries to avoid
splitting words.
Maximum overlap between consecutive chunks in characters to maintain context
continuity.
split_chars_keep
list[str] | None
default:"['!', '?', '.']"
Characters to split at while keeping them in the text. Used to identify
natural breakpoints.
split_chars_remove
list[str] | None
default:"['\\n']"
Characters to split at and remove from the text. Useful for removing
formatting characters.
Chunking Example
from superlinked import TextSimilaritySpace, chunk
# Create chunked text for long documents
chunked_content = chunk(
text=document_schema.content,
chunk_size=500,
chunk_overlap=100,
split_chars_keep=[".", "!", "?", ";"],
split_chars_remove=["\n", "\r"]
)
# Use chunked text in similarity space
document_space = TextSimilaritySpace(
text=chunked_content,
model="sentence-transformers/all-mpnet-base-v2"
)
Properties
space_field_set
space_field_set: SpaceFieldSet
Manages the text fields and their processing configuration within the space.
transformation_config: TransformationConfig[Vector, str]
Configuration object that defines how text strings are transformed into vector representations.
Model Selection
Popular Models
General Purpose
Multilingual
Domain-Specific
# Balanced performance and quality
TextSimilaritySpace(
text=schema.content,
model="sentence-transformers/all-MiniLM-L6-v2" # 384 dimensions
)
# Higher quality, larger size
TextSimilaritySpace(
text=schema.content,
model="sentence-transformers/all-mpnet-base-v2" # 768 dimensions
)
Use Cases
Document Search
Semantic search across document collections:
# Knowledge base search
knowledge_space = TextSimilaritySpace(
text=document_schema.content,
model="sentence-transformers/all-mpnet-base-v2",
cache_size=20000
)
# Index and query
index = Index([knowledge_space])
# Query: "How to optimize database performance?"
# Finds semantically similar documents even with different wording
Product Recommendations
Product discovery based on descriptions:
# E-commerce product similarity
product_space = TextSimilaritySpace(
text=product_schema.description,
model="sentence-transformers/all-MiniLM-L6-v2"
)
# Combine with other features
product_index = Index([
product_space,
CategoricalSimilaritySpace(category_input=product_schema.category),
NumberSpace(number=product_schema.price)
])
Content Moderation
Detect similar content for moderation:
# Content similarity detection
content_space = TextSimilaritySpace(
text=post_schema.content,
model="sentence-transformers/all-MiniLM-L6-v2"
)
# Find potentially duplicate or similar posts
Multi-Field Text Processing
Process multiple text fields together:
# Combine title and content for better context
article_space = TextSimilaritySpace(
text=[article_schema.title, article_schema.content],
model="sentence-transformers/all-mpnet-base-v2"
)
Caching Strategy
Cache Size: Set cache_size
based on your expected unique text inputs.
For high-traffic applications with repetitive queries, larger cache sizes
(50k-100k) can significantly improve performance.
Model Selection
Model Trade-offs: Smaller models (MiniLM) are faster but may be less
accurate. Larger models (MPNet) provide better quality but require more
computational resources.
Chunking Optimization
Chunk Size: Very small chunks may lose context, while very large chunks
may dilute important information. Start with 250-500 characters and adjust
based on your content type.
Best Practices
Text Preprocessing
# Example: Clean text before processing
@schema
class CleanDocumentSchema:
id: str
raw_content: str
clean_content: str # Preprocessed text
# Use clean_content for better embeddings
text_space = TextSimilaritySpace(
text=document_schema.clean_content,
model="sentence-transformers/all-MiniLM-L6-v2"
)
Model Versioning
Model Consistency: Use specific model versions in production to ensure
consistent embeddings across deployments. Avoid using “latest” tags that may
change.
Memory Management
Resource Usage: Text embedding models can be memory-intensive. Monitor
GPU/CPU usage and consider model size when scaling to multiple instances.
Integration Example
from superlinked import (
TextSimilaritySpace, InMemoryApp, Index,
InMemorySource, chunk
)
# Complete setup for semantic search
@schema
class DocumentSchema:
id: str
title: str
content: str
category: str
document_schema = DocumentSchema()
# Create chunked text space for long documents
chunked_content = chunk(
text=document_schema.content,
chunk_size=400,
chunk_overlap=50
)
text_space = TextSimilaritySpace(
text=chunked_content,
model="sentence-transformers/all-mpnet-base-v2",
cache_size=25000
)
# Set up the application
source = InMemorySource(document_schema)
index = Index([text_space])
app = InMemoryApp(sources=[source], indices=[index])
# Now ready for semantic search across chunked documents