The TextSimilaritySpace class transforms text data into high-dimensional vector embeddings, enabling semantic similarity search. It leverages pre-trained SentenceTransformers models that are specifically optimized for encoding longer text sequences efficiently.

Constructor

Create a new text similarity space with the specified configuration.
TextSimilaritySpace(
    text,
    model,
    cache_size=10000,
    model_cache_dir=None,
    model_handler=TextModelHandler.SENTENCE_TRANSFORMERS,
    embedding_engine_config=None
)

Parameters

text
String | ChunkingNode | Sequence[String | ChunkingNode]
required
The text input(s) to be transformed into vectors. Can be a single String schema field, a ChunkingNode for processed text, or a sequence of either. Must be SchemaField objects, not regular Python strings.
model
str
required
The SentenceTransformers model identifier to use for text embedding. Examples include “all-MiniLM-L6-v2”, “all-mpnet-base-v2”, or custom model paths.
cache_size
int
default:"10000"
The number of embeddings to store in an in-memory LRU cache for performance optimization. Set to 0 to disable caching.
model_cache_dir
Path | None
default:"None"
Directory to cache downloaded models. If None, uses the default cache directory provided by the model library.
model_handler
TextModelHandler
default:"SENTENCE_TRANSFORMERS"
The handler for the embedding model. Currently supports SentenceTransformers models.
embedding_engine_config
EmbeddingEngineConfig | None
default:"None"
Optional configuration for the embedding engine behavior and optimization settings.

Example

from superlinked import TextSimilaritySpace, schema

@schema
class ArticleSchema:
    id: str
    title: str
    content: str
    summary: str

article_schema = ArticleSchema()

# Basic text similarity space
content_space = TextSimilaritySpace(
    text=article_schema.content,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

# Advanced configuration
advanced_space = TextSimilaritySpace(
    text=[article_schema.title, article_schema.summary],
    model="sentence-transformers/all-mpnet-base-v2",
    cache_size=50000,
    model_cache_dir="/path/to/models"
)

Text Chunking

chunk() Function

For processing long documents, use the chunk() function to split text into smaller, more manageable pieces:
chunk(
    text,
    chunk_size=None,
    chunk_overlap=None,
    split_chars_keep=None,
    split_chars_remove=None
) -> ChunkingNode
text
String
required
The String schema field containing the text to be chunked.
chunk_size
int | None
default:"250"
Maximum size of each chunk in characters. Respects word boundaries to avoid splitting words.
chunk_overlap
int | None
default:"50"
Maximum overlap between consecutive chunks in characters to maintain context continuity.
split_chars_keep
list[str] | None
default:"['!', '?', '.']"
Characters to split at while keeping them in the text. Used to identify natural breakpoints.
split_chars_remove
list[str] | None
default:"['\\n']"
Characters to split at and remove from the text. Useful for removing formatting characters.

Chunking Example

from superlinked import TextSimilaritySpace, chunk

# Create chunked text for long documents
chunked_content = chunk(
    text=document_schema.content,
    chunk_size=500,
    chunk_overlap=100,
    split_chars_keep=[".", "!", "?", ";"],
    split_chars_remove=["\n", "\r"]
)

# Use chunked text in similarity space
document_space = TextSimilaritySpace(
    text=chunked_content,
    model="sentence-transformers/all-mpnet-base-v2"
)

Properties

space_field_set

space_field_set: SpaceFieldSet
Manages the text fields and their processing configuration within the space.

transformation_config

transformation_config: TransformationConfig[Vector, str]
Configuration object that defines how text strings are transformed into vector representations.

Model Selection

# Balanced performance and quality
TextSimilaritySpace(
    text=schema.content,
    model="sentence-transformers/all-MiniLM-L6-v2"  # 384 dimensions
)

# Higher quality, larger size

TextSimilaritySpace(
text=schema.content,
model="sentence-transformers/all-mpnet-base-v2" # 768 dimensions
)

Use Cases

Semantic search across document collections:
# Knowledge base search
knowledge_space = TextSimilaritySpace(
    text=document_schema.content,
    model="sentence-transformers/all-mpnet-base-v2",
    cache_size=20000
)

# Index and query
index = Index([knowledge_space])
# Query: "How to optimize database performance?"
# Finds semantically similar documents even with different wording

Product Recommendations

Product discovery based on descriptions:
# E-commerce product similarity
product_space = TextSimilaritySpace(
    text=product_schema.description,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

# Combine with other features
product_index = Index([
    product_space,
    CategoricalSimilaritySpace(category_input=product_schema.category),
    NumberSpace(number=product_schema.price)
])

Content Moderation

Detect similar content for moderation:
# Content similarity detection
content_space = TextSimilaritySpace(
    text=post_schema.content,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

# Find potentially duplicate or similar posts

Multi-Field Text Processing

Process multiple text fields together:
# Combine title and content for better context
article_space = TextSimilaritySpace(
    text=[article_schema.title, article_schema.content],
    model="sentence-transformers/all-mpnet-base-v2"
)

Performance Optimization

Caching Strategy

Cache Size: Set cache_size based on your expected unique text inputs. For high-traffic applications with repetitive queries, larger cache sizes (50k-100k) can significantly improve performance.

Model Selection

Model Trade-offs: Smaller models (MiniLM) are faster but may be less accurate. Larger models (MPNet) provide better quality but require more computational resources.

Chunking Optimization

Chunk Size: Very small chunks may lose context, while very large chunks may dilute important information. Start with 250-500 characters and adjust based on your content type.

Best Practices

Text Preprocessing

# Example: Clean text before processing
@schema
class CleanDocumentSchema:
    id: str
    raw_content: str
    clean_content: str  # Preprocessed text

# Use clean_content for better embeddings
text_space = TextSimilaritySpace(
    text=document_schema.clean_content,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

Model Versioning

Model Consistency: Use specific model versions in production to ensure consistent embeddings across deployments. Avoid using “latest” tags that may change.

Memory Management

Resource Usage: Text embedding models can be memory-intensive. Monitor GPU/CPU usage and consider model size when scaling to multiple instances.

Integration Example

from superlinked import (
    TextSimilaritySpace, InMemoryApp, Index,
    InMemorySource, chunk
)

# Complete setup for semantic search
@schema
class DocumentSchema:
    id: str
    title: str
    content: str
    category: str

document_schema = DocumentSchema()

# Create chunked text space for long documents
chunked_content = chunk(
    text=document_schema.content,
    chunk_size=400,
    chunk_overlap=50
)

text_space = TextSimilaritySpace(
    text=chunked_content,
    model="sentence-transformers/all-mpnet-base-v2",
    cache_size=25000
)

# Set up the application
source = InMemorySource(document_schema)
index = Index([text_space])
app = InMemoryApp(sources=[source], indices=[index])

# Now ready for semantic search across chunked documents