TextSimilaritySpace - Superlinked

The TextSimilaritySpace class transforms text data into high-dimensional vector embeddings, enabling semantic similarity search. It leverages pre-trained SentenceTransformers models that are specifically optimized for encoding longer text sequences efficiently.

Constructor

Create a new text similarity space with the specified configuration.

TextSimilaritySpace(
    text,
    model,
    cache_size=10000,
    model_cache_dir=None,
    model_handler=TextModelHandler.SENTENCE_TRANSFORMERS,
    embedding_engine_config=None
)

Parameters

text

String | ChunkingNode | Sequence[String | ChunkingNode]

required

The text input(s) to be transformed into vectors. Can be a single String schema field, a ChunkingNode for processed text, or a sequence of either. Must be SchemaField objects, not regular Python strings.

model

str

required

The SentenceTransformers model identifier to use for text embedding. Examples include “all-MiniLM-L6-v2”, “all-mpnet-base-v2”, or custom model paths.

cache_size

int

default:"10000"

The number of embeddings to store in an in-memory LRU cache for performance optimization. Set to 0 to disable caching.

model_cache_dir

Path | None

default:"None"

Directory to cache downloaded models. If None, uses the default cache directory provided by the model library.

model_handler

TextModelHandler

default:"SENTENCE_TRANSFORMERS"

The handler for the embedding model. Currently supports SentenceTransformers models.

embedding_engine_config

EmbeddingEngineConfig | None

default:"None"

Optional configuration for the embedding engine behavior and optimization settings.

Example

from superlinked import TextSimilaritySpace, schema

@schema
class ArticleSchema:
    id: str
    title: str
    content: str
    summary: str

article_schema = ArticleSchema()

# Basic text similarity space
content_space = TextSimilaritySpace(
    text=article_schema.content,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

# Advanced configuration
advanced_space = TextSimilaritySpace(
    text=[article_schema.title, article_schema.summary],
    model="sentence-transformers/all-mpnet-base-v2",
    cache_size=50000,
    model_cache_dir="/path/to/models"
)

Text Chunking

chunk() Function

For processing long documents, use the chunk() function to split text into smaller, more manageable pieces:

chunk(
    text,
    chunk_size=None,
    chunk_overlap=None,
    split_chars_keep=None,
    split_chars_remove=None
) -> ChunkingNode

text

String

required

The String schema field containing the text to be chunked.

chunk_size

int | None

default:"250"

Maximum size of each chunk in characters. Respects word boundaries to avoid splitting words.

chunk_overlap

int | None

default:"50"

Maximum overlap between consecutive chunks in characters to maintain context continuity.

split_chars_keep

list[str] | None

default:"['!', '?', '.']"

Characters to split at while keeping them in the text. Used to identify natural breakpoints.

split_chars_remove

list[str] | None

default:"['\\n']"

Characters to split at and remove from the text. Useful for removing formatting characters.

Chunking Example

from superlinked import TextSimilaritySpace, chunk

# Create chunked text for long documents
chunked_content = chunk(
    text=document_schema.content,
    chunk_size=500,
    chunk_overlap=100,
    split_chars_keep=[".", "!", "?", ";"],
    split_chars_remove=["\n", "\r"]
)

# Use chunked text in similarity space
document_space = TextSimilaritySpace(
    text=chunked_content,
    model="sentence-transformers/all-mpnet-base-v2"
)

Properties

space_field_set

space_field_set: SpaceFieldSet

Manages the text fields and their processing configuration within the space.

transformation_config

transformation_config: TransformationConfig[Vector, str]

Configuration object that defines how text strings are transformed into vector representations.

Model Selection

Popular Models

General Purpose
Multilingual
Domain-Specific

# Balanced performance and quality
TextSimilaritySpace(
    text=schema.content,
    model="sentence-transformers/all-MiniLM-L6-v2"  # 384 dimensions
)

# Higher quality, larger size

TextSimilaritySpace(
text=schema.content,
model="sentence-transformers/all-mpnet-base-v2" # 768 dimensions
)

# Multilingual support
TextSimilaritySpace(
    text=schema.content,
    model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

# Legal documents
TextSimilaritySpace(
    text=schema.content,
    model="sentence-transformers/all-MiniLM-L6-v2"  # Fine-tune for legal domain
)

# Scientific papers

TextSimilaritySpace(
text=schema.content,
model="sentence-transformers/allenai-specter"
)

Use Cases

Document Search

Semantic search across document collections:

# Knowledge base search
knowledge_space = TextSimilaritySpace(
    text=document_schema.content,
    model="sentence-transformers/all-mpnet-base-v2",
    cache_size=20000
)

# Index and query
index = Index([knowledge_space])
# Query: "How to optimize database performance?"
# Finds semantically similar documents even with different wording

Product Recommendations

Product discovery based on descriptions:

# E-commerce product similarity
product_space = TextSimilaritySpace(
    text=product_schema.description,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

# Combine with other features
product_index = Index([
    product_space,
    CategoricalSimilaritySpace(category_input=product_schema.category),
    NumberSpace(number=product_schema.price)
])

Content Moderation

Detect similar content for moderation:

# Content similarity detection
content_space = TextSimilaritySpace(
    text=post_schema.content,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

# Find potentially duplicate or similar posts

Multi-Field Text Processing

Process multiple text fields together:

# Combine title and content for better context
article_space = TextSimilaritySpace(
    text=[article_schema.title, article_schema.content],
    model="sentence-transformers/all-mpnet-base-v2"
)

Performance Optimization

Caching Strategy

Cache Size: Set cache_size based on your expected unique text inputs. For high-traffic applications with repetitive queries, larger cache sizes (50k-100k) can significantly improve performance.

Model Selection

Model Trade-offs: Smaller models (MiniLM) are faster but may be less accurate. Larger models (MPNet) provide better quality but require more computational resources.

Chunking Optimization

Chunk Size: Very small chunks may lose context, while very large chunks may dilute important information. Start with 250-500 characters and adjust based on your content type.

Best Practices

Text Preprocessing

# Example: Clean text before processing
@schema
class CleanDocumentSchema:
    id: str
    raw_content: str
    clean_content: str  # Preprocessed text

# Use clean_content for better embeddings
text_space = TextSimilaritySpace(
    text=document_schema.clean_content,
    model="sentence-transformers/all-MiniLM-L6-v2"
)

Model Versioning

Model Consistency: Use specific model versions in production to ensure consistent embeddings across deployments. Avoid using “latest” tags that may change.

Memory Management

Resource Usage: Text embedding models can be memory-intensive. Monitor GPU/CPU usage and consider model size when scaling to multiple instances.

Integration Example

from superlinked import (
    TextSimilaritySpace, InMemoryApp, Index,
    InMemorySource, chunk
)

# Complete setup for semantic search
@schema
class DocumentSchema:
    id: str
    title: str
    content: str
    category: str

document_schema = DocumentSchema()

# Create chunked text space for long documents
chunked_content = chunk(
    text=document_schema.content,
    chunk_size=400,
    chunk_overlap=50
)

text_space = TextSimilaritySpace(
    text=chunked_content,
    model="sentence-transformers/all-mpnet-base-v2",
    cache_size=25000
)

# Set up the application
source = InMemorySource(document_schema)
index = Index([text_space])
app = InMemoryApp(sources=[source], indices=[index])

# Now ready for semantic search across chunked documents

Reference

​Constructor

​Parameters

​Example

​Text Chunking

​chunk() Function

​Chunking Example

​Properties

​space_field_set

​transformation_config

​Model Selection

​Popular Models

​Use Cases

​Document Search

​Product Recommendations

​Content Moderation

​Multi-Field Text Processing

​Performance Optimization

​Caching Strategy

​Model Selection

​Chunking Optimization

​Best Practices

​Text Preprocessing

​Model Versioning

​Memory Management

​Integration Example

Constructor

Parameters

Example

Text Chunking

chunk() Function

Chunking Example

Properties

space_field_set

transformation_config

Model Selection

Popular Models

Use Cases

Document Search

Product Recommendations

Content Moderation

Multi-Field Text Processing

Performance Optimization

Caching Strategy

Model Selection

Chunking Optimization

Best Practices

Text Preprocessing

Model Versioning

Memory Management

Integration Example