The Source class serves as the abstract foundation for all data source implementations in Superlinked. It defines the interface for providing data to vector indices for processing and search operations.

Constructor

Source()
This is an abstract base class and cannot be instantiated directly. Use concrete implementations for specific data source types.

Inheritance Hierarchy

The Source class serves as the base for all data source implementations: Inheritance Chain: SourceABC

Available Implementations

Source Types and Use Cases

Development Sources

InMemorySource: Perfect for development, testing, and prototyping where data is provided programmatically and stored in memory. InteractiveSource: Ideal for interactive development environments where you want to add data incrementally and see immediate results.

Production Sources

RestSource: Designed for production applications where data comes through REST API endpoints, enabling real-time data ingestion. DataLoaderSource: Suitable for batch processing scenarios where data is loaded from files (CSV, JSON, Parquet) or external data sources.

Data Flow Architecture

Sources integrate into the Superlinked pipeline as the entry point for data:
  1. Data Input: Sources receive raw data from various inputs (APIs, files, memory)
  2. Parsing: Data is processed through associated parsers to match schema requirements
  3. Validation: Schema validation ensures data conformity
  4. Transformation: Data flows to vector spaces for embedding generation
  5. Indexing: Processed vectors are stored in indices for searching

Source Selection Guide

For Development

InMemorySource: Use when you want to quickly test with small datasets and don’t need persistence. Perfect for tutorials and experimentation.

For Interactive Development

InteractiveSource: Use when building and testing incrementally, allowing you to add data on-the-fly and observe results immediately.

For Production APIs

RestSource: Use for production web applications where data is ingested through HTTP endpoints. Provides scalable real-time data processing.

For Batch Processing

DataLoaderSource: Use for ETL pipelines and batch processing scenarios where you need to load large datasets from files or databases.

Integration Pattern

All source implementations follow a consistent pattern:
  1. Schema Association: Each source is associated with a specific schema that defines the data structure
  2. Parser Configuration: Sources can use custom parsers to handle different data formats
  3. Data Validation: Incoming data is validated against the associated schema
  4. Event Publishing: Sources publish data events to connected indices and processing components

Best Practices

Schema Design

Schema Consistency: Ensure your data source consistently provides data that matches the associated schema structure. Mismatched data will cause processing failures.

Performance Considerations

Batch Processing: For large datasets, consider using DataLoaderSource with appropriate batch sizes to optimize memory usage and processing performance.

Error Handling

Data Validation: Implement proper error handling for data validation failures, especially in production environments where data quality may vary.

Data Processing Pipeline

Sources work together with other Superlinked components:
  • Schemas: Define the structure of data that sources must provide
  • Parsers: Transform raw data from sources into schema-compliant format
  • Spaces: Convert parsed data into vector representations
  • Indices: Organize and store vectors for efficient searching
  • Queries: Search through the indexed data provided by sources
The abstract nature of the Source class ensures consistent behavior across different data input methods while allowing each implementation to optimize for its specific use case and data patterns.