awesome-architecture-mds/data-analytics/splink/Data_I_O_Backend_Abstraction.md at main · CodeBoarding/awesome-architecture-mds

graph LR
    Database_Abstraction_Layer["Database Abstraction Layer"]
    Splink_Data_Abstraction["Splink Data Abstraction"]
    Backend_Specific_Database_APIs["Backend-Specific Database APIs"]
    SQL_Query_Pipeline_Orchestration["SQL Query Pipeline Orchestration"]
    SQL_Dialect_Management["SQL Dialect Management"]
    SQL_Generation_Modules["SQL Generation Modules"]
    Database_Abstraction_Layer -- "delegates operations to" --> Backend_Specific_Database_APIs
    Backend_Specific_Database_APIs -- "provides backend-specific logic to" --> Database_Abstraction_Layer
    Database_Abstraction_Layer -- "converts query results into" --> Splink_Data_Abstraction
    Splink_Data_Abstraction -- "triggers dropping of temporary tables in" --> Database_Abstraction_Layer
    SQL_Generation_Modules -- "enqueues SQL snippets into" --> SQL_Query_Pipeline_Orchestration
    SQL_Query_Pipeline_Orchestration -- "consults" --> SQL_Dialect_Management
    SQL_Query_Pipeline_Orchestration -- "passes final generated query to" --> Database_Abstraction_Layer

Details

The Data I/O & Backend Abstraction subsystem in Splink is responsible for providing a unified interface for data loading, managing data representations, and abstracting interactions with various database backends (DuckDB, Spark, Postgres, etc.) through SQL generation and execution. It acts as the pluggable data access layer, decoupling the core Splink logic from the underlying data storage and processing engines.

Database Abstraction Layer

Provides a unified, high-level interface for all database operations. It orchestrates SQL execution, manages table registration and deletion, and handles query result caching. It acts as the primary entry point for data interaction from other parts of the Splink library.

Related Classes/Methods:

splink.internals.database_api

Splink Data Abstraction

Offers a consistent, abstract representation of data within Splink, decoupling the library's logic from specific underlying data storage (e.g., Pandas DataFrames, Spark DataFrames, database tables). It manages the lifecycle of the underlying physical table, including its eventual dropping.

Related Classes/Methods:

splink.internals.splink_dataframe

Backend-Specific Database APIs

Implement concrete database operations for specific backends. They handle backend-specific SQL execution, table creation/deletion, and initialization tasks like registering User-Defined Functions (UDFs) relevant to their respective database systems. This embodies the Strategy Pattern for pluggable backends.

Related Classes/Methods:

SQL Query Pipeline Orchestration

Manages the construction and flow of complex SQL queries, typically composed of multiple Common Table Expressions (CTEs). It allows for enqueuing SQL statements from various modules and generating the final, executable SQL query.

Related Classes/Methods:

splink.internals.pipeline

SQL Dialect Management

Provides functions and expressions specific to different SQL dialects. Its role is to ensure that the generated SQL is compatible with the target database backend by handling variations in function names, syntax, and available features.

Related Classes/Methods:

splink.internals.dialects

SQL Generation Modules

These modules are responsible for generating specific SQL snippets required for various data transformations within the Splink pipeline, such as computing term frequencies for comparisons or performing vertical concatenation of datasets.

Related Classes/Methods:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details

Database Abstraction Layer

Splink Data Abstraction

Backend-Specific Database APIs

SQL Query Pipeline Orchestration

SQL Dialect Management

SQL Generation Modules

FAQ

FilesExpand file tree

Data_I_O_Backend_Abstraction.md

Latest commit

History

Data_I_O_Backend_Abstraction.md

File metadata and controls

Details

Database Abstraction Layer

Splink Data Abstraction

Backend-Specific Database APIs

SQL Query Pipeline Orchestration

SQL Dialect Management

SQL Generation Modules

FAQ