DSL Reference

SqlCAD uses an "Extended SQL" format. It wraps standard SQL in DEFINE <BLOCK> <NAME> structures.

Metadata Styles

Metadata can be defined in two styles:

1. Braceless Style (Preferred)

Clean and readable. Uses standard YAML block style. No trailing commas required.

DEFINE DIM Users primary_key: user_id AS ( ... )

2. Brace Style

Useful for single-line definitions or existing models. Uses YAML flow style inside braces.

DEFINE DIM Users { primary_key: user_id } AS ( ... )

Common Metadata

All blocks support:

  • description (string): Documentation for the block.
  • layer (string): Overrides default output folder (e.g., base, silver, gold).
  • source_file (string): Path to an external .sql file to use instead of the AS (...) body.
  • tests (list): Data quality assertions (e.g., not_null, distinct, expression). See Observability.

DEFINE TABLE

Defines a raw source table and optionally maps it to physical files for automated loading.

Metadata:

  • source (dict, optional): Configuration for physical data sources (e.g., CSV/Parquet files).
    • path (string): Path or glob pattern to source files (e.g., data/exports/*.csv).
    • format (string): File format (csv, parquet, json). Defaults to file extension.
    • selection (string): Strategy for selecting files from a glob pattern.
      • all: (Default) Read all matching files.
      • latest: Select only the single file with the most recent modification time.
    • table (string): Name of an existing table in the database (alternative to path).
  • columns (dict): Map of ColumnName to Type OR a configuration object.
    • Simple Style: ColumnName: TypeString (e.g., UserID: INTEGER)
    • Mapped Style:
      • type (string): Target SQL type.
      • source (string, optional): Raw column name in the source file.
      • expression (string, optional): SQL expression for transformation (e.g., CAST(price AS DECIMAL) / 100).
  • external (bool): If true, skips column validation. Useful for migration.

Example:

DEFINE TABLE RawCategories { source: { path: "simulation/exports/categories_*.csv", format: "csv", selection: "latest" }, columns: { CategoryID: { type: INTEGER, source: category_id }, CategoryName: { type: TEXT, source: category_name }, Description: TEXT // Maps to "Description" if source not provided } }

DEFINE DIM

Defines a Dimension in the Silver Layer. Dimensions are stateful objects that track history (SCD Type 2).

Metadata:

  • primary_key (string): Required. The unique natural identifier for the object.
  • materialization (string): view (default) or table.
  • temporal (dict, optional): Customizes SCD2 behavior.
    • valid_from_col (string): Defaults to valid_from.
    • valid_to_col (string): Defaults to valid_to.
  • cluster_by (list): Columns to cluster by (e.g., [CustomerID]).
  • auto_optimize (bool): If true, enables default clustering (PK).

Implicit Columns:

  • valid_from (TIMESTAMP)
  • valid_to (TIMESTAMP)
  • is_current (BOOLEAN)
  • updated_at (TIMESTAMP)

Example:

DEFINE DIM Customer { primary_key: CustomerID } AS ( SELECT * FROM raw.customers )

DEFINE FACT

Defines a Fact in the Silver Layer. Facts are immutable event records.

Metadata:

  • temporal (dict): Required.
    • event_time_col (string): The timestamp column representing when the event occurred.
  • materialization (string): view (default) or table.
  • partition_by (string/expr): Partition expression (e.g., DATE(OrderDate)).
  • cluster_by (list): Columns to cluster by.
  • auto_optimize (bool): If true, enables default partitioning (DATE of event_time).

Example:

DEFINE FACT Sales { temporal: { event_time_col: OrderDate } } AS ( SELECT * FROM raw.sales )

DEFINE ENTITY

Legacy / Abstract. Defines a generic Silver Layer object. Ideally, use DIM or FACT for specific semantics. ENTITY can be used for standard tables that don't fit strict Dim/Fact patterns or for Type 1 dimensions.

Metadata:

  • primary_key (string): Required.
  • type (string): dimension or fact (deprecated hint).
  • temporal (dict): Enables SCD2 logic (same as DIM).

DEFINE REPORT

Defines a final, user-facing output (e.g., for BI tools).

Rules:

  • Must specify explicit columns. No SELECT *.
  • Automatically includes a reference_time column (defaults to CURRENT_TIMESTAMP if omitted).

Metadata:

  • temporal (dict): Point-in-time configuration (optional).
  • columns (dict): Required. Explicit schema definition.
  • materialization (string): view (default) or table.

Example:

DEFINE REPORT MonthlyRevenue { columns: { Month: DATE, Revenue: NUMERIC } } AS ( ... )

DEFINE AGGREGATION

Defines a reusable intermediate metric layer (e.g., "Customer Stats").

Rules:

  • SQL Body must contain a GROUP BY clause.
  • Automatically includes a reference_time column.

Metadata:

  • primary_key (string): Required.
  • materialization (string): view (default) or table.

Example:

DEFINE AGGREGATION CustomerStats { primary_key: CustomerID } AS ( SELECT CustomerID, SUM(Amount) ... GROUP BY CustomerID )

DEFINE FEATURE_SET

Defines a collection of ML features, automatically handling Point-in-Time (PIT) correctness.

Styles:

  1. Compositional (Preferred): References a list of existing DEFINE FEATURE blocks.
  2. Monolithic: Defines SQL logic directly (discouraged).

Metadata:

  • entity (string): Required. The anchor entity (e.g., Customers).
  • features (list): List of feature names to include.
  • temporal (dict):
    • point_in_time (bool): If true, ensures features are joined respecting validity windows.

Example (Compositional):

DEFINE FEATURE_SET CustomerFeatures { entity: Customers temporal: { point_in_time: true } features: - customer_lifetime_value - customer_last_click }

DEFINE MART

Generic / Legacy. Defines a flexible gold layer table. prefer REPORT or AGGREGATION.

Metadata:

  • temporal (dict):
    • event_time_col (string): Used to calculate Point-in-Time joins against upstream Entities.
  • STRATEGY (string):
    • INCREMENTAL: Generates an append-only load script.
    • MERGE: Generates an upsert load script (Update existing, Insert new).
  • INCREMENT_BY (string): The column to use for high-water mark (required if STRATEGY is set).
  • unique_key (list/string): Required for STRATEGY: MERGE. Keys to match records for updates.
  • materialization (string): view (default) or table.

Example:

DEFINE MART UserGrowth temporal: event_time_col: signup_date STRATEGY: INCREMENTAL INCREMENT_BY: signup_date AS ( ... )

DEFINE JOIN

Defines a relationship edge in the graph.

Metadata:

  • from (string): Name of the starting Entity.
  • to (string): Name of the target Entity.
  • type (string): one_to_one, one_to_many, many_to_one, many_to_many.

Example:

DEFINE JOIN user_orders ON from: Orders to: Users type: many_to_one AS ( LEFT JOIN Users u ON u.id = Orders.user_id )

DEFINE FEATURE

Defines a reusable ML feature logic.

Metadata:

  • entity (string): The entity this feature attaches to.
  • join_key (string): The column to join on.
  • timestamp (string): The column representing the event time.
  • aggregation (string): SQL aggregation (e.g., COUNT(*)).
  • window_unit (string): days or rows.
  • window (int): Number of units to look back.
  • order_by (string): Column to sort by (required if unit is rows).

Example:

DEFINE FEATURE last_3_orders entity: Users join_key: user_id timestamp: order_date, aggregation: SUM(amount) window_unit: rows AS ( SELECT * FROM Orders )

DEFINE FEATURE_TABLE

Defines a dataset generation task (Training or Scoring) by combining a Feature Set with a population and time spine.

Metadata:

  • feature_set (string): Required. Name of the FEATURE_SET to use.
  • spine (dict): Required.
    • entity_source (string): SQL query returning the entities (e.g., SELECT CustomerID FROM Customers).
    • time (dict):
      • start_date (string/date): Start of the date range.
      • end_date (string/date): End of the date range. Defaults to CURRENT_DATE.
  • materialization (string): table (recommended) or view.
  • primary_key (list): Required for tables. Composite key [spine_date, EntityID].

Example:

DEFINE FEATURE_TABLE CustomerTrainingData { materialization: table, feature_set: CustomerFeatures, columns: spine_date: TIMESTAMP CustomerID: TEXT customer_lifetime_orders: INTEGER primary_key: - spine_date - CustomerID spine: entity_source: "SELECT CustomerID FROM Customers" time: start_date: "2023-01-01" end_date: "2023-12-31" }