Features & ML Engineering
SqlCAD treats Machine Learning features as first-class citizens. A FEATURE is a reusable definition of an aggregation logic that is guaranteed to be calculated correctly at any point in time (Point-in-Time Correctness).
This allows you to define a signal (e.g., "Number of orders in the last N days") once, and reuse it across multiple Marts with different window sizes, all while ensuring no data leakage occurs when generating training datasets.
Defining a Feature
A FEATURE block defines the source stream of events and the aggregation logic. It does not hardcode the window size; that is specified when the feature is used.
Metadata Reference
entity: The target entity this feature enriches (e.g.,Customer).join_key: The column in the feature's SQL that links to the target entity's Primary Key.timestamp: The column in the feature's SQL used for time-window filtering.aggregation: The SQL aggregation expression (e.g.,COUNT(*),SUM(amount)).window_unit:days(default) orrows.order_by: (Required forrowsunit) The column to order records by before limiting.
Example: Time-Based Window (Days)
DEFINE FEATURE recent_order_count entity: Customer join_key: customer_id timestamp: order_date aggregation: COUNT(*) window_unit: days AS ( -- The raw event stream SELECT customer_id, order_date, amount FROM Orders )
Example: Row-Based Window (Last N items)
DEFINE FEATURE last_n_order_value entity: Customer join_key: customer_id timestamp: order_date order_by: order_date aggregation: SUM(amount) window_unit: rows AS ( SELECT customer_id, order_date, amount FROM Orders )
Using Features in Marts
Features are applied within a MART (or ENTITY) using the special feature() function. SqlCAD parses this function and expands it into a Point-in-Time correct subquery.
Syntax: feature('feature_name', window_size)
window_size: Integer. Represents days or rows depending on the Feature definition.
Example Mart
DEFINE MART CustomerChurnDataset description: "Training dataset for churn prediction" AS ( SELECT c.customer_id, c.valid_from as observation_date, c.segment, -- Count orders in the 30 days PRIOR to the observation date feature('recent_order_count', 30) as orders_last_30d, -- Count orders in the 90 days PRIOR to the observation date feature('recent_order_count', 90) as orders_last_90d, -- Sum of value of the last 3 orders feature('last_n_order_value', 3) as val_last_3_orders FROM Customer c WHERE c.valid_from > '2023-01-01' )
How It Works (Point-in-Time Correctness)
When SqlCAD generates the SQL for the Mart above, it automatically handles the temporal join logic to prevent data leakage.
- Identification: It identifies
Customer cas the target entity. - Snapshotting: It uses
c.valid_fromas the "Point-in-Time" anchor for each row. - Lateral Join: It generates a
LEFT JOIN LATERALsubquery for each feature call.
Generated SQL Logic (Conceptual)
For feature('recent_order_count', 30), the generated SQL looks like:
LEFT JOIN LATERAL ( SELECT COUNT(*) as feature_value FROM (SELECT ... FROM Orders) AS source WHERE -- Link to specific customer source.customer_id = c.customer_id -- Only look at data KNOWN at the time of observation (c.valid_from) AND source.order_date < c.valid_from -- Apply the window (last 30 days) AND source.order_date >= c.valid_from - INTERVAL '30 days' ) AS feat_recent_order_count_30 ON TRUE
This ensures that if you are looking at a Customer record from 6 months ago, the feature value reflects the data available at that time, not the current state.
Feature Sets (DEFINE FEATURE_SET)
While you can manually invoke feature() in a Mart, it is often easier to define a collection of features as a dedicated dataset using FEATURE_SET.
Feature Sets allow you to:
- Group related features (e.g., "Customer Churn Features").
- Automate the Point-in-Time join logic without writing any SQL.
- Produce a wide table ready for training or serving.
Example
DEFINE FEATURE_SET CustomerFeatures { entity: Customers temporal: { point_in_time: true } features: - recent_order_count - last_n_order_value }
This generates a table containing all columns from the Customers entity plus the listed features. It identifies the temporal validity of each Customer row (via SCD2 columns or explicit Metadata) and calculates the features as of that specific time.
Training Datasets (DEFINE FEATURE_TABLE)
While FEATURE_SET defines the logical collection of features, FEATURE_TABLE defines the physical generation of a dataset, typically for training models.
A FEATURE_TABLE combines:
- Time Spine: A range of dates (e.g., daily snapshots).
- Entity Spine: A population of entities to include (e.g., "All active customers").
- Feature Set: The features to calculate for each (Entity, Date) pair.
This allows you to generate massive historical training sets with perfect Point-in-Time correctness.
Structure
DEFINE FEATURE_TABLE CustomerTrainingData { materialization: table, feature_set: CustomerFeatures, -- Define the columns to output (usually Spine + Features + Targets?) columns: spine_date: TIMESTAMP CustomerID: TEXT customer_lifetime_orders: INTEGER spine: -- The population to include (static or dynamic query) entity_source: "SELECT CustomerID FROM Customers WHERE Region = 'North'" -- The temporal range time: start_date: "2023-01-01" end_date: "2023-12-31" -- Defaults to CURRENT_DATE if omitted primary_key: - spine_date - CustomerID }
How Generation Works
- Date Spine Generation: SqlCAD generates a series of dates from
start_datetoend_date(daily). - Entity Spine Generation: Validates the
entity_sourcequery. - Cross Join: Every Entity is paired with every Date in the spine.
- Result: One row per entity per day.
- Feature Lookup: For each row, the features from
feature_setare joined.- Crucial: The join uses the Spine Date as the reference time, NOT
CURRENT_TIMESTAMP. - This simulates "what would we have known about this customer on that specific day?"
- Crucial: The join uses the Spine Date as the reference time, NOT
Use Cases
- Backtesting: Generate features for every day last year to train a churn model.
- Scoring: Set
start_dateandend_datetoCURRENT_DATEto generate today's feature vector for inference.