Column-Level Configuration Guide
Baselinr supports fine-grained column-level configurations for profiling, drift detection, and anomaly detection. This allows you to control exactly which columns are profiled, which metrics are computed, and how drift/anomaly detection behaves on a per-column basis.
Overview
Column-level configurations follow a dbt-like nested pattern, where column configurations are nested under table definitions in your config.yml:
profiling:
tables:
- table: customers
schema: public
columns:
- name: email
metrics: [count, null_count, distinct_count]
drift:
enabled: true
thresholds:
low: 2.0
medium: 5.0
high: 10.0
Key Features
- Column Selection: Choose which columns to profile using explicit names or patterns
- Per-Column Metrics: Specify which metrics to compute for each column
- Per-Column Drift Control: Customize drift thresholds, strategies, and enable/disable per column
- Per-Column Anomaly Control: Configure anomaly detection methods and thresholds per column
- Pattern Matching: Use wildcards (
*_id) or regex patterns for column names - Dependency Management: Automatic handling of dependencies (drift/anomaly require profiling)
Configuration Structure
Column Configuration Schema
columns:
- name: <column_name_or_pattern> # Required
pattern_type: wildcard | regex # Optional (default: wildcard)
metrics: [<list_of_metrics>] # Optional (overrides table-level)
profiling: # Optional
enabled: true | false # Default: true
drift: # Optional
enabled: true | false # Default: true
strategy: <strategy_name> # Override drift strategy
thresholds: # Per-column thresholds
low: <float>
medium: <float>
high: <float>
baselines: # Override baseline selection
strategy: <strategy>
windows: {...}
anomaly: # Optional
enabled: true | false # Default: true
methods: [<list_of_methods>] # Override enabled methods
thresholds: # Per-column thresholds
iqr_threshold: <float>
mad_threshold: <float>
ewma_deviation_threshold: <float>
Column Selection
Explicit Column Names
Specify exact column names to configure:
columns:
- name: email
metrics: [count, null_count, distinct_count]
- name: age
metrics: [count, mean, stddev, min, max]
Wildcard Patterns
Use wildcard patterns to match multiple columns:
columns:
- name: "*_id" # Matches: customer_id, order_id, product_id
metrics: [count, null_count]
- name: "email*" # Matches: email, email_address, email_verified
metrics: [count, null_count, distinct_count]
Wildcard Syntax:
*matches any sequence of characters?matches a single character
Regex Patterns
For more complex patterns, use regex:
columns:
- name: "^(customer|order|product)_id$"
pattern_type: regex
metrics: [count, null_count]
Excluding Columns
To skip profiling specific columns, set profiling.enabled: false:
columns:
- name: internal_notes
profiling:
enabled: false # Column won't be profiled
- name: "*_temp" # Skip all temporary columns
profiling:
enabled: false
Profiling Configuration
Select Which Columns to Profile
By default, all columns are profiled. When you specify columns, only matching columns are profiled (unless include_defaults is used).
profiling:
tables:
- table: customers
columns:
- name: email
- name: age
- name: name
# Only email, age, and name will be profiled
To profile everything except specific columns:
profiling:
tables:
- table: customers
columns:
- name: "*" # Profile all columns
- name: internal_notes # Except this one
profiling:
enabled: false
Custom Metrics Per Column
Override table-level metrics for specific columns:
profiling:
tables:
- table: customers
columns:
- name: email
# Only compute these metrics for email
metrics: [count, null_count, distinct_count]
- name: age
# Full metrics for age
metrics: [count, mean, stddev, min, max, null_ratio]
- name: metadata_json
# Minimal metrics for large JSON columns
metrics: [count, null_count]
Available Metrics:
count- Total row countnull_count- Number of null valuesnull_ratio- Ratio of nulls (0.0-1.0)distinct_count- Number of distinct valuesunique_ratio- Ratio of distinct to totalapprox_distinct_count- Approximate distinct countmin- Minimum valuemax- Maximum valuemean- Average (numeric)stddev- Standard deviation (numeric)histogram- Distribution histogram (numeric)data_type_inferred- Inferred semantic typemin_length,max_length,avg_length- String length metrics
Drift Detection Configuration
Per-Column Drift Thresholds
Override global drift thresholds for specific columns:
profiling:
tables:
- table: customers
columns:
- name: lifetime_value
drift:
enabled: true
thresholds:
low: 5.0 # 5% change = low severity
medium: 10.0 # 10% change = medium severity
high: 20.0 # 20% change = high severity
Disable Drift Detection Per Column
Skip drift detection for specific columns:
profiling:
tables:
- table: customers
columns:
- name: internal_notes
drift:
enabled: false # No drift detection for this column
- name: "*_id"
drift:
enabled: false # No drift for ID columns
Per-Column Drift Strategy
Override drift strategy for specific columns:
profiling:
tables:
- table: customers
columns:
- name: amount
drift:
strategy: statistical # Use statistical tests for this column
thresholds: {...}
- name: status
drift:
strategy: absolute_threshold # Use simple thresholds
thresholds:
low: 2.0
medium: 5.0
high: 10.0
Per-Column Baseline Selection
Override baseline selection strategy per column:
profiling:
tables:
- table: customers
columns:
- name: daily_revenue
drift:
baselines:
strategy: prior_period # Use prior period for seasonality
windows:
prior_period: 7 # Same day last week
Anomaly Detection Configuration
Per-Column Anomaly Methods
Enable specific anomaly detection methods per column:
profiling:
tables:
- table: orders
columns:
- name: amount
anomaly:
enabled: true
methods: [control_limits, iqr, mad] # Only these methods
- name: order_date
anomaly:
enabled: true
methods: [seasonality, regime_shift] # Focus on temporal patterns
Per-Column Anomaly Thresholds
Customize anomaly detection sensitivity per column:
profiling:
tables:
- table: orders
columns:
- name: amount
anomaly:
enabled: true
thresholds:
iqr_threshold: 2.0 # More sensitive (default: 1.5)
mad_threshold: 3.5 # More sensitive (default: 3.0)
ewma_deviation_threshold: 2.5 # More sensitive (default: 2.0)
- name: quantity
anomaly:
enabled: true
thresholds:
iqr_threshold: 3.0 # Less sensitive
Disable Anomaly Detection Per Column
profiling:
tables:
- table: customers
columns:
- name: metadata_json
anomaly:
enabled: false # Skip anomaly detection
Dependency Management
Understanding Dependencies
There's a critical dependency chain:
Profiling → Drift Detection → Anomaly Detection
- Profiling must run first (produces metrics)
- Drift Detection requires profiling (compares metrics across runs)
- Anomaly Detection requires profiling (analyzes current run's metrics)
Automatic Dependency Handling
Baselinr automatically handles dependencies:
- If a column is not profiled (
profiling.enabled: false), drift and anomaly detection are automatically skipped - If drift/anomaly is configured but profiling is disabled, a warning is issued
- Columns without profiling cannot have drift or anomaly detection enabled
Example: Invalid Configuration (Will Warn)
profiling:
tables:
- table: customers
columns:
- name: metadata
profiling:
enabled: false # ❌ Profiling disabled
drift:
enabled: true # ⚠️ Warning: Drift requires profiling
anomaly:
enabled: true # ⚠️ Warning: Anomaly requires profiling
Result: Warnings are logged, and drift/anomaly are automatically skipped for this column.
Complete Examples
Example 1: Basic Column Selection
profiling:
tables:
- table: customers
schema: public
columns:
- name: email
metrics: [count, null_count, distinct_count]
- name: age
metrics: [count, mean, stddev, min, max]
- name: "*_id"
metrics: [count, null_count]
Example 2: Selective Profiling with Drift Control
profiling:
tables:
- table: customers
columns:
- name: email
metrics: [count, null_count, distinct_count]
drift:
enabled: true
thresholds:
low: 2.0
medium: 5.0
high: 10.0
- name: lifetime_value
metrics: [count, mean, stddev, min, max]
drift:
enabled: true
thresholds:
low: 5.0
medium: 15.0
high: 30.0
- name: internal_notes
profiling:
enabled: false # Not profiled
- name: "*_id"
drift:
enabled: false # No drift detection for IDs
Example 3: Full Configuration with Anomaly Detection
profiling:
tables:
- table: orders
columns:
- name: amount
metrics: [count, mean, stddev, min, max]
drift:
enabled: true
strategy: absolute_threshold
thresholds:
low: 5.0
medium: 15.0
high: 30.0
anomaly:
enabled: true
methods: [control_limits, iqr, mad]
thresholds:
iqr_threshold: 2.0
mad_threshold: 3.5
- name: order_date
metrics: [count, min, max]
anomaly:
enabled: true
methods: [seasonality, regime_shift]
- name: notes
profiling:
enabled: false # Skip entirely
Example 4: Pattern-Based Column Configuration
profiling:
tables:
- table: events
columns:
# Profile all timestamp columns
- name: "*_timestamp"
metrics: [count, min, max]
drift:
enabled: true
# Profile all ID columns with minimal metrics
- name: "*_id"
metrics: [count, null_count]
drift:
enabled: false # IDs shouldn't drift
# Use regex for complex patterns
- name: "^(email|phone|address).*"
pattern_type: regex
metrics: [count, null_count, distinct_count]
Schema-Level Configuration
Schema-level configurations allow you to apply settings to all tables within a specified schema, reducing configuration duplication and enabling organizational policy management.
Overview
Schema-level configs support all the same options as table-level configs (partition, sampling, column configs, filters), and they merge with table-level configs following the precedence: Schema → Table → Column.
Schema Configuration Structure
profiling:
schemas:
- schema: analytics
database: warehouse # Optional: database-specific schema config
partition:
strategy: latest
key: date
sampling:
enabled: true
fraction: 0.1
columns:
- name: "*_id"
drift:
enabled: false # All ID columns in analytics schema skip drift
- name: "*_metadata"
profiling:
enabled: false # Skip metadata columns
# Filter fields also supported
min_rows: 100
table_types: [table]
exclude_patterns: ["*_temp"]
tables:
- table: orders
schema: analytics # Inherits schema-level configs
columns:
- name: total_amount # Override/add table-specific config
drift:
thresholds:
low: 1.0
How Schema Configs Work
- Schema Matching: Schema configs match based on schema name (and optionally database name)
- Config Merging: Schema configs are merged with table patterns before profiling
- Precedence: Table-level configs override schema-level configs
- Column Configs: Schema column configs are merged with table column configs (table takes precedence)
Example: Schema-Level Column Configs
Apply column configurations to all tables in a schema:
profiling:
schemas:
- schema: analytics
columns:
- name: "*_id"
drift:
enabled: false # All ID columns skip drift detection
- name: "*_metadata"
profiling:
enabled: false # Skip metadata columns
tables:
- table: orders
schema: analytics # Inherits schema-level column configs
- table: customers
schema: analytics # Also inherits schema-level column configs
columns:
- name: email # Add table-specific override
drift:
enabled: true
Example: Schema-Level Sampling
Apply sampling configuration to all tables in a schema:
profiling:
schemas:
- schema: staging
sampling:
enabled: true
fraction: 0.1 # All staging tables sample 10%
tables:
- select_schema: true
schema: staging # Inherits sampling config
Example: Database-Specific Schema Configs
Use database-specific schema configs for multi-database setups:
profiling:
schemas:
- schema: analytics
database: warehouse_prod
sampling:
enabled: true
fraction: 0.05 # Production: 5% sampling
- schema: analytics
database: warehouse_dev
sampling:
enabled: true
fraction: 0.2 # Development: 20% sampling
Schema Config with Pattern Matching
Schema configs apply to tables discovered via patterns:
profiling:
schemas:
- schema: analytics
columns:
- name: "*_id"
drift:
enabled: false
tables:
- pattern: "user_*" # Pattern matches user_profiles, user_sessions, etc.
schema: analytics # All matched tables inherit schema column configs
Schema Config with select_schema
Schema configs work with select_schema to apply to all tables:
profiling:
schemas:
- schema: analytics
partition:
strategy: latest
key: date
columns:
- name: "*_temp"
profiling:
enabled: false
tables:
- select_schema: true
schema: analytics # All tables in analytics schema inherit configs
Database-Level Configuration
Database-level configurations allow you to apply settings to all schemas/tables within a specified database, providing the broadest scope for organizational policy management.
Overview
Database-level configs support all the same options as schema-level configs (partition, sampling, column configs, filters), and they merge with schema and table configs following the precedence: Database → Schema → Table → Column.
Database Configuration Structure
profiling:
databases:
- database: warehouse
partition:
strategy: latest
key: date
sampling:
enabled: true
fraction: 0.05 # All tables in warehouse database sample 5%
columns:
- name: "*_id"
drift:
enabled: false # All ID columns in warehouse database skip drift
- name: "*_temp"
profiling:
enabled: false # Skip temp columns in all tables
# Filter fields also supported
min_rows: 100
table_types: [table]
exclude_patterns: ["*_temp", "*_test"]
schemas:
- schema: analytics
database: warehouse # Inherits database-level configs
columns:
- name: "*_metadata"
profiling:
enabled: false # Schema-level override
tables:
- table: orders
schema: analytics
database: warehouse # Inherits both database and schema configs
How Database Configs Work
- Scope: Database configs apply to ALL schemas/tables in the specified database
- Merging: Database configs are merged with schema and table configs
- Precedence: Database → Schema → Table → Column (each level can override previous)
- Column Configs: Database column configs are merged with schema and table column configs (table takes highest precedence)
Example: Database-Level Column Configs
Apply policies at the database level that can be overridden at schema or table level:
profiling:
databases:
- database: warehouse
columns:
- name: "*_id"
drift:
enabled: false # Database-level: disable drift for all ID columns
schemas:
- schema: analytics
database: warehouse
columns:
- name: "customer_id"
drift:
enabled: true # Schema-level: override for customer_id in analytics schema
tables:
- table: orders
schema: analytics
database: warehouse
columns:
- name: "customer_id"
drift:
enabled: false # Table-level: override again for orders table
thresholds:
low: 1.0 # With custom thresholds
Example: Database-Level Sampling
Apply consistent sampling strategy across all tables in a database:
profiling:
databases:
- database: staging_db
sampling:
enabled: true
fraction: 0.05 # All tables in staging_db sample 5%
tables:
- select_all_schemas: true
database: staging_db # Inherits database-level sampling
Example: Multi-Level Precedence
Demonstrate how configurations merge across all levels:
profiling:
databases:
- database: warehouse
partition:
strategy: all # Database-level default
columns:
- name: "*_id"
drift:
enabled: false
schemas:
- schema: analytics
database: warehouse
partition:
strategy: latest # Schema-level override
columns:
- name: "customer_id"
drift:
enabled: true
tables:
- table: orders
schema: analytics
database: warehouse
partition:
strategy: range # Table-level override
key: created_at
columns:
- name: "customer_id"
drift:
enabled: false # Table-level override
Result:
orderstable usesrangepartition strategy (table overrides schema and database)customer_idcolumn has drift disabled (table overrides schema, which overrides database)
When to Use Database-Level Configs
Use database-level configs for:
- Organization-Wide Policies: Apply consistent policies across all tables in a database
- Cost Management: Apply sampling or filtering at the database level
- Security: Disable profiling/drift for sensitive column patterns across the entire database
- Environment-Specific Settings: Different policies for
productionvsstagingdatabases
Database Configs vs Schema Configs
- Database Configs: Apply to ALL schemas/tables in the database (broadest scope)
- Schema Configs: Apply to ALL tables in a specific schema (can be database-specific)
- Table Configs: Apply to a specific table (most specific)
Use database configs for organization-wide policies, and schema configs for schema-specific overrides.
Configuration Precedence
Configurations are merged with the following precedence (highest to lowest):
- Column-level config (most specific)
- Table-level config (from
profiling.tables) - Schema-level config (from
profiling.schemas) - Database-level config (from
profiling.databases) - Global config (defaults from
drift_detectionandstoragesections)
Example with Database, Schema, and Table Levels:
# Global defaults
drift_detection:
strategy: absolute_threshold
absolute_threshold:
low_threshold: 5.0
medium_threshold: 15.0
high_threshold: 30.0
profiling:
databases:
- database: warehouse
columns:
- name: "*_id"
drift:
enabled: false # Database-level: disable drift for all IDs
schemas:
- schema: analytics
database: warehouse
columns:
- name: "customer_id"
drift:
enabled: true # Schema-level: override for customer_id
tables:
- table: customers
schema: analytics
database: warehouse
columns:
- name: amount
# Column-level overrides schema, database, and global
drift:
enabled: true
thresholds:
low: 2.0 # Uses 2.0 instead of 5.0
medium: 10.0 # Uses 10.0 instead of 15.0
high: 20.0 # Uses 20.0 instead of 30.0
Backward Compatibility
Column-level configurations are fully backward compatible:
- If
columnsis not specified, all columns are profiled with table-level defaults - Table-level configurations continue to work as before
- Existing configurations without column configs work unchanged
- Column-level features are opt-in
Best Practices
1. Start Broad, Then Narrow
Begin with table-level configurations, then add column-level configs for specific needs:
# Start here
profiling:
tables:
- table: customers
# Then refine specific columns
profiling:
tables:
- table: customers
columns:
- name: critical_field
drift:
thresholds:
low: 1.0 # Very sensitive
2. Use Patterns for Repeated Configurations
Instead of listing every column, use patterns:
columns:
- name: "*_id"
drift:
enabled: false # IDs shouldn't drift
- name: "*_timestamp"
metrics: [count, min, max] # Timestamps don't need histograms
3. Disable Profiling for Large/Unimportant Columns
Save compute resources by skipping large JSON/text columns:
columns:
- name: raw_json_payload
profiling:
enabled: false
- name: "*_metadata"
profiling:
enabled: false
4. Adjust Sensitivity Based on Business Importance
Use tighter thresholds for critical business metrics:
columns:
- name: revenue
drift:
thresholds:
low: 1.0 # Very sensitive
medium: 3.0
high: 5.0
- name: metadata
drift:
thresholds:
low: 10.0 # More lenient
medium: 20.0
high: 40.0
5. Group Related Columns
Use patterns to configure related columns together:
columns:
- name: "*_email*" # email, email_address, email_verified
metrics: [count, null_count, distinct_count]
drift:
thresholds:
low: 2.0
Troubleshooting
Column Not Being Profiled
Problem: Column specified in config isn't being profiled.
Solutions:
- Check column name spelling (case-sensitive in some databases)
- Verify pattern matches (test with explicit name first)
- Ensure
profiling.enabledis notfalse - Check that column exists in the table
Drift Detection Not Running
Problem: Drift configured but not detecting changes.
Solutions:
- Verify column was actually profiled (check
profiled_columnsin metadata) - Check
drift.enabledis notfalse - Ensure at least 2 profiling runs exist for comparison
- Verify thresholds are appropriate for the data
Warnings About Dependencies
Problem: Warnings about drift/anomaly configured but profiling disabled.
Solutions:
- Remove drift/anomaly config if profiling is intentionally disabled
- Enable profiling if you want drift/anomaly detection
- Review configuration for typos or logic errors
See Also
- Drift Detection Guide - Comprehensive drift detection documentation
- Anomaly Detection Guide - Anomaly detection documentation
- Configuration Reference - Complete configuration schema reference