Skip to content

core.validation

Data validation for feature DataFrames: range checks, missing rates, monotonic timestamps, session boundaries, and distribution warnings.

Usage

from taskclf.core.validation import validate_feature_dataframe

report = validate_feature_dataframe(df, max_missing_rate=0.5)
if not report.ok:
    for finding in report.errors:
        print(f"ERROR: {finding.message}")
for finding in report.warnings:
    print(f"WARN: {finding.message}")

Hard checks (errors)

  • Non-nullable columns must not contain nulls.
  • Nullable columns must not exceed max_missing_rate.
  • Numeric values must fall within declared ranges (from features_v1.json).
  • bucket_end_ts must equal bucket_start_ts + 60s.
  • bucket_start_ts must be strictly increasing within each (user_id, session_id) group.

Soft checks (warnings)

  • Constant-value columns (std == 0).
  • Dominant-value columns (>90% identical).
  • Class imbalance (<5% representation) if label column exists.
  • Session boundary changes with very small gaps.

taskclf.core.validation

Data validation: range checks, missing rates, monotonic timestamps, and more.

ValidationReport

Bases: BaseModel

Collects all findings from :func:validate_feature_dataframe.

Source code in src/taskclf/core/validation.py
class ValidationReport(BaseModel):
    """Collects all findings from :func:`validate_feature_dataframe`."""

    findings: list[Finding] = []

    @property
    def errors(self) -> list[Finding]:
        return [f for f in self.findings if f.severity == Severity.ERROR]

    @property
    def warnings(self) -> list[Finding]:
        return [f for f in self.findings if f.severity == Severity.WARNING]

    @property
    def ok(self) -> bool:
        return len(self.errors) == 0

validate_feature_dataframe(df, *, max_missing_rate=0.5, bucket_seconds=DEFAULT_BUCKET_SECONDS)

Run hard and soft checks on a feature DataFrame.

Hard checks (errors): * Non-nullable columns contain nulls. * Nullable columns exceed max_missing_rate. * Numeric values outside declared ranges. * bucket_end_ts != bucket_start_ts + bucket_seconds. * Non-monotonic bucket_start_ts within (user_id, session_id) groups.

Soft checks (warnings): * Constant-value columns (std == 0). * Dominant-value columns (>90% identical). * Label class imbalance (<5% representation) if label column exists.

Parameters:

Name Type Description Default
df DataFrame

Feature DataFrame to validate.

required
max_missing_rate float

Maximum allowed null fraction for nullable columns.

0.5
bucket_seconds int

Expected window width in seconds.

DEFAULT_BUCKET_SECONDS

Returns:

Name Type Description
A ValidationReport

class:ValidationReport with all findings.

Source code in src/taskclf/core/validation.py
def validate_feature_dataframe(
    df: pd.DataFrame,
    *,
    max_missing_rate: float = 0.5,
    bucket_seconds: int = DEFAULT_BUCKET_SECONDS,
) -> ValidationReport:
    """Run hard and soft checks on a feature DataFrame.

    Hard checks (errors):
        * Non-nullable columns contain nulls.
        * Nullable columns exceed *max_missing_rate*.
        * Numeric values outside declared ranges.
        * ``bucket_end_ts != bucket_start_ts + bucket_seconds``.
        * Non-monotonic ``bucket_start_ts`` within ``(user_id, session_id)``
          groups.

    Soft checks (warnings):
        * Constant-value columns (std == 0).
        * Dominant-value columns (>90% identical).
        * Label class imbalance (<5% representation) if ``label`` column
          exists.

    Args:
        df: Feature DataFrame to validate.
        max_missing_rate: Maximum allowed null fraction for nullable columns.
        bucket_seconds: Expected window width in seconds.

    Returns:
        A :class:`ValidationReport` with all findings.
    """
    report = ValidationReport()
    if df.empty:
        report.findings.append(
            Finding(
                severity=Severity.WARNING,
                check="empty_dataframe",
                message="DataFrame is empty; no checks performed.",
            )
        )
        return report

    _check_non_nullable(df, report)
    _check_missing_rates(df, max_missing_rate, report)
    _check_ranges(df, report)
    _check_bucket_end_consistency(df, bucket_seconds, report)
    _check_monotonic_timestamps(df, report)
    _check_session_boundaries(df, report)
    _check_distributions(df, report)
    _check_class_balance(df, report)

    return report