core.validation¶
Data validation for feature DataFrames: range checks, missing rates, monotonic timestamps, session boundaries, and distribution warnings.
Usage¶
from taskclf.core.validation import validate_feature_dataframe
report = validate_feature_dataframe(df, max_missing_rate=0.5)
if not report.ok:
for finding in report.errors:
print(f"ERROR: {finding.message}")
for finding in report.warnings:
print(f"WARN: {finding.message}")
Hard checks (errors)¶
- Non-nullable columns must not contain nulls.
- Nullable columns must not exceed
max_missing_rate. - Numeric values must fall within declared ranges (from
features_v1.json). bucket_end_tsmust equalbucket_start_ts + 60s.bucket_start_tsmust be strictly increasing within each(user_id, session_id)group.
Soft checks (warnings)¶
- Constant-value columns (std == 0).
- Dominant-value columns (>90% identical).
- Class imbalance (<5% representation) if
labelcolumn exists. - Session boundary changes with very small gaps.
taskclf.core.validation
¶
Data validation: range checks, missing rates, monotonic timestamps, and more.
ValidationReport
¶
Bases: BaseModel
Collects all findings from :func:validate_feature_dataframe.
Source code in src/taskclf/core/validation.py
validate_feature_dataframe(df, *, max_missing_rate=0.5, bucket_seconds=DEFAULT_BUCKET_SECONDS)
¶
Run hard and soft checks on a feature DataFrame.
Hard checks (errors):
* Non-nullable columns contain nulls.
* Nullable columns exceed max_missing_rate.
* Numeric values outside declared ranges.
* bucket_end_ts != bucket_start_ts + bucket_seconds.
* Non-monotonic bucket_start_ts within (user_id, session_id)
groups.
Soft checks (warnings):
* Constant-value columns (std == 0).
* Dominant-value columns (>90% identical).
* Label class imbalance (<5% representation) if label column
exists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Feature DataFrame to validate. |
required |
max_missing_rate
|
float
|
Maximum allowed null fraction for nullable columns. |
0.5
|
bucket_seconds
|
int
|
Expected window width in seconds. |
DEFAULT_BUCKET_SECONDS
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
ValidationReport
|
class: |