core.schema¶
Feature schema versioning, deterministic hashing, and DataFrame
validation. The schema is the versioned contract between feature
producers (features.build) and consumers (train, infer). Per
AGENTS.md, inference must refuse to run when the schema hash
recorded in a model bundle differs from the hash of the feature
pipeline that produced the input data.
Feature schemas¶
TaskCLF currently supports three persisted feature contracts:
FeatureSchemaV1: original schema, includesuser_idin persisted rows and model featuresFeatureSchemaV2: removesuser_idfrom the schema/model feature contractFeatureSchemaV3: current default; keepsuser_idon persisted rows for joins/evaluation while using v2-style model semantics plus keyed title-sketch features
FeatureSchemaV1¶
Central class that owns the canonical column registry, the schema hash,
and both row-level and DataFrame-level validators.
FeatureSchemaV1 is implemented as a frozen slotted dataclass with
class-level constants (VERSION, COLUMNS, SCHEMA_HASH).
| Attribute | Type | Description |
|---|---|---|
VERSION |
str |
"v1" -- schema generation tag |
COLUMNS |
dict[str, type] |
Ordered column-name to Python-type mapping (41 columns) |
SCHEMA_HASH |
str |
Deterministic hex digest derived from column names + types |
The hash is computed at import time by JSON-serialising the ordered
[[name, type_name], ...] pairs and passing them through
stable_hash. Any column addition, removal, rename,
or type change produces a different hash automatically.
Column registry¶
Columns are grouped by role. All columns are required; nullable
fields (e.g. keys_per_min when no input watcher is present) are
typed float but may contain None at the Pydantic model level.
Identity and time¶
| Column | Type | Description |
|---|---|---|
user_id |
str |
Pseudonymous user identifier |
device_id |
str |
Optional device identifier |
session_id |
str |
Hash-based session ID (see features.sessions) |
bucket_start_ts |
datetime |
UTC-aligned bucket start |
bucket_end_ts |
datetime |
bucket_start_ts + bucket_seconds |
Schema metadata¶
| Column | Type | Description |
|---|---|---|
schema_version |
str |
Must equal FeatureSchemaV1.VERSION |
schema_hash |
str |
Must equal FeatureSchemaV1.SCHEMA_HASH |
source_ids |
list |
Collector IDs that contributed (e.g. ["aw-watcher-window"]) |
Application context¶
| Column | Type | Description |
|---|---|---|
app_id |
str |
Bundle ID of the dominant app in the bucket |
app_category |
str |
Semantic category (e.g. "editor", "browser") |
window_title_hash |
str |
Privacy-safe hash of the window title |
is_browser |
bool |
Whether the dominant app is a browser |
is_editor |
bool |
Whether the dominant app is a code editor |
is_terminal |
bool |
Whether the dominant app is a terminal |
domain_category |
str |
Browser domain classification (see features.domain) |
window_title_bucket |
int |
Hash-bucketed title ID (see features.text) |
title_repeat_count_session |
int |
How many times this title hash appeared in the current session |
App-switching metrics¶
| Column | Type | Description |
|---|---|---|
app_switch_count_last_5m |
int |
Unique-app switches in the 5-minute look-back window |
app_switch_count_last_15m |
int |
Same metric over 15 minutes |
app_foreground_time_ratio |
float |
Fraction of the bucket the dominant app was foreground |
app_change_count |
int |
App changes within the bucket itself |
top2_app_concentration_15m |
float |
Combined time share of the two most-used apps over the last 15 minutes |
Input activity¶
| Column | Type | Description |
|---|---|---|
keys_per_min |
float |
Keystrokes per minute (aggregate, no raw keys stored) |
backspace_ratio |
float |
Fraction of keystrokes that are backspace |
shortcut_rate |
float |
Fraction of keystrokes involving modifier keys |
clicks_per_min |
float |
Mouse clicks per minute |
scroll_events_per_min |
float |
Scroll events per minute |
mouse_distance |
float |
Total mouse travel in pixels |
active_seconds_keyboard |
float |
Seconds with keyboard activity in the bucket |
active_seconds_mouse |
float |
Seconds with mouse activity |
active_seconds_any |
float |
Seconds with any input |
max_idle_run_seconds |
float |
Longest consecutive idle stretch |
event_density |
float |
Active events per second of activity |
Temporal dynamics (rolling)¶
| Column | Type | Description |
|---|---|---|
keys_per_min_rolling_5 |
float |
5-bucket rolling mean of keys_per_min |
keys_per_min_rolling_15 |
float |
15-bucket rolling mean |
mouse_distance_rolling_5 |
float |
5-bucket rolling mean of mouse_distance |
mouse_distance_rolling_15 |
float |
15-bucket rolling mean |
keys_per_min_delta |
float |
Current minus rolling-5 mean |
clicks_per_min_delta |
float |
Current minus rolling-5 mean |
mouse_distance_delta |
float |
Current minus rolling-5 mean |
Calendar and session¶
| Column | Type | Description |
|---|---|---|
hour_of_day |
int |
0--23 hour extracted from bucket_start_ts |
day_of_week |
int |
0 (Monday) -- 6 (Sunday) |
session_length_so_far |
float |
Minutes elapsed since session start |
validate_row¶
Validates a raw dict as a FeatureRow via Pydantic, then checks that
schema_version and schema_hash match the current contract.
from taskclf.core.schema import FeatureSchemaV1
row = FeatureSchemaV1.validate_row(raw_dict)
# raises ValueError on schema_version or schema_hash mismatch
Returns the validated FeatureRow on success.
coerce_nullable_numeric¶
Converts nullable numeric columns from object dtype (caused by
None values from FeatureRow.model_dump()) to float64 (with
NaN). Call this before validate_dataframe whenever a
DataFrame is built from model-dumped rows that may contain None in
numeric fields.
The function modifies the DataFrame in place and also returns it for chaining convenience.
import pandas as pd
from taskclf.core.schema import FeatureSchemaV1, coerce_nullable_numeric
df = pd.DataFrame([row.model_dump() for row in feature_rows])
coerce_nullable_numeric(df)
FeatureSchemaV1.validate_dataframe(df)
validate_dataframe¶
Checks that a DataFrame has exactly the expected columns (no missing, no extra) and that pandas dtype kinds are compatible with the declared Python types.
The dtype compatibility mapping:
| Python type | Accepted pandas dtype kinds |
|---|---|
int |
i (signed), u (unsigned) |
float |
f (float), i, u (promotion safe) |
bool |
b (bool), i, u (numpy coercion) |
str |
O (object), U (unicode) |
Types not in this map (e.g. datetime, list) are skipped during
dtype checking.
import pandas as pd
from taskclf.core.schema import FeatureSchemaV1, coerce_nullable_numeric
df = pd.DataFrame([row.model_dump() for row in feature_rows])
coerce_nullable_numeric(df)
FeatureSchemaV1.validate_dataframe(df) # raises ValueError on mismatch
See also¶
core.types--FeatureRowPydantic modelcore.hashing--stable_hashused for schema hash computationfeatures.build-- feature computation pipeline that produces schema-conformant rows
taskclf.core.schema
¶
Feature schema versioning, deterministic hashing, and DataFrame validation.
FeatureSchemaV1
dataclass
¶
Schema contract for feature rows (v1).
Holds the canonical column list, computes a deterministic schema hash, and validates individual rows or DataFrames against the contract.
Source code in src/taskclf/core/schema.py
validate_row(data)
classmethod
¶
Validate data as a FeatureRow and verify schema metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Raw dict of field values (e.g. from JSON or |
required |
Returns:
| Type | Description |
|---|---|
FeatureRowBase
|
The validated |
Raises:
| Type | Description |
|---|---|
ValueError
|
If pydantic validation fails, or |
Source code in src/taskclf/core/schema.py
validate_dataframe(df)
classmethod
¶
Check that df conforms to the v1 column contract.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to validate (typically built from |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If columns are missing, unexpected columns are present, or pandas dtype kinds do not match the expected Python types. |
Source code in src/taskclf/core/schema.py
FeatureSchemaV2
dataclass
¶
Schema contract for feature rows (v2).
Identical to :class:FeatureSchemaV1 except user_id has been
removed from the column registry. Personalization shifts to
calibrators and per-user post-processing.
Source code in src/taskclf/core/schema.py
validate_row(data)
classmethod
¶
Validate data as a FeatureRow and verify schema metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Raw dict of field values (e.g. from JSON or |
required |
Returns:
| Type | Description |
|---|---|
FeatureRowBase
|
The validated |
Raises:
| Type | Description |
|---|---|
ValueError
|
If pydantic validation fails, or |
Source code in src/taskclf/core/schema.py
validate_dataframe(df)
classmethod
¶
Check that df conforms to the v2 column contract.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to validate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If columns are missing, unexpected columns are present, or pandas dtype kinds do not match the expected Python types. |
Source code in src/taskclf/core/schema.py
FeatureSchemaV3
dataclass
¶
Schema contract for feature rows (v3).
Extends :class:FeatureSchemaV1 with high-signal keyed title sketch
features while keeping user_id on persisted rows for joins and
per-user evaluation.
Source code in src/taskclf/core/schema.py
get_feature_schema(schema_version)
¶
Return the schema class for schema_version.
Source code in src/taskclf/core/schema.py
get_feature_storage_dir(schema_version)
¶
iter_feature_schema_versions(preferred_schema_version=None)
¶
Return schema versions ordered for lookup, newest-first by default.
Source code in src/taskclf/core/schema.py
resolve_feature_parquet_path(data_dir, target_date, *, schema_version=None)
¶
Return the first existing feature parquet path for target_date.
When schema_version is provided it is checked first, then older/newer versions are tried as fallbacks. When omitted, lookup proceeds newest-first.
Source code in src/taskclf/core/schema.py
coerce_nullable_numeric(df)
¶
Convert nullable numeric columns from object (None) to float64 (NaN).
When FeatureRow.model_dump() emits None for Optional[float]
fields, pandas stores the column as object dtype. This helper
coerces those columns to float64 so downstream validation and
parquet writing see the correct dtype.
The DataFrame is modified in-place and also returned for convenience.