core.types¶

Pydantic models for the core data contracts.

FeatureRow identity fields¶

Every FeatureRow carries stable identity columns alongside the schema metadata and feature values:

Field	Type	Description
`user_id`	`str`	Random UUID identifying the user (not PII).
`device_id`	`str \\| None`	Optional device identifier.
`session_id`	`str`	Deterministic session identifier derived from `user_id` + session start timestamp.
`bucket_start_ts`	`datetime`	Start of the 60 s bucket (UTC, inclusive).
`bucket_end_ts`	`datetime`	End of the 60 s bucket (UTC, exclusive).

The primary key is (user_id, bucket_start_ts).

LabelSpan fields¶

LabelSpan represents a contiguous time span carrying a single task-type label. Gold labels and weak labels share this structure. During validation, all timestamps are normalized to timezone-aware UTC via ts_utc_aware_get() so persisted label spans and in-memory comparisons use a single timestamp convention. Legacy naive inputs (from CSV, Parquet, or API) are accepted and treated as UTC on read.

Field	Type	Description
`start_ts`	`datetime`	Span start (aware UTC, inclusive).
`end_ts`	`datetime`	Span end (aware UTC, exclusive).
`label`	`str`	Task-type label from `LABEL_SET_V1`.
`provenance`	`str`	Origin tag, e.g. `"manual"` or `"weak:app_rule"`.
`user_id`	`str \\| None`	User who created this label (optional, default `None`).
`confidence`	`float \\| None`	Labeler confidence 0-1 (optional, default `None`). `NaN` is coerced to `None`.

TitlePolicy¶

TitlePolicy controls whether raw window titles may appear in a FeatureRow.

Member	Value	Behaviour
`HASH_ONLY`	`"hash_only"`	Default. All `raw_*` fields are rejected.
`RAW_WINDOW_TITLE_OPT_IN`	`"raw_window_title_opt_in"`	`raw_window_title` is accepted but excluded from `model_dump()`, preventing leakage into `data/processed/`. All other `raw_*` fields remain prohibited.

Pass the policy via Pydantic validation context:

from taskclf.core.types import FeatureRow, TitlePolicy

row = FeatureRow.model_validate(
    data,
    context={"title_policy": TitlePolicy.RAW_WINDOW_TITLE_OPT_IN},
)
row.raw_window_title   # available on the instance
row.model_dump()       # raw_window_title is NOT included

`taskclf.core.types` ¶

Core data contracts: Event protocol, feature rows, and label spans.

`Event` ¶

Bases: Protocol

Normalized activity event contract.

Any adapter event type that exposes these read-only attributes satisfies the protocol -- no inheritance required. For example, :class:~taskclf.adapters.activitywatch.types.AWEvent is a valid Event without importing or subclassing this protocol.

Source code in src/taskclf/core/types.py

@runtime_checkable
class Event(Protocol):
    """Normalized activity event contract.

    Any adapter event type that exposes these read-only attributes
    satisfies the protocol -- no inheritance required.  For example,
    :class:`~taskclf.adapters.activitywatch.types.AWEvent` is a valid
    ``Event`` without importing or subclassing this protocol.
    """

    @property
    def timestamp(self) -> datetime: ...
    @property
    def duration_seconds(self) -> float: ...
    @property
    def app_id(self) -> str: ...
    @property
    def window_title_hash(self) -> str: ...
    @property
    def is_browser(self) -> bool: ...
    @property
    def is_editor(self) -> bool: ...
    @property
    def is_terminal(self) -> bool: ...
    @property
    def app_category(self) -> str: ...

`CoreLabel` ¶

Bases: StrEnum

Canonical task-type labels (v1).

Member ordering matches schema/labels_v1.json label IDs. Do NOT reorder or remove members without a version bump.

Source code in src/taskclf/core/types.py

class CoreLabel(StrEnum):
    """Canonical task-type labels (v1).

    Member ordering matches ``schema/labels_v1.json`` label IDs.
    Do NOT reorder or remove members without a version bump.
    """

    Build = "Build"
    Debug = "Debug"
    Review = "Review"
    Write = "Write"
    ReadResearch = "ReadResearch"
    Communicate = "Communicate"
    Meet = "Meet"
    BreakIdle = "BreakIdle"

`TitlePolicy` ¶

Bases: StrEnum

Controls whether raw window titles may appear in a :class:FeatureRow.

HASH_ONLY (default) All raw_* fields are rejected — the standard privacy mode.

RAW_WINDOW_TITLE_OPT_IN raw_window_title is accepted (but still excluded from model_dump() so it can never leak into data/processed/). All other raw_* fields remain prohibited.

Pass the policy via Pydantic validation context::

FeatureRow.model_validate(data, context={"title_policy": TitlePolicy.RAW_WINDOW_TITLE_OPT_IN})

Source code in src/taskclf/core/types.py

class TitlePolicy(StrEnum):
    """Controls whether raw window titles may appear in a :class:`FeatureRow`.

    ``HASH_ONLY`` (default)
        All ``raw_*`` fields are rejected — the standard privacy mode.

    ``RAW_WINDOW_TITLE_OPT_IN``
        ``raw_window_title`` is accepted (but still excluded from
        ``model_dump()`` so it can never leak into ``data/processed/``).
        All other ``raw_*`` fields remain prohibited.

    Pass the policy via Pydantic validation context::

        FeatureRow.model_validate(data, context={"title_policy": TitlePolicy.RAW_WINDOW_TITLE_OPT_IN})
    """

    HASH_ONLY = "hash_only"
    RAW_WINDOW_TITLE_OPT_IN = "raw_window_title_opt_in"

`FeatureRowBase` ¶

Bases: BaseModel

One bucketed observation (typically 60 s).

All persisted feature rows carry schema metadata so downstream consumers can detect silent drift.

Fields are grouped into four sections:

meta — bucket_start_ts, schema_version, schema_hash, source_ids.
context — app_id, app_category, window_title_hash, is_browser, is_editor, is_terminal, app_switch_count_last_5m, app_foreground_time_ratio, app_change_count, app_dwell_time_seconds, app_entropy_5m, app_entropy_15m, top2_app_concentration_15m, idle_return_indicator.
keyboard / mouse — nullable until the corresponding collector is wired (keys_per_min, backspace_ratio, shortcut_rate, clicks_per_min, scroll_events_per_min, mouse_distance).
activity occupancy — nullable; derived from aw-watcher-input (active_seconds_keyboard, active_seconds_mouse, active_seconds_any, max_idle_run_seconds, event_density).
temporal — hour_of_day, day_of_week, session_length_so_far.

A pre-validator rejects any field whose name starts with raw_ to enforce the privacy invariant (no raw keystrokes / titles). The single exception is raw_window_title, which is accepted when validation context carries title_policy=TitlePolicy.RAW_WINDOW_TITLE_OPT_IN. Even then, the field is excluded from model_dump() so it cannot leak into persisted datasets.

Source code in src/taskclf/core/types.py

class FeatureRowBase(BaseModel, frozen=True):
    """One bucketed observation (typically 60 s).

    All persisted feature rows carry schema metadata so downstream
    consumers can detect silent drift.

    Fields are grouped into four sections:

    - **meta** — ``bucket_start_ts``, ``schema_version``, ``schema_hash``,
      ``source_ids``.
    - **context** — ``app_id``, ``app_category``, ``window_title_hash``,
      ``is_browser``, ``is_editor``, ``is_terminal``,
      ``app_switch_count_last_5m``, ``app_foreground_time_ratio``,
      ``app_change_count``, ``app_dwell_time_seconds``,
      ``app_entropy_5m``, ``app_entropy_15m``,
      ``top2_app_concentration_15m``, ``idle_return_indicator``.
    - **keyboard / mouse** — nullable until the corresponding collector is
      wired (``keys_per_min``, ``backspace_ratio``, ``shortcut_rate``,
      ``clicks_per_min``, ``scroll_events_per_min``, ``mouse_distance``).
    - **activity occupancy** — nullable; derived from ``aw-watcher-input``
      (``active_seconds_keyboard``, ``active_seconds_mouse``,
      ``active_seconds_any``, ``max_idle_run_seconds``, ``event_density``).
    - **temporal** — ``hour_of_day``, ``day_of_week``, ``session_length_so_far``.

    A pre-validator rejects any field whose name starts with ``raw_`` to
    enforce the privacy invariant (no raw keystrokes / titles).  The
    single exception is ``raw_window_title``, which is accepted when
    validation context carries ``title_policy=TitlePolicy.RAW_WINDOW_TITLE_OPT_IN``.
    Even then, the field is excluded from ``model_dump()`` so it cannot
    leak into persisted datasets.
    """

    # -- identity --
    user_id: str = Field(description="Random UUID identifying the user (not PII).")
    device_id: str | None = Field(
        default=None, description="Optional device identifier."
    )
    session_id: str = Field(
        description="Deterministic session identifier derived from user_id + session start."
    )

    # -- meta --
    bucket_start_ts: datetime = Field(description="Start of the 60 s bucket (UTC).")
    bucket_end_ts: datetime = Field(
        description="End of the 60 s bucket (UTC, exclusive)."
    )
    schema_version: str = Field(description="Schema version tag, e.g. 'v1'.")
    schema_hash: str = Field(description="Deterministic hash of the column registry.")
    source_ids: list[str] = Field(
        min_length=1, description="Collector IDs that contributed to this row."
    )

    # -- context --
    app_id: str = Field(
        description="Reverse-domain app identifier, e.g. 'com.apple.Terminal'."
    )
    app_category: str = Field(
        description="Semantic app category, e.g. 'editor', 'chat', 'meeting'."
    )
    window_title_hash: str = Field(description="Hashed window title (never raw).")
    is_browser: bool = Field(description="True if the foreground app is a web browser.")
    is_editor: bool = Field(description="True if the foreground app is a code editor.")
    is_terminal: bool = Field(
        description="True if the foreground app is a terminal emulator."
    )
    app_switch_count_last_5m: int = Field(
        ge=0, description="Number of unique app switches in the last 5 minutes."
    )
    app_foreground_time_ratio: float = Field(
        ge=0.0,
        le=1.0,
        description="Fraction of the bucket the dominant app was foreground.",
    )
    app_change_count: int = Field(
        ge=0, description="Number of app transitions within this bucket."
    )
    app_dwell_time_seconds: float = Field(
        ge=0.0,
        description="Seconds the dominant app has been foreground continuously across consecutive buckets.",
    )
    app_entropy_5m: float | None = Field(
        default=None,
        ge=0.0,
        description="Shannon entropy of app duration distribution over the last 5 minutes.",
    )
    app_entropy_15m: float | None = Field(
        default=None,
        ge=0.0,
        description="Shannon entropy of app duration distribution over the last 15 minutes.",
    )
    top2_app_concentration_15m: float | None = Field(
        default=None,
        ge=0.0,
        le=1.0,
        description="Combined time share of the two most-used apps over the last 15 minutes.",
    )
    idle_return_indicator: bool = Field(
        default=False,
        description="True if this bucket immediately follows an idle gap (i.e., starts a new session).",
    )

    # -- keyboard (nullable until collector is wired) --
    keys_per_min: float | None = Field(
        default=None, description="Keystrokes per minute."
    )
    backspace_ratio: float | None = Field(
        default=None,
        ge=0.0,
        le=1.0,
        description="Fraction of keystrokes that are backspace.",
    )
    shortcut_rate: float | None = Field(
        default=None, ge=0.0, description="Keyboard shortcuts per minute."
    )

    # -- mouse (nullable until collector is wired) --
    clicks_per_min: float | None = Field(
        default=None, ge=0.0, description="Mouse clicks per minute."
    )
    scroll_events_per_min: float | None = Field(
        default=None, ge=0.0, description="Scroll events per minute."
    )
    mouse_distance: float | None = Field(
        default=None, ge=0.0, description="Mouse distance in pixels."
    )

    # -- activity occupancy (nullable until input collector is wired) --
    active_seconds_keyboard: float | None = Field(
        default=None,
        ge=0.0,
        description="Seconds with keyboard activity within this bucket.",
    )
    active_seconds_mouse: float | None = Field(
        default=None,
        ge=0.0,
        description="Seconds with mouse activity within this bucket.",
    )
    active_seconds_any: float | None = Field(
        default=None,
        ge=0.0,
        description="Seconds with any input activity within this bucket.",
    )
    max_idle_run_seconds: float | None = Field(
        default=None,
        ge=0.0,
        description="Longest consecutive idle run (seconds) within this bucket.",
    )
    event_density: float | None = Field(
        default=None,
        ge=0.0,
        description="Input events per active second within this bucket.",
    )

    # -- browser domain (item 38) --
    domain_category: str = Field(
        default="unknown",
        description="Privacy-preserving browser domain category (e.g. 'search', 'docs', 'social'); 'non_browser' for non-browser apps.",
    )

    # -- title clustering (item 39) --
    window_title_bucket: int = Field(
        ge=0, le=255, description="Hash-trick bucket (0-255) of window_title_hash."
    )
    title_repeat_count_session: int = Field(
        ge=0,
        description="Number of times this window_title_hash has appeared in the current session.",
    )

    # -- temporal dynamics: rolling means (item 40) --
    keys_per_min_rolling_5: float | None = Field(
        default=None, ge=0.0, description="5-bucket rolling mean of keys_per_min."
    )
    keys_per_min_rolling_15: float | None = Field(
        default=None, ge=0.0, description="15-bucket rolling mean of keys_per_min."
    )
    mouse_distance_rolling_5: float | None = Field(
        default=None, ge=0.0, description="5-bucket rolling mean of mouse_distance."
    )
    mouse_distance_rolling_15: float | None = Field(
        default=None, ge=0.0, description="15-bucket rolling mean of mouse_distance."
    )

    # -- temporal dynamics: deltas (item 40) --
    keys_per_min_delta: float | None = Field(
        default=None, description="Change in keys_per_min from previous bucket."
    )
    clicks_per_min_delta: float | None = Field(
        default=None, description="Change in clicks_per_min from previous bucket."
    )
    mouse_distance_delta: float | None = Field(
        default=None, description="Change in mouse_distance from previous bucket."
    )

    # -- temporal dynamics: extended switch count (item 40) --
    app_switch_count_last_15m: int = Field(
        ge=0, description="Unique app switches in the last 15 minutes."
    )

    # -- temporal --
    hour_of_day: int = Field(
        ge=0, le=23, description="Hour component of bucket_start_ts (0-23)."
    )
    day_of_week: int = Field(
        ge=0, le=6, description="Day of week (0=Monday, 6=Sunday)."
    )
    session_length_so_far: float = Field(
        ge=0.0, description="Minutes since session start."
    )

    # -- opt-in raw title (excluded from serialization) --
    raw_window_title: str | None = Field(
        default=None,
        exclude=True,
        description="Raw window title; only accepted when title_policy=RAW_WINDOW_TITLE_OPT_IN.",
    )

    @field_validator("bucket_start_ts", "bucket_end_ts", mode="before")
    @classmethod
    def _ensure_aware_utc(cls, v: object) -> object:
        """Tag naive datetimes as UTC; convert non-UTC aware datetimes."""
        if isinstance(v, datetime):
            if v.tzinfo is None:
                return v.replace(tzinfo=timezone.utc)
            return v.astimezone(timezone.utc)
        return v

    @model_validator(mode="before")
    @classmethod
    def reject_prohibited_fields(cls, values: dict, info: ValidationInfo) -> dict:  # type: ignore[override]
        if isinstance(values, dict):
            ctx = info.context or {}
            title_policy = ctx.get("title_policy", TitlePolicy.HASH_ONLY)
            for key in values:
                for prefix in _PROHIBITED_FIELD_PREFIXES:
                    if key.startswith(prefix):
                        if (
                            key == "raw_window_title"
                            and title_policy == TitlePolicy.RAW_WINDOW_TITLE_OPT_IN
                        ):
                            continue
                        raise ValueError(
                            f"Prohibited field '{key}': fields starting with "
                            f"'{prefix}' must not appear in a FeatureRow"
                        )
        return values

    def model_dump(self, *args, **kwargs):  # type: ignore[override]
        exclude = kwargs.pop("exclude", None)
        if self.schema_version == "v2":
            if exclude is None:
                exclude = {"user_id"}
            elif isinstance(exclude, dict):
                exclude = {**exclude, "user_id": True}
            else:
                exclude = set(exclude) | {"user_id"}
        if self.schema_version != "v3":
            if exclude is None:
                exclude = set(V3_ONLY_FEATURE_FIELDS)
            elif isinstance(exclude, dict):
                exclude = {
                    **exclude,
                    **{field: True for field in V3_ONLY_FEATURE_FIELDS},
                }
            else:
                exclude = set(exclude) | set(V3_ONLY_FEATURE_FIELDS)
        return super().model_dump(*args, exclude=exclude, **kwargs)

`LabelSpan` ¶

Bases: BaseModel

A contiguous time span carrying a single task-type label.

Gold labels and weak labels share this structure; provenance distinguishes them (e.g. "manual" vs "weak:app_rule").

Optional user_id ties the span to a specific user (required for multi-user datasets and block-to-window projection). Optional confidence records the labeler's self-assessed certainty. Both default to None for backward compatibility with existing label CSV imports.

Source code in src/taskclf/core/types.py

class LabelSpan(BaseModel, frozen=True):
    """A contiguous time span carrying a single task-type label.

    Gold labels and weak labels share this structure; ``provenance``
    distinguishes them (e.g. ``"manual"`` vs ``"weak:app_rule"``).

    Optional ``user_id`` ties the span to a specific user (required for
    multi-user datasets and block-to-window projection).  Optional
    ``confidence`` records the labeler's self-assessed certainty.
    Both default to ``None`` for backward compatibility with existing
    label CSV imports.
    """

    start_ts: datetime = Field(description="Span start (aware UTC, inclusive).")
    end_ts: datetime = Field(description="Span end (aware UTC, exclusive).")
    label: str = Field(description="Task-type label from LABEL_SET_V1.")
    provenance: str = Field(description="Origin tag, e.g. 'manual' or 'weak:app_rule'.")
    user_id: str | None = Field(
        default=None, description="User who created this label."
    )
    confidence: float | None = Field(
        default=None, ge=0.0, le=1.0, description="Labeler confidence (0-1)."
    )

    @field_validator("start_ts", "end_ts", mode="before")
    @classmethod
    def _normalize_timestamps(cls, v: object) -> object:
        """Normalize label span timestamps to aware UTC."""
        if isinstance(v, datetime):
            return ts_utc_aware_get(v)
        return v

    @field_validator("confidence", mode="before")
    @classmethod
    def _nan_confidence_to_none(cls, v: object) -> object:
        if isinstance(v, float) and math.isnan(v):
            return None
        return v

    extend_forward: bool = Field(
        default=False,
        description="When true, this label extends forward until the next label is created.",
    )

    @model_validator(mode="after")
    def _check_invariants(self) -> LabelSpan:
        if self.extend_forward:
            if self.end_ts < self.start_ts:
                raise ValueError(
                    f"end_ts ({self.end_ts}) must not be before "
                    f"start_ts ({self.start_ts})"
                )
        elif self.end_ts <= self.start_ts:
            raise ValueError(
                f"end_ts ({self.end_ts}) must be strictly after "
                f"start_ts ({self.start_ts})"
            )
        if self.label not in LABEL_SET_V1:
            raise ValueError(
                f"Unknown label {self.label!r}; must be one of {sorted(LABEL_SET_V1)}"
            )
        return self

core.types¶

FeatureRow identity fields¶

LabelSpan fields¶

TitlePolicy¶

taskclf.core.types ¶

Event ¶

CoreLabel ¶

TitlePolicy ¶

FeatureRowBase ¶

LabelSpan ¶

`taskclf.core.types` ¶

`Event` ¶

`CoreLabel` ¶

`TitlePolicy` ¶

`FeatureRowBase` ¶

`LabelSpan` ¶