Skip to content

labels.weak_rules

Heuristic weak-labeling rules that map feature rows to task-type labels.

Overview

Weak rules provide an automated, low-confidence alternative to manual labeling. Each rule inspects a single feature column and, when matched, proposes a LabelSpan with provenance="weak:<rule_name>". Rules are evaluated in list order; the first match wins.

Weak labels share the same LabelSpan structure as gold labels. The provenance field distinguishes them — gold labels use "manual", while weak labels use the "weak:<rule_name>" convention. This allows downstream consumers (training, projection) to filter or weight labels by origin.

WeakRule

Frozen dataclass representing a single heuristic rule.

Field Type Description
name str Human-readable identifier, also used in provenance
field str Feature column to inspect (e.g. "app_id")
pattern str Value the column must equal for the rule to fire
label str Task-type label to assign (must be in LABEL_SET_V1)
confidence float \| None Optional confidence attached to produced spans

Construction raises ValueError if label is not in LABEL_SET_V1.

Built-in rule maps

The module provides three dictionaries that feed build_default_rules(). They are ordered by specificity when assembled into the default list.

APP_ID_RULES (highest priority)

Maps reverse-domain app_id values to labels.

app_id Label
com.apple.Terminal Build
com.microsoft.VSCode Build
com.jetbrains.intellij Build
com.googlecode.iterm2 Build
org.mozilla.firefox ReadResearch
com.google.Chrome ReadResearch
com.apple.Safari ReadResearch
com.apple.mail Communicate
com.tinyspeck.slackmacgap Communicate
us.zoom.xos Meet
com.apple.Notes Write
com.apple.finder BreakIdle

APP_CATEGORY_RULES

Maps app_category values to labels.

Category Label
lockscreen BreakIdle
meeting Meet
chat Communicate
email Communicate
editor Build
terminal Build
devtools Debug
docs Write
design Write
media BreakIdle
file_manager BreakIdle

The lockscreen category covers OS lock/login screens (macOS loginwindow, Windows LockApp.exe/LogonUI.exe, Linux screen lockers like i3lock, swaylock, gnome-screensaver, etc.). No productive task is possible while the screen is locked, so these are unconditionally labeled BreakIdle.

DOMAIN_CATEGORY_RULES (lowest priority)

Maps browser domain_category values to labels.

Domain Label
code_hosting Build
email_web Communicate
chat Communicate
social BreakIdle
video BreakIdle
news ReadResearch
docs ReadResearch
search ReadResearch
productivity Write

Functions

build_default_rules

def build_default_rules() -> list[WeakRule]

Builds the default rule list from the three built-in maps, ordered by specificity: APP_ID_RULES first, then APP_CATEGORY_RULES, then DOMAIN_CATEGORY_RULES.

match_rule

def match_rule(
    row: dict[str, Any],
    rules: Sequence[WeakRule],
) -> tuple[str, str] | None

Matches a single feature row (as a dict) against an ordered list of rules. Returns (label, rule_name) on the first match, or None if no rule fires.

apply_weak_rules

def apply_weak_rules(
    features_df: pd.DataFrame,
    rules: Sequence[WeakRule] | None = None,
    user_id: str | None = None,
    bucket_seconds: int = DEFAULT_BUCKET_SECONDS,
) -> list[LabelSpan]

Applies rules to every row in features_df. Consecutive buckets (by bucket_start_ts) that receive the same label are merged into a single LabelSpan. A new span starts when the label changes or there is a time gap between buckets.

Parameter Default Description
features_df DataFrame with bucket_start_ts, bucket_end_ts, and feature columns
rules None Rule list; defaults to build_default_rules()
user_id None User id attached to every produced span
bucket_seconds 60 Expected bucket duration for gap detection

Usage

import pandas as pd
from taskclf.labels.weak_rules import apply_weak_rules, build_default_rules

rules = build_default_rules()
weak_spans = apply_weak_rules(features_df, rules=rules, user_id="u1")

for span in weak_spans:
    print(f"{span.start_ts}{span.end_ts}  {span.label}  ({span.provenance})")

Custom rules can be mixed with or replace the defaults:

from taskclf.labels.weak_rules import WeakRule, apply_weak_rules

custom_rules = [
    WeakRule(name="figma", field="app_id", pattern="com.figma.Desktop", label="Write"),
]
spans = apply_weak_rules(features_df, rules=custom_rules)

See also

taskclf.labels.weak_rules

Heuristic weak-labeling rules that map feature rows to task-type labels.

Weak rules provide an automated, low-confidence alternative to manual labeling. Each rule inspects a single feature column (e.g. app_id, app_category, domain_category) and, when matched, proposes a :class:~taskclf.core.types.LabelSpan with provenance="weak:<rule_name>".

Rules are evaluated in list order; the first match wins. The default rule list is ordered by specificity: app_id rules first, then app_category, then domain_category.

WeakRule dataclass

A single heuristic labeling rule.

Attributes:

Name Type Description
name str

Human-readable identifier (also used in provenance).

field str

Feature column to inspect (e.g. "app_id").

pattern str

Value that the column must equal for the rule to fire.

label str

Task-type label to assign (must be in LABEL_SET_V1).

confidence float | None

Optional confidence score attached to produced spans.

Source code in src/taskclf/labels/weak_rules.py
@dataclass(frozen=True, slots=True)
class WeakRule:
    """A single heuristic labeling rule.

    Attributes:
        name: Human-readable identifier (also used in ``provenance``).
        field: Feature column to inspect (e.g. ``"app_id"``).
        pattern: Value that the column must equal for the rule to fire.
        label: Task-type label to assign (must be in ``LABEL_SET_V1``).
        confidence: Optional confidence score attached to produced spans.
    """

    name: str
    field: str
    pattern: str
    label: str
    confidence: float | None = None

    def __post_init__(self) -> None:
        if self.label not in LABEL_SET_V1:
            raise ValueError(
                f"WeakRule {self.name!r}: unknown label {self.label!r}; "
                f"must be one of {sorted(LABEL_SET_V1)}"
            )

build_default_rules()

Build the default rule list ordered by specificity.

Order: app_id rules, then app_category, then domain_category. Within each group the iteration order of the corresponding dict is preserved.

Returns:

Type Description
list[WeakRule]

List of :class:WeakRule instances.

Source code in src/taskclf/labels/weak_rules.py
def build_default_rules() -> list[WeakRule]:
    """Build the default rule list ordered by specificity.

    Order: ``app_id`` rules, then ``app_category``, then
    ``domain_category``.  Within each group the iteration order of the
    corresponding dict is preserved.

    Returns:
        List of :class:`WeakRule` instances.
    """
    rules: list[WeakRule] = []
    for app_id, label in APP_ID_RULES.items():
        rules.append(
            WeakRule(
                name=f"app_id:{app_id}", field="app_id", pattern=app_id, label=label
            )
        )
    for cat, label in APP_CATEGORY_RULES.items():
        rules.append(
            WeakRule(
                name=f"app_category:{cat}",
                field="app_category",
                pattern=cat,
                label=label,
            )
        )
    for dom, label in DOMAIN_CATEGORY_RULES.items():
        rules.append(
            WeakRule(
                name=f"domain_category:{dom}",
                field="domain_category",
                pattern=dom,
                label=label,
            )
        )
    return rules

match_rule(row, rules)

Match a single feature row against rules (first match wins).

Parameters:

Name Type Description Default
row dict[str, Any]

Feature row as a dict (column name -> value).

required
rules Sequence[WeakRule]

Ordered sequence of rules to evaluate.

required

Returns:

Type Description
tuple[str, str] | None

(label, rule_name) of the first matching rule, or None

tuple[str, str] | None

if no rule fires.

Source code in src/taskclf/labels/weak_rules.py
def match_rule(
    row: dict[str, Any],
    rules: Sequence[WeakRule],
) -> tuple[str, str] | None:
    """Match a single feature row against *rules* (first match wins).

    Args:
        row: Feature row as a dict (column name -> value).
        rules: Ordered sequence of rules to evaluate.

    Returns:
        ``(label, rule_name)`` of the first matching rule, or ``None``
        if no rule fires.
    """
    for rule in rules:
        value = row.get(rule.field)
        if value is not None and value == rule.pattern:
            return rule.label, rule.name
    return None

apply_weak_rules(features_df, rules=None, user_id=None, bucket_seconds=DEFAULT_BUCKET_SECONDS)

Apply weak rules to every row in features_df and merge spans.

Consecutive buckets (ordered by bucket_start_ts) that receive the same label are merged into a single :class:LabelSpan. A new span starts whenever the label changes or there is a gap between buckets.

Parameters:

Name Type Description Default
features_df DataFrame

DataFrame with at least bucket_start_ts and bucket_end_ts columns plus the feature columns referenced by rules.

required
rules Sequence[WeakRule] | None

Rule list to apply. Defaults to :func:build_default_rules.

None
user_id str | None

Optional user id attached to every produced span.

None
bucket_seconds int

Expected bucket duration; used for gap detection.

DEFAULT_BUCKET_SECONDS

Returns:

Type Description
list[LabelSpan]

List of :class:LabelSpan with

list[LabelSpan]

provenance="weak:<rule_name>".

Source code in src/taskclf/labels/weak_rules.py
def apply_weak_rules(
    features_df: pd.DataFrame,
    rules: Sequence[WeakRule] | None = None,
    user_id: str | None = None,
    bucket_seconds: int = DEFAULT_BUCKET_SECONDS,
) -> list[LabelSpan]:
    """Apply weak rules to every row in *features_df* and merge spans.

    Consecutive buckets (ordered by ``bucket_start_ts``) that receive
    the **same label** are merged into a single :class:`LabelSpan`.
    A new span starts whenever the label changes or there is a gap
    between buckets.

    Args:
        features_df: DataFrame with at least ``bucket_start_ts`` and
            ``bucket_end_ts`` columns plus the feature columns
            referenced by *rules*.
        rules: Rule list to apply.  Defaults to
            :func:`build_default_rules`.
        user_id: Optional user id attached to every produced span.
        bucket_seconds: Expected bucket duration; used for gap detection.

    Returns:
        List of :class:`LabelSpan` with
        ``provenance="weak:<rule_name>"``.
    """
    if rules is None:
        rules = build_default_rules()

    if features_df.empty:
        return []

    df = features_df.sort_values("bucket_start_ts").reset_index(drop=True)

    spans: list[LabelSpan] = []
    current_label: str | None = None
    current_rule_name: str | None = None
    current_confidence: float | None = None
    span_start: dt.datetime | None = None
    span_end: dt.datetime | None = None
    expected_gap = dt.timedelta(seconds=bucket_seconds)

    for _, row_series in df.iterrows():
        row: dict[str, Any] = {str(k): v for k, v in row_series.to_dict().items()}
        bucket_start = row["bucket_start_ts"]
        bucket_end = row.get("bucket_end_ts", bucket_start + expected_gap)

        match = match_rule(row, rules)
        if match is None:
            if current_label is not None:
                spans.append(
                    LabelSpan(
                        start_ts=span_start,  # type: ignore[arg-type]
                        end_ts=span_end,  # type: ignore[arg-type]
                        label=current_label,
                        provenance=f"weak:{current_rule_name}",
                        user_id=user_id,
                        confidence=current_confidence,
                    )
                )
                current_label = None
                current_rule_name = None
                current_confidence = None
                span_start = None
                span_end = None
            continue

        label, rule_name = match
        confidence = next((r.confidence for r in rules if r.name == rule_name), None)

        is_contiguous = (
            span_end is not None and bucket_start <= span_end + dt.timedelta(seconds=1)
        )

        if label == current_label and is_contiguous:
            span_end = bucket_end
        else:
            if current_label is not None:
                spans.append(
                    LabelSpan(
                        start_ts=span_start,  # type: ignore[arg-type]
                        end_ts=span_end,  # type: ignore[arg-type]
                        label=current_label,
                        provenance=f"weak:{current_rule_name}",
                        user_id=user_id,
                        confidence=current_confidence,
                    )
                )
            current_label = label
            current_rule_name = rule_name
            current_confidence = confidence
            span_start = bucket_start
            span_end = bucket_end

    if current_label is not None:
        spans.append(
            LabelSpan(
                start_ts=span_start,  # type: ignore[arg-type]
                end_ts=span_end,  # type: ignore[arg-type]
                label=current_label,
                provenance=f"weak:{current_rule_name}",
                user_id=user_id,
                confidence=current_confidence,
            )
        )

    return spans