labels.weak_rules¶

Heuristic weak-labeling rules that map feature rows to task-type labels.

Overview¶

Weak rules provide an automated, low-confidence alternative to manual labeling. Each rule inspects a single feature column and, when matched, proposes a LabelSpan with provenance="weak:<rule_name>". Rules are evaluated in list order; the first match wins.

Weak labels share the same LabelSpan structure as gold labels. The provenance field distinguishes them — gold labels use "manual", while weak labels use the "weak:<rule_name>" convention. This allows downstream consumers (training, projection) to filter or weight labels by origin.

WeakRule¶

Frozen dataclass representing a single heuristic rule.

Field	Type	Description
`name`	`str`	Human-readable identifier, also used in `provenance`
`field`	`str`	Feature column to inspect (e.g. `"app_id"`)
`pattern`	`str`	Value the column must equal for the rule to fire
`label`	`str`	Task-type label to assign (must be in `LABEL_SET_V1`)
`confidence`	`float \\| None`	Optional confidence attached to produced spans

Construction raises ValueError if label is not in LABEL_SET_V1.

Built-in rule maps¶

The module provides three dictionaries that feed build_default_rules(). They are ordered by specificity when assembled into the default list.

`APP_ID_RULES` (highest priority)¶

Maps reverse-domain app_id values to labels.

`app_id`	Label
`com.apple.Terminal`	Build
`com.microsoft.VSCode`	Build
`com.jetbrains.intellij`	Build
`com.googlecode.iterm2`	Build
`org.mozilla.firefox`	ReadResearch
`com.google.Chrome`	ReadResearch
`com.apple.Safari`	ReadResearch
`com.apple.mail`	Communicate
`com.tinyspeck.slackmacgap`	Communicate
`us.zoom.xos`	Meet
`com.apple.Notes`	Write
`com.apple.finder`	BreakIdle

`APP_CATEGORY_RULES`¶

Maps app_category values to labels.

Category	Label
`lockscreen`	BreakIdle
`meeting`	Meet
`chat`	Communicate
`email`	Communicate
`editor`	Build
`terminal`	Build
`devtools`	Debug
`docs`	Write
`design`	Write
`media`	BreakIdle
`file_manager`	BreakIdle

The lockscreen category covers OS lock/login screens (macOS loginwindow, Windows LockApp.exe/LogonUI.exe, Linux screen lockers like i3lock, swaylock, gnome-screensaver, etc.). No productive task is possible while the screen is locked, so these are unconditionally labeled BreakIdle.

`DOMAIN_CATEGORY_RULES` (lowest priority)¶

Maps browser domain_category values to labels.

Domain	Label
`code_hosting`	Build
`email_web`	Communicate
`chat`	Communicate
`social`	BreakIdle
`video`	BreakIdle
`news`	ReadResearch
`docs`	ReadResearch
`search`	ReadResearch
`productivity`	Write

Functions¶

`build_default_rules`¶

def build_default_rules() -> list[WeakRule]

Builds the default rule list from the three built-in maps, ordered by specificity: APP_ID_RULES first, then APP_CATEGORY_RULES, then DOMAIN_CATEGORY_RULES.

`match_rule`¶

def match_rule(
    row: dict[str, Any],
    rules: Sequence[WeakRule],
) -> tuple[str, str] | None

Matches a single feature row (as a dict) against an ordered list of rules. Returns (label, rule_name) on the first match, or None if no rule fires.

`apply_weak_rules`¶

def apply_weak_rules(
    features_df: pd.DataFrame,
    rules: Sequence[WeakRule] | None = None,
    user_id: str | None = None,
    bucket_seconds: int = DEFAULT_BUCKET_SECONDS,
) -> list[LabelSpan]

Applies rules to every row in features_df. Consecutive buckets (by bucket_start_ts) that receive the same label are merged into a single LabelSpan. A new span starts when the label changes or there is a time gap between buckets.

Parameter	Default	Description
`features_df`	—	DataFrame with `bucket_start_ts`, `bucket_end_ts`, and feature columns
`rules`	`None`	Rule list; defaults to `build_default_rules()`
`user_id`	`None`	User id attached to every produced span
`bucket_seconds`	`60`	Expected bucket duration for gap detection

Usage¶

import pandas as pd
from taskclf.labels.weak_rules import apply_weak_rules, build_default_rules

rules = build_default_rules()
weak_spans = apply_weak_rules(features_df, rules=rules, user_id="u1")

for span in weak_spans:
    print(f"{span.start_ts} → {span.end_ts}  {span.label}  ({span.provenance})")

Custom rules can be mixed with or replace the defaults:

from taskclf.labels.weak_rules import WeakRule, apply_weak_rules

custom_rules = [
    WeakRule(name="figma", field="app_id", pattern="com.figma.Desktop", label="Write"),
]
spans = apply_weak_rules(features_df, rules=custom_rules)

`taskclf.labels.weak_rules` ¶

Heuristic weak-labeling rules that map feature rows to task-type labels.

Weak rules provide an automated, low-confidence alternative to manual labeling. Each rule inspects a single feature column (e.g. app_id, app_category, domain_category) and, when matched, proposes a :class:~taskclf.core.types.LabelSpan with provenance="weak:<rule_name>".

Rules are evaluated in list order; the first match wins. The default rule list is ordered by specificity: app_id rules first, then app_category, then domain_category.

`WeakRule` `dataclass` ¶

A single heuristic labeling rule.

Attributes:

Name	Type	Description
`name`	`str`	Human-readable identifier (also used in `provenance`).
`field`	`str`	Feature column to inspect (e.g. `"app_id"`).
`pattern`	`str`	Value that the column must equal for the rule to fire.
`label`	`str`	Task-type label to assign (must be in `LABEL_SET_V1`).
`confidence`	`float \| None`	Optional confidence score attached to produced spans.

Source code in src/taskclf/labels/weak_rules.py

@dataclass(frozen=True, slots=True)
class WeakRule:
    """A single heuristic labeling rule.

    Attributes:
        name: Human-readable identifier (also used in ``provenance``).
        field: Feature column to inspect (e.g. ``"app_id"``).
        pattern: Value that the column must equal for the rule to fire.
        label: Task-type label to assign (must be in ``LABEL_SET_V1``).
        confidence: Optional confidence score attached to produced spans.
    """

    name: str
    field: str
    pattern: str
    label: str
    confidence: float | None = None

    def __post_init__(self) -> None:
        if self.label not in LABEL_SET_V1:
            raise ValueError(
                f"WeakRule {self.name!r}: unknown label {self.label!r}; "
                f"must be one of {sorted(LABEL_SET_V1)}"
            )

`build_default_rules()` ¶

Build the default rule list ordered by specificity.

Order: app_id rules, then app_category, then domain_category. Within each group the iteration order of the corresponding dict is preserved.

Returns:

Type	Description
`list[WeakRule]`	List of :class:`WeakRule` instances.

Source code in src/taskclf/labels/weak_rules.py

def build_default_rules() -> list[WeakRule]:
    """Build the default rule list ordered by specificity.

    Order: ``app_id`` rules, then ``app_category``, then
    ``domain_category``.  Within each group the iteration order of the
    corresponding dict is preserved.

    Returns:
        List of :class:`WeakRule` instances.
    """
    rules: list[WeakRule] = []
    for app_id, label in APP_ID_RULES.items():
        rules.append(
            WeakRule(
                name=f"app_id:{app_id}", field="app_id", pattern=app_id, label=label
            )
        )
    for cat, label in APP_CATEGORY_RULES.items():
        rules.append(
            WeakRule(
                name=f"app_category:{cat}",
                field="app_category",
                pattern=cat,
                label=label,
            )
        )
    for dom, label in DOMAIN_CATEGORY_RULES.items():
        rules.append(
            WeakRule(
                name=f"domain_category:{dom}",
                field="domain_category",
                pattern=dom,
                label=label,
            )
        )
    return rules

`match_rule(row, rules)` ¶

Match a single feature row against rules (first match wins).

Parameters:

Name	Type	Description	Default
`row`	`dict[str, Any]`	Feature row as a dict (column name -> value).	required
`rules`	`Sequence[WeakRule]`	Ordered sequence of rules to evaluate.	required

Returns:

Type	Description
`tuple[str, str] \| None`	`(label, rule_name)` of the first matching rule, or `None`
`tuple[str, str] \| None`	if no rule fires.

Source code in src/taskclf/labels/weak_rules.py

def match_rule(
    row: dict[str, Any],
    rules: Sequence[WeakRule],
) -> tuple[str, str] | None:
    """Match a single feature row against *rules* (first match wins).

    Args:
        row: Feature row as a dict (column name -> value).
        rules: Ordered sequence of rules to evaluate.

    Returns:
        ``(label, rule_name)`` of the first matching rule, or ``None``
        if no rule fires.
    """
    for rule in rules:
        value = row.get(rule.field)
        if value is not None and value == rule.pattern:
            return rule.label, rule.name
    return None

`apply_weak_rules(features_df, rules=None, user_id=None, bucket_seconds=DEFAULT_BUCKET_SECONDS)` ¶

Apply weak rules to every row in features_df and merge spans.

Consecutive buckets (ordered by bucket_start_ts) that receive the same label are merged into a single :class:LabelSpan. A new span starts whenever the label changes or there is a gap between buckets.

Parameters:

Name	Type	Description	Default
`features_df`	`DataFrame`	DataFrame with at least `bucket_start_ts` and `bucket_end_ts` columns plus the feature columns referenced by rules.	required
`rules`	`Sequence[WeakRule] \| None`	Rule list to apply. Defaults to :func:`build_default_rules`.	`None`
`user_id`	`str \| None`	Optional user id attached to every produced span.	`None`
`bucket_seconds`	`int`	Expected bucket duration; used for gap detection.	`DEFAULT_BUCKET_SECONDS`

Returns:

Type	Description
`list[LabelSpan]`	List of :class:`LabelSpan` with
`list[LabelSpan]`	`provenance="weak:<rule_name>"`.

Source code in src/taskclf/labels/weak_rules.py

def apply_weak_rules(
    features_df: pd.DataFrame,
    rules: Sequence[WeakRule] | None = None,
    user_id: str | None = None,
    bucket_seconds: int = DEFAULT_BUCKET_SECONDS,
) -> list[LabelSpan]:
    """Apply weak rules to every row in *features_df* and merge spans.

    Consecutive buckets (ordered by ``bucket_start_ts``) that receive
    the **same label** are merged into a single :class:`LabelSpan`.
    A new span starts whenever the label changes or there is a gap
    between buckets.

    Args:
        features_df: DataFrame with at least ``bucket_start_ts`` and
            ``bucket_end_ts`` columns plus the feature columns
            referenced by *rules*.
        rules: Rule list to apply.  Defaults to
            :func:`build_default_rules`.
        user_id: Optional user id attached to every produced span.
        bucket_seconds: Expected bucket duration; used for gap detection.

    Returns:
        List of :class:`LabelSpan` with
        ``provenance="weak:<rule_name>"``.
    """
    if rules is None:
        rules = build_default_rules()

    if features_df.empty:
        return []

    df = features_df.sort_values("bucket_start_ts").reset_index(drop=True)

    spans: list[LabelSpan] = []
    current_label: str | None = None
    current_rule_name: str | None = None
    current_confidence: float | None = None
    span_start: dt.datetime | None = None
    span_end: dt.datetime | None = None
    expected_gap = dt.timedelta(seconds=bucket_seconds)

    for _, row_series in df.iterrows():
        row: dict[str, Any] = {str(k): v for k, v in row_series.to_dict().items()}
        bucket_start = row["bucket_start_ts"]
        bucket_end = row.get("bucket_end_ts", bucket_start + expected_gap)

        match = match_rule(row, rules)
        if match is None:
            if current_label is not None:
                spans.append(
                    LabelSpan(
                        start_ts=span_start,  # type: ignore[arg-type]
                        end_ts=span_end,  # type: ignore[arg-type]
                        label=current_label,
                        provenance=f"weak:{current_rule_name}",
                        user_id=user_id,
                        confidence=current_confidence,
                    )
                )
                current_label = None
                current_rule_name = None
                current_confidence = None
                span_start = None
                span_end = None
            continue

        label, rule_name = match
        confidence = next((r.confidence for r in rules if r.name == rule_name), None)

        is_contiguous = (
            span_end is not None and bucket_start <= span_end + dt.timedelta(seconds=1)
        )

        if label == current_label and is_contiguous:
            span_end = bucket_end
        else:
            if current_label is not None:
                spans.append(
                    LabelSpan(
                        start_ts=span_start,  # type: ignore[arg-type]
                        end_ts=span_end,  # type: ignore[arg-type]
                        label=current_label,
                        provenance=f"weak:{current_rule_name}",
                        user_id=user_id,
                        confidence=current_confidence,
                    )
                )
            current_label = label
            current_rule_name = rule_name
            current_confidence = confidence
            span_start = bucket_start
            span_end = bucket_end

    if current_label is not None:
        spans.append(
            LabelSpan(
                start_ts=span_start,  # type: ignore[arg-type]
                end_ts=span_end,  # type: ignore[arg-type]
                label=current_label,
                provenance=f"weak:{current_rule_name}",
                user_id=user_id,
                confidence=current_confidence,
            )
        )

    return spans

labels.weak_rules¶

Overview¶

WeakRule¶

Built-in rule maps¶

APP_ID_RULES (highest priority)¶

APP_CATEGORY_RULES¶

DOMAIN_CATEGORY_RULES (lowest priority)¶

Functions¶

build_default_rules¶

match_rule¶

apply_weak_rules¶

Usage¶

See also¶

taskclf.labels.weak_rules ¶

WeakRule dataclass ¶

build_default_rules() ¶

match_rule(row, rules) ¶

apply_weak_rules(features_df, rules=None, user_id=None, bucket_seconds=DEFAULT_BUCKET_SECONDS) ¶

`APP_ID_RULES` (highest priority)¶

`APP_CATEGORY_RULES`¶

`DOMAIN_CATEGORY_RULES` (lowest priority)¶

`build_default_rules`¶

`match_rule`¶

`apply_weak_rules`¶

`taskclf.labels.weak_rules` ¶

`WeakRule` `dataclass` ¶

`build_default_rules()` ¶

`match_rule(row, rules)` ¶

`apply_weak_rules(features_df, rules=None, user_id=None, bucket_seconds=DEFAULT_BUCKET_SECONDS)` ¶