features/text¶
Hash-based window-title featurization (privacy-safe, no raw titles).
taskclf.features.text
¶
Window-title featurization using hash-based approaches.
All functions operate on hashed or salted representations — raw window
titles are never stored or returned. This satisfies the project's
privacy invariant (see AGENTS.md).
TitleSketchFeatures
dataclass
¶
Privacy-safe title features derived from a raw window title.
Source code in src/taskclf/features/text.py
normalize_title(raw_title)
¶
Normalize a raw window title before featurization.
derive_title_sketch_features(raw_title, secret, *, token_buckets=DEFAULT_TITLE_TOKEN_SKETCH_BUCKETS, char3_buckets=DEFAULT_TITLE_CHAR3_SKETCH_BUCKETS)
¶
Convert a raw title into non-reversible keyed sketch features.
Source code in src/taskclf/features/text.py
featurize_title(raw_title, salt)
¶
Convert a raw window title into a privacy-safe salted hash.
This is the single entry-point for title featurization. Callers should discard the raw title immediately after calling this function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_title
|
str
|
The original window title string. |
required |
salt
|
str
|
A per-installation or per-session secret used to prevent rainbow-table attacks on the hash. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A 12-character hex digest that is deterministic for the same |
str
|
(raw_title, salt) pair but infeasible to reverse. |
Source code in src/taskclf/features/text.py
title_hash_bucket(title_hash, n_buckets=256)
¶
Map a title hash to an integer bucket index via the hash trick.
Useful for converting the opaque hex hash into a bounded categorical feature that tree-based or embedding models can consume directly.
If title_hash is not valid hex (e.g. from test data), falls back to a SHA-256 digest for deterministic bucketing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
title_hash
|
str
|
Hex string produced by :func: |
required |
n_buckets
|
int
|
Size of the output space. Must be >= 1. |
256
|
Returns:
| Type | Description |
|---|---|
int
|
Integer in |
Raises:
| Type | Description |
|---|---|
ValueError
|
If n_buckets < 1. |