Skip to content

infer.taxonomy

User-specific taxonomy mapping: core labels to user-defined buckets.

Overview

The taxonomy layer sits between the model's core 8-class predictions and the user-facing display. It maps one or more core labels into user-defined buckets via a YAML config, without altering the underlying core predictions.

core label + core_probs → TaxonomyResolver → mapped_label + mapped_probs
TaxonomyResolver is implemented as a slotted dataclass and still accepts the same constructor input (TaxonomyConfig).

See the taxonomy guide and configs/user_taxonomy_example.yaml for configuration details.

Config model hierarchy

TaxonomyConfig

Top-level config loaded from YAML.

Field Type Default Description
version str "1.0" Config schema version
label_schema_version str "labels_v1" Expected label schema
user_id str \| None None Optional user scope
display TaxonomyDisplay (defaults) Display preferences
reject TaxonomyReject (defaults) Rejection display settings
buckets list[TaxonomyBucket] (required) At least one bucket
advanced TaxonomyAdvanced (defaults) Tuning knobs

TaxonomyBucket

A user-facing task category that aggregates one or more core labels.

Field Type Description
name str Unique display name
description str Human-readable description
core_labels list[str] Core labels mapped to this bucket (must be valid LABEL_SET_V1 entries)
color str Hex color for display (#RRGGBB)

TaxonomyDisplay

Field Type Default Description
show_core_labels bool False Show underlying core labels in UI
default_view "mapped" \| "core" "mapped" Default view mode
color_theme str "default" Color theme name

TaxonomyReject

Field Type Default Description
mixed_label_name str "Mixed/Unknown" Label shown for rejected predictions
include_rejected_in_reports bool False Include rejected buckets in reports

TaxonomyAdvanced

Field Type Default Description
probability_aggregation "sum" \| "max" "sum" How core-label probs are combined per bucket
min_confidence_for_mapping float 0.55 Minimum confidence for mapping
reweight_core_labels dict[str, float] {} Per-label probability multipliers

TaxonomyResolver

Stateless mapper from core predictions to user-defined buckets. Precomputes index lookups at construction time for fast per-row resolution.

from pathlib import Path
from taskclf.infer.taxonomy import load_taxonomy, TaxonomyResolver

config = load_taxonomy(Path("configs/user_taxonomy.yaml"))
resolver = TaxonomyResolver(config)
result = resolver.resolve(core_label_id, core_probs)
print(result.mapped_label, result.mapped_probs)

resolve_batch maps an entire batch at once:

results = resolver.resolve_batch(pred_indices, proba_matrix)
mapped_labels = [r.mapped_label for r in results]

Aggregation modes

When a bucket contains multiple core labels, their probabilities are combined using the configured aggregation mode:

  • sum (default) -- probabilities are summed, then the full vector is renormalized.
  • max -- the maximum probability among the bucket's core labels is used, then renormalized.

Fallback bucket

Core labels not assigned to any user bucket are automatically collected into an "Other" fallback bucket. A log message lists the unmapped labels when this occurs.

Reweighting

advanced.reweight_core_labels allows adjusting core-label probabilities before mapping. Each entry is a label: weight multiplier applied to the probability vector, which is then renormalized. This can bias the mapping toward or away from specific core labels without retraining.

I/O helpers

  • load_taxonomy(path) -- load and validate a YAML config.
  • save_taxonomy(config, path) -- serialize a config to YAML.
  • default_taxonomy() -- create an identity mapping (one bucket per core label) as a starting point for customisation.

taskclf.infer.taxonomy

User-specific taxonomy mapping: core labels -> user-defined buckets.

This module implements the personalization mapping layer described in docs/guide/model_io.md Section 5. It converts model predictions (core label + probability vector) into user-facing bucket labels with aggregated probabilities, without altering the underlying core predictions.

Typical flow::

config = load_taxonomy(Path("configs/user_taxonomy.yaml"))
resolver = TaxonomyResolver(config)
result = resolver.resolve(core_label_id, core_probs)
# result.mapped_label, result.mapped_probs

TaxonomyBucket

Bases: BaseModel

A user-facing task category that aggregates one or more core labels.

Source code in src/taskclf/infer/taxonomy.py
class TaxonomyBucket(BaseModel, frozen=True):
    """A user-facing task category that aggregates one or more core labels."""

    name: str = Field(min_length=1, description="Unique display name for this bucket.")
    description: str = Field(default="", description="Human-readable description.")
    core_labels: list[str] = Field(
        min_length=1, description="Core labels mapped to this bucket."
    )
    color: str = Field(default="#808080", description="Hex color for display.")

    @model_validator(mode="after")
    def _validate(self) -> TaxonomyBucket:
        for label in self.core_labels:
            if label not in LABEL_SET_V1:
                raise ValueError(
                    f"Unknown core label {label!r} in bucket {self.name!r}; "
                    f"must be one of {_CORE_LABEL_NAMES}"
                )
        if not _HEX_COLOR_RE.match(self.color):
            raise ValueError(
                f"Invalid hex color {self.color!r} in bucket {self.name!r}; "
                f"expected format #RRGGBB"
            )
        return self

TaxonomyDisplay

Bases: BaseModel

User display preferences (not used by resolver logic).

Source code in src/taskclf/infer/taxonomy.py
class TaxonomyDisplay(BaseModel, frozen=True):
    """User display preferences (not used by resolver logic)."""

    show_core_labels: bool = False
    default_view: Literal["mapped", "core"] = "mapped"
    color_theme: str = "default"

TaxonomyReject

Bases: BaseModel

How rejected predictions are surfaced.

Source code in src/taskclf/infer/taxonomy.py
class TaxonomyReject(BaseModel, frozen=True):
    """How rejected predictions are surfaced."""

    mixed_label_name: str = MIXED_UNKNOWN
    include_rejected_in_reports: bool = False

TaxonomyAdvanced

Bases: BaseModel

Advanced mapping tuning knobs.

Source code in src/taskclf/infer/taxonomy.py
class TaxonomyAdvanced(BaseModel, frozen=True):
    """Advanced mapping tuning knobs."""

    probability_aggregation: Literal["sum", "max"] = "sum"
    min_confidence_for_mapping: float = Field(default=0.55, ge=0.0, le=1.0)
    reweight_core_labels: dict[str, float] = Field(default_factory=dict)

    @model_validator(mode="after")
    def _validate_reweights(self) -> TaxonomyAdvanced:
        for label, weight in self.reweight_core_labels.items():
            if label not in LABEL_SET_V1:
                raise ValueError(
                    f"Unknown core label {label!r} in reweight_core_labels; "
                    f"must be one of {_CORE_LABEL_NAMES}"
                )
            if weight <= 0:
                raise ValueError(f"Reweight for {label!r} must be > 0, got {weight}")
        return self

TaxonomyConfig

Bases: BaseModel

Full user-specific taxonomy mapping configuration.

Loaded from a YAML file matching the format in configs/user_taxonomy_example.yaml.

Source code in src/taskclf/infer/taxonomy.py
class TaxonomyConfig(BaseModel, frozen=True):
    """Full user-specific taxonomy mapping configuration.

    Loaded from a YAML file matching the format in
    ``configs/user_taxonomy_example.yaml``.
    """

    version: str = "1.0"
    label_schema_version: str = "labels_v1"
    user_id: str | None = None
    display: TaxonomyDisplay = Field(default_factory=TaxonomyDisplay)
    reject: TaxonomyReject = Field(default_factory=TaxonomyReject)
    buckets: list[TaxonomyBucket] = Field(min_length=1)
    advanced: TaxonomyAdvanced = Field(default_factory=TaxonomyAdvanced)

    @model_validator(mode="after")
    def _validate_config(self) -> TaxonomyConfig:
        names = [b.name for b in self.buckets]
        if len(names) != len(set(names)):
            seen: set[str] = set()
            dupes = [n for n in names if n in seen or seen.add(n)]  # type: ignore[func-returns-value]
            raise ValueError(f"Duplicate bucket names: {dupes}")
        return self

TaxonomyResult

Bases: BaseModel

Output of the taxonomy mapping resolver for a single window.

Source code in src/taskclf/infer/taxonomy.py
class TaxonomyResult(BaseModel, frozen=True):
    """Output of the taxonomy mapping resolver for a single window."""

    mapped_label: str = Field(description="User-facing bucket label.")
    mapped_probs: dict[str, float] = Field(
        description="Bucket name -> aggregated probability (sums to 1.0)."
    )

TaxonomyResolver dataclass

Stateless mapper from core predictions to user-defined buckets.

Precomputes index lookups at construction time so that per-row resolution is fast.

Parameters:

Name Type Description Default
config TaxonomyConfig

Validated taxonomy config.

required
Source code in src/taskclf/infer/taxonomy.py
@dataclass(eq=False)
class TaxonomyResolver:
    """Stateless mapper from core predictions to user-defined buckets.

    Precomputes index lookups at construction time so that per-row
    resolution is fast.

    Args:
        config: Validated taxonomy config.
    """

    config: TaxonomyConfig
    _bucket_names: list[str] = field(init=False)
    _bucket_core_indices: list[list[int]] = field(init=False)
    _has_fallback: bool = field(init=False)
    _n_buckets: int = field(init=False)
    _agg: Literal["sum", "max"] = field(init=False)
    _reweights: np.ndarray | None = field(init=False, default=None)

    def __post_init__(self) -> None:
        self._bucket_names = [b.name for b in self.config.buckets]
        self._bucket_core_indices = []
        covered: set[str] = set()
        for bucket in self.config.buckets:
            indices = [_CORE_LABEL_INDEX[lbl] for lbl in bucket.core_labels]
            self._bucket_core_indices.append(indices)
            covered.update(bucket.core_labels)

        uncovered = LABEL_SET_V1 - covered
        self._has_fallback = bool(uncovered)
        if self._has_fallback:
            logger.info(
                "Core labels %s not in any bucket; assigning to '%s' fallback",
                sorted(uncovered),
                FALLBACK_BUCKET_NAME,
            )
            fallback_indices = [_CORE_LABEL_INDEX[lbl] for lbl in sorted(uncovered)]
            self._bucket_names.append(FALLBACK_BUCKET_NAME)
            self._bucket_core_indices.append(fallback_indices)

        self._n_buckets = len(self._bucket_names)
        self._agg = self.config.advanced.probability_aggregation

        if self.config.advanced.reweight_core_labels:
            w = np.ones(len(_CORE_LABEL_NAMES), dtype=np.float64)
            for label, weight in self.config.advanced.reweight_core_labels.items():
                w[_CORE_LABEL_INDEX[label]] = weight
            self._reweights = w

    def resolve(
        self,
        core_label_id: int,  # noqa: ARG002 – kept for API symmetry
        core_probs: np.ndarray,
        *,
        is_rejected: bool = False,
    ) -> TaxonomyResult:
        """Map a single window's core prediction to a user bucket.

        Args:
            core_label_id: Index of the predicted core label (unused
                directly -- probabilities drive the mapping).
            core_probs: Probability vector of shape ``(8,)`` from the
                model.  **Not modified** by this method.
            is_rejected: Whether the prediction was below the reject
                threshold.

        Returns:
            A ``TaxonomyResult`` with ``mapped_label`` and
            ``mapped_probs``.
        """
        if is_rejected:
            return TaxonomyResult(
                mapped_label=self.config.reject.mixed_label_name,
                mapped_probs={},
            )

        probs = core_probs.astype(np.float64, copy=True)

        if self._reweights is not None:
            probs = probs * self._reweights
            total = probs.sum()
            if total > 0:
                probs /= total

        bucket_probs = np.zeros(self._n_buckets, dtype=np.float64)
        for i, indices in enumerate(self._bucket_core_indices):
            if self._agg == "sum":
                bucket_probs[i] = probs[indices].sum()
            else:
                bucket_probs[i] = probs[indices].max()

        bp_total = bucket_probs.sum()
        if bp_total > 0:
            bucket_probs /= bp_total

        best_idx = int(bucket_probs.argmax())
        mapped_label = self._bucket_names[best_idx]
        mapped_probs = {
            name: round(float(p), 6)
            for name, p in zip(self._bucket_names, bucket_probs)
        }

        return TaxonomyResult(mapped_label=mapped_label, mapped_probs=mapped_probs)

    def resolve_batch(
        self,
        core_label_ids: np.ndarray,
        core_probs: np.ndarray,
        *,
        is_rejected: np.ndarray | None = None,
    ) -> list[TaxonomyResult]:
        """Map a batch of core predictions to user buckets.

        Args:
            core_label_ids: Shape ``(N,)`` array of predicted core label
                indices.
            core_probs: Shape ``(N, 8)`` probability matrix.
            is_rejected: Optional boolean array of shape ``(N,)``.

        Returns:
            List of ``TaxonomyResult``, one per row.
        """
        n = len(core_label_ids)
        if is_rejected is None:
            is_rejected = np.zeros(n, dtype=bool)
        return [
            self.resolve(
                int(core_label_ids[i]),
                core_probs[i],
                is_rejected=bool(is_rejected[i]),
            )
            for i in range(n)
        ]

    @property
    def bucket_names(self) -> list[str]:
        """Ordered list of bucket names (including fallback if present)."""
        return list(self._bucket_names)

bucket_names property

Ordered list of bucket names (including fallback if present).

resolve(core_label_id, core_probs, *, is_rejected=False)

Map a single window's core prediction to a user bucket.

Parameters:

Name Type Description Default
core_label_id int

Index of the predicted core label (unused directly -- probabilities drive the mapping).

required
core_probs ndarray

Probability vector of shape (8,) from the model. Not modified by this method.

required
is_rejected bool

Whether the prediction was below the reject threshold.

False

Returns:

Type Description
TaxonomyResult

A TaxonomyResult with mapped_label and

TaxonomyResult

mapped_probs.

Source code in src/taskclf/infer/taxonomy.py
def resolve(
    self,
    core_label_id: int,  # noqa: ARG002 – kept for API symmetry
    core_probs: np.ndarray,
    *,
    is_rejected: bool = False,
) -> TaxonomyResult:
    """Map a single window's core prediction to a user bucket.

    Args:
        core_label_id: Index of the predicted core label (unused
            directly -- probabilities drive the mapping).
        core_probs: Probability vector of shape ``(8,)`` from the
            model.  **Not modified** by this method.
        is_rejected: Whether the prediction was below the reject
            threshold.

    Returns:
        A ``TaxonomyResult`` with ``mapped_label`` and
        ``mapped_probs``.
    """
    if is_rejected:
        return TaxonomyResult(
            mapped_label=self.config.reject.mixed_label_name,
            mapped_probs={},
        )

    probs = core_probs.astype(np.float64, copy=True)

    if self._reweights is not None:
        probs = probs * self._reweights
        total = probs.sum()
        if total > 0:
            probs /= total

    bucket_probs = np.zeros(self._n_buckets, dtype=np.float64)
    for i, indices in enumerate(self._bucket_core_indices):
        if self._agg == "sum":
            bucket_probs[i] = probs[indices].sum()
        else:
            bucket_probs[i] = probs[indices].max()

    bp_total = bucket_probs.sum()
    if bp_total > 0:
        bucket_probs /= bp_total

    best_idx = int(bucket_probs.argmax())
    mapped_label = self._bucket_names[best_idx]
    mapped_probs = {
        name: round(float(p), 6)
        for name, p in zip(self._bucket_names, bucket_probs)
    }

    return TaxonomyResult(mapped_label=mapped_label, mapped_probs=mapped_probs)

resolve_batch(core_label_ids, core_probs, *, is_rejected=None)

Map a batch of core predictions to user buckets.

Parameters:

Name Type Description Default
core_label_ids ndarray

Shape (N,) array of predicted core label indices.

required
core_probs ndarray

Shape (N, 8) probability matrix.

required
is_rejected ndarray | None

Optional boolean array of shape (N,).

None

Returns:

Type Description
list[TaxonomyResult]

List of TaxonomyResult, one per row.

Source code in src/taskclf/infer/taxonomy.py
def resolve_batch(
    self,
    core_label_ids: np.ndarray,
    core_probs: np.ndarray,
    *,
    is_rejected: np.ndarray | None = None,
) -> list[TaxonomyResult]:
    """Map a batch of core predictions to user buckets.

    Args:
        core_label_ids: Shape ``(N,)`` array of predicted core label
            indices.
        core_probs: Shape ``(N, 8)`` probability matrix.
        is_rejected: Optional boolean array of shape ``(N,)``.

    Returns:
        List of ``TaxonomyResult``, one per row.
    """
    n = len(core_label_ids)
    if is_rejected is None:
        is_rejected = np.zeros(n, dtype=bool)
    return [
        self.resolve(
            int(core_label_ids[i]),
            core_probs[i],
            is_rejected=bool(is_rejected[i]),
        )
        for i in range(n)
    ]

load_taxonomy(path)

Load and validate a taxonomy config from a YAML file.

Parameters:

Name Type Description Default
path Path

Path to a YAML file matching the taxonomy config schema.

required

Returns:

Type Description
TaxonomyConfig

Validated TaxonomyConfig.

Raises:

Type Description
FileNotFoundError

If path does not exist.

ValueError / ValidationError

If the YAML is malformed or invalid.

Source code in src/taskclf/infer/taxonomy.py
def load_taxonomy(path: Path) -> TaxonomyConfig:
    """Load and validate a taxonomy config from a YAML file.

    Args:
        path: Path to a YAML file matching the taxonomy config schema.

    Returns:
        Validated ``TaxonomyConfig``.

    Raises:
        FileNotFoundError: If *path* does not exist.
        ValueError / ValidationError: If the YAML is malformed or invalid.
    """
    raw = yaml.safe_load(path.read_text())
    if isinstance(raw, dict) and "version" in raw:
        raw["version"] = str(raw["version"])
    return TaxonomyConfig.model_validate(raw)

save_taxonomy(config, path)

Serialize a taxonomy config to YAML.

Parameters:

Name Type Description Default
config TaxonomyConfig

Validated taxonomy config to write.

required
path Path

Destination file path.

required

Returns:

Type Description
Path

The path that was written.

Source code in src/taskclf/infer/taxonomy.py
def save_taxonomy(config: TaxonomyConfig, path: Path) -> Path:
    """Serialize a taxonomy config to YAML.

    Args:
        config: Validated taxonomy config to write.
        path: Destination file path.

    Returns:
        The *path* that was written.
    """
    data = config.model_dump(mode="json")
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(yaml.dump(data, default_flow_style=False, sort_keys=False))
    return path

default_taxonomy()

Create an identity taxonomy: one bucket per core label.

Useful as a starting point for user customisation.

Source code in src/taskclf/infer/taxonomy.py
def default_taxonomy() -> TaxonomyConfig:
    """Create an identity taxonomy: one bucket per core label.

    Useful as a starting point for user customisation.
    """
    buckets = [
        TaxonomyBucket(
            name=label,
            description=f"Core label: {label}",
            core_labels=[label],
        )
        for label in CoreLabel
    ]
    return TaxonomyConfig(buckets=buckets)