core.model_io¶
Model bundle persistence: save, load, and metadata for trained model artifacts.
ModelMetadata Fields¶
| Field | Type | Description |
|---|---|---|
schema_version |
str | Feature schema version ("v1" or "v2") |
schema_hash |
str | Deterministic hash of the feature schema |
label_set |
list[str] | Sorted list of core labels used in training |
train_date_from |
str | First date of the training range (ISO-8601) |
train_date_to |
str | Last date of the training range (ISO-8601) |
params |
dict | Model hyperparameters |
git_commit |
str | Git commit SHA at training time |
dataset_hash |
str | SHA-256 hash of the training dataset for reproducibility |
reject_threshold |
float or None | Reject threshold used during evaluation. Advisory only — the canonical runtime threshold lives in InferencePolicy. |
data_provenance |
str | Origin: "real", "synthetic", or "mixed" |
created_at |
str | ISO-8601 timestamp of bundle creation |
unknown_category_freq_threshold |
int or None | Minimum category frequency used during training (categories below this become __unknown__) |
unknown_category_mask_rate |
float or None | Fraction of known categories randomly masked to __unknown__ during training |
Schema Version Support¶
load_model_bundle validates bundles against a registry of known schema
versions. Both v1 and v2 bundles are accepted; the bundle's
schema_version field selects the expected hash. A hash mismatch
(e.g. loading a v1 bundle whose hash has been tampered to v2's value)
raises ValueError.
build_metadata accepts a schema_version parameter (default "v1")
and fills in the correct version string and hash automatically.
taskclf.core.model_io
¶
Model bundle persistence: save, load, and metadata for trained model artifacts.
ModelMetadata
¶
Bases: BaseModel
Immutable record stored alongside a trained model as metadata.json.
Captures the feature schema version/hash, label vocabulary, training date range, hyperparameters, and the git commit at training time so that inference can verify compatibility before predicting.
Source code in src/taskclf/core/model_io.py
reject_threshold = None
class-attribute
instance-attribute
¶
.. deprecated::
Advisory only. The canonical runtime reject threshold now
lives in :class:~taskclf.core.inference_policy.InferencePolicy.
generate_run_id()
¶
Produce a unique run directory name: YYYY-MM-DD_HHMMSS_run-XXXX.
Returns:
| Type | Description |
|---|---|
str
|
A string like |
Source code in src/taskclf/core/model_io.py
save_model_bundle(model, metadata, metrics, confusion_df, base_dir, cat_encoders=None)
¶
Persist a complete model bundle into base_dir/<run_id>/.
Writes the core files per the Model Bundle Contract plus an optional
categorical_encoders.json mapping each categorical column to its
sorted vocabulary list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Booster
|
Trained LightGBM booster. |
required |
metadata
|
ModelMetadata
|
Provenance record (schema hash, label set, params, etc.). |
required |
metrics
|
dict
|
Evaluation dict (as returned by
:func: |
required |
confusion_df
|
DataFrame
|
Labelled confusion matrix for CSV export. |
required |
base_dir
|
Path
|
Parent directory (e.g. |
required |
cat_encoders
|
dict | None
|
Optional dict mapping categorical column names to
fitted |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the newly created run directory. |
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the generated run directory already exists. |
Source code in src/taskclf/core/model_io.py
load_model_bundle(run_dir, *, validate_schema=True, validate_labels=True)
¶
Load a model bundle and optionally validate schema hash and label set.
Schema validation accepts v1, v2, and v3 bundles: the bundle's
schema_version is looked up in the known schema registry and its
schema_hash is checked against the corresponding expected hash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_dir
|
Path
|
Path to an existing run directory (e.g.
|
required |
validate_schema
|
bool
|
When |
True
|
validate_labels
|
bool
|
When |
True
|
Returns:
| Type | Description |
|---|---|
Booster
|
A |
ModelMetadata
|
is a dict mapping column names to fitted |
dict[str, Any]
|
instances. Returns an empty dict when no encoder file exists. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If validation is enabled and the schema hash or label set recorded in the bundle does not match the running code. |
Source code in src/taskclf/core/model_io.py
build_metadata(label_set, train_date_from, train_date_to, params, *, dataset_hash, reject_threshold=None, data_provenance='real', unknown_category_freq_threshold=None, unknown_category_mask_rate=None, schema_version=LATEST_FEATURE_SCHEMA_VERSION)
¶
Convenience builder that fills in schema info and git commit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
label_set
|
list[str]
|
Task-type labels used during training. |
required |
train_date_from
|
date
|
First date of the training range. |
required |
train_date_to
|
date
|
Last date (inclusive) of the training range. |
required |
params
|
dict[str, Any]
|
LightGBM (or other model) hyperparameters dict. |
required |
dataset_hash
|
str
|
Deterministic SHA-256 hash of the training dataset used for reproducibility auditing. |
required |
reject_threshold
|
float | None
|
Reject threshold used during evaluation. |
None
|
data_provenance
|
Literal['real', 'synthetic', 'mixed']
|
Origin of the training data
( |
'real'
|
unknown_category_freq_threshold
|
int | None
|
Minimum category frequency
used during training (categories below this are |
None
|
unknown_category_mask_rate
|
float | None
|
Fraction of known categories randomly
masked to |
None
|
schema_version
|
str
|
|
LATEST_FEATURE_SCHEMA_VERSION
|
Returns:
| Type | Description |
|---|---|
ModelMetadata
|
A populated |
Raises:
| Type | Description |
|---|---|
ValueError
|
If schema_version is not recognised. |