Skip to content

core.store

Parquet (and future duckdb) IO primitives.

pandas is imported inside read_parquet() and write_parquet() so importing taskclf.core.store does not eagerly load the full dataframe stack; callers that only need other modules avoid that cost until parquet I/O runs.

taskclf.core.store

Parquet I/O primitives for persisting DataFrames.

write_parquet(df, path)

Write df to a parquet file at path atomically.

Writes to a temporary file in the same directory first, then atomically replaces the target via :func:os.replace. This prevents readers from ever seeing a partially-written file.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to persist.

required
path Path

Destination file path (e.g. data/processed/.../features.parquet).

required

Returns:

Type Description
Path

The path that was written, for convenient chaining.

Source code in src/taskclf/core/store.py
def write_parquet(df: pd.DataFrame, path: Path) -> Path:
    """Write *df* to a parquet file at *path* atomically.

    Writes to a temporary file in the same directory first, then
    atomically replaces the target via :func:`os.replace`.  This
    prevents readers from ever seeing a partially-written file.

    Args:
        df: DataFrame to persist.
        path: Destination file path (e.g. ``data/processed/.../features.parquet``).

    Returns:
        The *path* that was written, for convenient chaining.
    """
    path.parent.mkdir(parents=True, exist_ok=True)
    fd, tmp = tempfile.mkstemp(dir=path.parent, suffix=".parquet.tmp")
    try:
        os.close(fd)
        df.to_parquet(tmp, engine="pyarrow", index=False)
        os.replace(tmp, path)
    except BaseException:
        with contextlib.suppress(OSError):
            os.unlink(tmp)
        raise
    return path

read_parquet(path)

Read a parquet file into a DataFrame.

Parameters:

Name Type Description Default
path Path

Path to an existing .parquet file.

required

Returns:

Type Description
DataFrame

The loaded DataFrame.

Source code in src/taskclf/core/store.py
def read_parquet(path: Path) -> pd.DataFrame:
    """Read a parquet file into a DataFrame.

    Args:
        path: Path to an existing ``.parquet`` file.

    Returns:
        The loaded DataFrame.
    """
    import pandas as pd

    return pd.read_parquet(path, engine="pyarrow")