bulum.utils.dataframe_extensions module

Provides extensions to dataframes which facilitates tracking and bulk analysis.

TimeseriesDataframes (TSDF) are a wrapper around pandas dataframes, with extra fields (tags, name, source, …) and methods that facilitate working with these fields.

DataframeEnsembles are a way to organise multiple TSDFs, with methods that work (at present) primarily with the tags associated with TSDFs.

class DataframeEnsemble(dfs: Iterable[TimeseriesDataframe] | None = None)

Bases: object

A DataframeEnsemble is an collection of bulum-style timeseries dataframes, which might represent collected results from a set of model runs. Each timeseries dataframe is stored in an internal object, with a little attached metadata. All timeseries in the ensemble are expected to have the same index, and the same columns.

While the class exposes the ensemble property as a dict, you should generally use the provided methods to interact with the ensemble, viz. .add_dataframe(), and .get().

add_dataframe(df: DataFrame | TimeseriesDataframe, key: Any | None = None, tag: str | None = None) → None

add_tag(tag: str) → None: Add a tag to all member dataframes.

assert_df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) → None: Internal function to verify new dfs.

df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) → bool: Internal function to verify new dfs.

filter_tag(tag: str, *, exclude: bool = False, **kwargs) → DataframeEnsemble

Return a new ensemble containing dataframes filtered by tag.

By default, it will include all dataframes whose tags partially match the provided tag.

This function delegates to TSDF.has_tag(), refer to that function for keyword arguments.

Parameters:

tag – The tag to match. String, regex pattern, or compiled regex pattern. (Regex requires regex argument to be set, c.f. TSDF.has_tag())
exclude (bool) – If True, it will filter out all dataframes which match the tag.

get(key: int | str | float | None = None) → TimeseriesDataframe: Return the underlying dataframe if the ensemble is a singleton, or the dataframe at the given key.

classmethod load(filename: str | Path, format_override: Literal['json', 'pickle', 'folder'] | None = None, pickle_safety_lock: bool = True) → DataframeEnsemble

Load (deserialise) a DataframeEnsemble from disk.

Parameters:

filename (str | Path) – Path to the file or folder to load
format_override ({"json", "pickle", "folder"}, optional) – Override automatic format detection from file extension/type
pickle_safety_lock (bool, default True) – Safety lock for pickle format. If True (default), raises ValueError when attempting to load pickle files. Set to False to allow pickle loading.

Returns:

Loaded DataframeEnsemble with all member dataframes and metadata restored

Return type:

DataframeEnsemble

Raises:

FileNotFoundError – If the specified file or folder doesn’t exist
ValueError – If the file format is unsupported or data is malformed, or if attempting to load pickle with safety lock enabled

Examples

>>> ensemble = DataframeEnsemble.load("results.json")
>>> ensemble = DataframeEnsemble.load("results")  # Auto-detects folder
>>> ensemble = DataframeEnsemble.load("results.zip")  # Auto-extracts zip
>>> ensemble = DataframeEnsemble.load("results.pkl", pickle_safety_lock=False)  # Unsafe!

map(func: Callable[[TimeseriesDataframe], TimeseriesDataframe]) → DataframeEnsemble

map(func: Callable[[DataFrame], DataFrame]) → DataframeEnsemble

Apply a function to all dataframes in the ensemble, returning a new ensemble with the results.

Parameters:: func – Univariate function on dataframes.

print_summary() → None

save(filename: str | Path, save_format: Literal['json', 'pickle', 'folder', 'zip', 'zip-folder'] = 'json', overwrite: bool = False) → None

Save (serialise) this DataframeEnsemble to disk.

Parameters:

filename (str | Path) – Path to save the file/folder (extension added automatically for some formats)
save_format ({"json", "pickle", "folder", "zip", "zip-folder"}, default "json") –
Format to save the ensemble in:
- json: Save as a single JSON file with all data
- pickle: Save as a pickle file (warning: pickle has security implications)
- folder: Save as a directory with CSV files and metadata.json
- zip: Save as a zip archive (creates folder, zips it, removes folder)
- zip-folder: Save as both a zip archive and keep the folder
overwrite (bool, default False) – If True, overwrite existing files/folders. If False, raise FileExistsError.

Raises:

ValueError – If save_format is not supported
TypeError – If ensemble contains keys with unsupported types (only int, str, float supported)
FileExistsError – If file/folder exists and overwrite=False

Examples

>>> ensemble = DataframeEnsemble()
>>> ensemble.add_dataframe(tsdf1, key=0)
>>> ensemble.save("results", save_format="folder")
>>> ensemble.save("results", save_format="json")

class RegexArg(*values)

Bases: Enum

Specifies the type of argument supplied to filtering functions in TSDF and DataframeEnsemble.

OBJECT = 2

PATTERN = 1

class TimeseriesDataframe(data=None, *args, name='', source='', description='', **kwargs)

Bases: DataFrame

A TimeseriesDataframe is thinly extended pd.Dataframe. Abbreviated casually as TSDF throughout the documentation. It adds the following fields:

name (str)
source (str)
description (str)
a string of tags (str)

Metadata Preservation

TimeseriesDataframe uses pandas’ _metadata and _constructor mechanisms to automatically preserve metadata across most operations:

Preserved: slicing, arithmetic with scalars/Series, fillna, apply, copy, transpose, rank, reset_index, and most other standard pandas operations
Not preserved: rolling window operations, pandas.concat()
No guarantees: Binary operations between two TimeseriesDataframes (e.g., tsdf1 + tsdf2) may preserve metadata from either operand depending on pandas internals

For operations that don’t preserve metadata, use the tsdf_apply() method.

Examples

>>> df = pd.DataFrame({'A': [1, 2, 3]})
>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="file.csv")
>>> tsdf.add_tag("raw,validated")
>>>
>>> # Metadata preserved automatically in most operations:
>>> result = tsdf * 2
>>> result.name  # "test"
>>> result.tags  # "raw,validated"
>>>
>>> # For rolling operations, use tsdf_apply:
>>> result = tsdf.tsdf_apply(lambda df: df.rolling(window=2).mean())
>>> result.name  # "test" - metadata preserved

See also

TimeseriesDataframe.tsdf_apply: Apply functions while preserving metadata
TimeseriesDataframe.from_dataframe: Create from existing DataFrame

TAG_DELIMITER = ','

add_tag(tag: str | Iterable[str], check_membership: bool = False) → None

Add a tag to the TimeseriesDataframe.

This is the canonical way to add tags to a TimeseriesDataframe. It can add multiple tags separated by the designated tag delimiter (by default, a comma ,).

Examples

The check_membership flag will ensure that tag does not match with existing tags, but will not (at present) check the other way around. For example, the following will not raise an error:

df.add_tag("01", True)
df.add_tag("01a", True)

count_tags() → int

classmethod from_dataframe(df, **kwargs) → TimeseriesDataframe

Create a TimeseriesDataframe from an existing DataFrame.

Parameters:

df (pd.DataFrame) – DataFrame to convert
**kwargs – Metadata fields (name, source, description)

Returns:

New instance with data from df and metadata from kwargs

Return type:

TimeseriesDataframe

Examples

>>> df = pd.DataFrame({'A': [1, 2, 3]})
>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="file.csv")

has_tag(pattern: str | Pattern, *, regex: RegexArg | None = None, exact: bool = False) → bool

Check if the provided tag matches any of the dataframe’s tags.

Parameters:

pattern (RegexArg, optional, keyword-only)) –
- None: Uses python in operation to check for membership; expects a string to be supplied to pattern.
- RegexArg: Uses the regex engine to search for the tag.
exact (bool) – Whether we require an exact match of the tag. This argument is superceded by a non-None regex argument, and may be accomplished (depending on the particulars) via regex by \b<regex>\b.

classmethod load(filename: str | Path, format_override: Literal['json', 'csv'] | None = None) → TimeseriesDataframe

Load (deserialise) a TimeseriesDataframe from a file.

Parameters:

filename (str | Path) – Path to the file to load
format_override ({"json", "csv"}, optional) – Override automatic format detection from file extension. If csv, will attempt to find <file>.json (where filename=<file>.csv) and vice versa (i.e. via suffix replacement).

Returns:

Loaded TimeseriesDataframe with all metadata restored

Return type:

TimeseriesDataframe

Examples

>>> tsdf = TimeseriesDataframe.load("output.json")
>>> tsdf = TimeseriesDataframe.load("output.csv") # will attempt to locate metadata JSON file
>>> tsdf = TimeseriesDataframe.load("data.dat", format_override="json")

classmethod metadata_fields() → list[str]: Return the list of metadata fields defined for this class.

print_summary() → None

save(filename: str | Path, save_format: Literal['json', 'csv'] = 'json', overwrite: bool = False) → None

Save (serialise) this TimeseriesDataframe to a file.

Parameters:

filename (str | Path) – Path to save the file (extension will be added automatically)
save_format ({"json", "csv"}, default "json") –
Format to save the file in:
- json: Save as JSON with all metadata
- csv: Save as CSV file with separate metadata.json
overwrite (bool, default False) – If True, overwrite existing files. If False, raise FileExistsError.

Examples

>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="data.csv")
>>> tsdf.add_tag("validated")
>>> tsdf.save("output", save_format="json")  # Creates output.json
>>> tsdf.save("output", save_format="csv")   # Creates output.csv + output.metadata.json

tsdf_apply(func: Callable[[DataFrame], DataFrame]) → TimeseriesDataframe

Apply a function to the underlying dataframe, returning a new TimeseriesDataframe with the results. The metadata fields are copied over.

Note

Most pandas operations automatically preserve metadata via the _constructor property, including:

Standard .apply(): tsdf.apply(lambda x: x * 2) preserves metadata
Arithmetic, slicing, fillna, copy, transpose, etc.

Use tsdf_apply() only when the operation bypasses _constructor:

Rolling window operations: tsdf.rolling(window=2).mean()
Concatenation: pd.concat([tsdf1, tsdf2])
Creating new DataFrames from scratch: pd.DataFrame(tsdf.mean())

Parameters:: func (Callable[[pd.DataFrame], pd.DataFrame]) – Function that takes a DataFrame and returns a DataFrame
Returns:: New TimeseriesDataframe with the result of func(self) and all metadata fields (name, source, description, tags) copied from self
Return type:: TimeseriesDataframe

Examples

>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="data.csv")
>>> tsdf.add_tag("raw")
>>>
>>> # Pandas .apply() preserves metadata automatically (no need for tsdf_apply):
>>> result = tsdf.apply(lambda x: x * 2)
>>> result.name  # "test" - metadata preserved!
>>>
>>> # Use tsdf_apply for rolling operations (metadata would be lost otherwise):
>>> result = tsdf.tsdf_apply(lambda df: df.rolling(window=2).mean())
>>> result.name  # "test" - metadata preserved
>>>
>>> # Use tsdf_apply when creating new DataFrames from scratch:
>>> result = tsdf.tsdf_apply(lambda df: pd.DataFrame(df.mean()))
>>> result.tags  # "raw" - metadata preserved