bulum.utils.dataframe_extensions module

Provides extensions to dataframes which facilitates tracking and bulk analysis.

TimeseriesDataframes (TSDF) are a wrapper around pandas dataframes, with extra fields (tags, name, source, …) and methods that facilitate working with these fields.

DataframeEnsembles are a way to organise multiple TSDFs, with methods that work (at present) primarily with the tags associated with TSDFs.

class DataframeEnsemble(dfs: Iterable[TimeseriesDataframe] | None = None)

Bases: object

A DataframeEnsemble is an collection of bulum-style timeseries dataframes, which might represent collected results from a set of model runs. Each timeseries dataframe is stored in an internal object, with a little attached metadata. All timeseries in the ensemble are expected to have the same index, and the same columns.

While the class exposes the ensemble property as a dict, you should generally use the provided methods to interact with the ensemble, viz. .add_dataframe(), and .get().

add_dataframe(df: DataFrame | TimeseriesDataframe, key: Any | None = None, tag: str | None = None) None
add_tag(tag: str) None

Add a tag to all member dataframes.

assert_df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) None

Internal function to verify new dfs.

df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) bool

Internal function to verify new dfs.

filter_tag(tag: str, *, exclude: bool = False, **kwargs) DataframeEnsemble

Return a new ensemble containing dataframes filtered by tag.

By default, it will include all dataframes whose tags partially match the provided tag.

This function delegates to TSDF.has_tag(), refer to that function for keyword arguments.

Parameters:
  • tag – The tag to match. String, regex pattern, or compiled regex pattern. (Regex requires regex argument to be set, c.f. TSDF.has_tag())

  • exclude (bool) – If True, it will filter out all dataframes which match the tag.

get(key: int | str | float | None = None) TimeseriesDataframe

Return the underlying dataframe if the ensemble is a singleton, or the dataframe at the given key.

classmethod load(filename: str | Path, format_override: Literal['json', 'pickle', 'folder'] | None = None, pickle_safety_lock: bool = True) DataframeEnsemble

Load (deserialise) a DataframeEnsemble from disk.

Parameters:
  • filename (str | Path) – Path to the file or folder to load

  • format_override ({"json", "pickle", "folder"}, optional) – Override automatic format detection from file extension/type

  • pickle_safety_lock (bool, default True) – Safety lock for pickle format. If True (default), raises ValueError when attempting to load pickle files. Set to False to allow pickle loading.

Returns:

Loaded DataframeEnsemble with all member dataframes and metadata restored

Return type:

DataframeEnsemble

Raises:
  • FileNotFoundError – If the specified file or folder doesn’t exist

  • ValueError – If the file format is unsupported or data is malformed, or if attempting to load pickle with safety lock enabled

Examples

>>> ensemble = DataframeEnsemble.load("results.json")
>>> ensemble = DataframeEnsemble.load("results")  # Auto-detects folder
>>> ensemble = DataframeEnsemble.load("results.zip")  # Auto-extracts zip
>>> ensemble = DataframeEnsemble.load("results.pkl", pickle_safety_lock=False)  # Unsafe!
map(func: Callable[[TimeseriesDataframe], TimeseriesDataframe]) DataframeEnsemble
map(func: Callable[[DataFrame], DataFrame]) DataframeEnsemble

Apply a function to all dataframes in the ensemble, returning a new ensemble with the results.

Parameters:

func – Univariate function on dataframes.

print_summary() None
save(filename: str | Path, save_format: Literal['json', 'pickle', 'folder', 'zip', 'zip-folder'] = 'json', overwrite: bool = False) None

Save (serialise) this DataframeEnsemble to disk.

Parameters:
  • filename (str | Path) – Path to save the file/folder (extension added automatically for some formats)

  • save_format ({"json", "pickle", "folder", "zip", "zip-folder"}, default "json") –

    Format to save the ensemble in:

    • json: Save as a single JSON file with all data

    • pickle: Save as a pickle file (warning: pickle has security implications)

    • folder: Save as a directory with CSV files and metadata.json

    • zip: Save as a zip archive (creates folder, zips it, removes folder)

    • zip-folder: Save as both a zip archive and keep the folder

  • overwrite (bool, default False) – If True, overwrite existing files/folders. If False, raise FileExistsError.

Raises:
  • ValueError – If save_format is not supported

  • TypeError – If ensemble contains keys with unsupported types (only int, str, float supported)

  • FileExistsError – If file/folder exists and overwrite=False

Examples

>>> ensemble = DataframeEnsemble()
>>> ensemble.add_dataframe(tsdf1, key=0)
>>> ensemble.save("results", save_format="folder")
>>> ensemble.save("results", save_format="json")
class RegexArg(*values)

Bases: Enum

Specifies the type of argument supplied to filtering functions in TSDF and DataframeEnsemble.

OBJECT = 2
PATTERN = 1
class TimeseriesDataframe(data=None, *args, name='', source='', description='', **kwargs)

Bases: DataFrame

A TimeseriesDataframe is thinly extended pd.Dataframe. Abbreviated casually as TSDF throughout the documentation. It adds the following fields:

  • name (str)

  • source (str)

  • description (str)

  • a string of tags (str)

Metadata Preservation

TimeseriesDataframe uses pandas’ _metadata and _constructor mechanisms to automatically preserve metadata across most operations:

  • Preserved: slicing, arithmetic with scalars/Series, fillna, apply, copy, transpose, rank, reset_index, and most other standard pandas operations

  • Not preserved: rolling window operations, pandas.concat()

  • No guarantees: Binary operations between two TimeseriesDataframes (e.g., tsdf1 + tsdf2) may preserve metadata from either operand depending on pandas internals

For operations that don’t preserve metadata, use the tsdf_apply() method.

Examples

>>> df = pd.DataFrame({'A': [1, 2, 3]})
>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="file.csv")
>>> tsdf.add_tag("raw,validated")
>>>
>>> # Metadata preserved automatically in most operations:
>>> result = tsdf * 2
>>> result.name  # "test"
>>> result.tags  # "raw,validated"
>>>
>>> # For rolling operations, use tsdf_apply:
>>> result = tsdf.tsdf_apply(lambda df: df.rolling(window=2).mean())
>>> result.name  # "test" - metadata preserved

See also

TimeseriesDataframe.tsdf_apply

Apply functions while preserving metadata

TimeseriesDataframe.from_dataframe

Create from existing DataFrame

TAG_DELIMITER = ','
add_tag(tag: str | Iterable[str], check_membership: bool = False) None

Add a tag to the TimeseriesDataframe.

This is the canonical way to add tags to a TimeseriesDataframe. It can add multiple tags separated by the designated tag delimiter (by default, a comma ,).

Examples

The check_membership flag will ensure that tag does not match with existing tags, but will not (at present) check the other way around. For example, the following will not raise an error:

df.add_tag("01", True)
df.add_tag("01a", True)
count_tags() int
classmethod from_dataframe(df, **kwargs) TimeseriesDataframe

Create a TimeseriesDataframe from an existing DataFrame.

Parameters:
  • df (pd.DataFrame) – DataFrame to convert

  • **kwargs – Metadata fields (name, source, description)

Returns:

New instance with data from df and metadata from kwargs

Return type:

TimeseriesDataframe

Examples

>>> df = pd.DataFrame({'A': [1, 2, 3]})
>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="file.csv")
has_tag(pattern: str | Pattern, *, regex: RegexArg | None = None, exact: bool = False) bool

Check if the provided tag matches any of the dataframe’s tags.

Parameters:
  • pattern (RegexArg, optional, keyword-only)) –

    • None: Uses python in operation to check for membership; expects a string to be supplied to pattern.

    • RegexArg: Uses the regex engine to search for the tag.

  • exact (bool) – Whether we require an exact match of the tag. This argument is superceded by a non-None regex argument, and may be accomplished (depending on the particulars) via regex by \b<regex>\b.

classmethod load(filename: str | Path, format_override: Literal['json', 'csv'] | None = None) TimeseriesDataframe

Load (deserialise) a TimeseriesDataframe from a file.

Parameters:
  • filename (str | Path) – Path to the file to load

  • format_override ({"json", "csv"}, optional) – Override automatic format detection from file extension. If csv, will attempt to find <file>.json (where filename=<file>.csv) and vice versa (i.e. via suffix replacement).

Returns:

Loaded TimeseriesDataframe with all metadata restored

Return type:

TimeseriesDataframe

Examples

>>> tsdf = TimeseriesDataframe.load("output.json")
>>> tsdf = TimeseriesDataframe.load("output.csv") # will attempt to locate metadata JSON file
>>> tsdf = TimeseriesDataframe.load("data.dat", format_override="json")
classmethod metadata_fields() list[str]

Return the list of metadata fields defined for this class.

print_summary() None
save(filename: str | Path, save_format: Literal['json', 'csv'] = 'json', overwrite: bool = False) None

Save (serialise) this TimeseriesDataframe to a file.

Parameters:
  • filename (str | Path) – Path to save the file (extension will be added automatically)

  • save_format ({"json", "csv"}, default "json") –

    Format to save the file in:

    • json: Save as JSON with all metadata

    • csv: Save as CSV file with separate metadata.json

  • overwrite (bool, default False) – If True, overwrite existing files. If False, raise FileExistsError.

Examples

>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="data.csv")
>>> tsdf.add_tag("validated")
>>> tsdf.save("output", save_format="json")  # Creates output.json
>>> tsdf.save("output", save_format="csv")   # Creates output.csv + output.metadata.json
tsdf_apply(func: Callable[[DataFrame], DataFrame]) TimeseriesDataframe

Apply a function to the underlying dataframe, returning a new TimeseriesDataframe with the results. The metadata fields are copied over.

Note

Most pandas operations automatically preserve metadata via the _constructor property, including:

  • Standard .apply(): tsdf.apply(lambda x: x * 2) preserves metadata

  • Arithmetic, slicing, fillna, copy, transpose, etc.

Use tsdf_apply() only when the operation bypasses _constructor:

  • Rolling window operations: tsdf.rolling(window=2).mean()

  • Concatenation: pd.concat([tsdf1, tsdf2])

  • Creating new DataFrames from scratch: pd.DataFrame(tsdf.mean())

Parameters:

func (Callable[[pd.DataFrame], pd.DataFrame]) – Function that takes a DataFrame and returns a DataFrame

Returns:

New TimeseriesDataframe with the result of func(self) and all metadata fields (name, source, description, tags) copied from self

Return type:

TimeseriesDataframe

Examples

>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="data.csv")
>>> tsdf.add_tag("raw")
>>>
>>> # Pandas .apply() preserves metadata automatically (no need for tsdf_apply):
>>> result = tsdf.apply(lambda x: x * 2)
>>> result.name  # "test" - metadata preserved!
>>>
>>> # Use tsdf_apply for rolling operations (metadata would be lost otherwise):
>>> result = tsdf.tsdf_apply(lambda df: df.rolling(window=2).mean())
>>> result.name  # "test" - metadata preserved
>>>
>>> # Use tsdf_apply when creating new DataFrames from scratch:
>>> result = tsdf.tsdf_apply(lambda df: pd.DataFrame(df.mean()))
>>> result.tags  # "raw" - metadata preserved

See also

TimeseriesDataframe.from_dataframe

Create TSDF from regular DataFrame

pandas.DataFrame.apply

Standard pandas apply (preserves metadata for TSDF)