bulum.utils.dataframe_extensions module

Provides extensions to dataframes which facilitates tracking and bulk analysis.

TimeseriesDataframes (TSDF) are a wrapper around pandas dataframes, with extra fields (tags, name, source, …) and methods that facilitate working with these fields.

DataframeEnsembles are a way to organise multiple TSDFs, with methods that work (at present) primarily with the tags associated with TSDFs.

class DataframeEnsemble(dfs: Iterable[TimeseriesDataframe] | None = None)

Bases: object

A DataframeEnsemble is an collection of bulum-style timeseries dataframes, which might represent collected results from a set of model runs. Each timeseries dataframe is stored in an internal object, with a little attached metadata. All timeseries in the ensemble are expected to have the same index, and the same columns.

add_dataframe(df: DataFrame | TimeseriesDataframe, key: Any | None = None, tag: str | None = None) None
add_tag(tag: str) None

Add a tag to all member dataframes.

assert_df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) None

Internal function to verify new dfs.

df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) bool

Internal function to verify new dfs.

filter_tag(tag: str, *, exclude: bool = False, **kwargs) DataframeEnsemble

Return a new ensemble containing dataframes filtered by tag.

By default, it will include all dataframes whose tags partially match the provided tag.

This function delegates to TSDF.has_tag(), refer to that function for keyword arguments.

Parameters:
  • tag – The tag to match. String, regex pattern, or compiled regex pattern. (Regex requires regex argument to be set, c.f. TSDF.has_tag())

  • exclude (bool) – If True, it will filter out all dataframes which match the tag.

get(key: Any | None = None) TimeseriesDataframe

Return the underlying dataframe if the ensemble is a singleton, or the dataframe at the given key.

print_summary() None
class RegexArg(*values)

Bases: Enum

Specifies the type of argument supplied to filtering functions in TSDF and

OBJECT = 2
PATTERN = 1
class TimeseriesDataframe(*, name='', source='', description='')

Bases: DataFrame

A TimeseriesDataframe is thinly extended pd.Dataframe. Abbreviated casually as TSDF throughout the documentation. It adds the following fields:

  • name (str)

  • source (str)

  • description (str)

  • a string of tags (str)

TAG_DELIMITER = ','

Used to consistently separate tags. Kept as a variable for semantic purposes.

add_tag(tag: str, check_membership: bool = False) None

Add a tag to the TimeseriesDataframe.

This is the canonical way to add tags to a TimeseriesDataframe. It can add multiple tags separated by the designated tag delimiter (by default, a comma ,).

Examples

The check_membership flag will ensure that tag does not match with existing tags, but will not (at present) check the other way around. For example, the following will not raise an error: ` df.add_tag("01", True) df.add_tag("01a", True) `

copy_from_dataframe(df)
count_tags() int
classmethod from_dataframe(df, **kwargs)
has_tag(pattern: str | Pattern, *, regex: RegexArg | None = None, exact: bool = False) bool

Check if the provided tag matches any of the dataframe’s tags.

Parameters:
  • pattern (RegexArg, optional, keyword-only)) –

    • None: Uses python in operation to check for membership; expects a string to be supplied to pattern.

    • RegexArg: Uses the regex engine to search for the tag.

  • exact (bool) – Whether we require an exact match of the tag. This argument is superceded by a non-None regex argument, and may be accomplished (depending on the particulars) via regex by \b<regex>\b.

print_summary() None