bulum.utils.dataframe_extensions module
Provides extensions to dataframes which facilitates tracking and bulk analysis.
TimeseriesDataframes (TSDF) are a wrapper around pandas dataframes, with extra fields (tags, name, source, …) and methods that facilitate working with these fields.
DataframeEnsembles are a way to organise multiple TSDFs, with methods that work (at present) primarily with the tags associated with TSDFs.
- class DataframeEnsemble(dfs: Iterable[TimeseriesDataframe] | None = None)
Bases:
objectA DataframeEnsemble is an collection of bulum-style timeseries dataframes, which might represent collected results from a set of model runs. Each timeseries dataframe is stored in an internal object, with a little attached metadata. All timeseries in the ensemble are expected to have the same index, and the same columns.
- add_dataframe(df: DataFrame | TimeseriesDataframe, key: Any | None = None, tag: str | None = None) None
- assert_df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) None
Internal function to verify new dfs.
- df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) bool
Internal function to verify new dfs.
- filter_tag(tag: str, *, exclude: bool = False, **kwargs) DataframeEnsemble
Return a new ensemble containing dataframes filtered by tag.
By default, it will include all dataframes whose tags partially match the provided tag.
This function delegates to TSDF.has_tag(), refer to that function for keyword arguments.
- Parameters:
tag – The tag to match. String, regex pattern, or compiled regex pattern. (Regex requires regex argument to be set, c.f. TSDF.has_tag())
exclude (bool) – If True, it will filter out all dataframes which match the tag.
- get(key: Any | None = None) TimeseriesDataframe
Return the underlying dataframe if the ensemble is a singleton, or the dataframe at the given key.
- class RegexArg(*values)
Bases:
EnumSpecifies the type of argument supplied to filtering functions in TSDF and
- OBJECT = 2
- PATTERN = 1
- class TimeseriesDataframe(*, name='', source='', description='')
Bases:
DataFrameA TimeseriesDataframe is thinly extended pd.Dataframe. Abbreviated casually as TSDF throughout the documentation. It adds the following fields:
name (str)
source (str)
description (str)
a string of tags (str)
- TAG_DELIMITER = ','
Used to consistently separate tags. Kept as a variable for semantic purposes.
- add_tag(tag: str, check_membership: bool = False) None
Add a tag to the TimeseriesDataframe.
This is the canonical way to add tags to a TimeseriesDataframe. It can add multiple tags separated by the designated tag delimiter (by default, a comma ,).
Examples
The check_membership flag will ensure that tag does not match with existing tags, but will not (at present) check the other way around. For example, the following will not raise an error:
` df.add_tag("01", True) df.add_tag("01a", True) `
- copy_from_dataframe(df)
- classmethod from_dataframe(df, **kwargs)
- has_tag(pattern: str | Pattern, *, regex: RegexArg | None = None, exact: bool = False) bool
Check if the provided tag matches any of the dataframe’s tags.
- Parameters:
pattern (RegexArg, optional, keyword-only)) –
None: Uses python in operation to check for membership; expects a string to be supplied to pattern.
RegexArg: Uses the regex engine to search for the tag.
exact (bool) – Whether we require an exact match of the tag. This argument is superceded by a non-None regex argument, and may be accomplished (depending on the particulars) via regex by
\b<regex>\b.