bulum.utils.dataframe_extensions module
Provides extensions to dataframes which facilitates tracking and bulk analysis.
TimeseriesDataframes (TSDF) are a wrapper around pandas dataframes, with extra fields (tags, name, source, …) and methods that facilitate working with these fields.
DataframeEnsembles are a way to organise multiple TSDFs, with methods that work (at present) primarily with the tags associated with TSDFs.
- class DataframeEnsemble(dfs: Iterable[TimeseriesDataframe] | None = None)
Bases:
objectA DataframeEnsemble is an collection of bulum-style timeseries dataframes, which might represent collected results from a set of model runs. Each timeseries dataframe is stored in an internal object, with a little attached metadata. All timeseries in the ensemble are expected to have the same index, and the same columns.
While the class exposes the ensemble property as a dict, you should generally use the provided methods to interact with the ensemble, viz. .add_dataframe(), and .get().
- add_dataframe(df: DataFrame | TimeseriesDataframe, key: Any | None = None, tag: str | None = None) None
- assert_df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) None
Internal function to verify new dfs.
- df_shape_matches_ensemble(new_df: DataFrame | TimeseriesDataframe) bool
Internal function to verify new dfs.
- filter_tag(tag: str, *, exclude: bool = False, **kwargs) DataframeEnsemble
Return a new ensemble containing dataframes filtered by tag.
By default, it will include all dataframes whose tags partially match the provided tag.
This function delegates to TSDF.has_tag(), refer to that function for keyword arguments.
- Parameters:
tag – The tag to match. String, regex pattern, or compiled regex pattern. (Regex requires regex argument to be set, c.f. TSDF.has_tag())
exclude (bool) – If True, it will filter out all dataframes which match the tag.
- get(key: int | str | float | None = None) TimeseriesDataframe
Return the underlying dataframe if the ensemble is a singleton, or the dataframe at the given key.
- classmethod load(filename: str | Path, format_override: Literal['json', 'pickle', 'folder'] | None = None, pickle_safety_lock: bool = True) DataframeEnsemble
Load (deserialise) a DataframeEnsemble from disk.
- Parameters:
filename (str | Path) – Path to the file or folder to load
format_override ({"json", "pickle", "folder"}, optional) – Override automatic format detection from file extension/type
pickle_safety_lock (bool, default True) – Safety lock for pickle format. If True (default), raises ValueError when attempting to load pickle files. Set to False to allow pickle loading.
- Returns:
Loaded DataframeEnsemble with all member dataframes and metadata restored
- Return type:
- Raises:
FileNotFoundError – If the specified file or folder doesn’t exist
ValueError – If the file format is unsupported or data is malformed, or if attempting to load pickle with safety lock enabled
Examples
>>> ensemble = DataframeEnsemble.load("results.json") >>> ensemble = DataframeEnsemble.load("results") # Auto-detects folder >>> ensemble = DataframeEnsemble.load("results.zip") # Auto-extracts zip >>> ensemble = DataframeEnsemble.load("results.pkl", pickle_safety_lock=False) # Unsafe!
- map(func: Callable[[TimeseriesDataframe], TimeseriesDataframe]) DataframeEnsemble
- map(func: Callable[[DataFrame], DataFrame]) DataframeEnsemble
Apply a function to all dataframes in the ensemble, returning a new ensemble with the results.
- Parameters:
func – Univariate function on dataframes.
- save(filename: str | Path, save_format: Literal['json', 'pickle', 'folder', 'zip', 'zip-folder'] = 'json', overwrite: bool = False) None
Save (serialise) this DataframeEnsemble to disk.
- Parameters:
filename (str | Path) – Path to save the file/folder (extension added automatically for some formats)
save_format ({"json", "pickle", "folder", "zip", "zip-folder"}, default "json") –
Format to save the ensemble in:
json: Save as a single JSON file with all data
pickle: Save as a pickle file (warning: pickle has security implications)
folder: Save as a directory with CSV files and metadata.json
zip: Save as a zip archive (creates folder, zips it, removes folder)
zip-folder: Save as both a zip archive and keep the folder
overwrite (bool, default False) – If True, overwrite existing files/folders. If False, raise FileExistsError.
- Raises:
ValueError – If save_format is not supported
TypeError – If ensemble contains keys with unsupported types (only int, str, float supported)
FileExistsError – If file/folder exists and overwrite=False
Examples
>>> ensemble = DataframeEnsemble() >>> ensemble.add_dataframe(tsdf1, key=0) >>> ensemble.save("results", save_format="folder") >>> ensemble.save("results", save_format="json")
- class RegexArg(*values)
Bases:
EnumSpecifies the type of argument supplied to filtering functions in TSDF and DataframeEnsemble.
- OBJECT = 2
- PATTERN = 1
- class TimeseriesDataframe(data=None, *args, name='', source='', description='', **kwargs)
Bases:
DataFrameA TimeseriesDataframe is thinly extended pd.Dataframe. Abbreviated casually as TSDF throughout the documentation. It adds the following fields:
name (str)
source (str)
description (str)
a string of tags (str)
Metadata Preservation
TimeseriesDataframe uses pandas’
_metadataand_constructormechanisms to automatically preserve metadata across most operations:Preserved: slicing, arithmetic with scalars/Series, fillna, apply, copy, transpose, rank, reset_index, and most other standard pandas operations
Not preserved: rolling window operations,
pandas.concat()No guarantees: Binary operations between two TimeseriesDataframes (e.g.,
tsdf1 + tsdf2) may preserve metadata from either operand depending on pandas internals
For operations that don’t preserve metadata, use the
tsdf_apply()method.Examples
>>> df = pd.DataFrame({'A': [1, 2, 3]}) >>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="file.csv") >>> tsdf.add_tag("raw,validated") >>> >>> # Metadata preserved automatically in most operations: >>> result = tsdf * 2 >>> result.name # "test" >>> result.tags # "raw,validated" >>> >>> # For rolling operations, use tsdf_apply: >>> result = tsdf.tsdf_apply(lambda df: df.rolling(window=2).mean()) >>> result.name # "test" - metadata preserved
See also
TimeseriesDataframe.tsdf_applyApply functions while preserving metadata
TimeseriesDataframe.from_dataframeCreate from existing DataFrame
- TAG_DELIMITER = ','
- add_tag(tag: str | Iterable[str], check_membership: bool = False) None
Add a tag to the TimeseriesDataframe.
This is the canonical way to add tags to a TimeseriesDataframe. It can add multiple tags separated by the designated tag delimiter (by default, a comma ,).
Examples
The check_membership flag will ensure that tag does not match with existing tags, but will not (at present) check the other way around. For example, the following will not raise an error:
df.add_tag("01", True) df.add_tag("01a", True)
- classmethod from_dataframe(df, **kwargs) TimeseriesDataframe
Create a TimeseriesDataframe from an existing DataFrame.
- Parameters:
df (pd.DataFrame) – DataFrame to convert
**kwargs – Metadata fields (name, source, description)
- Returns:
New instance with data from df and metadata from kwargs
- Return type:
Examples
>>> df = pd.DataFrame({'A': [1, 2, 3]}) >>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="file.csv")
- has_tag(pattern: str | Pattern, *, regex: RegexArg | None = None, exact: bool = False) bool
Check if the provided tag matches any of the dataframe’s tags.
- Parameters:
pattern (RegexArg, optional, keyword-only)) –
None: Uses python in operation to check for membership; expects a string to be supplied to pattern.
RegexArg: Uses the regex engine to search for the tag.
exact (bool) – Whether we require an exact match of the tag. This argument is superceded by a non-None regex argument, and may be accomplished (depending on the particulars) via regex by
\b<regex>\b.
- classmethod load(filename: str | Path, format_override: Literal['json', 'csv'] | None = None) TimeseriesDataframe
Load (deserialise) a TimeseriesDataframe from a file.
- Parameters:
filename (str | Path) – Path to the file to load
format_override ({"json", "csv"}, optional) – Override automatic format detection from file extension. If csv, will attempt to find <file>.json (where filename=<file>.csv) and vice versa (i.e. via suffix replacement).
- Returns:
Loaded TimeseriesDataframe with all metadata restored
- Return type:
Examples
>>> tsdf = TimeseriesDataframe.load("output.json") >>> tsdf = TimeseriesDataframe.load("output.csv") # will attempt to locate metadata JSON file >>> tsdf = TimeseriesDataframe.load("data.dat", format_override="json")
- save(filename: str | Path, save_format: Literal['json', 'csv'] = 'json', overwrite: bool = False) None
Save (serialise) this TimeseriesDataframe to a file.
- Parameters:
filename (str | Path) – Path to save the file (extension will be added automatically)
save_format ({"json", "csv"}, default "json") –
Format to save the file in:
json: Save as JSON with all metadata
csv: Save as CSV file with separate metadata.json
overwrite (bool, default False) – If True, overwrite existing files. If False, raise FileExistsError.
Examples
>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="data.csv") >>> tsdf.add_tag("validated") >>> tsdf.save("output", save_format="json") # Creates output.json >>> tsdf.save("output", save_format="csv") # Creates output.csv + output.metadata.json
- tsdf_apply(func: Callable[[DataFrame], DataFrame]) TimeseriesDataframe
Apply a function to the underlying dataframe, returning a new TimeseriesDataframe with the results. The metadata fields are copied over.
Note
Most pandas operations automatically preserve metadata via the
_constructorproperty, including:Standard
.apply():tsdf.apply(lambda x: x * 2)preserves metadataArithmetic, slicing, fillna, copy, transpose, etc.
Use tsdf_apply() only when the operation bypasses
_constructor:Rolling window operations:
tsdf.rolling(window=2).mean()Concatenation:
pd.concat([tsdf1, tsdf2])Creating new DataFrames from scratch:
pd.DataFrame(tsdf.mean())
- Parameters:
func (Callable[[pd.DataFrame], pd.DataFrame]) – Function that takes a DataFrame and returns a DataFrame
- Returns:
New TimeseriesDataframe with the result of func(self) and all metadata fields (name, source, description, tags) copied from self
- Return type:
Examples
>>> tsdf = TimeseriesDataframe.from_dataframe(df, name="test", source="data.csv") >>> tsdf.add_tag("raw") >>> >>> # Pandas .apply() preserves metadata automatically (no need for tsdf_apply): >>> result = tsdf.apply(lambda x: x * 2) >>> result.name # "test" - metadata preserved! >>> >>> # Use tsdf_apply for rolling operations (metadata would be lost otherwise): >>> result = tsdf.tsdf_apply(lambda df: df.rolling(window=2).mean()) >>> result.name # "test" - metadata preserved >>> >>> # Use tsdf_apply when creating new DataFrames from scratch: >>> result = tsdf.tsdf_apply(lambda df: pd.DataFrame(df.mean())) >>> result.tags # "raw" - metadata preserved
See also
TimeseriesDataframe.from_dataframeCreate TSDF from regular DataFrame
pandas.DataFrame.applyStandard pandas apply (preserves metadata for TSDF)