rics.ml.time_split.integration.pandas#

Integration with the Pandas library.

Examples

Splitting a pandas.Series with split_pandas().

>>> import pandas as pd
>>> index=pd.date_range("2022", "2022-1-10", freq="h")
>>> series = pd.Series(range(len(index)), index=index)
>>> series.sample(3, random_state=1999)
2022-01-04 00:00:00    72
2022-01-03 21:00:00    69
2022-01-04 14:00:00    86
dtype: int64

Series may only be split on the index. The log_progress keyword argument is optional.

>>> for fold in split_pandas(series, schedule="1d", log_progress="progress"):  
...     print(
...         f"Summary of fold {tuple(map(pd.Timestamp.isoformat, fold.bounds))}:"
...         f"\n  {fold.data.mean()=}"
...         f"\n  {fold.future_data.mean()=}",
...     )
INFO:progress:Begin fold 1/2: ('2022-01-01' <= [schedule: '2022-01-08' (Saturday)] < '2022-01-09').
Summary of fold ('2022-01-01T00:00:00', '2022-01-08T00:00:00', '2022-01-09T00:00:00'):
  fold.data.mean()=83.5
  fold.future_data.mean()=179.5
INFO:progress:Finished fold 1/2 [schedule: '2022-01-08' (Saturday)] after 1ms.  
INFO:progress:Begin fold 2/2: ('2022-01-02' <= [schedule: '2022-01-09' (Sunday)] < '2022-01-10').
Summary of fold ('2022-01-02T00:00:00', '2022-01-09T00:00:00', '2022-01-10T00:00:00'):
  fold.data.mean()=107.5
  fold.future_data.mean()=203.5
INFO:progress:Finished fold 2/2 [schedule: '2022-01-09' (Sunday)] after 873μs.

The split_pandas function returns PandasDatetimeSplit-tuples.

Functions

split_pandas(data, schedule, *[, before, ...])

Split a pandas type.

Classes

PandasDatetimeSplit(data, future_data, bounds)

Time-based split of a pandas type.

split_pandas(data: PandasT, schedule: DatetimeIndex | Iterable[str | Timestamp | datetime | date | datetime64] | str | Timedelta | timedelta | timedelta64, *, before: int | Literal['all'] | str | Timedelta | timedelta | timedelta64 = '7d', after: int | Literal['all'] | str | Timedelta | timedelta | timedelta64 = 1, n_splits: int | None = None, flex: bool | Literal['auto'] | str = 'auto', step: int = 1, time_column: Hashable = None, inclusive: Literal['left', 'right', 'neither'] = 'left', log_progress: str | bool | Dict[str, Any] | Logger | LoggerAdapter = False) Iterable[PandasDatetimeSplit[PandasT]][source]#

Split a pandas type.

This function splits indexed data (i.e. Series and DataFrame, not the index itself. Use time_split.split for pandas Index types, setting available=data.index.

Parameters:
  • data – A pandas data container type to split.

  • schedule – A collection of timestamps, a pandas offset alias, or a cron expression.

  • before – Range before schedule timestamps. Either a pandas offset alias, an integer (schedule-based offsets), or ‘all’ (requires available data).

  • after – Range after schedule timestamps. Either a pandas offset alias, an integer (schedule-based offsets), or ‘all’ (requires available data).

  • step – Select a subset of folds, preferring folds later in the schedule.

  • n_splits – Maximum number of folds, preferring folds later in the schedule.

  • flex – A pandas offset alias used to expand available data to its likely “true” limits. Pass False to disable.

  • time_column – A column in data to split on. Use index if None.

  • inclusive – Which side to make the splits inclusive on.

  • log_progress – Controls logging of fold progress. See log_split_progress() for details.

For more information about the schedule, before/after and flex-arguments, see the User guide.

Yields:

Tuples (data, future_data, bounds).

Raises:
  • TypeError – If the chosen split attribute is not a timestamp.

  • ValueError – For disallowed inclusive values.

class PandasDatetimeSplit(data: PandasT, future_data: PandasT, bounds: DatetimeSplitBounds)[source]#

Bases: NamedTuple, Generic[PandasT]

Time-based split of a pandas type.

Warning

When running Python < 3.11, this is a @dataclass rather than a tuple.

data: PandasT#

Data before bounds.mid.

future_data: PandasT#

Data after bounds.mid.

bounds: DatetimeSplitBounds#

The underlying bounds that produced this split.