rics.ml.time_split.integration.pandas#
Integration with the Pandas library.
Examples
Splitting a pandas.Series
with split_pandas()
.
>>> import pandas as pd
>>> index = pd.date_range("2022", "2022-1-10", freq="h")
>>> series = pd.Series(range(len(index)), index=index)
>>> series.sample(3, random_state=1999)
2022-01-04 00:00:00 72
2022-01-03 21:00:00 69
2022-01-04 14:00:00 86
dtype: int64
Series may only be split on the index. The schedule keyword argument is required, but log_progress is not.
>>> for fold in split_pandas(
... series, schedule="1d", log_progress="progress"
... ):
... print(
... f"Summary of fold {tuple(map(pd.Timestamp.isoformat, fold.bounds))}:"
... f"\n {fold.data.mean()=}"
... f"\n {fold.future_data.mean()=}",
... )
INFO:progress:Begin fold 1/2: ('2022-01-01' <= [schedule: '2022-01-08' (Saturday)] < '2022-01-09').
Summary of fold ('2022-01-01T00:00:00', '2022-01-08T00:00:00', '2022-01-09T00:00:00'):
fold.data.mean()=83.5
fold.future_data.mean()=179.5
INFO:progress:Finished fold 1/2 [schedule: '2022-01-08' (Saturday)] after 1ms.
INFO:progress:Begin fold 2/2: ('2022-01-02' <= [schedule: '2022-01-09' (Sunday)] < '2022-01-10').
Summary of fold ('2022-01-02T00:00:00', '2022-01-09T00:00:00', '2022-01-10T00:00:00'):
fold.data.mean()=107.5
fold.future_data.mean()=203.5
INFO:progress:Finished fold 2/2 [schedule: '2022-01-09' (Sunday)] after 873μs.
When splitting dataframes, you may optionally pass a time_column argument as well. By default, both frames and series are split along the index.
Functions
|
Split a pandas type. |
- split_pandas(data: PandasT, time_column: Hashable = None, *, log_progress: str | bool | dict[str, Any] | Logger | LoggerAdapter = False, **kwargs: Unpack[DatetimeIndexSplitterKwargs]) Iterable[DatetimeSplit[PandasT]] [source]#
Split a pandas type.
This function splits indexed data (i.e.
Series
andDataFrame
, not the index itself. Usetime_split.split
for pandasIndex
types, settingavailable=data.index
.- Parameters:
data – A pandas data container type to split; either
Series
or aDataFrame
.time_column – A column in data to split on. Use
data`.index
ifNone
.log_progress – Controls logging of fold progress. See
log_split_progress()
for details.**kwargs – See
split()
. The available keyword is managed by the integration.
For more information about the schedule, before/after and flex-arguments, see the User guide.
- Yields:
Tuples
(data, future_data, bounds)
.- Raises:
TypeError – If time_column does not denote a datetime index-like field.