rics.pandas#

Utility functions for pandas.

Classes

`DatetimeSplitter`(schedule, before, after, ...)	See `TimeFold.make_sklearn_splitter()`.
`TimeFold`(time, data, future_data)	Create temporal k-folds from a `DataFrame` for cross-validation.

class TimeFold(time: Timestamp, data: DataFrame, future_data: DataFrame)[source]#

Bases: NamedTuple

Create temporal k-folds from a DataFrame for cross-validation.

Folds are closed on the right side (inclusive left).

Use TimeFold.iter() to create folds, or TimeFold.make_sklearn_splitter() to create a scikit-learn compatible splitter for cross validation.

The ranges surrounding the scheduled times are determined by the before and after arguments, interpreted as follows based on type:

Before/after argument options.#
Argument type	Interpretation
String `'all'`	Include all data before/after the scheduled time. Equivalent to `max_train_size=None` when using TimeSeriesSplit.
`int > 0`	Include all data within N schedule periods from the scheduled time.
Anything else	Passed as-is to the `pandas.Timedelta` class. Must be positive. See Offset aliases for valid frequency strings.

Folds always lie fully within the available time span, but empty data or future_data frames are possible if the data is not continuous.

Examples

Iterating over folds using TimeFold.iter.

>>> df = pd.DataFrame({'time': pd.date_range('2022', '2022-1-15', freq='7h')})
>>> for fold in TimeFold.iter(df, schedule='68h', after='1d'):
...     print(fold)
TimeFold('2022-01-06 16:00:00': data.shape=(17, 1), future_data.shape=(3, 1))
TimeFold('2022-01-09 12:00:00': data.shape=(18, 1), future_data.shape=(3, 1))
TimeFold('2022-01-12 08:00:00': data.shape=(17, 1), future_data.shape=(4, 1))

The TimeFold class is a named tuple, so it can be unpacked.

>>> for t, d, fd in TimeFold.iter(df, schedule='68h', after='1d'):
...     print(f"{t}: {len(d)=}, {len(fd)=}")
2022-01-06 16:00:00: len(d)=17, len(fd)=3
2022-01-09 12:00:00: len(d)=18, len(fd)=3
2022-01-12 08:00:00: len(d)=17, len(fd)=4

Including all data before the scheduled time.

>>> for fold in TimeFold.iter(df, schedule='68h', before='all'):
...    print(fold)
TimeFold('2022-01-03 20:00:00': data.shape=(10, 1), future_data.shape=(10, 1))
TimeFold('2022-01-06 16:00:00': data.shape=(20, 1), future_data.shape=(10, 1))
TimeFold('2022-01-09 12:00:00': data.shape=(30, 1), future_data.shape=(9, 1))

Plotting folds using TimeFold.plot.

>>> from rics import configure_stuff; configure_stuff()
>>> data = pd.date_range('2022', '2022-1-21', freq='38min')
>>> TimeFold.plot(data, schedule='0 0 * * MON,FRI', before='all')

The expression '0 0 * * MON,FRI' means “every Monday and Friday at midnight”. The numbers shown are the row counts for the fold.

With after=1 (the default), our Future data expands until the next scheduled time. This may be interpreted as “taking a step forward” in the schedule. Using integer before arguments works analogously, in the opposite direction. Vertical lines indicate outer limits of the data.

Notes

This method may be used to create temporal folds from heterogeneous/unaggregated data, typically used for training models (e.g. on raw transaction data). If your data is a well-formed time series, consider using the TimeSeriesSplit class from scikit-learn instead.

time: Timestamp#: The scheduled time. Determined by the schedule-argument.

data: DataFrame#: Data before the scheduled time. Determined by the before-argument.

future_data: DataFrame#: (“Future”) data, after the scheduled time. Determined by the after-argument.

Create temporal k-folds from a heterogeneous DataFrame.

Parameters:

df – A pandas DataFrame.
schedule – Timestamps which denote the anchor dates of the folds (e.g. training dates). If a Timedelta or str, create schedule from the start of df[time_column]. Alternatively, you may pass a cron expression (requires croniter).
before – The period before the scheduled time to include. See Before/after argument options.
after – The period after the scheduled time to include. See Before/after argument options.
n_splits – Maximum number of splits, preferring later folds. Has no effects if the actual number of splits given df is less than n_splits.
time_column – Column to base the folds on if DataFrame-type data is given, ignored otherwise. Pass None to use DataFrame.index.

Yields:

Tuples TimeFold(time, data, future_data).