Utility functions for pandas.
Classes
|
Create temporal k-folds from a |
|
Bases: NamedTuple
Create temporal k-folds from a DataFrame for cross-validation.
Folds are closed on the right side (inclusive left).
Use TimeFold.iter() to create folds, or TimeFold.make_sklearn_splitter() to create a scikit-learn
compatible splitter for cross validation.
The ranges surrounding the scheduled times are determined by the before and after arguments,
interpreted as follows based on type:
Argument type |
Interpretation |
|---|---|
String |
Include all data before/after the scheduled time. |
|
Include all data within N schedule periods from the scheduled time. |
Anything else |
Passed as-is to the |
Folds always lie fully within the available time span, but empty data or future_data frames are
possible if the data is not continuous.
Examples
Iterating over folds using TimeFold.iter.
>>> df = pd.DataFrame({'time': pd.date_range('2022', '2022-1-15', freq='7h')})
>>> for fold in TimeFold.iter(df, schedule='68h', after='1d'):
... print(fold)
TimeFold('2022-01-06 16:00:00': data.shape=(17, 1), future_data.shape=(3, 1))
TimeFold('2022-01-09 12:00:00': data.shape=(18, 1), future_data.shape=(3, 1))
TimeFold('2022-01-12 08:00:00': data.shape=(17, 1), future_data.shape=(4, 1))
The TimeFold class is a named tuple, so it can be unpacked.
>>> for t, d, fd in TimeFold.iter(df, schedule='68h', after='1d'):
... print(f"{t}: {len(d)=}, {len(fd)=}")
2022-01-06 16:00:00: len(d)=17, len(fd)=3
2022-01-09 12:00:00: len(d)=18, len(fd)=3
2022-01-12 08:00:00: len(d)=17, len(fd)=4
Including all data before the scheduled time.
>>> for fold in TimeFold.iter(df, schedule='68h', before='all'):
... print(fold)
TimeFold('2022-01-03 20:00:00': data.shape=(10, 1), future_data.shape=(10, 1))
TimeFold('2022-01-06 16:00:00': data.shape=(20, 1), future_data.shape=(10, 1))
TimeFold('2022-01-09 12:00:00': data.shape=(30, 1), future_data.shape=(9, 1))
Plotting folds using TimeFold.plot.
>>> from rics import configure_stuff; configure_stuff()
>>> df = pd.DataFrame({'time': pd.date_range('2022', '2022-1-21')})
>>> TimeFold.plot(df, schedule='0 0 * * MON,FRI')
The expression '0 0 * * MON,FRI' means “every Monday and Friday at midnight”.
With after=1 (the default), our Future data expands until the next scheduled time. This may be interpreted
as “taking a step forward” in the schedule. Using integer before arguments works analogously, in the
opposite direction. Vertical lines indicate outer limits of df.
Notes
This method may be used to create temporal folds from heterogeneous/unaggregated data, typically used for training models (e.g. on raw transaction data). If your data is a well-formed time series, consider using the TimeSeriesSplit class from scikit-learn instead.
(“Future”) data, after the scheduled time. Determined by the after-argument.
Create temporal k-folds from a heterogeneous DataFrame.
df – A pandas DataFrame.
schedule – Timestamps which denote the anchor dates of the folds (e.g. training dates). If a Timedelta or
str, create schedule from the start of df[time_column]. Alternatively, you may pass a cron
expression (requires croniter).
before – The period before the scheduled time to include. See Before/after argument options.
after – The period after the scheduled time to include. See Before/after argument options.
time_column – Column to base the folds on. Use index if None.
Tuples TimeFold(time, data, future_data).
See also
The TimeFold.plot() method, which may be used to visualize temporal folds.
Create a scikit-learn compatible splitter.
schedule – Timestamps which denote the anchor dates of the folds (e.g. training dates). If a Timedelta or
str, create schedule from the start of df[time_column]. Alternatively, you may pass a cron
expression (requires croniter).
before – The period before the scheduled time to include. See Before/after argument options.
after – The period after the scheduled time to include. See Before/after argument options.
time_column – Column to base the folds on. Use index if None. If given, the returned splitter will not
be able to handle y-arguments.
A sklearn-compatible splitter backed by TimeFold.iter().
See also
The TimeFold.plot() method, which may be used to visualize temporal folds.
Plot the intervals that would be returned by TimeFold.iter() if invoked with the same parameters.
df – A pandas DataFrame.
schedule – Timestamps which denote the anchor dates of the folds (e.g. training dates). If a Timedelta or
str, create schedule from the start of df[time_column]. Alternatively, you may pass a cron
expression (requires croniter).
before – The period before the scheduled time to include for each iteration. See Before/after argument options.
after – The period after the scheduled time to include for each iteration. See Before/after argument options.
time_column – Column to base the folds on. Use index if None.
**kwargs – Keyword arguments for matplotlib.pyplot.subplots().
A Figure object.
ValueError – For empty ranges.
Bases: object
See TimeFold.make_sklearn_splitter().
Returns the number of splitting iterations with the given arguments.
Generate indices to split data into training and test set.
X – Training data (features). Must be a Pandas type.
y – Target variable. Must be a Pandas type.
groups – Always ignored, exists for compatibility.
The training/test set indices for that split.
ValueError – If both X and y are None.