Fetching data from remote sources
Get data from a remote source, then cache it locally. Supports postprocessing as well in which case both raw and postprocessed data is stored.
- misc.get_local_or_remote(remote_root: Union[str, bytes, PathLike], local_root: Union[str, bytes, PathLike] = '.', force: bool = False, postprocessor: Optional[Callable[[str], Any]] = None, show_progress: bool = False) Path
Retrieve the path of a local file, downloading it if needed.
If file is not available at the local root path, it will be downloaded using requests.get. A postprocessor may be given in which case the name of the final file will be
local_root/<name-of-postprocessor>/file. Removing a raw local file (ielocal_root/file) will invalidate postprocessed files as well.- Parameters
file – A file to retrieve or download.
remote_root – Remote URL where the data may be retrieved using
requests.get.local_root – Local directory where the file may be cached.
force – If True, always download and apply processing (if applicable). Existing files will be overwritten.
postprocessor – A function which takes a single argument input_path and returns a pickleable type.
show_progress – If True, show a progress bar. Requires the tqdm package.
- Returns
An absolute path to the data.
- Raises
ValueError – If local root path does not exist or is not a directory.
ValueError – If the local file does not exist and
remote==None.ModuleNotFoundError – If the
tqdmpackage is not installed butshow_progress==True.
Warning
This function is meant for manual work. There is no automatic handling of failures of any kind.
Example: Downloading data with local cache
Fetch the Title Basics table (a CSV file) of the IMDb dataset.
>>> from rics.utility.misc import get_local_or_remote
>>> import pandas as pd
>>>
>>> file = "name.basics.tsv.gz"
>>> local_root = "my-data" # default = "."
>>> remote_root = "https://datasets.imdbws.com"
>>> path = get_local_or_remote(file, remote_root, local_root, show_progress=True)
>>> pd.read_csv(path, sep="\t").shape
https://datasets.imdbws.com/name.basics.tsv.gz: 100%|██████████| 214M/214M [00:05<00:00, 39.3MiB/s]
(11453719, 6)
We had download name.basics.tsv.gz the first time, but get_local_or_remote returns immediately the second
time it is called. Fetching can be forced using force_remote=True.
>>> path = get_local_or_remote(file, remote_root, local_root, show_progress=True)
>>> pd.read_csv(path, sep="\t").shape
(11453719, 6)