Miscellaneous utility methods for Python applications.
Functions
|
Combine |
|
Retrieve the path of a local file, downloading it if needed. |
|
Read an environment variable if arg if prefixed by env_marker, otherwise return arg as-is. |
|
Check if obj is serializable using Pickle. |
|
Get name of method or class. |
Combine import_module() and getattr() to retrieve items by name.
name – A name or fully qualified name.
default_module – A namespace to search if name is not fully qualified (contains no '.'-characters).
An object with the fully qualified name name.
ValueError – If name does not contain any dots and default_module=None.
Get name of method or class.
arg – Something get a name for.
A type name.
Retrieve the path of a local file, downloading it if needed.
If file is not available at the local root path, it will be downloaded using requests.get. A postprocessor may
be given in which case the name of the final file will be local_root/<name-of-postprocessor>/file. Removing
a raw local file (ie local_root/file) will invalidate postprocessed files as well.
file – A file to retrieve or download.
remote_root – Remote URL where the data may be retrieved using requests.get.
local_root – Local directory where the file may be cached.
force – If True, always download and apply processing (if applicable). Existing files will be overwritten.
postprocessor – A function which takes a single argument input_path and returns a pickleable type.
show_progress – If True, show a progress bar. Requires the tqdm package.
An absolute path to the data.
ValueError – If local root path does not exist or is not a directory.
ValueError – If the local file does not exist and remote=None.
ModuleNotFoundError – If the tqdm package is not installed but show_progress=True.
Examples
Fetch the Title Basics table (a CSV file) of the IMDb dataset.
>>> from rics.utility.misc import get_local_or_remote
>>> import pandas as pd
>>>
>>> file = "name.basics.tsv.gz"
>>> local_root = "my-data" # default = "."
>>> remote_root = "https://datasets.imdbws.com"
>>> path = get_local_or_remote(file, remote_root, local_root, show_progress=True)
>>> pd.read_csv(path, sep="\t").shape
https://datasets.imdbws.com/name.basics.tsv.gz: 100%|██████████| 214M/214M [00:05<00:00, 39.3MiB/s]
(11453719, 6)
We had download name.basics.tsv.gz the first time, but get_local_or_remote returns immediately the second
time it is called. Fetching can be forced using force_remote=True.
>>> path = get_local_or_remote(file, remote_root, local_root, show_progress=True)
>>> pd.read_csv(path, sep="\t").shape
(11453719, 6)
Read an environment variable if arg if prefixed by env_marker, otherwise return arg as-is.
arg – A literal value or environment variable to read.
env_marker – A prefix which indicates that arg should be interpreted as environment variable name.
default – Default value to use if the variable denoted by arg doesn’t exist.
A processed version arg where the final response is ans_type(processed-arg).
ValueError – If arg does not start with env_marker and enforce_env_var is True.
Notes
The constructor of desired_return_type may raise errors not listed here.