Fetching data using `PandasFetcher`#

Translating using pickle files.

[1]:

import sys

import rics

# Print relevant versions
print(f"{rics.__version__=}")
print(f"{sys.version=}")
!git log --pretty=oneline --abbrev-commit -1

rics.__version__='0.11.1'
sys.version='3.9.7 (default, Sep 16 2021, 13:09:58) \n[GCC 7.5.0]'
a1dea9c (HEAD -> master) Changed `like_database_table` from score function to alias function

[2]:

from rics.utility.logs import basic_config, logging

basic_config(level=logging.INFO, rics_level=logging.DEBUG)

Make local Pickle files#

We’lll download data from https://datasets.imdbws.com and clean it to make sure all values are given (which means that for actors are dead and titles have stopped airing).

[3]:

sources = ["name.basics", "title.basics"]

[4]:

from data import load_imdb

for dataset in sources:
    load_imdb(dataset)

2022-07-03T11:20:23.406 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/rics/jupyterlab/data-cache/name.basics.tsv.gz'.
2022-07-03T11:20:23.409 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/name.basics.tsv.gz'.
2022-07-03T11:20:23.538 [rics.utility.misc.get_local_or_remote:INFO] Fetching data from 'https://datasets.imdbws.com/name.basics.tsv.gz'..

2022-07-03T11:20:29.111 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2022-07-03T11:20:29.111 [rics.utility.misc.get_local_or_remote:INFO] Running clean_and_fix_ids..
2022-07-03T11:21:06.903 [rics.utility.misc.get_local_or_remote:INFO] Serializing processed data to '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'..
2022-07-03T11:21:07.394 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/rics/jupyterlab/data-cache/title.basics.tsv.gz'.
2022-07-03T11:21:07.395 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/title.basics.tsv.gz'.
2022-07-03T11:21:07.396 [rics.utility.misc.get_local_or_remote:INFO] Fetching data from 'https://datasets.imdbws.com/title.basics.tsv.gz'..

2022-07-03T11:21:11.717 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'.
2022-07-03T11:21:11.718 [rics.utility.misc.get_local_or_remote:INFO] Running clean_and_fix_ids..
/home/dev/git/rics/jupyterlab/data.py:35: DtypeWarning: Columns (4,5) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(input_path, sep="\t", header=0, engine="c")
2022-07-03T11:21:42.736 [rics.utility.misc.get_local_or_remote:INFO] Serializing processed data to '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'..

Create translator from config#

Click here to see the file.

[5]:

from rics.translation import Translator

translator = Translator.from_config("config.toml")
translator

[5]:

Translator(online=True: fetcher=PandasFetcher(read_function=read_pickle, read_path_format='../../data-cache/clean_and_fix_ids/{source}.tsv.pkl'))

[6]:

tmap = translator.store()._cached_tmap

2022-07-03T11:21:43.131 [rics.translation.fetching.PandasFetcher:DEBUG] Sources initialized: ['name.basics', 'title.basics']
2022-07-03T11:21:43.221 [rics.mapping.Mapper:DEBUG] Begin mapping value='original_name' in context='name.basics' to candidates={'int_id_nconst', 'birthYear', 'primaryProfession', 'knownForTitles', 'deathYear', 'nconst', 'primaryName'} using HeuristicScore([force_lower_case] -> default_score_function).
2022-07-03T11:21:43.222 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'primaryName', score=0.182 < 1.0.
2022-07-03T11:21:43.223 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'int_id_nconst', score=0.077 < 1.0.
2022-07-03T11:21:43.225 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'deathYear', score=0.022 < 1.0.
2022-07-03T11:21:43.227 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'primaryProfession', score=0.015 < 1.0.
2022-07-03T11:21:43.229 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'birthYear', score=0.000 < 1.0.
2022-07-03T11:21:43.230 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'knownForTitles', score=0.000 < 1.0.
2022-07-03T11:21:43.232 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'nconst', score=0.000 < 1.0.
2022-07-03T11:21:43.236 [rics.mapping.Mapper:DEBUG] Could not map value='original_name' in context='name.basics' to any of candidates={'int_id_nconst', 'birthYear', 'primaryProfession', 'knownForTitles', 'deathYear', 'nconst', 'primaryName'}.
2022-07-03T11:21:43.237 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='name.basics': {'to': 'deathYear', 'from': 'birthYear', 'name': 'primaryName', 'id': 'nconst', 'original_name': None}.
2022-07-03T11:21:43.653 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles', 'int_id_nconst') for 165894 IDS from 'name.basics' in 0.415318 sec.
2022-07-03T11:21:43.655 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='title.basics': {'to': 'endYear', 'from': 'startYear', 'name': 'primaryTitle', 'original_name': 'originalTitle', 'id': 'tconst'}.
2022-07-03T11:21:43.776 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres', 'int_id_tconst') for 44003 IDS from 'title.basics' in 0.121071 sec.

[7]:

for source in tmap:
    translations = tmap[source]
    print(f"Translations for {source=};")
    for i, (idx, translation) in enumerate(tmap[source].items()):
        print(f"    {repr(idx)} -> {repr(translation)}")
        if i == 2:
            break

Translations for source='name.basics';
    'nm0000001' -> 'nm0000001:Fred Astaire *1899†1987'
    'nm0000002' -> 'nm0000002:Lauren Bacall *1924†2014'
    'nm0000004' -> 'nm0000004:John Belushi *1949†1982'
Translations for source='title.basics';
    'tt0025509' -> 'tt0025509:Les Misérables (original: Les misérables) *1934†1934'
    'tt0035803' -> 'tt0035803:The German Weekly Review (original: Die Deutsche Wochenschau) *1940†1945'
    'tt0038276' -> 'tt0038276:You Are an Artist (original: You Are an Artist) *1946†1955'

Prepare for `SqlFetcher` demo#

PostgreSQL must be running locally, with a user called postgres using password your_password and a the database imdb created.

import sqlalchemy

engine = sqlalchemy.create_engine("postgresql://postgres:your_password@localhost:5432/imdb")

for source in sources:
    df = load_imdb(source)[0]
    df.to_sql(source.replace(".", "_"), engine, if_exists="replace")

Copy-and paste, then run this snippet to load data into the SQL database.

[ ]: