PandasFetcher#Translating using pickle files.
[1]:
import sys
import rics
# Print relevant versions
print(f"{rics.__version__=}")
print(f"{sys.version=}")
!git log --pretty=oneline --abbrev-commit -1
rics.__version__='0.17.0.dev1'
sys.version='3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0]'
61bf123 (HEAD -> main, origin/main, origin/HEAD) Add primer examples
[2]:
from rics.utility.logs import basic_config, logging
basic_config(level=logging.INFO, rics_level=logging.DEBUG)
We’lll download data from https://datasets.imdbws.com and clean it to make sure all values are given (which means that for actors are dead and titles have stopped airing).
[3]:
sources = ["name.basics", "title.basics"]
[4]:
from data import load_imdb
for dataset in sources:
load_imdb(dataset)
2022-11-13T17:28:44.530 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/rics/jupyterlab/data-cache/name.basics.tsv.gz'.
2022-11-13T17:28:44.534 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/name.basics.tsv.gz'.
2022-11-13T17:28:44.635 [rics.utility.misc.get_local_or_remote:INFO] Fetching data from 'https://datasets.imdbws.com/name.basics.tsv.gz'..
2022-11-13T17:28:49.717 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2022-11-13T17:28:49.722 [rics.utility.misc.get_local_or_remote:INFO] Running clean_and_fix_ids..
2022-11-13T17:29:34.774 [rics.utility.misc.get_local_or_remote:INFO] Serializing processed data to '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'..
2022-11-13T17:29:35.067 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/rics/jupyterlab/data-cache/title.basics.tsv.gz'.
2022-11-13T17:29:35.068 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/title.basics.tsv.gz'.
2022-11-13T17:29:35.069 [rics.utility.misc.get_local_or_remote:INFO] Fetching data from 'https://datasets.imdbws.com/title.basics.tsv.gz'..
2022-11-13T17:29:38.829 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'.
2022-11-13T17:29:38.830 [rics.utility.misc.get_local_or_remote:INFO] Running clean_and_fix_ids..
/home/dev/git/rics/jupyterlab/data.py:37: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(input_path, sep="\t", header=0, engine="c")
2022-11-13T17:30:13.333 [rics.utility.misc.get_local_or_remote:INFO] Serializing processed data to '/home/dev/git/rics/jupyterlab/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'..
Click here to see the file.
[5]:
from rics.translation import Translator
translator = Translator.from_config("config.toml")
translator
2022-11-13T17:30:13.855 [rics.translation.fetching.PandasFetcher:DEBUG] Sources initialized: ['title.basics', 'name.basics']
[5]:
Translator(online=True: fetcher=PandasFetcher(sources=['title.basics', 'name.basics']))
[6]:
tmap = translator.store().cache
2022-11-13T17:30:14.029 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('from', 'original_name', 'to', 'name', 'id') in context='title.basics' to candidates=('primaryTitle', 'endYear', 'startYear', 'titleType', 'isAdult', 'tconst', 'int_id_tconst', 'runtimeMinutes', 'originalTitle', 'genres') using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2022-11-13T17:30:14.046 [rics.mapping.Mapper:DEBUG] Computed 5x10 match scores in 0.00430678 sec:
candidates primaryTitle endYear startYear titleType isAdult tconst int_id_tconst runtimeMinutes originalTitle genres
values
from -inf -inf inf -inf -inf -inf -inf -inf -inf -inf
original_name -inf -inf -inf -inf -inf -inf -inf -inf inf -inf
to -inf inf -inf -inf -inf -inf -inf -inf -inf -inf
name inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
id -inf -inf -inf -inf -inf inf -inf -inf -inf -inf
2022-11-13T17:30:14.068 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'from' -> 'startYear'; score=inf (short-circuit or override).
2022-11-13T17:30:14.069 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 9 other matches:
'from' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'from' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'from' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'from' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'from' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
'from' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'from' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
'from' -> 'genres'; score=-inf (superseded by short-circuit or override).
'from' -> 'endYear'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.070 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'name' -> 'primaryTitle'; score=inf (short-circuit or override).
2022-11-13T17:30:14.071 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 9 other matches:
'name' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'name' -> 'endYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'name' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'name' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'name' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
'name' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
'name' -> 'genres'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.072 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'id' -> 'tconst'; score=inf (short-circuit or override).
2022-11-13T17:30:14.073 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 9 other matches:
'id' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'id' -> 'endYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'id' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'id' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
'id' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'id' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
'id' -> 'genres'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.074 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'to' -> 'endYear'; score=inf (short-circuit or override).
2022-11-13T17:30:14.075 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 9 other matches:
'to' -> 'genres'; score=-inf (superseded by short-circuit or override).
'to' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'to' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
'to' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'to' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
'to' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'to' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'to' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'to' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.076 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'original_name' -> 'originalTitle'; score=inf (short-circuit or override).
2022-11-13T17:30:14.077 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 9 other matches:
'original_name' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'endYear'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'genres'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.082 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.0346657 sec.
2022-11-13T17:30:14.083 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='title.basics': {'from': 'startYear', 'original_name': 'originalTitle', 'to': 'endYear', 'name': 'primaryTitle', 'id': 'tconst'}.
2022-11-13T17:30:14.305 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres', 'int_id_tconst') for 46376 IDS from 'title.basics' in 0.221314 sec using PandasFetcher(sources=['title.basics', 'name.basics']).
2022-11-13T17:30:14.306 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('from', 'original_name', 'to', 'name', 'id') in context='name.basics' to candidates=('nconst', 'primaryName', 'deathYear', 'int_id_nconst', 'birthYear', 'knownForTitles', 'primaryProfession') using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2022-11-13T17:30:14.314 [rics.mapping.Mapper:DEBUG] Computed 5x7 match scores in 0.00501394 sec:
candidates nconst primaryName deathYear int_id_nconst birthYear knownForTitles primaryProfession
values
from -inf -inf -inf -inf inf -inf -inf
original_name 0.0 0.181818 0.022222 0.076923 0.0 0.0 0.015385
to -inf -inf inf -inf -inf -inf -inf
name -inf inf -inf -inf -inf -inf -inf
id inf -inf -inf -inf -inf -inf -inf
2022-11-13T17:30:14.318 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'id' -> 'nconst'; score=inf (short-circuit or override).
2022-11-13T17:30:14.320 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 6 other matches:
'id' -> 'primaryName'; score=-inf (superseded by short-circuit or override).
'id' -> 'deathYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
'id' -> 'birthYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'id' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.323 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'from' -> 'birthYear'; score=inf (short-circuit or override).
2022-11-13T17:30:14.326 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 6 other matches:
'from' -> 'nconst'; score=-inf (superseded by short-circuit or override).
'from' -> 'primaryName'; score=-inf (superseded by short-circuit or override).
'from' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
'from' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'from' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
'from' -> 'deathYear'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.328 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'name' -> 'primaryName'; score=inf (short-circuit or override).
2022-11-13T17:30:14.330 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 6 other matches:
'name' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
'name' -> 'birthYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'name' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
'name' -> 'deathYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'nconst'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.332 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'to' -> 'deathYear'; score=inf (short-circuit or override).
2022-11-13T17:30:14.334 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 6 other matches:
'to' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
'to' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
'to' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'to' -> 'birthYear'; score=-inf (superseded by short-circuit or override).
'to' -> 'primaryName'; score=-inf (superseded by short-circuit or override).
'to' -> 'nconst'; score=-inf (superseded by short-circuit or override).
2022-11-13T17:30:14.335 [rics.mapping.Mapper.unmapped.details:DEBUG] Could not map value='original_name':
'original_name' -> 'primaryName'; score=0.182 < 1.0 (below threshold).
'original_name' -> 'int_id_nconst'; score=0.077 < 1.0 (below threshold).
'original_name' -> 'deathYear'; score=0.022 < 1.0 (below threshold).
'original_name' -> 'primaryProfession'; score=0.015 < 1.0 (below threshold).
'original_name' -> 'birthYear'; score=0.000 < 1.0 (below threshold).
'original_name' -> 'nconst'; score=0.000 < 1.0 (below threshold).
'original_name' -> 'knownForTitles'; score=0.000 < 1.0 (below threshold).
2022-11-13T17:30:14.343 [rics.mapping.Mapper.unmapped:DEBUG] Could not map {'original_name'} in context='name.basics' to any of candidates={'nconst', 'primaryName', 'deathYear', 'int_id_nconst', 'birthYear', 'knownForTitles', 'primaryProfession'}.
2022-11-13T17:30:14.344 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.0275069 sec.
2022-11-13T17:30:14.346 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='name.basics': {'from': 'birthYear', 'to': 'deathYear', 'name': 'primaryName', 'id': 'nconst', 'original_name': None}.
2022-11-13T17:30:14.657 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles', 'int_id_nconst') for 169189 IDS from 'name.basics' in 0.309258 sec using PandasFetcher(sources=['title.basics', 'name.basics']).
[7]:
for source in tmap:
translations = tmap[source]
print(f"Translations for {source=};")
for i, (idx, translation) in enumerate(tmap[source].items()):
print(f" {repr(idx)} -> {repr(translation)}")
if i == 2:
break
Translations for source='name.basics';
'nm0000001' -> 'nm0000001:Fred Astaire *1899†1987'
'nm0000002' -> 'nm0000002:Lauren Bacall *1924†2014'
'nm0000004' -> 'nm0000004:John Belushi *1949†1982'
Translations for source='title.basics';
'tt0025509' -> 'tt0025509:Les Misérables (original: Les misérables) *1934†1934'
'tt0035803' -> 'tt0035803:The German Weekly Review (original: Die Deutsche Wochenschau) *1940†1945'
'tt0038276' -> 'tt0038276:You Are an Artist (original: You Are an Artist) *1946†1955'
SqlFetcher demo#PostgreSQL must be running locally, with a user called postgres using password your_password and the database imdb created.
import sqlalchemy
engine = sqlalchemy.create_engine("postgresql+pg8000://postgres:your_password@localhost:5432/imdb")
for source in sources:
df = load_imdb(source)[0]
df.to_sql(source.replace(".", "_"), engine, if_exists="replace")
Copy-and paste, then run this snippet to load data into the SQL database.
[ ]: