SqlFetcher#Translating using a SQL database. This notebook assumes that the Prepare for ``SqlFetcher`` demo-step from the PickleFetcher demo notebook has been completed.
[1]:
import sys
import rics
# Print relevant versions
print(f"{rics.__version__=}")
print(f"{sys.version=}")
!git log --pretty=oneline --abbrev-commit -1
rics.__version__='0.17.0.dev1'
sys.version='3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]'
cbd8da2 (HEAD -> main, origin/main, origin/HEAD) Rerun some notebooks
[2]:
from rics.utility.logs import basic_config, logging
basic_config(level=logging.INFO, rics_level=logging.DEBUG)
Click here to see the file.
[3]:
from rics.translation import Translator
translator = Translator.from_config("config.toml")
translator
2022-10-08T15:45:39.660 [rics.translation.fetching.SqlFetcher:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Metadata created in 0.109644 sec.
2022-10-08T15:45:39.662 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('id',) in context='title_basics' to candidates=('titleType', 'index', 'tconst', 'isAdult', 'startYear', 'primaryTitle', 'endYear', 'genres', 'runtimeMinutes', 'originalTitle', 'int_id_tconst') using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2022-10-08T15:45:39.670 [rics.mapping.Mapper:DEBUG] Computed 1x11 match scores in 0.00208911 sec:
candidates titleType index tconst isAdult startYear primaryTitle endYear genres runtimeMinutes originalTitle int_id_tconst
values
id -inf -inf inf -inf -inf -inf -inf -inf -inf -inf -inf
2022-10-08T15:45:39.680 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'id' -> 'tconst'; score=inf (short-circuit or override).
2022-10-08T15:45:39.681 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 10 other matches:
'id' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'id' -> 'index'; score=-inf (superseded by short-circuit or override).
'id' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'id' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'id' -> 'endYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'genres'; score=-inf (superseded by short-circuit or override).
'id' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'id' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
'id' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:39.682 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.00932053 sec.
2022-10-08T15:45:39.696 [rics.translation.fetching.SqlFetcher:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Size of 'title_basics' resolved in 0.0125943 sec: 45674 rows.
2022-10-08T15:45:39.699 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('id',) in context='name_basics' to candidates=('index', 'primaryProfession', 'knownForTitles', 'birthYear', 'deathYear', 'nconst', 'primaryName', 'int_id_nconst') using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2022-10-08T15:45:39.705 [rics.mapping.Mapper:DEBUG] Computed 1x8 match scores in 0.00296836 sec:
candidates index primaryProfession knownForTitles birthYear deathYear nconst primaryName int_id_nconst
values
id -inf -inf -inf -inf -inf inf -inf -inf
2022-10-08T15:45:39.713 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'id' -> 'nconst'; score=inf (short-circuit or override).
2022-10-08T15:45:39.714 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 7 other matches:
'id' -> 'index'; score=-inf (superseded by short-circuit or override).
'id' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
'id' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'id' -> 'birthYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'deathYear'; score=-inf (superseded by short-circuit or override).
'id' -> 'primaryName'; score=-inf (superseded by short-circuit or override).
'id' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:39.716 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.0090395 sec.
2022-10-08T15:45:39.740 [rics.translation.fetching.SqlFetcher:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Size of 'name_basics' resolved in 0.021451 sec: 168310 rows.
2022-10-08T15:45:39.742 [rics.translation.fetching.SqlFetcher:INFO] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Processed 2 tables in 0.190995 sec. Lengths={'name_basics': 168310, 'title_basics': 45674}.
[3]:
Translator(online=True: fetcher=SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']))
[4]:
import pandas as pd
engine = translator._fetcher._engine
def first_title(seed=None, n=1000):
df = pd.read_sql("SELECT * FROM name_basics;", engine).sample(n, random_state=seed)
df["firstTitle"] = df.knownForTitles.str.split(",").str[0]
return df[["nconst", "firstTitle"]]
[5]:
translator.store().cache
2022-10-08T15:45:39.771 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('original_name', 'to', 'name', 'from') in context='title_basics' to candidates=('titleType', 'index', 'tconst', 'isAdult', 'startYear', 'primaryTitle', 'endYear', 'genres', 'runtimeMinutes', 'originalTitle', 'int_id_tconst') using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2022-10-08T15:45:39.785 [rics.mapping.Mapper:DEBUG] Computed 4x11 match scores in 0.00633102 sec:
candidates titleType index tconst isAdult startYear primaryTitle endYear genres runtimeMinutes originalTitle int_id_tconst
values
original_name -inf -inf -inf -inf -inf -inf -inf -inf -inf inf -inf
to -inf -inf -inf -inf -inf -inf inf -inf -inf -inf -inf
name -inf -inf -inf -inf -inf inf -inf -inf -inf -inf -inf
from -inf -inf -inf -inf inf -inf -inf -inf -inf -inf -inf
2022-10-08T15:45:39.791 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'name' -> 'primaryTitle'; score=inf (short-circuit or override).
2022-10-08T15:45:39.792 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 10 other matches:
'name' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
'name' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'name' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'endYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'genres'; score=-inf (superseded by short-circuit or override).
'name' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'name' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
'name' -> 'index'; score=-inf (superseded by short-circuit or override).
'name' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'name' -> 'titleType'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:39.793 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'to' -> 'endYear'; score=inf (short-circuit or override).
2022-10-08T15:45:39.795 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 10 other matches:
'to' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
'to' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'to' -> 'index'; score=-inf (superseded by short-circuit or override).
'to' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'to' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'to' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'to' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'to' -> 'genres'; score=-inf (superseded by short-circuit or override).
'to' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'to' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:39.798 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'from' -> 'startYear'; score=inf (short-circuit or override).
2022-10-08T15:45:39.798 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 10 other matches:
'from' -> 'index'; score=-inf (superseded by short-circuit or override).
'from' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'from' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'from' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'from' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'from' -> 'endYear'; score=-inf (superseded by short-circuit or override).
'from' -> 'genres'; score=-inf (superseded by short-circuit or override).
'from' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'from' -> 'originalTitle'; score=-inf (superseded by short-circuit or override).
'from' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:39.799 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'original_name' -> 'originalTitle'; score=inf (short-circuit or override).
2022-10-08T15:45:39.800 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 10 other matches:
'original_name' -> 'titleType'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'index'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'tconst'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'isAdult'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'startYear'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'primaryTitle'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'endYear'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'genres'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'runtimeMinutes'; score=-inf (superseded by short-circuit or override).
'original_name' -> 'int_id_tconst'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:39.802 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.0161652 sec.
2022-10-08T15:45:39.803 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='title_basics': {'id': 'tconst', 'original_name': 'originalTitle', 'to': 'endYear', 'name': 'primaryTitle', 'from': 'startYear'}.
2022-10-08T15:45:40.191 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('tconst', 'primaryTitle', 'originalTitle', 'startYear', 'endYear') for 45674 IDS from 'title_basics' in 0.385863 sec using SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']).
2022-10-08T15:45:40.192 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('original_name', 'to', 'name', 'from') in context='name_basics' to candidates=('index', 'primaryProfession', 'knownForTitles', 'birthYear', 'deathYear', 'nconst', 'primaryName', 'int_id_nconst') using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2022-10-08T15:45:40.197 [rics.mapping.Mapper:DEBUG] Computed 4x8 match scores in 0.00238939 sec:
candidates index primaryProfession knownForTitles birthYear deathYear nconst primaryName int_id_nconst
values
original_name 0.022222 0.015385 0.0 0.0 0.022222 0.0 0.181818 0.076923
to -inf -inf -inf -inf inf -inf -inf -inf
name -inf -inf -inf -inf -inf -inf inf -inf
from -inf -inf -inf inf -inf -inf -inf -inf
2022-10-08T15:45:40.200 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'name' -> 'primaryName'; score=inf (short-circuit or override).
2022-10-08T15:45:40.202 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 7 other matches:
'name' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
'name' -> 'deathYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'nconst'; score=-inf (superseded by short-circuit or override).
'name' -> 'index'; score=-inf (superseded by short-circuit or override).
'name' -> 'birthYear'; score=-inf (superseded by short-circuit or override).
'name' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'name' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:40.202 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'from' -> 'birthYear'; score=inf (short-circuit or override).
2022-10-08T15:45:40.203 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 7 other matches:
'from' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
'from' -> 'index'; score=-inf (superseded by short-circuit or override).
'from' -> 'deathYear'; score=-inf (superseded by short-circuit or override).
'from' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'from' -> 'nconst'; score=-inf (superseded by short-circuit or override).
'from' -> 'primaryName'; score=-inf (superseded by short-circuit or override).
'from' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:40.206 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'to' -> 'deathYear'; score=inf (short-circuit or override).
2022-10-08T15:45:40.208 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 7 other matches:
'to' -> 'int_id_nconst'; score=-inf (superseded by short-circuit or override).
'to' -> 'primaryName'; score=-inf (superseded by short-circuit or override).
'to' -> 'nconst'; score=-inf (superseded by short-circuit or override).
'to' -> 'birthYear'; score=-inf (superseded by short-circuit or override).
'to' -> 'knownForTitles'; score=-inf (superseded by short-circuit or override).
'to' -> 'primaryProfession'; score=-inf (superseded by short-circuit or override).
'to' -> 'index'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:40.210 [rics.mapping.Mapper.unmapped.details:DEBUG] Could not map value='original_name':
'original_name' -> 'primaryName'; score=0.182 < 1.0 (below threshold).
'original_name' -> 'int_id_nconst'; score=0.077 < 1.0 (below threshold).
'original_name' -> 'index'; score=0.022 < 1.0 (below threshold).
'original_name' -> 'deathYear'; score=0.022 < 1.0 (below threshold).
'original_name' -> 'primaryProfession'; score=0.015 < 1.0 (below threshold).
'original_name' -> 'knownForTitles'; score=0.000 < 1.0 (below threshold).
'original_name' -> 'birthYear'; score=0.000 < 1.0 (below threshold).
'original_name' -> 'nconst'; score=0.000 < 1.0 (below threshold).
2022-10-08T15:45:40.211 [rics.mapping.Mapper.unmapped:DEBUG] Could not map {'original_name'} in context='name_basics' to any of candidates={'index', 'knownForTitles', 'primaryProfession', 'birthYear', 'deathYear', 'nconst', 'primaryName', 'int_id_nconst'}.
2022-10-08T15:45:40.212 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.0142098 sec.
2022-10-08T15:45:40.213 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='name_basics': {'id': 'nconst', 'to': 'deathYear', 'name': 'primaryName', 'from': 'birthYear', 'original_name': None}.
2022-10-08T15:45:41.573 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('nconst', 'primaryName', 'birthYear', 'deathYear') for 168310 IDS from 'name_basics' in 1.36 sec using SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']).
2022-10-08T15:45:41.575 [rics.translation.fetching.SqlFetcher:DEBUG] Dispose Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb)
[5]:
TranslationMap('name_basics': 168310 IDs, 'title_basics': 45674 IDs)
In the IMDb list anyway. I have no idea how they’re ordered in “knownForTitles”.
[6]:
df = first_title(seed=5)
df.head()
[6]:
| nconst | firstTitle | |
|---|---|---|
| 33993 | nm0260875 | tt0255068 |
| 22215 | nm0167306 | tt0252264 |
| 76585 | nm0604711 | tt0052933 |
| 47602 | nm0369739 | tt0125301 |
| 164264 | nm6981261 | tt3952746 |
[7]:
translator.translate(df).head(5)
2022-10-08T15:45:43.539 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('nconst', 'firstTitle') to candidates=('title_basics', 'name_basics') using HeuristicScore([like_database_table()] -> modified_hamming).
2022-10-08T15:45:43.545 [rics.mapping.Mapper:DEBUG] Computed 2x2 match scores in 0.00440562 sec:
candidates title_basics name_basics
values
nconst -inf inf
firstTitle inf -inf
2022-10-08T15:45:43.548 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'nconst' -> 'name_basics'; score=inf (short-circuit or override).
2022-10-08T15:45:43.550 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 1 other matches:
'nconst' -> 'title_basics'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:43.551 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'firstTitle' -> 'title_basics'; score=inf (short-circuit or override).
2022-10-08T15:45:43.552 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 1 other matches:
'firstTitle' -> 'name_basics'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:43.554 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.0068388 sec.
2022-10-08T15:45:43.711 [rics.translation.Translator:DEBUG] Failed to translate 0.000% of IDs for name='nconst' using source='name_basics'.
2022-10-08T15:45:43.712 [rics.translation.Translator:DEBUG] Failed to translate 87.200% of IDs for name='firstTitle' using source='title_basics'.
2022-10-08T15:45:43.712 [rics.translation.Translator:DEBUG] Verified 2000 IDs from 2 different sources in 0.156801 sec.
[7]:
| nconst | firstTitle | |
|---|---|---|
| 33993 | nm0260875:Margarito Esparza *1936†2016 | tt0255068 not translated; default name=Title u... |
| 22215 | nm0167306:Rick Cluchey *1933†2015 | tt0252264 not translated; default name=Title u... |
| 76585 | nm0604711:Henry Morgan *1915†1994 | tt0052933 not translated; default name=Title u... |
| 47602 | nm0369739:Svatopluk Havelka *1925†2009 | tt0125301 not translated; default name=Title u... |
| 164264 | nm6981261:Tyler Sanders *2004†2022 | tt3952746:Just Add Magic (original: Just Add M... |
[8]:
translator.translate(df, inplace=True) # returns None
df.head(5)
2022-10-08T15:45:43.906 [rics.mapping.Mapper:DEBUG] Begin computing match scores for values=('nconst', 'firstTitle') to candidates=('title_basics', 'name_basics') using HeuristicScore([like_database_table()] -> modified_hamming).
2022-10-08T15:45:43.911 [rics.mapping.Mapper:DEBUG] Computed 2x2 match scores in 0.00335408 sec:
candidates title_basics name_basics
values
nconst -inf inf
firstTitle inf -inf
2022-10-08T15:45:43.914 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'nconst' -> 'name_basics'; score=inf (short-circuit or override).
2022-10-08T15:45:43.916 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 1 other matches:
'nconst' -> 'title_basics'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:43.917 [rics.mapping.Mapper.accept:DEBUG] Accepted: 'firstTitle' -> 'title_basics'; score=inf (short-circuit or override).
2022-10-08T15:45:43.919 [rics.mapping.Mapper.accept.details:DEBUG] This match supersedes 1 other matches:
'firstTitle' -> 'name_basics'; score=-inf (superseded by short-circuit or override).
2022-10-08T15:45:43.920 [rics.mapping.Mapper:DEBUG] Match selection with cardinality='ManyToOne' completed in 0.00831991 sec.
2022-10-08T15:45:44.058 [rics.translation.Translator:DEBUG] Failed to translate 0.000% of IDs for name='nconst' using source='name_basics'.
2022-10-08T15:45:44.059 [rics.translation.Translator:DEBUG] Failed to translate 87.200% of IDs for name='firstTitle' using source='title_basics'.
2022-10-08T15:45:44.061 [rics.translation.Translator:DEBUG] Verified 2000 IDs from 2 different sources in 0.139807 sec.
[8]:
| nconst | firstTitle | |
|---|---|---|
| 33993 | nm0260875:Margarito Esparza *1936†2016 | tt0255068 not translated; default name=Title u... |
| 22215 | nm0167306:Rick Cluchey *1933†2015 | tt0252264 not translated; default name=Title u... |
| 76585 | nm0604711:Henry Morgan *1915†1994 | tt0052933 not translated; default name=Title u... |
| 47602 | nm0369739:Svatopluk Havelka *1925†2009 | tt0125301 not translated; default name=Title u... |
| 164264 | nm6981261:Tyler Sanders *2004†2022 | tt3952746:Just Add Magic (original: Just Add M... |
[ ]: