SqlFetcher#Translating using a SQL database. This notebook assumes that the Prepare for ``SqlFetcher`` demo-step from PickleFetcher.ipynb has been completed.
[1]:
import sys
import rics
# Print relevant versions
print(f"{rics.__version__=}")
print(f"{sys.version=}")
!git log --pretty=oneline --abbrev-commit -1
rics.__version__='0.11.1'
sys.version='3.9.7 (default, Sep 16 2021, 13:09:58) \n[GCC 7.5.0]'
a1dea9c (HEAD -> master) Changed `like_database_table` from score function to alias function
[2]:
from rics.utility.logs import basic_config, logging
basic_config(level=logging.INFO, rics_level=logging.DEBUG)
Click here to see the file.
[3]:
from rics.translation import Translator
translator = Translator.from_config("config.toml")
translator
2022-07-03T11:17:53.831 [rics.translation.fetching.SqlFetcher:DEBUG] Metadata created in 0.170319 sec.
2022-07-03T11:17:53.881 [rics.translation.fetching.SqlFetcher:DEBUG] Size of title_basics=42074 resolved in 0.0484876 sec.
2022-07-03T11:17:54.005 [rics.translation.fetching.SqlFetcher:DEBUG] Size of name_basics=163559 resolved in 0.12327 sec.
2022-07-03T11:17:54.007 [rics.translation.fetching.SqlFetcher:INFO] Processed 2 tables in 0.347049 sec. Lengths={'name_basics': 163559, 'title_basics': 42074}.
[3]:
Translator(online=True: fetcher=SqlFetcher(engine=Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']))
[4]:
import pandas as pd
engine = translator._fetcher._engine
def first_title(seed=None, n=1000):
df = pd.read_sql("SELECT * FROM name_basics;", engine).sample(n, random_state=seed)
df["firstTitle"] = df.knownForTitles.str.split(",").str[0]
return df[["nconst", "firstTitle"]]
[5]:
tmap = translator.store()
2022-07-03T11:17:54.033 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='title_basics': {'id': 'tconst', 'name': 'primaryTitle', 'from': 'startYear', 'to': 'endYear', 'original_name': 'originalTitle'}.
2022-07-03T11:17:54.451 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('tconst', 'primaryTitle', 'originalTitle', 'startYear', 'endYear') for 42074 IDS from 'title_basics' in 0.415381 sec.
2022-07-03T11:17:54.454 [rics.mapping.Mapper:DEBUG] Begin mapping value='original_name' in context='name_basics' to candidates={'birthYear', 'deathYear', 'int_id_nconst', 'primaryProfession', 'nconst', 'knownForTitles', 'primaryName', 'index'} using HeuristicScore([force_lower_case] -> default_score_function).
2022-07-03T11:17:54.456 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'primaryName', score=0.182 < 1.0.
2022-07-03T11:17:54.457 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'int_id_nconst', score=0.077 < 1.0.
2022-07-03T11:17:54.457 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'deathYear', score=0.022 < 1.0.
2022-07-03T11:17:54.458 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'index', score=0.022 < 1.0.
2022-07-03T11:17:54.459 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'primaryProfession', score=0.015 < 1.0.
2022-07-03T11:17:54.459 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'birthYear', score=0.000 < 1.0.
2022-07-03T11:17:54.461 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'nconst', score=0.000 < 1.0.
2022-07-03T11:17:54.462 [rics.mapping.Mapper.reject:DEBUG] Rejected: 'original_name' -> 'knownForTitles', score=0.000 < 1.0.
2022-07-03T11:17:54.462 [rics.mapping.Mapper:DEBUG] Could not map value='original_name' in context='name_basics' to any of candidates={'birthYear', 'deathYear', 'int_id_nconst', 'primaryProfession', 'nconst', 'knownForTitles', 'primaryName', 'index'}.
2022-07-03T11:17:54.463 [rics.translation.fetching.AbstractFetcher:DEBUG] Placeholder mappings for source='name_basics': {'id': 'nconst', 'name': 'primaryName', 'from': 'birthYear', 'to': 'deathYear', 'original_name': None}.
2022-07-03T11:17:55.848 [rics.translation.fetching.AbstractFetcher:DEBUG] Fetched ('nconst', 'primaryName', 'birthYear', 'deathYear') for 163559 IDS from 'name_basics' in 1.38 sec.
2022-07-03T11:17:55.849 [rics.translation.fetching.SqlFetcher:INFO] Deleting Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb)
In the IMDb list anyway. I have no idea how they’re ordered in “knownForTitles”.
[6]:
df = first_title(seed=5)
df.head()
[6]:
| nconst | firstTitle | |
|---|---|---|
| 5536 | nm0038172 | tt0063897 |
| 5882 | nm0040962 | tt0043440 |
| 105691 | nm0865925 | tt0373558 |
| 115067 | nm0941259 | tt5558956 |
| 126661 | nm1229926 | tt0329418 |
[7]:
translator.translate(df).head(5)
[7]:
| nconst | firstTitle | |
|---|---|---|
| 5536 | nm0038172:Peter Aryans *1918†2001 | tt0063897:Floris (original: Floris) *1969†1969 |
| 5882 | nm0040962:Ugo Attanasio *1887†1969 | tt0043440 not translated; default name=Title u... |
| 105691 | nm0865925:Manuel Toledano *1974†2007 | tt0373558:Lo + plus (original: Lo + plus) *199... |
| 115067 | nm0941259:Roberta Woolley *1938†2017 | tt5558956 not translated; default name=Title u... |
| 126661 | nm1229926:Carla Hansen *1906†2001 | tt0329418 not translated; default name=Title u... |
[8]:
translator.translate(df, inplace=True) # returns None
df.head(5)
[8]:
| nconst | firstTitle | |
|---|---|---|
| 5536 | nm0038172:Peter Aryans *1918†2001 | tt0063897:Floris (original: Floris) *1969†1969 |
| 5882 | nm0040962:Ugo Attanasio *1887†1969 | tt0043440 not translated; default name=Title u... |
| 105691 | nm0865925:Manuel Toledano *1974†2007 | tt0373558:Lo + plus (original: Lo + plus) *199... |
| 115067 | nm0941259:Roberta Woolley *1938†2017 | tt5558956 not translated; default name=Title u... |
| 126661 | nm1229926:Carla Hansen *1906†2001 | tt0329418 not translated; default name=Title u... |
[ ]: