DVD Rental Database#

This example translates a query from the DVD Rental Sample Database. It covers most of the more advanced features that have been implemented.

Start the database#

Using Docker, start the database by running:

docker run -p 5001:5432 moertel/postgresql-sample-dvdrental@sha256:e35f8dc4011d053777631208c85e3976a422b65e12383579d8a856a7849082c5

from the terminal. To describe the database, run

psql -h localhost -p 5001 -U postgres -d dvdrental -c "\d+"

from a separate terminal window. Leave out the last part (-c "\d+") to query the database manually.

Query to translate#

We will use a small query to describe rental transactions in the database:

Query returning DVD rental transactions.#
 1SELECT customer_id,
 2       inventory.film_id,
 3       film_category.category_id,
 4       staff_id,
 5       store_id,
 6       rental_date,
 7       return_date
 8FROM rental
 9LEFT JOIN inventory ON rental.inventory_id = inventory.inventory_id
10LEFT JOIN film_category on film_category.film_id = inventory.film_id;

The query above shows who rented what and when, what store they rented from and from whom.

Randomly sampled rows from the query. The first column is the record index in the query.#

customer_id

film_id

staff_id

store_id

rental_date

return_date

9564

385

129

1

2

2005-07-31 02:40:37

2005-08-01 01:59:37

14933

328

158

1

1

2005-08-22 05:53:52

2005-08-27 06:20:52

544

506

448

2

1

2005-05-28 07:24:28

2005-06-06 01:31:28

1869

206

786

2

1

2005-06-17 02:27:03

2005-06-20 22:41:03

928

421

743

1

2

2005-05-30 12:53:01

2005-05-31 14:28:01

Configuration files#

The database has a few quirks, which are managed by configuration. See the Translator Configuration Files page to learn more about config files.

Translation configuration, mapping, and definition of the categories.#
 1################################################################################
 2# For help, see https://rics.readthedocs.io                                    #
 3################################################################################
 4[translator]
 5fmt = "{id}:{name}[ {last_name}]"
 6
 7# ------------------------------------------------------------------------------
 8# Name-to-source mapping configuration. Binds names to source, eg 'cute_animals'
 9# -> 'my_database.animals'. Overrides take precedence over scoring logic.
10[translator.mapping]
11unmapped_values_action = "raise"
12score_function = "modified_hamming"
13
14[[translator.mapping.score_function_heuristics]]
15function = "like_database_table"
16
17[[translator.mapping.filter_functions]]
18function = "require_regex_match"
19regex = ".*_id$"
20where = "name"
21
22################################################################################
23# Fetching configuration.
24################################################################################
25[fetching.MemoryFetcher.data.category]
26id = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
27name = [
28    "Action", "Animation",
29    "Children", "Classics", "Comedy",
30    "Documentary", "Drama",
31    "Family", "Foreign",
32    "Games", "Horror", "Music", "New",
33    "Sci-Fi", "Sports", "Travel"
34]
Configuration for fetching SQL data.#
 1[fetching.SqlFetcher]
 2connection_string = "postgresql+pg8000://postgres@localhost:5001/dvdrental"
 3blacklist_tables = [ "category" ]  # Let's pretend this table is off-limits
 4
 5# ------------------------------------------------------------------------------
 6# Placeholder mapping configuration. Binds actual names in sources (such as
 7# column names in an SQL table) to wanted names, eg id = 'animal_id'.
 8[fetching.mapping]
 9unmapped_values_action = "ignore"
10
11[[fetching.mapping.filter_functions]]
12function = "banned_substring"
13substrings = ["p2007"]
14where = "context"
15
16[[fetching.mapping.score_function_heuristics]]
17function = "value_fstring_alias"
18fstring = "first_{value}"
19[[fetching.mapping.score_function_heuristics]]
20function = "value_fstring_alias"
21fstring = "{context}_{value}"
22
23[fetching.mapping.overrides.addess]
24code = "postal_code"
25[fetching.mapping.overrides.film]
26name = "title"

Translating#

Translating now becomes a simple matter. The following snippet is a test case which translates all of the ~16000 rows returned by the query, verifying a random sample of 5 rows.

 1import pandas as pd
 2import pytest
 3
 4from rics.translation import Translator
 5
 6from .conftest import DATE_COLUMNS, DVD_RENTAL_SKIP_REASON, wait_for_dvdrental
 7
 8
 9@pytest.fixture(scope="module")
10def translator():
11    extra_fetchers = ["tests/translation/dvdrental/sql-fetcher.toml"]
12    config = "tests/translation/dvdrental/translation.toml"
13    return Translator.from_config(config, extra_fetchers)
14
15
16@pytest.mark.skipif(not wait_for_dvdrental(), reason=DVD_RENTAL_SKIP_REASON)
17def test_dvd_rental(translator):
18    expected = pd.read_csv("tests/translation/dvdrental/translated.csv", index_col=0, parse_dates=DATE_COLUMNS)
19
20    with open("tests/translation/dvdrental/query.sql") as f:
21        query = f.read()
22
23    engine = translator._fetcher.fetchers[-1]._engine  # The SQL fetcher was given last
24    df: pd.DataFrame = pd.read_sql(query, con=engine)
25    translator.translate(df, inplace=True)
26
27    assert (df.select_dtypes("datetime").columns == DATE_COLUMNS).all()
28    actual = df.sample(len(expected), random_state=0)
29    assert expected.equals(actual), actual.to_string()

The translated rows are:

The rows translated. Dates are not translated, nor is the first (row number/index) column.#

customer_id

film_id

category_id

staff_id

store_id

rental_date

return_date

12503

29:ANGELA HERNANDEZ

235:DIVIDE MONSTER

14:Sci-Fi

1:Mike Hillyer

2:MySQL

2005-08-18 13:20:13

2005-08-22 12:55:13

13357

565:JAIME NETTLES

191:CROOKED FROGMEN

3:Children

2:Jon Stephens

1:MySakila

2005-08-19 21:07:54

2005-08-20 17:29:54

1638

259:LENA JENSEN

643:ORIENT CLOSER

7:Drama

2:Jon Stephens

2:MySQL

2005-06-16 08:46:26

2005-06-24 14:39:26

7129

574:JULIAN VEST

235:DIVIDE MONSTER

14:Sci-Fi

1:Mike Hillyer

2:MySQL

2005-07-27 06:33:06

2005-07-28 06:15:06

5931

258:MYRTLE FLEMING

739:ROCKY WAR

7:Drama

2:Jon Stephens

2:MySQL

2005-07-10 22:11:04

2005-07-19 18:12:04