[1]:
import sys
import rics
# Print relevant versions
print(f"{rics.__version__=}")
print(f"{sys.version=}")
!git log --pretty=oneline --abbrev-commit -1
rics.__version__='0.14.0'
sys.version='3.9.12 (main, Jun 1 2022, 11:38:51) \n[GCC 7.5.0]'
120c623 (HEAD -> master) fixup! Add perftest tests and enable coverage
[2]:
from rics.utility.logs import basic_config, logging
basic_config(level=logging.DEBUG, matplotlib_level=logging.INFO)
What’s faster at various counts of requests IDs?
[3]:
import sqlalchemy
engine = sqlalchemy.create_engine(
"postgresql+pg8000://postgres:your_password@localhost:5432/imdb"
)
Define the functions or classes we’re testing. See candidates.py for the actual file, or the end of this notebook for a printout.
[4]:
# candidates.py is located in the same folder as this notebook
from candidates import Candidates
c = Candidates(engine)
candidates = [c.as_is, c.between, c.is_in, c.heuristic]
Make sure candidates are equivalent.
[5]:
import numpy as np
all_ids = [int(row[0]) for row in c.as_is([])]
int_list = lambda size: list(
map(int, np.random.choice(all_ids, size=size, replace=False))
)
test_data = {
f"{size} ids": int_list(size) for size in [50, 250, 500, 1000, 2500, 5000, 10_000]
}
[6]:
first_case = next(iter(test_data.keys()))
must_include = set(test_data[first_case])
for cand in candidates:
cand_output = set(row[0] for row in cand(test_data[first_case]))
assert not must_include.difference(
cand_output
), f"Bad candidate: {cand}. {cand_output=} != {reference_output=}"
Run comparison, show results per candidate and test data set.
[7]:
from rics.performance import run_multivariate_test, format_perf_counter
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme("notebook")
data = run_multivariate_test(candidates, test_data, time_per_candidate=3)
2022-07-21T17:41:30.412 [rics.performance:INFO] Expected total runtime: 60.00 sec.
2022-07-21T17:41:30.413 [rics.performance:DEBUG] Run candidate 'as_is' 5x2 times...
2022-07-21T17:41:45.203 [rics.performance:DEBUG] Run candidate 'between' 5x2 times...
2022-07-21T17:41:59.897 [rics.performance:DEBUG] Run candidate 'is_in' 5x8 times...
2022-07-21T17:42:14.991 [rics.performance:DEBUG] Run candidate 'heuristic' 5x5 times...
Summarized results per candidate.
[8]:
data.groupby("Candidate")["Time [ms]"].agg(["std", "mean", "min", "max"]).sort_values(
"mean"
).round(1)
[8]:
| std | mean | min | max | |
|---|---|---|---|---|
| Candidate | ||||
| is_in | 43.9 | 53.9 | 7.5 | 154.8 |
| heuristic | 86.3 | 83.0 | 7.5 | 241.1 |
| between | 32.6 | 209.9 | 181.9 | 368.5 |
| as_is | 22.0 | 211.1 | 183.7 | 273.6 |
Probaby unfair in favour of fetching “too much” since the DB is running locally. We’re also fetching the same things over and over again which may lead to caching.
The heuristic method seems good and will only get better with tuning.
IDs in this data set (title_basics) are very sparse which causes poor performance for the heuristic on the “5000 ids” set; performance drops earlier than for pure as_in.
Heuristic method is suscptible to sparse ID collections
Results for is_in are not cached, it just performs horrendously bad for large amounts of “in”-values.
Use heuristic method.
run_multivariate_test signature#Click here for docs.
[9]:
?run_multivariate_test
Signature:
run_multivariate_test(
candidate_method: Union[Callable[[Any], NoneType], Collection[Callable[[Any], NoneType]], Dict[str, Callable[[Any], NoneType]]],
test_data: Union[Any, Dict[str, Any]],
time_per_candidate: float = 6.0,
plot: bool = True,
**figure_kwargs: Any,
) -> pandas.core.frame.DataFrame
Docstring:
Run performance tests for multiple candidate methods on collections of test data.
This is a convenience method which combines :meth:`MultiCaseTimer.run() <rics.performance.MultiCaseTimer.run>`,
:meth:`~rics.performance.to_dataframe` and -- if plotting is enabled -- :meth:`~rics.performance.plot_run`. For full
functionally these methods should be use directly.
Args:
candidate_method: Candidate methods to evaluate.
test_data: Test data to evaluate.
time_per_candidate: Desired runtime for each repetition per candidate label.
plot: If ``True``, plot a figure using :meth:`~rics.performance.plot_run`.
**figure_kwargs: Keyword arguments for the barplot. Ignored if ``plot=False``.
Returns:
A long-format DataFrame of results.
Raises:
ModuleNotFoundError: If Seaborn isn't installed and ``plot=True``.
File: ~/git/rics/src/rics/performance/_wrapper.py
Type: function
candidates.py#[10]:
!pygmentize candidates.py
import sqlalchemy
# This value is probably a bit off compared to other real-world scenarios since IDs are very sparse in the IMDb data
MAX_OVERFETCH_FACTOR = 50
class Candidates:
def __init__(self, engine):
self.engine = engine
metadata = sqlalchemy.MetaData()
metadata.reflect(engine)
self.id_column = metadata.tables["title_basics"].columns["int_id_tconst"]
extra_cols = (
metadata.tables["title_basics"].columns["primaryTitle"],
metadata.tables["title_basics"].columns["originalTitle"],
metadata.tables["title_basics"].columns["runtimeMinutes"],
)
self.bare_select = sqlalchemy.select(self.id_column, *extra_cols)
def _exec(self, stmt):
return list(self.engine.execute(stmt))
def as_is(self, ids):
return self._exec(self.bare_select)
def between(self, ids):
min_id = min(ids)
max_id = max(ids)
select = self.bare_select.where(self.id_column.between(min_id, max_id))
return self._exec(select)
def is_in(self, ids):
return self._exec(self.bare_select.where(self.id_column.in_(ids)))
def heuristic(self, ids):
stmt = self.bare_select
num_ids = len(ids)
in_clause = stmt.where(self.id_column.in_(ids))
if num_ids < 1200:
return self._exec(in_clause)
min_id = min(ids)
max_id = max(ids)
between_clause = stmt.where(self.id_column.between(min_id, max_id))
if num_ids > 4500:
return self._exec(between_clause)
overfetch_factor = (max_id - min_id) / num_ids
return self._exec(in_clause if overfetch_factor > MAX_OVERFETCH_FACTOR else between_clause)
[ ]: