LeapDB includes "star schema optimization" which is a technique for improving query performance on star schema queries that filter out rows based on filters in dimension tables. This reduces IO for star schema queries and provides opportunities for query parallelization.
Hi. I’m Justin Swanhart, the CEO and founder of Leap DB. I’ve been working with databases and distributed systems for my entire twenty-two-year career.
In order to optimize star schema queries, WARP uses a parallel vectorized hash join internally to collect
the rowids of fact table and dimension table columns and project out only these rowids to MySQL during the
JOIN process.
Internally, WARP partitions data into partitions of 1 million rows (by default). These
partitions are processed in parallel and hash joined to the dimension tables. WARP supports dimension tables
that are joined by INTEGER or BIGINT columns. As each fact row is processed, rows are discarded that don't
match dimension table lookups. Rows not discarded are added to an internal list of rowids (per-partition)
which allows WARP to then prune partitions that do not have any matching dimension rows, reducing the amount
of IO done by the query proportional to the number of partitions eliminated.
By processing partitions in
parallel and eliminating extra IO (in addition to column store IO reduction) queries can be improved in
performance by 100X or more on the star schema benchmark, compared to InnoDB! In fact, in the following
example, cold performance is 400X faster and hot performance is over 5000x faster! This is without using
materialized views, just the column store included with LeapDB.
mysql> select c_city, s_city, d_year, sum(lo_revenue) -> as revenue -> from customer -> join lineorder -> on lo_custkey = c_customerkey -> join supplier -> on lo_suppkey = s_suppkey -> join dim_date -> on lo_orderdatekey = d_datekey -> where -> (c_city='UNITED KI1' or c_city='UNITED KI5') -> and (s_city='UNITED KI1' or s_city='UNITED KI5') -> and d_year >= 1992 and d_year <= 1997 -> group by c_city, s_city, d_year -> order by d_year asc, revenue desc; +------------+------------+--------+------------+ | c_city | s_city | d_year | revenue | +------------+------------+--------+------------+ | UNITED KI1 | UNITED KI1 | 1992 | 5426878207 | | UNITED KI5 | UNITED KI1 | 1992 | 5359647842 | | UNITED KI1 | UNITED KI5 | 1992 | 5087392899 | | UNITED KI5 | UNITED KI5 | 1992 | 5030188109 | | UNITED KI5 | UNITED KI1 | 1993 | 5487855346 | | UNITED KI1 | UNITED KI1 | 1993 | 5388735268 | | UNITED KI1 | UNITED KI5 | 1993 | 5264963555 | | UNITED KI5 | UNITED KI5 | 1993 | 5082171171 | | UNITED KI5 | UNITED KI1 | 1994 | 5536002134 | | UNITED KI1 | UNITED KI1 | 1994 | 5447718221 | | UNITED KI5 | UNITED KI5 | 1994 | 5141415853 | | UNITED KI1 | UNITED KI5 | 1994 | 5094359575 | | UNITED KI5 | UNITED KI1 | 1995 | 5321015879 | | UNITED KI1 | UNITED KI1 | 1995 | 5300521630 | | UNITED KI1 | UNITED KI5 | 1995 | 5190621611 | | UNITED KI5 | UNITED KI5 | 1995 | 5113257875 | | UNITED KI1 | UNITED KI1 | 1996 | 5582184547 | | UNITED KI5 | UNITED KI1 | 1996 | 5344938723 | | UNITED KI1 | UNITED KI5 | 1996 | 5261127108 | | UNITED KI5 | UNITED KI5 | 1996 | 5202073275 | | UNITED KI5 | UNITED KI1 | 1997 | 5433144227 | | UNITED KI1 | UNITED KI1 | 1997 | 5369255242 | | UNITED KI1 | UNITED KI5 | 1997 | 5180772174 | | UNITED KI5 | UNITED KI5 | 1997 | 5090286645 | +------------+------------+--------+------------+ 24 rows in set (1 min 23.05 sec)Now repeat the query so that the cache can take effect:
mysql> select c_city, s_city, d_year, sum(lo_revenue) as revenue from customer join lineorder on lo_custkey = c_customerkey join supplier on lo_suppkey = s_suppkey join dim_date on lo_orderdatekey = d_datekey where (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, revenue desc;Compare this to InnoDB. Instead of 1 minute (or 7 seconds hot), it takes 13 hours and 8 minutes (hot or cold)!
+------------+------------+--------+------------+ | c_city | s_city | d_year | revenue | +------------+------------+--------+------------+ | UNITED KI1 | UNITED KI1 | 1992 | 5426878207 | | UNITED KI5 | UNITED KI1 | 1992 | 5359647842 | | UNITED KI1 | UNITED KI5 | 1992 | 5087392899 | | UNITED KI5 | UNITED KI5 | 1992 | 5030188109 | | UNITED KI5 | UNITED KI1 | 1993 | 5487855346 | | UNITED KI1 | UNITED KI1 | 1993 | 5388735268 | | UNITED KI1 | UNITED KI5 | 1993 | 5264963555 | | UNITED KI5 | UNITED KI5 | 1993 | 5082171171 | | UNITED KI5 | UNITED KI1 | 1994 | 5536002134 | | UNITED KI1 | UNITED KI1 | 1994 | 5447718221 | | UNITED KI5 | UNITED KI5 | 1994 | 5141415853 | | UNITED KI1 | UNITED KI5 | 1994 | 5094359575 | | UNITED KI5 | UNITED KI1 | 1995 | 5321015879 | | UNITED KI1 | UNITED KI1 | 1995 | 5300521630 | | UNITED KI1 | UNITED KI5 | 1995 | 5190621611 | | UNITED KI5 | UNITED KI5 | 1995 | 5113257875 | | UNITED KI1 | UNITED KI1 | 1996 | 5582184547 | | UNITED KI5 | UNITED KI1 | 1996 | 5344938723 | | UNITED KI1 | UNITED KI5 | 1996 | 5261127108 | | UNITED KI5 | UNITED KI5 | 1996 | 5202073275 | | UNITED KI5 | UNITED KI1 | 1997 | 5433144227 | | UNITED KI1 | UNITED KI1 | 1997 | 5369255242 | | UNITED KI1 | UNITED KI5 | 1997 | 5180772174 | | UNITED KI5 | UNITED KI5 | 1997 | 5090286645 | +------------+------------+--------+------------+ 24 rows in set (7.08 sec)
mysql> select c_city, s_city, d_year, sum(lo_revenue) as revenue from customer join lineorder on lo_custkey = c_customerkey join supplier on lo_suppkey = s_suppkey join dim_date on lo_orderdatekey = d_datekey where (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, revenue desc; +------------+------------+--------+------------+ | c_city | s_city | d_year | revenue | +------------+------------+--------+------------+ | UNITED KI1 | UNITED KI1 | 1992 | 5426878207 | | UNITED KI5 | UNITED KI1 | 1992 | 5359647842 | | UNITED KI1 | UNITED KI5 | 1992 | 5087392899 | | UNITED KI5 | UNITED KI5 | 1992 | 5030188109 | | UNITED KI5 | UNITED KI1 | 1993 | 5487855346 | | UNITED KI1 | UNITED KI1 | 1993 | 5388735268 | | UNITED KI1 | UNITED KI5 | 1993 | 5264963555 | | UNITED KI5 | UNITED KI5 | 1993 | 5082171171 | | UNITED KI5 | UNITED KI1 | 1994 | 5536002134 | | UNITED KI1 | UNITED KI1 | 1994 | 5447718221 | | UNITED KI5 | UNITED KI5 | 1994 | 5141415853 | | UNITED KI1 | UNITED KI5 | 1994 | 5094359575 | | UNITED KI5 | UNITED KI1 | 1995 | 5321015879 | | UNITED KI1 | UNITED KI1 | 1995 | 5300521630 | | UNITED KI1 | UNITED KI5 | 1995 | 5190621611 | | UNITED KI5 | UNITED KI5 | 1995 | 5113257875 | | UNITED KI1 | UNITED KI1 | 1996 | 5582184547 | | UNITED KI5 | UNITED KI1 | 1996 | 5344938723 | | UNITED KI1 | UNITED KI5 | 1996 | 5261127108 | | UNITED KI5 | UNITED KI5 | 1996 | 5202073275 | | UNITED KI5 | UNITED KI1 | 1997 | 5433144227 | | UNITED KI1 | UNITED KI1 | 1997 | 5369255242 | | UNITED KI1 | UNITED KI5 | 1997 | 5180772174 | | UNITED KI5 | UNITED KI5 | 1997 | 5090286645 | +------------+------------+--------+------------+ 24 rows in set (13 hours 8 min 15.81 sec)This is a well-tuned InnoDB database!