min read

Star schema optimization in LeapDB

LeapDB includes "star schema optimization" which is a technique for improving query performance on star schema queries that filter out rows based on filters in dimension tables. This reduces IO for star schema queries and provides opportunities for query parallelization.

LeapDB star schema optimization with parallel query

LeapDB includes "star schema optimization" which is a technique for improving query performance on star schema
queries that filter out rows based on filters in dimension tables. This reduces IO for star schema queries and provides opportunities for query parallelization.

In order to optimize star schema queries, WARP uses a parallel vectorized hash join internally to collect
the rowids of fact table and dimension table columns and project out only these rowids to MySQL during the
JOIN process.

Internally, WARP partitions data into partitions of 1 million rows (by default). These
partitions are processed in parallel and hash joined to the dimension tables. WARP supports dimension tables
that are joined by INTEGER or BIGINT columns. As each fact row is processed, rows are discarded that don't
match dimension table lookups. Rows not discarded are added to an internal list of rowids (per-partition)
which allows WARP to then prune partitions that do not have any matching dimension rows, reducing the amount
of IO done by the query proportional to the number of partitions eliminated.

By processing partitions in parallel and eliminating extra IO (in addition to column store IO reduction) queries can be improved in
performance by 100X or more on the star schema benchmark, compared to InnoDB! In fact, in the following
example, cold performance is 400X faster and hot performance is over 5000x faster! This is without using
materialized views, just the column store included with LeapDB.

The following is query 3.3 of the "star schema benchmark" at scale factor 100 (64GB of data in the fact
table). This first result set is a "cold" database. The query takes about 1 and a half minutes cold
and only 7 seconds hot on WARP.

mysql> select c_city, s_city, d_year, sum(lo_revenue)
    -> as revenue
    -> from customer
    -> join lineorder
    ->   on lo_custkey = c_customerkey
    -> join supplier
    ->   on lo_suppkey = s_suppkey
    -> join dim_date
    ->   on lo_orderdatekey = d_datekey
    -> where
    -> (c_city='UNITED KI1' or c_city='UNITED KI5')
    -> and (s_city='UNITED KI1' or s_city='UNITED KI5')
    -> and d_year >= 1992 and d_year <= 1997
    -> group by c_city, s_city, d_year
    -> order by d_year asc, revenue desc;
+------------+------------+--------+------------+
| c_city     | s_city     | d_year | revenue    |
+------------+------------+--------+------------+
| UNITED KI1 | UNITED KI1 |   1992 | 5426878207 |
| UNITED KI5 | UNITED KI1 |   1992 | 5359647842 |
| UNITED KI1 | UNITED KI5 |   1992 | 5087392899 |
| UNITED KI5 | UNITED KI5 |   1992 | 5030188109 |
| UNITED KI5 | UNITED KI1 |   1993 | 5487855346 |
| UNITED KI1 | UNITED KI1 |   1993 | 5388735268 |
| UNITED KI1 | UNITED KI5 |   1993 | 5264963555 |
| UNITED KI5 | UNITED KI5 |   1993 | 5082171171 |
| UNITED KI5 | UNITED KI1 |   1994 | 5536002134 |
| UNITED KI1 | UNITED KI1 |   1994 | 5447718221 |
| UNITED KI5 | UNITED KI5 |   1994 | 5141415853 |
| UNITED KI1 | UNITED KI5 |   1994 | 5094359575 |
| UNITED KI5 | UNITED KI1 |   1995 | 5321015879 |
| UNITED KI1 | UNITED KI1 |   1995 | 5300521630 |
| UNITED KI1 | UNITED KI5 |   1995 | 5190621611 |
| UNITED KI5 | UNITED KI5 |   1995 | 5113257875 |
| UNITED KI1 | UNITED KI1 |   1996 | 5582184547 |
| UNITED KI5 | UNITED KI1 |   1996 | 5344938723 |
| UNITED KI1 | UNITED KI5 |   1996 | 5261127108 |
| UNITED KI5 | UNITED KI5 |   1996 | 5202073275 |
| UNITED KI5 | UNITED KI1 |   1997 | 5433144227 |
| UNITED KI1 | UNITED KI1 |   1997 | 5369255242 |
| UNITED KI1 | UNITED KI5 |   1997 | 5180772174 |
| UNITED KI5 | UNITED KI5 |   1997 | 5090286645 |
+------------+------------+--------+------------+
24 rows in set (1 min 23.05 sec)

Now repeat the query so that the cache can take effect:

mysql> select c_city, s_city, d_year, sum(lo_revenue) as revenue from customer join lineorder   on 
lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join dim_date   on lo_orderdatekey = 
d_datekey where (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED 
KI5') and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, revenue 
desc;

+------------+------------+--------+------------+
| c_city     | s_city     | d_year | revenue    |
+------------+------------+--------+------------+
| UNITED KI1 | UNITED KI1 |   1992 | 5426878207 |
| UNITED KI5 | UNITED KI1 |   1992 | 5359647842 |
| UNITED KI1 | UNITED KI5 |   1992 | 5087392899 |
| UNITED KI5 | UNITED KI5 |   1992 | 5030188109 |
| UNITED KI5 | UNITED KI1 |   1993 | 5487855346 |
| UNITED KI1 | UNITED KI1 |   1993 | 5388735268 |
| UNITED KI1 | UNITED KI5 |   1993 | 5264963555 |
| UNITED KI5 | UNITED KI5 |   1993 | 5082171171 |
| UNITED KI5 | UNITED KI1 |   1994 | 5536002134 |
| UNITED KI1 | UNITED KI1 |   1994 | 5447718221 |
| UNITED KI5 | UNITED KI5 |   1994 | 5141415853 |
| UNITED KI1 | UNITED KI5 |   1994 | 5094359575 |
| UNITED KI5 | UNITED KI1 |   1995 | 5321015879 |
| UNITED KI1 | UNITED KI1 |   1995 | 5300521630 |
| UNITED KI1 | UNITED KI5 |   1995 | 5190621611 |
| UNITED KI5 | UNITED KI5 |   1995 | 5113257875 |
| UNITED KI1 | UNITED KI1 |   1996 | 5582184547 |
| UNITED KI5 | UNITED KI1 |   1996 | 5344938723 |
| UNITED KI1 | UNITED KI5 |   1996 | 5261127108 |
| UNITED KI5 | UNITED KI5 |   1996 | 5202073275 |
| UNITED KI5 | UNITED KI1 |   1997 | 5433144227 |
| UNITED KI1 | UNITED KI1 |   1997 | 5369255242 |
| UNITED KI1 | UNITED KI5 |   1997 | 5180772174 |
| UNITED KI5 | UNITED KI5 |   1997 | 5090286645 |
+------------+------------+--------+------------+
24 rows in set (7.08 sec)

Compare this to InnoDB. Instead of 1 minute (or 7 seconds hot), it takes 13 hours and 8 minutes (hot or cold)!

mysql> select c_city, s_city, d_year, sum(lo_revenue) as revenue from customer join lineorder   on 
lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join dim_date   on lo_orderdatekey =
d_datekey where (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED 
KI5') and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, revenue 
desc;

+------------+------------+--------+------------+
| c_city     | s_city     | d_year | revenue    |
+------------+------------+--------+------------+
| UNITED KI1 | UNITED KI1 |   1992 | 5426878207 |
| UNITED KI5 | UNITED KI1 |   1992 | 5359647842 |
| UNITED KI1 | UNITED KI5 |   1992 | 5087392899 |
| UNITED KI5 | UNITED KI5 |   1992 | 5030188109 |
| UNITED KI5 | UNITED KI1 |   1993 | 5487855346 |
| UNITED KI1 | UNITED KI1 |   1993 | 5388735268 |
| UNITED KI1 | UNITED KI5 |   1993 | 5264963555 |
| UNITED KI5 | UNITED KI5 |   1993 | 5082171171 |
| UNITED KI5 | UNITED KI1 |   1994 | 5536002134 |
| UNITED KI1 | UNITED KI1 |   1994 | 5447718221 |
| UNITED KI5 | UNITED KI5 |   1994 | 5141415853 |
| UNITED KI1 | UNITED KI5 |   1994 | 5094359575 |
| UNITED KI5 | UNITED KI1 |   1995 | 5321015879 |
| UNITED KI1 | UNITED KI1 |   1995 | 5300521630 |
| UNITED KI1 | UNITED KI5 |   1995 | 5190621611 |
| UNITED KI5 | UNITED KI5 |   1995 | 5113257875 |
| UNITED KI1 | UNITED KI1 |   1996 | 5582184547 |
| UNITED KI5 | UNITED KI1 |   1996 | 5344938723 |
| UNITED KI1 | UNITED KI5 |   1996 | 5261127108 |
| UNITED KI5 | UNITED KI5 |   1996 | 5202073275 |
| UNITED KI5 | UNITED KI1 |   1997 | 5433144227 |
| UNITED KI1 | UNITED KI1 |   1997 | 5369255242 |
| UNITED KI1 | UNITED KI5 |   1997 | 5180772174 |
| UNITED KI5 | UNITED KI5 |   1997 | 5090286645 |
+------------+------------+--------+------------+
24 rows in set (13 hours 8 min 15.81 sec)

This is a well-tuned InnoDB database!

The difference between InnoDB and WARP with star-schema optimization is night-and-day.

Latest Articles

View All Articles

Using a materialized view to accelerate multiple queries on a large dataset

I use my benchmark queries on the BTS Ontime data set in a materialized view context and compare performance of the queries when using the materialized view against querying the base table.

Query Optimisation