min read

Using a materialized view to accelerate multiple queries on a large dataset

I use my benchmark queries on the BTS Ontime data set in a materialized view context and compare performance of the queries when using the materialized view against querying the base table.

Using a materialized view to quickly answer multiple queries

This blog post is a followup to my last blog post comparing WARP and InnoDB. At the end of that post I promised to follow up with how materialized views can help with query performance when btree indexing can't effectively speed up queries enough. One example I gave was using a covering index to speed up queries, but it only speed up the queries by 3x. Materialized views speed up queries 100x or more!

mysql> create incremental materialized view ontime_OriginAirport_DestAirport_Year_cnt 
as 
select OriginAirportId, DestAirportId, Year, count(*) cnt 
  from ontime_innodb 
 group by OriginAirportId, DestAirportId, Year;

Query OK, 0 rows affected (9 min 37.53 sec)

The first query that I demonstrated in the earlier blog post was:

select Year, count(*) cnt from ontime_innodb group by 1;

We can answer this query using our materialized view

mysql> select Year, sum(cnt) as `count(*)` 
         from ontime_OriginAirport_DestAirport_Year_cnt 
        group by Year order by Year;
+------+----------+
| Year | count(*) |
+------+----------+
| 1987 |  1311827 |
| 1988 |  5202096 |
| 1989 |  5041200 |
| 1990 |  5270893 |
| 1991 |  5076925 |
| 1992 |  5092166 |
| 1993 |  5070501 |
| 1994 |  5180048 |
| 1995 |  5327435 |
| 1996 |  5351983 |
| 1997 |  5411843 |
| 1998 |  5384721 |
| 1999 |  5527884 |
| 2000 |  5683047 |
| 2001 |  3422767 |
| 2002 |  4935434 |
| 2003 |  6488540 |
| 2004 |  7129270 |
| 2005 |  7140596 |
| 2006 |  7141922 |
| 2007 |  7455458 |
| 2008 |  7009726 |
| 2009 |  6450285 |
| 2010 |  6450117 |
| 2011 |  6085281 |
| 2012 |  6096762 |
| 2013 |  6369482 |
| 2014 |  5819811 |
+------+----------+
28 rows in set (0.05 sec)

Compare that to running the query against the base table:

mysql> select Year, count(*) cnt 
         from ontime_innodb 
        group by 1 order by 1;
+------+----------+
| Year | count(*) |
+------+----------+
| 1987 |  1311827 |
| 1988 |  5202096 |
| 1989 |  5041200 |
| 1990 |  5270893 |
| 1991 |  5076925 |
| 1992 |  5092166 |
| 1993 |  5070501 |
| 1994 |  5180048 |
| 1995 |  5327435 |
| 1996 |  5351983 |
| 1997 |  5411843 |
| 1998 |  5384721 |
| 1999 |  5527884 |
| 2000 |  5683047 |
| 2001 |  3422767 |
| 2002 |  4935434 |
| 2003 |  6488540 |
| 2004 |  7129270 |
| 2005 |  7140596 |
| 2006 |  7141922 |
| 2007 |  7455458 |
| 2008 |  7009726 |
| 2009 |  6450285 |
| 2010 |  6450117 |
| 2011 |  6085281 |
| 2012 |  6096762 |
| 2013 |  6369482 |
| 2014 |  5819811 |
+------+----------+
28 rows in set (8 min 7.4194 sec)

Using materialized views provides an impressive speedup. Maintaining the materialized view is easy. I deleted a month of data from the base table:

mysql> delete from ontime_innodb where FlightDate < '1987-11-01';
Query OK, 448621 rows affected (5.80 sec)

The background process that reads from the binary logs takes some time to process them. I waited until the background process caught up (about 4000K rows/sec) and then I refreshed the view:

mysql> call leapdb.refresh2('ontime','ontime_OriginAirport_DestAirport_Year_cnt');
Query OK, 0 rows affected (1.46 sec)

As you can see, it takes very little time to refresh the view once the background process is caught up. If you modify data in batches or nightly, the background process should not be behind often.

Now we can do the query by year and see that the number of rows in 1987 has decreased.

mysql> select Year, sum(cnt) as `count(*)` 
         from ontime_OriginAirport_DestAirport_Year_cnt 
        group by Year order by Year;
+------+----------+
| Year | count(*) |
+------+----------+
| 1987 |   914406 |
| 1988 |  5202096 |
| 1989 |  5041200 |
| 1990 |  5270893 |
| 1991 |  5076925 |
| 1992 |  5092166 |
| 1993 |  5070501 |
| 1994 |  5180048 |
| 1995 |  5327435 |
| 1996 |  5351983 |
| 1997 |  5411843 |
| 1998 |  5384721 |
| 1999 |  5527884 |
| 2000 |  5683047 |
| 2001 |  3422767 |
| 2002 |  4935434 |
| 2003 |  6488540 |
| 2004 |  7129270 |
| 2005 |  7140596 |
| 2006 |  7141922 |
| 2007 |  7455458 |
| 2008 |  7009726 |
| 2009 |  6450285 |
| 2010 |  6450117 |
| 2011 |  6085281 |
| 2012 |  6096762 |
| 2013 |  6369482 |
| 2014 |  5819811 |
+------+----------+
28 rows in set (0.03 sec)

The next query I ran in the previous blog post used WARP automatic indexing. The query when run on InnoDB was:

mysql> select count(*) cnt 
              from ontime_innodb 
             where Year = 1998;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (8 min 17.9652 sec)

The materialized view can be used to answer such a query much faster than scanning the table. In this way, the materialized view is acting as a kind of index.

mysql> select Year, sum(cnt) as `count(*)` 
        from ontime_OriginAirport_DestAirport_Year_cnt 
       where Year = '1998';
+------+----------+
| Year | count(*) |
+------+----------+
| 1998 |  5384721 |
+------+----------+
1 row in set (0.02 sec)

Finally, in the previous blog post I demonstrated a query that could not be answered efficiently using a btree index:

mysql> select count(*) cnt 
         from ontime_innodb 
        where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt    |
+--------+
| 275132 |
+--------+
1 row in set (8 min 10.2135 sec)

Using the materialized view, this query can be answered efficiently:

mysql> select sum(cnt) as `count(*)` 
         from ontime_OriginAirport_DestAirport_Year_cnt 
        where OriginAirportId = 10135 or DestAirportId = 10135;
+----------+
| count(*) |
+----------+
|   275132 |
+----------+
1 row in set (0.02 sec)

Using simple query rewrites to acess the materialized view, you can speed up lots of different queries on your system. This single materialized view answers many different questions. Choose how you build your views and you can get a lot of bang for your buck.

Latest Articles

View All Articles

Star schema optimization in LeapDB

LeapDB includes "star schema optimization" which is a technique for improving query performance on star schema queries that filter out rows based on filters in dimension tables. This reduces IO for star schema queries and provides opportunities for query parallelization.