8
 min read

WARP vs InnoDB on a wide 75GiB table

A comparison between InnoDB and WARP on two decades of the BTS ontime flight data, a 109 column wide table.

WARP vs InnoDB on a wide 75GiB table

The benefits of a column store are many, especially on wide tables with many columns. The BTS Ontime flight statistics data is a large (100GB+) data set that consists of a single large table. There are lookup tables available for some fields, but they are not included in this test. I loaded a subset of this data to cover over two decades of statistics, and the raw file clocks in at 79GB. I had to post-process the files somewhat because they use empty delimiters for NULL instead of the \N convention for NULL that MySQL uses with LOAD DATA INFILE.

Test machine

6 core (12 thread) AMD processor [AMD Ryzen 5 5600@4.2GHz] with a Samsung 970 1TB NVME disk and 48GB ram. The disk is quite fast compared to say EBS cloud storage, so results would be different in the cloud or using SATA SSD.

Load times

Load WARP table (11 mins)
File '/data/ontime/ontime.csv' (79.68 GB) was imported in 9 min 58.1429 sec at 133.22 MB/s
Total rows affected in ontime.ontime_warp: Records: 157928010  Deleted: 0  Skipped: 0  Warnings: 0

Load InnoDB table (37.5 mins)
File '/data/ontime/ontime.csv' (79.68 GB) was imported in 37 min 28.6070 sec at 35.44 MB/s
Total rows affected in ontime.ontime_innodb: Records: 157928010  Deleted: 0  Skipped: 0  Warnings: 0
Note that there is an index on FlightDate on ontime_innodb.

No more FULL TABLE SCANS

WARP is a column store. This means that is can access and do IO for only the subset of columns that are accessed by the query. The only time a "full table scan" is executed by WARP is when SELECT * FROM TABLE; is used. Because only the columns that are used in the query are accessed on disk, query performance is increased mostly due to decreased IO. If a star schema is used, star schema optimization increases performance even further by adding parallelism to the joins

Select COUNT(*) FROM TABLE; reads only the smallest amount of data from disk necessary to answer the query (the row metadata). Even though InnoDB supports parallel query for SELECT COUNT(*) it is significantly slower and uses significantly more IO than WARP. Here are the two compared:
mysql> select count(*) from ontime_warp;
+-----------+
| count(*)  |
+-----------+
| 157928010 |
+-----------+
1 row in set (42.19 sec)

mysql> select count(*) from ontime_innodb;
+-----------+
| count(*)  |
+-----------+
| 157928020 |
+-----------+
1 row in set (4 min 13.73 sec)

The parallel query on InnoDB is not very effective, it returns data only about 2x faster than a normal full table scan.

Let's take a look at a query that accesses the full table, but only one column.
mysql> select Year, count(*) cnt from ontime_innodb group by 1;
+------+---------+
| Year | cnt     |
+------+---------+
| 1987 | 1311827 |
| 1992 | 5092166 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
+------+---------+
28 rows in set (8 min 7.4194 sec)

> select Year, count(*) cnt from ontime group by 1;
+------+---------+
| Year | cnt     |
+------+---------+
| 1987 | 1311826 |
| 2014 | 5819811 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 1990 | 5270893 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 1991 | 5076925 |
| 1992 | 5092157 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
+------+---------+
28 rows in set (1 min 9.2067 sec)
As you can see, WARP is about 8x faster than InnoDB when accessing only a single column in an analytics type query.

Automatic bitmap indexing

If a WHERE clause is added to queries, instead of doing full table scans, WARP will scan the involved columns and build bitmap indexes. Bitmap indexes are really cool types of indexes that answer range queries efficiently, but can also answer complex WHERE conditions, such as OR conditions against two different columns. These conditions can't be answered by traditional b-tree indexes.

Before looking at a complex query with an OR over multiple columns, first let's look at the performance of a simple one column filter.

mysql> select count(*) cnt from ontime_innodb where Year = 1998;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (8 min 17.9652 sec)

mysql> select count(*) cnt from ontime_warp where Year = 1998;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (8.1808 sec)

mysql> select count(*) cnt from ontime_warp where Year = 1998;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (2.1808 sec)

As you can see, WARP is extremely fast compared to InnoDB. InnoDB does a full table scan. I ran the query twice on WARP because the 8 second time included the time to build the index and do the IO. The second query uses cached data and the index. Two seconds compared to eight minutes is a pretty good performance improvement, I think you will agree!

Bitmap index verses b-tree index

A covering btree index is often considered a good performance tradeoff when a table is wide. I indexed the FlightDate column of the InnoDB table and compared the performance of the InnoDB index with the bitmap index created by WARP.
mysql> select count(*) cnt from ontime_innodb where FlightDate between 19980101 and 19981231;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (8.1397 sec)

> select count(*) cnt from ontime_innodb where FlightDate between 19980101 and 19981231;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (1.9824 sec)

> select count(*) cnt from ontime_warp where FlightDate between 19980101 and 19981231;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (9.8641 sec)

> select count(*) cnt from ontime_warp where FlightDate between 19980101 and 19981231;
+---------+
| cnt     |
+---------+
| 5384721 |
+---------+
1 row in set (2.1297 sec)

And now a full column scan (WARP) versus a full index scan (InnoDB)

 mysql> select Year(FlightDate), count(*) cnt from ontime_innodb group by 1;
+------------------+---------+
| Year(FlightDate) | cnt     |
+------------------+---------+
|             1987 | 1311827 |
|             1988 | 5202096 |
|             1989 | 5041200 |
|             1990 | 5270893 |
|             1991 | 5076925 |
|             1992 | 5092166 |
|             1993 | 5070501 |
|             1994 | 5180048 |
|             1995 | 5327435 |
|             1996 | 5351983 |
|             1997 | 5411843 |
|             1998 | 5384721 |
|             1999 | 5527884 |
|             2000 | 5683047 |
|             2001 | 3422767 |
|             2002 | 4935434 |
|             2003 | 6488540 |
|             2004 | 7129270 |
|             2005 | 7140596 |
|             2006 | 7141922 |
|             2007 | 7455458 |
|             2008 | 7009726 |
|             2009 | 6450285 |
|             2010 | 6450117 |
|             2011 | 6085281 |
|             2012 | 6096762 |
|             2013 | 6369482 |
|             2014 | 5819811 |
+------------------+---------+
28 rows in set (3 min 38.2186 sec)

> select Year(FlightDate), count(*) cnt from ontime group by 1;
+------------------+---------+
| Year(FlightDate) | cnt     |
+------------------+---------+
|             1987 | 1311826 |
|             2014 | 5819811 |
|             1988 | 5202096 |
|             1989 | 5041200 |
|             2005 | 7140596 |
|             2006 | 7141922 |
|             2007 | 7455458 |
|             2008 | 7009726 |
|             1990 | 5270893 |
|             2009 | 6450285 |
|             2010 | 6450117 |
|             2011 | 6085281 |
|             2012 | 6096762 |
|             2013 | 6369482 |
|             1991 | 5076925 |
|             1992 | 5092157 |
|             1993 | 5070501 |
|             1994 | 5180048 |
|             1995 | 5327435 |
|             1996 | 5351983 |
|             1997 | 5411843 |
|             1998 | 5384721 |
|             1999 | 5527884 |
|             2000 | 5683047 |
|             2001 | 3422767 |
|             2002 | 4935434 |
|             2003 | 6488540 |
|             2004 | 7129270 |
+------------------+---------+
28 rows in set (1 min 11.8889 sec)
Note that the InnoDB index-only scan produced a sorted result, because the index is read in sorted order. The cost of the query on WARP is essentially the same as the query that used the Year column. A full column scan on WARP is 3x faster than a index-only scan on InnoDB. This is because WARP uses sequential IO instead of random IO when traversing the data.

Bitmap indexes with OR condition

In the following queries, I have indexes OriginAirportId and DestAirportId in the InnoDB table. However, InnoDB can not efficiently answer this type of query usign two indexes. WARP on the other hand will index both columns and combine the indexes to answer the query.
mysql> select count(*) cnt from ontime_innodb where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt    |
+--------+
| 275922 |
+--------+
1 row in set (8 min 10.1118 sec)

mysql> select count(*) cnt from ontime_warp where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt    |
+--------+
| 275922 |
+--------+
1 row in set (22.0516 sec)

mysql> select count(*) cnt from ontime_warp where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt    |
+--------+
| 275922 |
+--------+
1 row in set (1.0812 sec)
In my next blog post I will use this same data set to create materialiaed views for these queries so that results are (nearly) instant!