A simple comparison for three SQL engines

I have a data from our production environment. The data is just words list, which contains 11922695 items.

I load the data into Mysql, Spark and Drill for a simple test against their query performance.

The test environment:

  • OS: Ubuntu 18.04 Linux x86_64, KVM instance
  • Hardware specs: 2 AMD cores, 4GB memory, 40GB NVME disk
  • Java: Openjdk version 11.0.13
  • Mysql: version 5.7.36, default installation, only one change by setting key_buffer_size to 256MB
  • Spark: version 3.2.0, deployed as local mode
  • Drill: version 1.20, run as drill-embedded

I load the words into the above databases first.

In spark:

scala> val list = sc.textFile("words.txt").toDF("word")
list: org.apache.spark.sql.DataFrame = [word: string]

In mysql:

mysql> create table words (word varchar(32));
Query OK, 0 rows affected (0.01 sec)

mysql> load data local infile './words.txt' into table words;
Query OK, 11922695 rows affected (29.88 sec)

In drill I need to rename the file to a csv, then query it directly like:

apache drill (dfs.pyh)> select * from `words.csv` limit 2;
+------+
| WORD |
+------+
| on   |
| jan  |
+------+
2 rows selected (0.235 seconds)

Hence I can make a simple SQL query to the three data objects. The purpose is to group the words and count their numbers.

First query in spark:

scala> list.groupBy("word").count.orderBy(desc("count")).show(20)
+----+------+                                                                   
|word| count|
+----+------+
| the|597563|
|  to|466569|
|   a|228042|
|  is|223282|
| and|215099|
|  in|176847|
|   i|174579|
| for|165258|
|  on|164483|
|  of|161266|
|this|159244|
| you|128854|
|that|126949|
|  at|126562|
|  it|117224|
|  be| 89777|
|from| 87112|
|with| 86998|
| not| 85245|
|  if| 82798|
+----+------+
only showing top 20 rows

Spark doesn’t show its query time. I got the time roughly by iOS’s Stopwatch app. I have run above queries some times, the average query time is about 3.7 second.

Second query in mysql:

mysql> select word,count(*) as dd from words group by word order by dd desc limit 20;
+------+--------+
| word | dd     |
+------+--------+
| the  | 597563 |
| to   | 466569 |
| a    | 228042 |
| is   | 223282 |
| and  | 215099 |
| in   | 176847 |
| i    | 174579 |
| for  | 165258 |
| on   | 164483 |
| of   | 161266 |
| this | 159244 |
| you  | 128854 |
| that | 126949 |
| at   | 126562 |
| it   | 117224 |
| be   |  89777 |
| from |  87112 |
| with |  86998 |
| not  |  85245 |
| if   |  82798 |
+------+--------+
20 rows in set (11.05 sec)

Then I created an index for this mysql table and query again:

mysql> create index wordIndex on words(word);
Query OK, 0 rows affected (40.21 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select word,count(*) as dd from words group by word order by dd desc limit 20;
+------+--------+
| word | dd     |
+------+--------+
| the  | 597563 |
| to   | 466569 |
| a    | 228042 |
| is   | 223282 |
| and  | 215099 |
| in   | 176847 |
| i    | 174579 |
| for  | 165258 |
| on   | 164483 |
| of   | 161266 |
| this | 159244 |
| you  | 128854 |
| that | 126949 |
| at   | 126562 |
| it   | 117224 |
| be   |  89777 |
| from |  87112 |
| with |  86998 |
| not  |  85245 |
| if   |  82798 |
+------+--------+
20 rows in set (5.10 sec)

As you see mysql has huge performance improvement after adding the index.

The last query in drill:

apache drill (dfs.pyh)> select word,count(*) as dd from `words.csv` group by word order by dd desc limit 20;
+------+--------+
| word |   dd   |
+------+--------+
| the  | 597563 |
| to   | 466569 |
| a    | 228042 |
| is   | 223282 |
| and  | 215099 |
| in   | 176847 |
| i    | 174579 |
| for  | 165258 |
| on   | 164483 |
| of   | 161266 |
| this | 159244 |
| you  | 128854 |
| that | 126949 |
| at   | 126562 |
| it   | 117224 |
| be   | 89777  |
| from | 87112  |
| with | 86998  |
| not  | 85245  |
| if   | 82798  |
+------+--------+
20 rows selected (3.507 seconds)

I have run the above queries some times, choose an average running time 3.5 second.

So the comparison of query time in three applications:

SparkMysqlDrill
Query time3.7swith index: 5.1s
without index: 11s
3.5s
Query time comparison in three SQL engines

It’s not surprised that for big data Mysql is not as efficient as the other two. Spark and Drill have the similar performance in this scenario. While Drill is much easier to use as you don’t have to know Scala programming.