Tag Archives: Reviews

A simple comparison for three SQL engines

I have a data from our production environment. The data is just words list, which contains 11922695 items.

I load the data into Mysql, Spark and Drill for a simple test against their query performance.

The test environment:

  • OS: Ubuntu 18.04 Linux x86_64, KVM instance
  • Hardware specs: 2 AMD cores, 4GB memory, 40GB NVME disk
  • Java: Openjdk version 11.0.13
  • Mysql: version 5.7.36, default installation, only one change by setting key_buffer_size to 256MB
  • Spark: version 3.2.0, deployed as local mode
  • Drill: version 1.20, run as drill-embedded

I load the words into the above databases first.

In spark:

scala> val list = sc.textFile("words.txt").toDF("word")
list: org.apache.spark.sql.DataFrame = [word: string]

In mysql:

mysql> create table words (word varchar(32));
Query OK, 0 rows affected (0.01 sec)

mysql> load data local infile './words.txt' into table words;
Query OK, 11922695 rows affected (29.88 sec)

In drill I need to rename the file to a csv, then query it directly like:

apache drill (dfs.pyh)> select * from `words.csv` limit 2;
+------+
| WORD |
+------+
| on   |
| jan  |
+------+
2 rows selected (0.235 seconds)

Hence I can make a simple SQL query to the three data objects. The purpose is to group the words and count their numbers.

First query in spark:

scala> list.groupBy("word").count.orderBy(desc("count")).show(20)
+----+------+                                                                   
|word| count|
+----+------+
| the|597563|
|  to|466569|
|   a|228042|
|  is|223282|
| and|215099|
|  in|176847|
|   i|174579|
| for|165258|
|  on|164483|
|  of|161266|
|this|159244|
| you|128854|
|that|126949|
|  at|126562|
|  it|117224|
|  be| 89777|
|from| 87112|
|with| 86998|
| not| 85245|
|  if| 82798|
+----+------+
only showing top 20 rows

Spark doesn’t show its query time. I got the time roughly by iOS’s Stopwatch app. I have run above queries some times, the average query time is about 3.7 second.

Second query in mysql:

mysql> select word,count(*) as dd from words group by word order by dd desc limit 20;
+------+--------+
| word | dd     |
+------+--------+
| the  | 597563 |
| to   | 466569 |
| a    | 228042 |
| is   | 223282 |
| and  | 215099 |
| in   | 176847 |
| i    | 174579 |
| for  | 165258 |
| on   | 164483 |
| of   | 161266 |
| this | 159244 |
| you  | 128854 |
| that | 126949 |
| at   | 126562 |
| it   | 117224 |
| be   |  89777 |
| from |  87112 |
| with |  86998 |
| not  |  85245 |
| if   |  82798 |
+------+--------+
20 rows in set (11.05 sec)

Then I created an index for this mysql table and query again:

mysql> create index wordIndex on words(word);
Query OK, 0 rows affected (40.21 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select word,count(*) as dd from words group by word order by dd desc limit 20;
+------+--------+
| word | dd     |
+------+--------+
| the  | 597563 |
| to   | 466569 |
| a    | 228042 |
| is   | 223282 |
| and  | 215099 |
| in   | 176847 |
| i    | 174579 |
| for  | 165258 |
| on   | 164483 |
| of   | 161266 |
| this | 159244 |
| you  | 128854 |
| that | 126949 |
| at   | 126562 |
| it   | 117224 |
| be   |  89777 |
| from |  87112 |
| with |  86998 |
| not  |  85245 |
| if   |  82798 |
+------+--------+
20 rows in set (5.10 sec)

As you see mysql has huge performance improvement after adding the index.

The last query in drill:

apache drill (dfs.pyh)> select word,count(*) as dd from `words.csv` group by word order by dd desc limit 20;
+------+--------+
| word |   dd   |
+------+--------+
| the  | 597563 |
| to   | 466569 |
| a    | 228042 |
| is   | 223282 |
| and  | 215099 |
| in   | 176847 |
| i    | 174579 |
| for  | 165258 |
| on   | 164483 |
| of   | 161266 |
| this | 159244 |
| you  | 128854 |
| that | 126949 |
| at   | 126562 |
| it   | 117224 |
| be   | 89777  |
| from | 87112  |
| with | 86998  |
| not  | 85245  |
| if   | 82798  |
+------+--------+
20 rows selected (3.507 seconds)

I have run the above queries some times, choose an average running time 3.5 second.

So the comparison of query time in three applications:

SparkMysqlDrill
Query time3.7swith index: 5.1s
without index: 11s
3.5s
Query time comparison in three SQL engines

It’s not surprised that for big data Mysql is not as efficient as the other two. Spark and Drill have the similar performance in this scenario. While Drill is much easier to use as you don’t have to know Scala programming.

Yandex disk API access is very slow

I bought Yandex Mail 360 service, which gives me another 100GB disk space.

But, the access to Yandex disk via API is very slow. I tried it with two ways from one of my VPS.

  • access with yandex disk plugin of rclone, whose endpoint is “cloud-api.yandex.com
  • access with webdav protocol, whose endpoint is “webdav.yandex.ru

Unfortunately, both ways failed to work.

I uploaded a file of 245MB to Yandex disk with the methods above, they can’t finish the jobs in about 1+ hour.

The network connection from my VPS to their service endpoints has no problem. As you see below:

$ ping webdav.yandex.ru -c 10 
PING webdav.yandex.ru (213.180.204.148) 56(84) bytes of data. 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=1 ttl=247 time=40.9 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=2 ttl=247 time=40.9 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=3 ttl=247 time=40.9 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=4 ttl=247 time=40.9 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=5 ttl=247 time=41.0 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=6 ttl=247 time=40.9 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=7 ttl=247 time=41.0 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=8 ttl=247 time=40.9 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=9 ttl=247 time=40.9 ms 
64 bytes from disk-webdav.stable.qloud-b.yandex.net (213.180.204.148):  icmp_seq=10 ttl=247 time=40.9 ms 
 
--- webdav.yandex.ru ping statistics --- 
10 packets transmitted, 10 received, 0% packet loss, time 9008ms 
rtt min/avg/max/mdev = 40.964/40.988/41.044/0.182 ms 
$ ping cloud-api.yandex.com -c 10 
PING api.disk.yandex.net (213.180.204.127) 56(84) bytes of data. 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=1 ttl=54  time=40.6 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=2 ttl=54  time=40.7 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=3 ttl=54  time=40.6 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=4 ttl=54  time=40.7 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=5 ttl=54  time=40.7 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=6 ttl=54  time=40.7 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=7 ttl=54  time=40.6 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=8 ttl=54  time=40.7 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=9 ttl=54  time=40.7 ms 
64 bytes from cloud-api.yandex.net (213.180.204.127): icmp_seq=10 ttl=54  time=40.7 ms 
 
--- api.disk.yandex.net ping statistics --- 
10 packets transmitted, 10 received, 0% packet loss, time 9015ms 
rtt min/avg/max/mdev = 40.671/40.723/40.776/0.204 ms 

When uploading, the command ‘vnstat’ shows the bandwidth usage for transfer out is only 139.28 kbit/s. This really sucks.

eth0  /  traffic statistics 
 
                           rx         |       tx 
--------------------------------------+------------------ 
  bytes                    31.21 MiB  |       49.41 MiB 
--------------------------------------+------------------ 
          max             537 kbit/s  |    19.34 Mbit/s 
      average           87.96 kbit/s  |   139.28 kbit/s 
          min               3 kbit/s  |        0 kbit/s 
--------------------------------------+------------------ 
  packets                     525460  |           10556 
--------------------------------------+------------------ 
          max                924 p/s  |         121 p/s 
      average                176 p/s  |           3 p/s 
          min                  5 p/s  |           0 p/s 
--------------------------------------+------------------ 
  time                 49.60 minutes 

Then I tried their official client for Linux, the help page is here. The result is surprised.

Yandex disk’s linux client works very fast. I uploaded a file about 1GB, it got finished in several minutes.

So in conclusion:

  • Don’t use API to access Yandex disk, neither rclone nor webdav. Yandex may have bandwidth limits on it.
  • Use their official clients, for instance, Mac, Windows and Linux’s.

Benchmark for VPS of Virmach, Hosthatch, Naranja

These three providers: Virmach, Hosthatch, and Naranja, are active on LET. They have got good reputation due to their good products or service etc.

I have VPS on all of them. So I took a benchmark with yabs.sh for the comparison of their performance, networking, disk IO etc.

This is the result of Virmach:

Mon Mar  1 08:51:38 HKT 2021
 
 Basic System Information:
 ---------------------------------
 Processor  : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
 CPU cores  : 2 @ 2199.998 MHz
 AES-NI     : ✔ Enabled
 VM-x/AMD-V : ❌ Disabled
 RAM        : 1.2 GiB
 Swap       : 256.0 MiB
 Disk       : 19.4 GiB
 
 fio Disk Speed Tests (Mixed R/W 50/50):
 ---------------------------------
 Block Size | 4k            (IOPS) | 64k           (IOPS)
   ------   | ---            ----  | ----           ---- 
 Read       | 42.03 MB/s   (10.5k) | 485.07 MB/s   (7.5k)
 Write      | 42.09 MB/s   (10.5k) | 487.62 MB/s   (7.6k)
 Total      | 84.12 MB/s   (21.0k) | 972.70 MB/s  (15.1k)
            |                      |                     
 Block Size | 512k          (IOPS) | 1m            (IOPS)
   ------   | ---            ----  | ----           ---- 
 Read       | 621.31 MB/s   (1.2k) | 332.98 MB/s    (325)
 Write      | 654.32 MB/s   (1.2k) | 355.16 MB/s    (346)
 Total      | 1.27 GB/s     (2.4k) | 688.15 MB/s    (671)
 
 iperf3 Network Speed Tests (IPv4):
 ---------------------------------
 Provider        | Location (Link)           | Send Speed      | Recv Speed     
                 |                           |                 |                
 Clouvider       | London, UK (10G)          | 488 Mbits/sec   | 182 Mbits/sec  
 Online.net      | Paris, FR (10G)           | 543 Mbits/sec   | 216 Mbits/sec  
 WorldStream     | The Netherlands (10G)     | 719 Mbits/sec   | 342 Mbits/sec  
 Biznet          | Jakarta, Indonesia (1G)   | busy            | busy           
 Clouvider       | NYC, NY, US (10G)         | 679 Mbits/sec   | 653 Mbits/sec  
 Velocity Online | Tallahassee, FL, US (10G) | 530 Mbits/sec   | 327 Mbits/sec  
 Clouvider       | Los Angeles, CA, US (10G) | 551 Mbits/sec   | 383 Mbits/sec  
 Iveloz Telecom  | Sao Paulo, BR (2G)        | 556 Mbits/sec   | 206 Mbits/sec  
 

 Geekbench 5 test failed. Run manually to determine cause.

As you see, Virmach’s disk IO and network speed are common, not bad and also not good. But their CPU is very worse. This is a VPS with double cores, it even can’t finish geekbench’s test.

Then, this is the result of Hosthatch:

Mon Mar  1 09:05:59 HKT 2021 

 Basic System Information:
 ---------------------------------
 Processor  : Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
 CPU cores  : 1 @ 2999.998 MHz
 AES-NI     : ✔ Enabled
 VM-x/AMD-V : ❌ Disabled
 RAM        : 1.9 GiB
 Swap       : 64.0 MiB
 Disk       : 19.6 GiB
 
 fio Disk Speed Tests (Mixed R/W 50/50):
 ---------------------------------
 Block Size | 4k            (IOPS) | 64k           (IOPS)
   ------   | ---            ----  | ----           ---- 
 Read       | 121.41 MB/s  (30.3k) | 1.26 GB/s    (19.7k)
 Write      | 121.73 MB/s  (30.4k) | 1.26 GB/s    (19.8k)
 Total      | 243.14 MB/s  (60.7k) | 2.53 GB/s    (39.5k)
            |                      |                     
 Block Size | 512k          (IOPS) | 1m            (IOPS)
   ------   | ---            ----  | ----           ---- 
 Read       | 2.20 GB/s     (4.3k) | 2.32 GB/s     (2.2k)
 Write      | 2.32 GB/s     (4.5k) | 2.48 GB/s     (2.4k)
 Total      | 4.52 GB/s     (8.8k) | 4.80 GB/s     (4.6k)
 
 iperf3 Network Speed Tests (IPv4):
 ---------------------------------
 Provider        | Location (Link)           | Send Speed      | Recv Speed     
                 |                           |                 |                
 Clouvider       | London, UK (10G)          | 466 Mbits/sec   | 6.24 Mbits/sec 
 Online.net      | Paris, FR (10G)           | 615 Mbits/sec   | 311 Mbits/sec  
 WorldStream     | The Netherlands (10G)     | 608 Mbits/sec   | 241 Mbits/sec  
 Biznet          | Jakarta, Indonesia (1G)   | busy            | busy           
 Clouvider       | NYC, NY, US (10G)         | 598 Mbits/sec   | 383 Mbits/sec  
 Velocity Online | Tallahassee, FL, US (10G) | 494 Mbits/sec   | 405 Mbits/sec  
 Clouvider       | Los Angeles, CA, US (10G) | 626 Mbits/sec   | 476 Mbits/sec  
 Iveloz Telecom  | Sao Paulo, BR (2G)        | busy            | busy           
 
 Running GB5 benchmark test... *cue elevator music*
 Geekbench 5 Benchmark Test:
 ---------------------------------
 Test            | Value                         
                 |                               
 Single Core     | 574                           
 Multi Core      | 578                           
 Full Test       | https://browser.geekbench.com/v5/cpu/6738467 

Hosthatch is well known for its storage VPS, their disk IO is always good. As you see above, the IO test gets a good result. Networking and CPU seem common.

The last part is for naranja:

Mon Mar  1 09:11:19 HKT 2021
 
 Basic System Information:
 ---------------------------------
 Processor  : AMD EPYC 7302 16-Core Processor
 CPU cores  : 1 @ 2999.998 MHz
 AES-NI     : ✔ Enabled
 VM-x/AMD-V : ❌ Disabled
 RAM        : 985.0 MiB
 Swap       : 1024.0 MiB
 Disk       : 19.7 GiB
 
 fio Disk Speed Tests (Mixed R/W 50/50):
 ---------------------------------
 Block Size | 4k            (IOPS) | 64k           (IOPS)
   ------   | ---            ----  | ----           ---- 
 Read       | 192.76 MB/s  (48.1k) | 816.09 MB/s  (12.7k)
 Write      | 193.27 MB/s  (48.3k) | 820.38 MB/s  (12.8k)
 Total      | 386.03 MB/s  (96.5k) | 1.63 GB/s    (25.5k)
            |                      |                     
 Block Size | 512k          (IOPS) | 1m            (IOPS)
   ------   | ---            ----  | ----           ---- 
 Read       | 1.06 GB/s     (2.0k) | 1.15 GB/s     (1.1k)
 Write      | 1.11 GB/s     (2.1k) | 1.23 GB/s     (1.2k)
 Total      | 2.17 GB/s     (4.2k) | 2.39 GB/s     (2.3k)
 
 iperf3 Network Speed Tests (IPv4):
 ---------------------------------
 Provider        | Location (Link)           | Send Speed      | Recv Speed     
                 |                           |                 |                
 Clouvider       | London, UK (10G)          | 920 Mbits/sec   | 606 Mbits/sec  
 Online.net      | Paris, FR (10G)           | 923 Mbits/sec   | 374 Mbits/sec  
 WorldStream     | The Netherlands (10G)     | 906 Mbits/sec   | 932 Mbits/sec  
 Biznet          | Jakarta, Indonesia (1G)   | busy            | busy           
 Clouvider       | NYC, NY, US (10G)         | 830 Mbits/sec   | 68.2 Mbits/sec 
 Velocity Online | Tallahassee, FL, US (10G) | 831 Mbits/sec   | 139 Mbits/sec  
 Clouvider       | Los Angeles, CA, US (10G) | 801 Mbits/sec   | 151 Mbits/sec  
 Iveloz Telecom  | Sao Paulo, BR (2G)        | 739 Mbits/sec   | 84.0 Mbits/sec 
 
 Running GB5 benchmark test... *cue elevator music*
 Geekbench 5 Benchmark Test:
 ---------------------------------
 Test            | Value                         
                 |                               
 Single Core     | 934                           
 Multi Core      | 926                           
 Full Test       | https://browser.geekbench.com/v5/cpu/6738468 

Naranja is using AMD CPU, as you see geekbench for a single core can even get 934 scores. This is far better than Virmach and Hosthatch. Besides CPU performance, their disk IO and network speed are also not bad.

So my simple summary:

When you need a VPS for study purpose, Virmach is good, their price is cheap for the scale. That test configuration needs only 13 USD/year.

When you need a storage VPS with good disk IO, Hosthatch is suitable. Their storage plan especially the BF ones, are cheap and good.

When you need the CPU power as well as not bad IO/networking, Naranja is great choice. And their price is cheap, my test configuration needs only 9.x EUR/year.

Vodafone Germany email migrated to their own platform

I had a vodafone.de email for long days. In the past days, Vodafone Germany hosted their email on Open Xchange, which brought me the worse experience.

Open Xchange is an open source email hosting solution. It has rich features include email, cloud, contact, calendar, push notify, docs etc. It has got a lot of customers during its years of operations. For example, Namecheap, IONOS, Virgilio.it are using their services.

But I don’t like this platform. It is developed by PHP language. It’s slow and not solid. Somethings I can’t open the page, and new email flushing is delayed.

I think many customers at Vodafone.de had the same feelings as me. While I am glad to see recently Vodafone.de has migrated their email service to their own platform, which is the same one as their global platform, such as Vodafone UK.

Their own platform is written by ASP language. It’s clean and fast, and more solid than the OX one. It has the main features of a modern online service, including email, cloud, calendar, contact, note etc.

Comparing to the old OX platform, I much like this new one. I wish vodafone.de gets better and better in their new days.