Time difference on handling the file larger than memory

Apache Spark can deal with big file well even if file size is much larger than the memory.

For instance, I have the file “rate.csv” which is 3.2GB in size:

$ du -h rate.csv 
3.2G	rate.csv

But my VPS has only 2GB total memory:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1992        1652         206           4         134         197
Swap:          1023         521         502

Even though Spark can read and handle this file well. For example, I load the file into spark and issue a SQL query via its Dataframe API.

// define the schema
scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
val schema: String = uid STRING,item STRING,rate FLOAT,time INT

// load csv file into spark as a dataframe
scala> val df = spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... 2 more fields]

// print the schema
scala> df.printSchema()
root
 |-- uid: string (nullable = true)
 |-- item: string (nullable = true)
 |-- rate: float (nullable = true)
 |-- time: integer (nullable = true)

// get the items count
scala> df.count()
val res1: Long = 82677131   

// get the partitions number
scala> df.rdd.getNumPartitions
val res3: Int = 26

// issue a query by grouping item and aggregating rate
scala> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
+----------+--------+                                                           
|      item|avg_rate|
+----------+--------+
|0001061100|     5.0|
|0001543849|     5.0|
|0001061127|     5.0|
|0001019880|     5.0|
|0001062395|     5.0|
|0000143502|     5.0|
|000014357X|     5.0|
|0001527665|     5.0|
|000107461X|     5.0|
|0000191639|     5.0|
|0001127748|     5.0|
|0000791156|     5.0|
|0001203088|     5.0|
|0001053744|     5.0|
|0001360183|     5.0|
|0001042335|     5.0|
|0001374400|     5.0|
|0001046810|     5.0|
|0001380877|     5.0|
|0001050230|     5.0|
+----------+--------+
only showing top 20 rows

As you see above, the query run pretty well. Spark didn’t throw out any memory overflow error. The total query time is: 1 min 46 seconds.

Apache Drill is quite memory sensitive, even so it can make the job done under this limited memory. For example:

> select columns[1] as `item`, avg(cast(columns[2] as Float)) as `avg_rate`
> from `rate.csv` group by `item` order by `avg_rate` desc limit 20;
+------------+----------+
|    item    | avg_rate |
+------------+----------+
| 0004133900 | 5.0      |
| 0005019222 | 5.0      |
| 0002557991 | 5.0      |
| 0002214830 | 5.0      |
| 0002720213 | 5.0      |
| 0002326817 | 5.0      |
| 0002254352 | 5.0      |
| 000225316X | 5.0      |
| 0002199203 | 5.0      |
| 0001712705 | 5.0      |
| 0002617501 | 5.0      |
| 0002113848 | 5.0      |
| 0001840266 | 5.0      |
| 0002117576 | 5.0      |
| 0001360183 | 5.0      |
| 0000202010 | 5.0      |
| 0002179083 | 5.0      |
| 0001954792 | 5.0      |
| 0002005468 | 5.0      |
| 0005097231 | 5.0      |
+------------+----------+
20 rows selected (119.311 seconds)

As you see Drill run well on this scenario too. The total query time is: 1 min 59 seconds.

Then I gave Mysql a try. I know for limited memory Mysql handles this scale of data very slow. Though the VM has only 2GB memory, I had to adjust two important memory arguments for Mysql.

mysql> show variables like 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name           | Value     |
+-------------------------+-----------+
| innodb_buffer_pool_size | 402653184 |
+-------------------------+-----------+
1 row in set (0.01 sec)

mysql> show variables like 'key_buffer_size';
+-----------------+-----------+
| Variable_name   | Value     |
+-----------------+-----------+
| key_buffer_size | 134217728 |
+-----------------+-----------+
1 row in set (0.02 sec)

The tuning to Mysql includes ‘innodb_buffer_pool_size’ was set to 384MB, ‘key_buffer_size’ was set to 128MB.

After then I created the table, and loaded the data into this table by using the statements below.

mysql> create table rate(userId varchar(24), itemId varchar(24),rating float, time int);
Query OK, 0 rows affected (0.11 sec)

mysql> load data local infile 'skydrive/rate.csv' into table rate FIELDS TERMINATED BY ',';
Query OK, 82677131 rows affected (21 min 21.25 sec)
Records: 82677131  Deleted: 0  Skipped: 0  Warnings: 0

Loading the data takes time 21 min 21 sec. And I need to add an index to the queried column, in this case it’s itemId.

mysql> create index itemIndex on rate(itemId);
Query OK, 0 rows affected (9 min 22.60 sec)
Records: 0  Duplicates: 0  Warnings: 0

Finally I made the query in Mysql:

mysql> select itemId, avg(rating) as avg_rate from rate group by itemId order by avg_rate desc limit 20;
+------------+----------+
| itemId     | avg_rate |
+------------+----------+
| 0000191639 |        5 |
| 0000143529 |        5 |
| 0000143502 |        5 |
| 0000202010 |        5 |
| 0000053155 |        5 |
| 0001019880 |        5 |
| 0001018043 |        5 |
| 000014357X |        5 |
| 000077135X |        5 |
| 0001026038 |        5 |
| 0000401048 |        5 |
| 0000000078 |        5 |
| 0000230022 |        5 |
| 0000913154 |        5 |
| 0000143588 |        5 |
| 0000466603 |        5 |
| 0001024388 |        5 |
| 0001006657 |        5 |
| 0000791156 |        5 |
| 0000174076 |        5 |
+------------+----------+
20 rows in set (16 min 29.75 sec)

Mysql’s total query time is: 16 min 30 seconds.

For comparison to Mysql, I installed Postgresql in this VM. It’s the default installation from Ubuntu apt source, with the basic tuning as below.

postgres=# show shared_buffers;
 shared_buffers 
----------------
 384MB
(1 row)

postgres=# show work_mem;
 work_mem 
----------
 16MB
(1 row)

I created the same table as in Mysql and setup the index to the queried column.

bigdata=# \d rate
                       Table "public.rate"
 Column |         Type          | Collation | Nullable | Default 
--------+-----------------------+-----------+----------+---------
 userid | character varying(24) |           |          | 
 itemid | character varying(24) |           |          | 
 rating | double precision      |           |          | 
 time   | integer               |           |          | 
Indexes:
    "itemindex" btree (itemid)

Then I loaded the data from csv to the table:

bigdata=# \copy rate FROM '/home/pyh/skydrive/rate.csv' DELIMITER ',' CSV;
COPY 82677131

And the same query was issued to Postgresql. As you see the command top’s output following, Postgresql uses 3 threads in the progress of shuffling. This consumes out all the available memory.

0945 postgres  20   0 1311264 640036   3336 D  20.9 31.4   0:20.64 postgres                                                                      
30944 postgres  20   0 1335840 637304   3356 D  18.6 31.2   0:20.86 postgres                                                                      
30330 postgres  20   0 1383316 558332   3272 R  49.3 27.4   4:45.60 postgres 

But finally Postgresql finished the job as well:

bigdata=# select itemId, avg(rating) as avg_rate from rate group by itemId order by avg_rate desc limit 20;
   itemid   | avg_rate 
------------+----------
 0001006657 |        5
 0001024388 |        5
 0000913154 |        5
 0000791156 |        5
 0001019880 |        5
 0000466603 |        5
 000077135X |        5
 0000191639 |        5
 0000174076 |        5
 000014357X |        5
 0001018043 |        5
 0000202010 |        5
 0000143529 |        5
 0000230022 |        5
 0000000078 |        5
 0000401048 |        5
 0000053155 |        5
 0000143588 |        5
 0000143502 |        5
 0001026038 |        5
(20 rows)

Time: 156100.056 ms (02:36.100)

The total query time in Postgresql is: 2 min 36 seconds.

So, for this limited memory and this scale of dataset, the specific big data apps have much more efficiency than the traditional Mysql. Though Postgresql is one of the RDBMs, it’s much faster than Mysql too. Because it uses multi-threading for the shuffling progress, while the drawback is it consumes the most memory than all the three others.

All queries above were run for several times, the minimum values were taken as the final results.

The query time comparison:

Data AppSparkDrillMysqlPostgresql
Query Time106 sec119 sec990 sec156 sec

Here are the App’s versions:

// Mysql version
mysql> select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.7.37-0ubuntu0.18.04.1 |
+-------------------------+
1 row in set (0.01 sec)

// Postgresql version
bigdata=# select version();
                                                               version                                                                
--------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.19 (Ubuntu 10.19-0ubuntu0.18.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
(1 row)

// Drill version
apache drill (dfs.skydrive)> select version from sys.version;
+---------+
| version |
+---------+
| 1.20.0  |
+---------+
1 row selected (0.641 seconds)

// Spark version
version 3.2.1        
Using Scala version 2.13.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_321)

The VM I was using for this test has 2G dedicated memory, 2 Intel cores, 50G NVME disk. It’s good to see all Apps run well under this limited memory. But Mysql is really slow comparing to the others.