computing performance comparison for words statistics

Here I provided a dataset for words appearing in the public tech lists.

The file is a text file, one line with one word. The total lines and size:

$ wc -l words.txt 
283218160 words.txt

$ du -h words.txt 
1.5G	words.txt

You can download this file from here (tgz compressed).

I just did a test to count the words in this file by grouping and sorting, with three methods: spark RDD API, spark dataframe API, and the scala program.

This is the syntax of spark RDD API in pyspark:

>>> rdd = sc.textFile("/tmp/words.txt")
>>> x:(x,1)).reduceByKey(lambda x,y:x+y).sortBy(lambda x:x[1],ascending=False).take(20)

This is the syntax of spark dataframe API in pyspark:

>>> df ="/tmp/words.txt")

This is the scala program:


object BigWords extends App {

  val file = Source.fromFile("/tmp/words.txt").getLines()
  val hash = scala.collection.mutable.Map[String,Int]()

  for (x <- file) {
    if (hash.contains(x)) hash(x) += 1 else hash(x) = 1


I compiled the scala program and run it:

$ scalac bigwords.scala 
$ time scala BigWords

real	0m49.031s
user	0m50.390s
sys	0m0.734s

As you see it takes 49s to finish the job.

While spark’s dataframe API is a bit slower, it takes 56s to finish the job (timer from my iOS stopwatch app):

|value|   count|
|  the|14218320|
|   to|11045040|
|    a| 5677600|
|  and| 5205760|
|   is| 4972080|
|    i| 4447440|
|   in| 4228200|
|   of| 3982280|
|   on| 3899320|
|  for| 3760800|
| this| 3684640|
|  you| 3485360|
|   at| 3238480|
| that| 3230920|
|   it| 2925320|
| with| 2181160|
|   be| 2172400|
|  not| 2124320|
| from| 2097120|
|   if| 1993560|
only showing top 20 rows

But, spark RDD API is quite slow in my use case. It takes 7m15s to finish the job:

[('the', 14218320), ('to', 11045040), ('a', 5677600), ('and', 5205760), ('is', 4972080), ('i', 4447440), ('in', 4228200), ('of', 3982280), ('on', 3899320), ('for', 3760800), ('this', 3684640), ('you', 3485360), ('at', 3238480), ('that', 3230920), ('it', 2925320), ('with', 2181160), ('be', 2172400), ('not', 2124320), ('from', 2097120), ('if', 1993560)]

I doubt it’s due to the slow python parser, so I re-run the RDD API with spark’s scala shell. The syntax and results as below:

scala> val rdd = sc.textFile("/tmp/words.txt")

scala>{(_,1)}.reduceByKey{_ + _}.sortBy{-_._2}.take(20)

res1: Array[(String, Int)] = Array((the,14218320), (to,11045040), (a,5677600), (and,5205760), (is,4972080), (i,4447440), (in,4228200), (of,3982280), (on,3899320), (for,3760800), (this,3684640), (you,3485360), (at,3238480), (that,3230920), (it,2925320), (with,2181160), (be,2172400), (not,2124320), (from,2097120), (if,1993560))

This time it takes 1m31s to finish the job.

The results summary in a table:

scala program49s
pyspark dataframe56s
scala RDD1m31s
pyspark RDD7m15s

I am so surprised pyspark’s RDD API is too slow as this. I want to give a research on the case.

Application environment:

  • OS: Ubuntu 18.04.6 LTS, a KVM instance
  • Spark version: 3.2.0 (with scala 2.12.15, and python 3.6.9)
  • Scala version: 2.13.7
  • CPU: double AMD EPYC 7302
  • Ram: 4GB dedicated
  • Disk: 40GB SSD