Tag Archives: Scala

Regex comparison in Perl and Scala

Perl’s regex is still very fast. Its running speed is amazing. Scala’s regex can work, but it’s as 3 times slower as Perl. Just got the result from my experience.

Here is the use case. Reads from a text file which is as big as gigs. Filters the lines with regex, and splits the line into words, then filters the words with another regex. Finally prints out the words.

This is perl script:

use strict;

open HDW,">","words.txt" or die $!;
open HD,"msg.txt" or die $!;

while(<HD>) {
  next if /^[^0-9a-zA-Z\s]/;
  chomp;
  my @words = split/\s+/,$_;
  for my $w (@words) {
    $w=lc($w);
    if ($w=~/^[a-z0-9]+$/ and length($w) < 30){
       print HDW $w,"\n";
    }
  }
}

close HD;
close HDW;

This is scala script:

import scala.io.Source

val patt1 = """^[^0-9a-zA-Z\s].*$"""
val patt2 = """^[a-z0-9]+$"""

val lines = Source.fromFile("msg.txt").getLines().filter(! _.matches(patt1))

for (x <- lines) {
  x.split("""\s+""").map(_.toLowerCase).filter(_.matches(patt2)).filter(_.size < 30).foreach {println}
}

Though scala is compiled as class, its executing time is 3 times to perl.

$ scalac -Xscript SplitWords words-parse.scala 
$ time scala SplitWords > scala-words.txt 

real	0m36.858s
user	0m25.494s
sys	0m13.449s

$ time perl words-parse.pl 

real	0m12.115s
user	0m11.770s
sys	0m0.184s

And, I found a feature that, scala’s regex must be full matching, while perl’s can be part matching.

Such as this matching in scala gets false:

scala> val str = "hello word"
val str: String = hello word

scala> str.matches("^hello")
val res0: Boolean = false

But in perl it’s always true:

$ perl -le '$str ="hello word"; print "true" if $str=~ /^hello/'
true

Regardless of language features, doing the right thing with the right tool is always right.

[ Update 1 ]

Thanks to the guy on scala forum, who points out that I can compile the regex only once. Then I improved the program as below:

import scala.io.Source

val patt1 = """[^0-9a-zA-Z\s].*""".r
val patt2 = """[a-z0-9]+""".r

val lines = Source.fromFile("msg.txt").getLines()

for {
  line <- lines
  if ! patt1.matches(line)
  word <- line.split("""\s+""").map(_.toLowerCase)
  if patt2.matches(word) && word.size < 30
} {
  println(word) 
}

Re-run and it takes less 6 seconds than before, about 30 seconds to finish the job. Still much slower than perl.

Please notice: this updated program works only in scala 2.13. My Spark application requires scala 2.12, which doesn’t work as the way.

[ Update 2 ]

Scala’s regex is anchored by default. So it takes the full matching. To take a part matching as perl, could use this (in scala 2.13):

scala> val regex = """^hello""".r.unanchored
val regex: scala.util.matching.UnanchoredRegex = ^hello

scala> regex.matches("hello word")
val res0: Boolean = true

As you see, when declared as unanchored, the regex can take part matching.

programming with Spark structure streaming

In my last blog I played programming with Spark streaming, which is called DStream.

This article I program with Spark structure streaming, which is a new way for streaming handling.

What’s the difference between DStream and structure streaming? The former is based on Spark’s traditional RDD API, while the latter is based on Spark’s SQL engine.

From my thought, if you want more transforms to the original data such as data cleaning, DStream is more useful. Since RDD provides a lot of higher order functions for data transform, such as map, reduce etc.

If you want more aggregate functions, such as statistics and reports, structure streaming is more convenient. Since SQL engine is powerful for implementation of this kind of jobs, such as group, count, sort etc.

But, structure streaming is much easier to use than DStream. It’s higher optimized. You can just program with structure streaming as the way of regular spark session.

To play with the demo, first we need a socket server which prints the data continuously to remote socket. And a streaming client will receive the data from the socket.

The socket server has been given in my last blog. Please start up it in a separate terminal.

Then, I follow the official demo to get the scala code below to handle the streaming.

import org.apache.spark.sql.SparkSession

object Myjob {
  def main(args: Array[String]): Unit = {
    if (args.length < 2) {
      System.err.println("Usage: StructuredNetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    val host = args(0)
    val port = args(1).toInt

    val spark = SparkSession
      .builder
      .appName("StructuredNetworkWordCount")
      .getOrCreate()

    import spark.implicits._

    // Create DataFrame representing the stream of input lines from connection to host:port
    val lines = spark.readStream
      .format("socket")
      .option("host", host)
      .option("port", port)
      .load()

    // Split the lines into words
    val words = lines.as[String].flatMap(_.split(" "))

    // Generate running word count
    val wordCounts = words.groupBy("value").count()

    // Start running the query that prints the running counts to the console
    val query = wordCounts.writeStream
      .outputMode("complete")
      .format("console")
      .start()

    query.awaitTermination()
  }
}

How to package and deploy the app? Please see my last blog, they are almost the same.

The final step is to run the structure streaming from Spark:

$ spark-submit --class "Myjob" --master local[2] target/scala-2.12/my-strucutre-streaming-job_2.12-1.0.jar localhost 9999

We can see the continuous output in the screen for words counting:

-------------------------------------------
Batch: 190
-------------------------------------------
+-------+-----+
|  value|count|
+-------+-----+
| orange|  360|
|  apple|  372|
|  mango|  380|
| tomato|  362|
|apricot|  357|
| cherry|  366|
| banana|  328|
|  lemon|  341|
|   plum|  365|
|  peach|  372|
+-------+-----+


-------------------------------------------
Batch: 191
-------------------------------------------
+-------+-----+
|  value|count|
+-------+-----+
| orange|  362|
|  apple|  378|
|  mango|  382|
| tomato|  363|
|apricot|  358|
| cherry|  368|
| banana|  332|
|  lemon|  342|
|   plum|  367|
|  peach|  377|
+-------+-----+

-------------------------------------------
Batch: 192
-------------------------------------------
+-------+-----+
|  value|count|
+-------+-----+
| orange|  363|
|  apple|  381|
|  mango|  386|
| tomato|  364|
|apricot|  361|
| cherry|  374|
| banana|  333|
|  lemon|  344|
|   plum|  369|
|  peach|  378|
+-------+-----+

The application environment:

  • OS: ubuntu 18.04 x86_64, a KVM instance
  • Ram: 4GB dedicated
  • CPU: double AMD 7302 processor
  • Spark: 3.2.0
  • Scala: 2.12.15

programming with Spark streaming

Spark can read streaming from many sources, including file system, object storage, socket, message queue etc. Here I run the app to receive streaming from a socket server, handle the streaming and print out the results.

First I have to prepare a socket server. I wrote one with Perl, the content is following:

package MyPackage;
 
use strict;
use base qw(Net::Server::PreFork);
 
MyPackage->run(host=>'localhost',port=>9999);
 
# over-ride the default echo handler
sub process_request {
    my $self = shift;
    my @fruits=qw(apple orange tomato plum cherry peach apricot banana mango lemon);

    while(1) {
        my $fruit = $fruits[int(rand(scalar @fruits))];
        print "$fruit\r\n" or return;
        select(undef, undef, undef, 0.25);
    }
}
 
1;

Please open a terminal and run this script. This script run as a socket server, listening on port 9999 for accepting connections. Once the connect has been established, it prints the content to the remote socket, one line one word.

Here is the scala program who reads the streaming from socket and handles it.

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
import org.apache.spark.storage.StorageLevel


object Myjob {
  def main(args: Array[String]): Unit = {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    // Create the context with a 15 second batch size
    val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(15))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (e.g. generated by 'nc')
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))

    // Convert RDDs of the words DStream to DataFrame and run SQL query
    words.foreachRDD { (rdd: RDD[String], time: Time) =>

      // Get the singleton instance of SparkSession
      val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
      import spark.implicits._

      // Convert RDD[String] to RDD[case class] to DataFrame
      val wordsDataFrame = rdd.map(w => Record(w)).toDF()

      // Creates a temporary view using the DataFrame
      wordsDataFrame.createOrReplaceTempView("words")

      // Do word count on table using SQL and print it
      val wordCountsDataFrame =
        spark.sql("select word, count(*) as total from words group by word")
      println(s"========= $time =========")
      wordCountsDataFrame.show()
    }

    ssc.start()
    ssc.awaitTermination()
  }
}


/** Case class for converting RDD to DataFrame */
case class Record(word: String)


/** Lazily instantiated singleton instance of SparkSession */
object SparkSessionSingleton {

  @transient  private var instance: SparkSession = _

  def getInstance(sparkConf: SparkConf): SparkSession = {
    if (instance == null) {
      instance = SparkSession
        .builder
        .config(sparkConf)
        .getOrCreate()
    }
    instance
  }
}

You can see the source code from spark’s official sample.

How to deploy it? First we need Apache spark running on localhost.

Then we start a new project by creating a project root dir. In the project dir, the dir tree looks as this:

./build.sbt
./src/main/scala/Myjob.scala

“build.sbt” is the configuration file for project building, whose content is as following:

name := "My Streaming Job"

version := "1.0"

scalaVersion := "2.12.15"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.0"

libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.2.0" % "provided"

Here “name” specifies the project name. “version” specifies the project version. “scalaVersion” specifies scala’s version. please notice for spark 3.2.x we can use scala 2.12.x only. “libraryDependencies” is the project’s dependencies packages.

In the project dir, we run the following command to build the project:

$ sbt package
[info] welcome to sbt 1.6.1 (Ubuntu Java 11.0.11)
[info] loading project definition from /home/pyh/ops/spark/job6/project
[info] loading settings for project job6 from build.sbt ...
[info] set current project to My Streaming Job (in build file:/home/pyh/ops/spark/job6/)
[success] Total time: 3 s, completed Feb 7, 2022, 3:56:56 PM

After the building you will find the project dir becomes:

$ ls
build.sbt  project  src  target

There are new dir “project” and “target” added. And the compiled class is located at:

target/scala-2.12/my-streaming-job_2.12-1.0.jar

Finally we run the project by submitting the job to local spark server:

$ spark-submit --class "Myjob" --master local[2] target/scala-2.12/my-streaming-job_2.12-1.0.jar localhost 9999

Since I have only 2 cores in the VM, here I specify local[2] to use max 2 threads.

The results are printed out to the screen periodically:

========= 1644221235000 ms =========
+-------+-----+
|   word|total|
+-------+-----+
| banana|    6|
|  lemon|    5|
|  mango|    9|
|  apple|    6|
|   plum|    5|
| tomato|   12|
| cherry|    8|
|  peach|    2|
|apricot|    5|
| orange|    2|
+-------+-----+

========= 1644221250000 ms =========
+-------+-----+
|   word|total|
+-------+-----+
|  peach|    9|
|   plum|    8|
| banana|    7|
|  apple|    5|
|apricot|    6|
| tomato|    5|
| cherry|    6|
|  lemon|    4|
|  mango|    8|
| orange|    2|
+-------+-----+

========= 1644221265000 ms =========
+-------+-----+
|   word|total|
+-------+-----+
|  apple|    2|
|  mango|    4|
|  lemon|    4|
|  peach|    9|
| orange|    6|
|apricot|    9|
| cherry|   10|
| banana|    4|
|   plum|    6|
| tomato|    6|
+-------+-----+

Until now all the job has been done.

The application environment:

  • OS: ubuntu 18.04 x86_64, a KVM instance
  • Ram: 4GB dedicated
  • CPU: double AMD 7302 processor
  • Spark: 3.2.0
  • Scala: 2.12.15

computing performance comparison for words statistics

Here I provided a dataset for words appearing in the public tech lists.

The file is a text file, one line with one word. The total lines and size:

$ wc -l words.txt 
283218160 words.txt

$ du -h words.txt 
1.5G	words.txt

You can download this file from here (tgz compressed).

I just did a test to count the words in this file by grouping and sorting, with three methods: spark RDD API, spark dataframe API, and the scala program.

This is the syntax of spark RDD API in pyspark:

>>> rdd = sc.textFile("/tmp/words.txt")
>>> rdd.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).sortBy(lambda x:x[1],ascending=False).take(20)

This is the syntax of spark dataframe API in pyspark:

>>> df = spark.read.text("/tmp/words.txt")
>>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show()

This is the scala program:

import scala.io.Source

object BigWords extends App {

  val file = Source.fromFile("/tmp/words.txt").getLines()
  val hash = scala.collection.mutable.Map[String,Int]()

  for (x <- file) {
    if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
  }

  hash.toList.sortBy(-_._2).take(20).foreach(println)
}

I compiled the scala program and run it:

$ scalac bigwords.scala 
$ time scala BigWords
(the,14218320)
(to,11045040)
(a,5677600)
(and,5205760)
(is,4972080)
(i,4447440)
(in,4228200)
(of,3982280)
(on,3899320)
(for,3760800)
(this,3684640)
(you,3485360)
(at,3238480)
(that,3230920)
(it,2925320)
(with,2181160)
(be,2172400)
(not,2124320)
(from,2097120)
(if,1993560)

real	0m49.031s
user	0m50.390s
sys	0m0.734s

As you see it takes 49s to finish the job.

While spark’s dataframe API is a bit slower, it takes 56s to finish the job (timer from my iOS stopwatch app):

+-----+--------+                                                                
|value|   count|
+-----+--------+
|  the|14218320|
|   to|11045040|
|    a| 5677600|
|  and| 5205760|
|   is| 4972080|
|    i| 4447440|
|   in| 4228200|
|   of| 3982280|
|   on| 3899320|
|  for| 3760800|
| this| 3684640|
|  you| 3485360|
|   at| 3238480|
| that| 3230920|
|   it| 2925320|
| with| 2181160|
|   be| 2172400|
|  not| 2124320|
| from| 2097120|
|   if| 1993560|
+-----+--------+
only showing top 20 rows

But, spark RDD API is quite slow in my use case. It takes 7m15s to finish the job:

[('the', 14218320), ('to', 11045040), ('a', 5677600), ('and', 5205760), ('is', 4972080), ('i', 4447440), ('in', 4228200), ('of', 3982280), ('on', 3899320), ('for', 3760800), ('this', 3684640), ('you', 3485360), ('at', 3238480), ('that', 3230920), ('it', 2925320), ('with', 2181160), ('be', 2172400), ('not', 2124320), ('from', 2097120), ('if', 1993560)]

I doubt it’s due to the slow python parser, so I re-run the RDD API with spark’s scala shell. The syntax and results as below:

scala> val rdd = sc.textFile("/tmp/words.txt")

scala> rdd.map{(_,1)}.reduceByKey{_ + _}.sortBy{-_._2}.take(20)

res1: Array[(String, Int)] = Array((the,14218320), (to,11045040), (a,5677600), (and,5205760), (is,4972080), (i,4447440), (in,4228200), (of,3982280), (on,3899320), (for,3760800), (this,3684640), (you,3485360), (at,3238480), (that,3230920), (it,2925320), (with,2181160), (be,2172400), (not,2124320), (from,2097120), (if,1993560))

This time it takes 1m31s to finish the job.

The results summary in a table:

programtime
scala program49s
pyspark dataframe56s
scala RDD1m31s
pyspark RDD7m15s

I am so surprised pyspark’s RDD API is too slow as this. I want to give a research on the case.

Application environment:

  • OS: Ubuntu 18.04.6 LTS, a KVM instance
  • Spark version: 3.2.0 (with scala 2.12.15, and python 3.6.9)
  • Scala version: 2.13.7
  • CPU: double AMD EPYC 7302
  • Ram: 4GB dedicated
  • Disk: 40GB SSD