Regex comparison in Perl and Scala

Perl’s regex is still very fast. Its running speed is amazing. Scala’s regex can work, but it’s as 3 times slower as Perl. Just got the result from my experience.

Here is the use case. Reads from a text file which is as big as gigs. Filters the lines with regex, and splits the line into words, then filters the words with another regex. Finally prints out the words.

This is perl script:

use strict;

open HDW,">","words.txt" or die $!;
open HD,"msg.txt" or die $!;

while(<HD>) {
  next if /^[^0-9a-zA-Z\s]/;
  chomp;
  my @words = split/\s+/,$_;
  for my $w (@words) {
    $w=lc($w);
    if ($w=~/^[a-z0-9]+$/ and length($w) < 30){
       print HDW $w,"\n";
    }
  }
}

close HD;
close HDW;

This is scala script:

import scala.io.Source

val patt1 = """^[^0-9a-zA-Z\s].*$"""
val patt2 = """^[a-z0-9]+$"""

val lines = Source.fromFile("msg.txt").getLines().filter(! _.matches(patt1))

for (x <- lines) {
  x.split("""\s+""").map(_.toLowerCase).filter(_.matches(patt2)).filter(_.size < 30).foreach {println}
}

Though scala is compiled as class, its executing time is 3 times to perl.

$ scalac -Xscript SplitWords words-parse.scala 
$ time scala SplitWords > scala-words.txt 

real	0m36.858s
user	0m25.494s
sys	0m13.449s

$ time perl words-parse.pl 

real	0m12.115s
user	0m11.770s
sys	0m0.184s

And, I found a feature that, scala’s regex must be full matching, while perl’s can be part matching.

Such as this matching in scala gets false:

scala> val str = "hello word"
val str: String = hello word

scala> str.matches("^hello")
val res0: Boolean = false

But in perl it’s always true:

$ perl -le '$str ="hello word"; print "true" if $str=~ /^hello/'
true

Regardless of language features, doing the right thing with the right tool is always right.

[ Update 1 ]

Thanks to the guy on scala forum, who points out that I can compile the regex only once. Then I improved the program as below:

import scala.io.Source

val patt1 = """[^0-9a-zA-Z\s].*""".r
val patt2 = """[a-z0-9]+""".r

val lines = Source.fromFile("msg.txt").getLines()

for {
  line <- lines
  if ! patt1.matches(line)
  word <- line.split("""\s+""").map(_.toLowerCase)
  if patt2.matches(word) && word.size < 30
} {
  println(word) 
}

Re-run and it takes less 6 seconds than before, about 30 seconds to finish the job. Still much slower than perl.

Please notice: this updated program works only in scala 2.13. My Spark application requires scala 2.12, which doesn’t work as the way.

[ Update 2 ]

Scala’s regex is anchored by default. So it takes the full matching. To take a part matching as perl, could use this (in scala 2.13):

scala> val regex = """^hello""".r.unanchored
val regex: scala.util.matching.UnanchoredRegex = ^hello

scala> regex.matches("hello word")
val res0: Boolean = true

As you see, when declared as unanchored, the regex can take part matching.