Tag Archives: DevOps

Benchmark for Scala, Ruby and Perl

I know this benchmark is maybe meaningless. But I would like to give a simple comparison for run speed of Scala, Ruby and Perl.

To tell the results directly: for this job, Perl is the fastest, taking 1.9s. Ruby is the second fast, taking 3.0s. Scala script is the slowest, taking 4.0s.

Two input data used by the scripts can be downloaded form here:

words.txt.tgz (11MB)

stopwords.txt.tgz (4KB)

Here is the Scala script:

import scala.io.Source

val li = Source.fromFile("words.txt").getLines()
val set_sw = Source.fromFile("stopwords.txt").getLines().toSet
val hash = scala.collection.mutable.Map[String,Int]()

for (x <- li) {
    if ( ! set_sw.contains(x) ) {
      if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
    }
}

val sorted = hash.toList.sortBy(-_._2)
sorted.take(20).foreach {println}

Here is the Ruby script:

stopwords = {}
File.open("stopwords.txt").each_line do |s|
  s.strip!
  stopwords[s] =1
end

count = {}
File.open("words.txt").each_line do |s|
  s.strip!
  if ! stopwords.has_key?(s)
    if count.has_key?(s) 
       count[s] += 1
    else
       count[s] = 1
    end
  end
end
      
z = count.sort {|a1,a2| a2[1]<=>a1[1]}
z.take(20).each do |s| puts "#{s[0]} -> #{s[1]}" end

Here is the Perl script:

use strict;

my %stopwords;

open HD,"stopwords.txt" or die $!;
while(<HD>) {
    chomp;
    $stopwords{$_} =1;
}
close HD;

my %count;

open HD,"words.txt" or die $!;
while(<HD>) {
    chomp;
    unless ( $stopwords{$_} ) {
        $count{$_} ++;
    }
}
close HD;

my $i=0;
for (sort {$count{$b} <=> $count{$a}} keys %count) {
    if ($i < 20) {
        print "$_ -> $count{$_}\n"
    } else {
       last; 
    }
    $i ++;
}

The basic idea of above scripts are the same. The difference is I use Set structure in Scala for keeping stopwords, but in Perl and Ruby I use Hash structure for stopwords.

And this is Scala’s run result:

$ time scala scala-set.sc 
(send,20987)
(message,17516)
(unsubscribe,15541)
(2021,15221)
(list,13017)
(mailing,12402)
(mail,11647)
(file,11133)
(flink,10114)
(email,9919)
(pm,9248)
(group,8865)
(problem,8853)
(code,8659)
(data,8657)
(2020,8398)
(received,8246)
(google,7921)
(discussion,7920)
(jan,7893)

real	0m4.096s
user	0m6.725s
sys	0m0.187s

This is Ruby’s run result:

$ time ruby ruby-hash.rb 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m3.062s
user	0m3.028s
sys	0m0.032s

The final is Perl’s run result:

$ time perl perl-hash.pl 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m1.924s
user	0m1.893s
sys	0m0.029s

I have run the above three scripts many times. Their results are similar.

Version for the languages:

$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]

$ perl -v
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-gnu-thread-multi
(with 71 registered patches, see perl -V for more detail)

Copyright 1987-2017, Larry Wall

$ scala -version
Scala code runner version 2.13.7 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.

The OS is ubuntu 18.04 for a KVM VPS. Hardware includes 4G ram, 40G ssd disk, double AMD 7302 processors.

I am surprised to see Perl has that fast speed among these three languages. Though I maybe have not written the best Ruby or Scala program for performance stuff, but this simple testing still shows Perl language has big performance advantages on the common text parsing jobs.

[updated 2022-01-29] Below is the updated content:

After I compiled the scala script, the running time becomes much shorter. So I was thinking the reason for the slow scala script above is the parser starts up too slow.

Scala script changed to this:

import scala.io.Source

object CountWords {
  def main(args: Array[String]):Unit = {

    val li = Source.fromFile("words.txt").getLines()
    val stopwords = Source.fromFile("stopwords.txt").getLines().toSet
    val hash = scala.collection.mutable.Map[String,Int]()

    for (x <- li) {
        if ( ! stopwords.contains(x) ) {
            if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
        }
    }

    hash.toList
     .sortBy(-_._2)
     .take(20)
     .foreach {println}
  }
}

And compiled it with:

$ scalac CountWords.scala 

Here is the comparison of running speed to perl:

$ time scala CountWords
(send,21919)
(message,19347)
(unsubscribe,16617)
(2021,15344)
(list,14271)
(mailing,13098)
(file,12537)
(mail,12122)
(jan,12070)
(email,10701)
(flink,10249)
(pm,9940)
(code,9562)
(group,9547)
(problem,9536)
(data,9373)
(2022,8932)
(received,8760)
(return,8566)
(discussion,8441)

real	0m2.107s
user	0m2.979s
sys	0m0.142s

$ time perl perl-hash.pl 
send -> 21919
message -> 19347
unsubscribe -> 16617
2021 -> 15344
list -> 14271
mailing -> 13098
file -> 12537
mail -> 12122
jan -> 12070
email -> 10701
flink -> 10249
pm -> 9940
code -> 9562
group -> 9547
problem -> 9536
data -> 9373
2022 -> 8932
received -> 8760
return -> 8566
discussion -> 8441

real	0m2.418s
user	0m2.380s
sys	0m0.036s

Now, perl run with 2.4s, while scala run with 2.1s, the latter is faster.

For this simple comparison, the running speed is finally with this order:

compiled scala > perl > ruby > scala script

How to auto backup the wordpress site

Backup is important when you run a website. For me I backup this blog which is powered by wordpress automatically, once the blog’s content gets updated.

This is the perl script to run in crontab, which checks the database to see if there is any update, if yes a backup will be implemented.

#!/usr/bin/perl
 use strict;
 use MySQL::mycrud;
  
 my $db = MySQL::mycrud_>new('my_user','127.0.0.1',3306,'my_database','my_passwd');
 my ($last_id) = $db->get_row("select ID from wp_posts order by ID desc limit 1");
 $db->disconnect;
 
 open HD,"/tmp/last-id.txt" or die $!;
 my $record_id = <HD>;
 close HD;

 chomp $record_id;

 if ($last_id > $record_id) {
     system "/path/to/backup.sh";  # implement a bash script
     open HDW,">","/tmp/last-id.txt" or die $!;
     print HDW $last_id;
     close HDW;
 }  

And, this is the bash script called by perl above, which implements the full backup for a wordpress site, including the site files and database.

#!/bin/bash

 cd /tmp
 DATE=`date +%Y-%m-%d`
 DIR="mysite.$DATE"

 mkdir -p $DIR
 
 # copy the site files from webdir
 sudo cp -rf /var/www/mysite/ $DIR/

 # dump database
 sudo mysqldump -uroot my_database > $DIR/my_database.sql
 sudo chown -R your_user_id $DIR
 
 tar zcf $DIR.tgz $DIR/
 rm -rf $DIR
 
 rclone copy $DIR.tgz dropbox:webbackup 

You should change the script to mach your use case, such as dir name, database name, user ID etc. And I upload the backup file to dropbox via rclone, you maybe want to change it with another way.

Perl binary search function

Recently I need to implement a binary searching with perl, so I got this code:

 use strict;
 use warnings;
 
 my $want = shift;
 die "$0 number" unless defined $want;
 
 my @list= (3,5,7,11,13,17,19);

 my $pos=bin_search(\@list,$want);
 print "Position: ", defined $pos ? $pos : "undef","\n";

 # binary search
 sub bin_search {
     my $array = shift;
     my $find  = shift;
 
     my ($l,$r)=(0,$#$array);
 
     while ($l<=$r) {
         my $m=int(($l+$r)/2);
 
         if ($find<$array->[$m]) {
             $r=$m-1;
         } elsif ($find>$array->[$m]) {
             $l=$m+1;
         } else {
             return $m;
         }
     }
     return undef;
 } 

It’s smart. Run it this way:

$ perl binarysearch.pl 11
 Position: 3
$ perl binarysearch.pl 13
 Position: 4
$ perl binarysearch.pl 15
 Position: undef 

It’s much faster than loop through the entire array to match the element.

Batch checking the existence of gmail accounts

When you try to register a gmail account, you most probably found all the usernames you desired have been taken.

So, a good tool for batch checking the available mailboxes becomes attractive.

Here I show a method used by myself. It’s a simple perl script:

#!/usr/bin/perl 

 use strict;
 use Gmail::Mailbox::Validate;
 
 my $username = shift || die "$0 username\n";
  
 my $v = Gmail::Mailbox::Validate->new();
 print "$username mailbox exists\n" if $v->validate($username); 

Given the username, this script will tell you if this mailbox at google exists.

For example:

$ ./gmbox wesley9807
 wesley9807 mailbox exists

$ ./gmbox wesley98076
 

The first one tell you username “wesley9807” exists. The second one returns nothing, that username may not be registered. So, you may have the chance to register the username “wesley98076”.

Please notice: The second command returns nothing, it does mean this username has no mailbox at google. But, it still does not mean you can take this username.

For example, google seems keep some good usernames, which have no mailboxes, but you can’t register for them. And, a google user may choose to delete his/her mailbox, but keep the other google service running (google drive etc), so you can not register this mailbox too.

Anyway with this method you can check a lot of usernames quickly. There is no need to try them one by one from google’s registration page.

How to install the required perl module? just use cpanm tool. For example:

$ sudo cpanm Gmail::Mailbox::Validate
 Gmail::Mailbox::Validate is up to date. (0.01) 

The last, you should not abuse it, otherwise google may block your IP or networks.

How to perform a rDNS lookup

rDNS (Reverse DNS) is important for identify an IP address. Some internet service, for instance, sending email from an IP, needs rDNS to be setup correctly.

Here I tell how to perform a rDNS lookup from Linux. There are two simple ways.

The first way, using dig command. The full path is “dig -x IP”, as below:

 $ dig -x 23.95.246.240


 ; <<>> DiG 9.10.3-P4-Ubuntu <<>> -x 23.95.246.240
 ;; global options: +cmd
 ;; Got answer:
 ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11872
 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
 

 ;; OPT PSEUDOSECTION:
 ; EDNS: version: 0, flags:; udp: 512
 ;; QUESTION SECTION:
 ;240.246.95.23.in-addr.arpa. IN PTR
 

 ;; ANSWER SECTION:
 240.246.95.23.in-addr.arpa. 3599 IN PTR 23-95-246-240-host.colocrossing.com.
 

 ;; Query time: 178 msec
 ;; SERVER: 8.8.8.8#53(8.8.8.8)
 ;; WHEN: Tue Feb 23 10:31:28 HKT 2021
 ;; MSG SIZE  rcvd: 104 

In the “ANSWER SECTION”, you will see the PTR record type, the value following that is IP’s rDNS.

The second way, using curl command to query localhost’s rDNS. The full path is “curl -sL hostname.cloudcache.net”, as below:

 $ curl -sL hostname.cloudcache.net
 Your IP: 23.95.246.240, Hostname: 23-95-246-240-host.colocrossing.com. 

As you see, the “hostname:” part is rDNS value for your host’s IP.