Small Inversions with Raku

by Arne Sommer

Small Inversions with Raku

[31] Published 5. September 2019

Perl 6 → Raku

This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.

This is my response to the Perl Weekly Challenge #024.

Challenge #24.1

Create a smallest script in terms of size that on execution doesn’t throw any error. The script doesn’t have to do anything special. You could even come up with smallest one-liner.

A liberal interpretation of «doesn't have to do anything special» gives «doesn't have to do anything at all». Let us try with an empty file, as there certainly isn't any shorter way of doing nothing:

$ touch 0
$ perl 0
$ raku 0

The touch program creates a file with zero size, and I have named it «0» (the digit zero) in honour of the file size.

The script works (but does nothing) in both Perl 5 and Raku.

Do nothing one-liners also work:

$ perl -E ''
$ raku -e ''

So we are done? Perhaps not, as the challenge requested «the smallest script in terms of size». It can be argued (as I do now) that an empty file isn't a script. (Trying to look up the definition of «script» on wikipedia doesn't help, so this is all on me.)

We can try with a file with an empty line (just a newline):

$ echo "" > 1
$ perl 1
$ raku 1

The echo program behaves as the «say» keyword, and adds a trailing newline.

The handy «hex-dump» program (written by yours truly in Raku, included in the zip file, and presented in the «Bonus» section below) can be used to show that the file indeed has a newline and nothing else:

$ raku hex-dump 1 
0A                            | 

Note that the (LF) character is what you get on a Unix-like system (including Linux and Mac OSX). Windows uses ␍␊ (CR + LF), and Mac (before OSX) uses (CR). Feel free to be confused. Also note that Raku hides the problem, as long as we read and write files in (the default) text mode. The «hex-dump» uses binary mode, so we get the file as it is byte for byte.

If we assume that a script must have a statement (of sorts), we can try with an empty statement, a single semicolon:

echo ";" > 2
$ raku hex-dump 2
3B 0A                         | ;
$ perl 2
$ raku 2

That works as well.

The file has a trailing newline as well. We can create a file without this newline with a one-liner in Raku (using «print» instead of «say») like this:

$ raku -e 'print ";"' > 1-semi
$ raku hex-dump 1-semi
3B                            | ;
$ perl 1
$ raku 1

And again, it works.

But a semicolon isn't really a statement. It is merely a statement separator.

So let us try with a statement. Create a file «1-one» with just the single digit «1» in it (no newlines), and try:

$ raku -e 'print "1"' > 1-one
$ perl 1-one
$ raku 1-one
WARNINGS for 1-one:
Useless use of constant integer 1 in sink context (line 1)

It works in Perl 5. But Raku gives us a warning. That isn't nice, but it is better than throwing an error (as the challenge told us to avoid). But I'll disregard this one in Raku even so.

Raku has the «sink» keyword that tells the compiler to throw away the value (as in «put it in the sink»). We can try to get rid of the value that way:

$ raku -e 'print "sink 1" > 6'
$ raku 6
WARNINGS for 6:
Useless use of constant integer 1 in sink context (line 1)

See docs.raku.org/language/statement-prefixes#index-entry-sink_(statement_prefix)-sink for more information about sink.

So the script has to do something after all. Let us try with an empty string, or two quotes (and note the use of three types of quotes to make it work):

$ raku -e 'print «\"\"»' > 2-quote
$ raku hex-dump 2-quote
22 22                         | ""
$ perl 2-quote
$ raku 2-quote
WARNINGS for /home/arne/Dropbox/raku/kurs/content/code/Challenge24/2-quote:
Useless use of constant string "" in sink context (line 1)

This warning should look familiar. I'll silently forget about this one.

The shortest statement I can come up with that doesn't have a side effect (a return value) that runs in Perl 5 (but see below) and Raku (without a sink context warning) is say 1.

$ raku -e 'print "say 1"' > 5
$ raku 5 
1

It works, in Raku.

Perl 5 is interesting. Running the script fails, as «say» must be enabled manually. But when we do it on the command line magic has been applied:

$ perl -E "say 1"
1

$ perl -e "say 1"
Number found where operator expected at -e line 1, near "say 1"
	(Do you need to predeclare say?)
syntax error at -e line 1, near "say 1"
Execution of -e aborted due to compilation errors.

The documentation explains -E like this: -E «behaves just like -e, except that it implicitly enables all optional features (in the main compilation unit).»

5 characters (say 1) isn't bad, but surely we can do better?

Let us try the anonymous state variable $:

raku -e '$'
WARNINGS for -e:
Useless use of unnamed $ variable in sink context (line 1)

The same old «sink context» problem. But we can do something that changes the variable, as that will remove the sink context. Normal assignment does the trick, but requires too much typing. The «++» or «--» operators (either in- or prefix) work:

$ raku -e '$++'
$ raku -e '$--'
$ raku -e '--$'
$ raku -e '++$'

That is 3 characters. Let's try the last one as a script:

$ raku -e "print '++$'" > 3
$ raku hex-dump 3
2B 2B 24                      | ++$
$ raku 3

It works.

The file sizes are as expected:

$ ls -l
-rw-r--r-- 1 arne arne 0 sep.   2 17:08 0
-rw-r--r-- 1 arne arne 1 sep.   2 17:09 1
-rw-r--r-- 1 arne arne 1 sep.   2 18:15 1-one
-rw-r--r-- 1 arne arne 1 sep.   2 17:51 1-semi
-rw-r--r-- 1 arne arne 2 sep.   2 17:51 2
-rw-r--r-- 1 arne arne 2 sep.   2 18:05 2-quote
-rw-r--r-- 1 arne arne 5 sep.   2 18:10 3
-rw-r--r-- 1 arne arne 5 sep.   2 18:16 5
-rw-r--r-- 1 arne arne 6 sep.   2 18:09 6

Summary

An empty file may be the shortest solution, depending on the definition of «script» and «statement». If not take your pick in this table:

    Filename         Content       Perl 5         Raku         Python         Ruby         Node.js    
0 ok ok ok okok
1 ok ok ok okok
1-semi; ok ok error okok
1-one 1 ok warning ok okok
2 ; ok ok error okok
3 ++$ error ok error error error
5 say 1 error1 ok error error error
6 sink 1 error warning error error error

1 = But it works when run as a one-liner with «-E», as shown above.

I have added columns for Python, Ruby and Node.js, in case anybody are interested.

Bonus: hex-dump

Use this program to see what is actually stored in a file, byte for byte. Here it is, without comments.

constant CR  = 9229.chr; # This is the unicode "C/R" symbol
constant LF  = 9226.chr; # This is the unicode "L/F" symbol
constant BOX = 9617.chr; # This is a unicode gray box

multi sub MAIN ($file where $file.IO && $file.IO.r, :$linesize = 10)
{
  my $fh = open $file, :bin;
  say-blob($fh.read, $linesize);
  $fh.close;
}

multi sub MAIN ($string, :$linesize = 10)
{
  say-blob($string.encode("utf-8"), $linesize);
}

sub say-blob ($blob, $linesize)
{
  my $ascii = "";
  my $elems = @$blob.elems;
  my $count = 0;
  for @$blob -> $byte
  {
    $count++;
    print $byte.fmt("%02X ");
    if $byte == 10
    {
      $ascii ~= CR; 
    }
    elsif $byte == 13
    {
      $ascii ~= LF; 
    }
    else
    {
      $ascii ~= 31 < $byte < 127 ?? $byte.chr !! BOX;
    }
    if $count == $linesize
    {
      say "| $ascii"; $count = 0; $ascii = "";
    }
  }

  if $count && $count < $linesize
  {
    print "   " x $linesize - $count; # Fill the last line
    say "| $ascii";
  }
}

Using it on itself, only showing the first 9 lines:

$ raku hex-dump hex-dump
23 21 20 2F 75 73 72 2F 62 69 | #! /usr/bi
6E 2F 65 6E 76 20 72 61 6B 75 | n/env raku
0A 0A 63 6F 6E 73 74 61 6E 74 | ␍␍constant
20 43 52 20 20 3D 20 39 32 32 |  CR  = 922
39 2E 63 68 72 3B 20 23 20 54 | 9.chr; # T
68 69 73 20 69 73 20 74 68 65 | his is the
20 75 6E 69 63 6F 64 65 20 22 |  unicode "
43 2F 52 22 20 73 79 6D 62 6F | C/R" symbo
6C 0A 63 6F 6E 73 74 61 6E 74 | l␍constant
...

It can also take a text on the command line (handled by the second «multi»), like this:

$ raku hex-dump "Kåss"
4B C3 A5 73 73                | K░░ss

The whole point of a hex dump is to take one byte at time, so the «å» which uses two bytes in Unicode/UTF-8 (C3 + A5)) is shown as two gray boxes as the bytes are non-printable in the ascii encoding.

The «␍» (C/R) and «␊» (L/F) symbols are not exactly readable. The «␤» (N/L) symbol (which is obtainable as «9252.chr») is much better, but is an abstraction that we cannot use as we are interested in the distinction between C/R and L/F.

Challenge #24.2

Create a script to implement full text search functionality using Inverted Index. According to wikipedia:

«In computer science, an inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content). The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database.»

Here is a nice example of Inverted Index.

To keep it simple, I'll start with a version where the indexer stores the index in a file (a plain text file, a very simplistic database), and then we use a second program to parse that file (instead of building up the index) each time we do a search.

File: make-inverted-index
sub MAIN (*@files where @files.elems >= 1, :$index = "index.txt")     # [1]
{
  my %index;                                         # [2]
  for @files -> $file                                # [3]
  {
    die "No such file: $file"    unless $file.IO.e;  # [4]
    die "Unreadable file: $file" unless $file.IO.r;  # [5]

    my @words = (slurp $file).words;                 # [6]

    @words.map({ %index{$_}.{$file} = True });       # [7]
  }

  die "Unable to write index" unless my $fh = open $index, :w;        # [8]
  $fh.say("$_:{ %index{$_}.keys.join(",") }") for %index.keys.sort;   # [9]
  $fh.close;                                                          # [10]
}

[1] Specify one or more files to index. Use the «--index=xxx» (where «xxx» is the filename) command line argument, if you want to save the index in another file than the default «index.txt».

[2] We collect the index in this variable.

[3] Go through the list of files,

[4] • exit if the file doesn't exist («IO.e»).

[5] • exit if the file is unreadable («IO.r»).

[6] Read the entire file (with «slurp»), and split it into a list of words (with «words»).

[7] The first dimension in the %index hash is the word, and the second is the file. I have chosen to use «map» to assign all the words at once, instead of a loop.

[8] Exit if unable to write the index file.

[9] Print the index. Note that I have used a loop here, as «map» would feel convoluted.

[10] Close the file.

Saving the index as a text file gives us (as developers) an easy way of checking that the index is correct. For now, I'll just say that it isn't. We'll get back to why later on.

The program doing the search:

File: search-inverted-index
sub MAIN ($word, :$index = "index.txt")      # [1]
{
  for (slurp $index).lines -> $line          # [2]
  {
    my ($entry, $files) = $line.split(":");  # [3]

    if $word eq $entry                       # [4]
    {
      say "«$word» found in: $files";        # [4a]
      exit;                                  # [4b]
    }
  }
  say "Not found.";                          # [5]
}

[1] Specify the word to search for. Specify the index file with the «--index=xxx» (where «xxx» is the filename) command line argument, if the default «index.txt» is wrong.

[2] I could have used «$file.IO.lines» instead of «(slurp $file).lines» - and probably should have.

[3] Separate the (key)word from the list of files.

[4] If we have found the word, say so (4a) and exit (4b).

[5] If we get here, we didn't find the word. Say so.

You may have noticed that the search program isn't very efficient, as it reads the index file each time we do a search. It is presumably faster than indexing all the source files, depending on the number of files and size of the vocabulary.

Here is a better version:

File: search-inverted-index-loop
multi sub MAIN ($word, :$index = "index.txt")  # [1]
{
  for (slurp $index).lines -> $line
  {
    my ($entry, $files) = $line.split(":");

    if $word eq $entry
    {
      say "«$word» found in: $files";
      exit;
    }
  }
  say "Not found.";
}

multi sub MAIN (:$index = "index.txt")       # [2]
{
  my %index;
  for (slurp $index).lines -> $line          # [3]
  {
    my ($entry, $files) = $line.split(":");  # [4]
    %index{$entry} = $files;                 # [5]
  }

  loop                                       # [6]
  {
    my $word = prompt("Search for word: ");  # [7]
    
    say %index{$word}                        # [8]
      ?? "«$word» found in: { %index{$word} }"
      !! "Not found.";
  }
}

[1] I have kept the old functionality; if a string is specified on the command line, the program searches for it, and exits.

[2] If run without arguments, it:

[3] • reads the index file, doing a loop for each line.

[4] • The part before the colon is the word, and the part after it is the list of files (separated by colons, but we ignore that).

[5] • Add the key/value pair to the index.

[6] An eternal loop,

[7] • ask for a new word to search for.

[8] • present the result.

See docs.raku.org/routine/prompt for more information about prompt.

The Problem with «words»

I used «words» to split the input files into (for want of a better word) words, but it isn't really suitable as this example shows:

$ raku
> "This is it, or\nelse! No-One.".words.join("|").say
This|is|it,|or|else!|No-One.

We have to write a custom splitter, which splits on everything that isn't a (Unicode) letter, digit or hyphen. (And «split» is the megical word.)

> "This is it, or\nelse! No-One.".split(/\W+/).join("|").say
This|is|it|or|else|No|One|

\w is a character class that includes letters (alphabetic characters), underscores and digits. Writing it in upper case inverts the meaning.

This one protects the hyphen, and gets rid of the underscore:

> "This is_it, or\nelse! No-One.".split(/<+[\W] - [\-] + [_] >+/).join("|").say
This|is|it|or|else|No-One|

The fixed indexer looks like this:

File: make-inverted-index-fixed
sub MAIN (*@files where @files.elems >= 1, :$index = "index.txt")
{
  my %index;
  for @files -> $file
  {
    die "No such file: $file"    unless $file.IO.e;
    die "Unreadable file: $file" unless $file.IO.r;

    my @words = (slurp $file).split(/<+[\W] - [\-] + [_] >+/);

    @words.map({ %index{$_}.{$file} = True });
  }

  die "Unable to write index" unless my $fh = open $index, :w;
  $fh.say("$_:{ %index{$_}.keys.join(",") }") for %index.keys.sort;
  $fh.close;
}

It works as intended. But don't take my word for it; try it out yourself...

Note that it will get rid of «.» when used as end of sentence periods, but also when used as a decimal point. So e.g. «3.14» will result in the two entries «3» and «14» - which isn't right. It is possible to fix this, but I'll let it be.

Bonus: A Client/Server Application with Cro

The eternal loop lookup client is a partial (a very little part) client/server application.

We can make a real client/server application with the Cro framework. Cro is a «set of libraries and tools for building distributed systems in Raku». And a distributed system «involves multiple processes that communicate with each other.»

See cro.services for more information. Note the code in the «Set up your routes, and serve them» section on the front page. I have used that code almost unchanged in my program.

Install the «Cro» and «Cro::HTTP» modules, if you haven't done so already:

$ zef install Cro
$ zef install Cro::HTTP
File: cro-index
use Cro::HTTP::Router;
use Cro::HTTP::Server;

sub MAIN (*@files where @files.elems >= 1, :$port = 10000)  # [1]
{
  my %index;
  for @files -> $file
  {
    say "Reading file: $file (start)";
    die "No such file: $file"    unless $file.IO.e;
    die "Unreadable file: $file" unless $file.IO.r;

    my @words = (slurp $file).split(/<+[\W] - [\-] + [_] >+/);
    say "Reading file: $file (finished)";

    @words.map({ %index{$_}.{$file} = True });
  }

  say "Port: $port";                                        # [2]
  
  my $application = route                                   # [3]
  {
    get -> 'search', $search                                # [3]
    {
      content 'text/plain', "$search: { %index{$search}.keys.join(", ") }\n";
    }                                                       # [4]
  }
  
  my Cro::Service $search-app                               # [5]
    = Cro::HTTP::Server.new: :host, :$port, :$application; 

  $search-app.start;                                        # [6]

  react whenever signal(SIGINT) { $search-app.stop; exit; } # [7]
}

[1] The default port number is 10000. Use the «--port=xxxx» command line argument to change it.

[2] The lines above this are the same as before. This line is just for information.

[3] We have only one route (web address) here. It is «/search/» followed by a string, and the string is placed in the «$search» variable.

[4] The respons is a plain text document, containg the result of running the code in the double quotes. (It prints the search key, and a list of matching files.

[5] Set up the service.

[6] Start the service.

[7] Tell the service to exit when we press <Control-C>.

Start the service like this:

$ raku cro-index sample-text.txt
Reading file: sample-text.txt (start)
Reading file: sample-text.txt (finished)
Port: 10000

Query the service like a normal web page. E.g. like this:

$ curl localhost:10000/search/The
The: sample-text.txt, small-text.txt

$ curl localhost:10000/search/Them
Them: 

We can use it to search for certain words in the program files for this week's challenges:

$ raku cro-index *
Reading file: 0 (start)
Reading file: 0 (finished)
Reading file: 1 (start)
Reading file: 1 (finished)
Reading file: 1-one (start)
Reading file: 1-one (finished)
Reading file: 1-semi (start)
Reading file: 1-semi (finished)
Reading file: 2 (start)
Reading file: 2 (finished)
Reading file: 5 (start)
Reading file: 5 (finished)
Reading file: 6 (start)
Reading file: 6 (finished)
Reading file: cro-index (start)
Reading file: cro-index (finished)
Reading file: hex-dump (start)
Reading file: hex-dump (finished)
Reading file: index.txt (start)
Reading file: index.txt (finished)
Reading file: make-inverted-index (start)
Reading file: make-inverted-index (finished)
Reading file: make-inverted-index-fixed (start)
Reading file: make-inverted-index-fixed (finished)
Reading file: sample-text.txt (start)
Reading file: sample-text.txt (finished)
Reading file: search-inverted-index (start)
Reading file: search-inverted-index (finished)
Reading file: search-inverted-index-loop (start)
Reading file: search-inverted-index-loop (finished)
Reading file: small-text.txt (start)
Reading file: small-text.txt (finished)
Port: 10000
$ curl localhost:10000/search/raku
raku: search-inverted-index, hex-dump, make-inverted-index-fixed, \
  make-inverted-index, cro-index, search-inverted-index-loop

$ curl localhost:10000/search/perl
perl:

The output is in almost the same format as the index file we made earlier. A real world application should format the output better, and this is really a web application waiting to be written...

But I'll stop here, and press <Control-C>.