Usage


WARNING!!!

This library includes both Ruby and C versions of StringScanner. Since the two classes are completely different, please read this whole page before using them.

Purpose of StringScanner

StringScanner is a Ruby extension for fast scanning.

Since Ruby's Regexp class cannot perform sub-string matches, scanning a sub-string requires first making a new String. For example

p " I_want_to_match_this_word but can't".index( /\A\w+/, 1 )

This code will display "nil". Another way to match it is like this:

str = " word word word"
while str.size > 0 do
  if /\A[ \t]+/ === str then
    str = $'
  elsif /\A\w+/ === str then
    str = $'
  end
end

But this method has a big performace problem. $' makes a new string EVERY time. So, in the above example, all these strings are created:

" word word word"
"word word word"
" word word"
"word word"
" word"
"word"
""

This results in a heavy load. If the length of 'str' is 50KB, nearly 50KB ** 2 / 5 = 50MB of memory is used.

StringScanner resolves this problem.
StringScanner has a C string and a pointer to it. When scanning, StringScanner will only increment the pointer, so no new strings are created. As a result, speed will increase and memory usage will decrease.

Simple examples and methods

Here are two short examples of scanning routines.
The first one is easy to write but performs quite poorly. The second is still easy to write, but is FAST thanks to the code in the StringScanner class.

First example:

ATOM = /\A\w+/
SPACE = /\A[ \t]+/

while str.size > 0 do
  if ATOM === str then
    str = $'
    return $&
  elsif SPACE === str then
    str = $'
    return $&
  end
end

Second example:

ATOM = /\A\w+/
SPACE = /\A[ \t]+/

s = StringScanner.new( str )
while s.rest? do
  if tmp = s.scan( ATOM ) then
    return tmp
  elsif tmp = s.scan( SPACE ) then
    return tmp
  end
end

The usage of StringScanner is simple.
First: Create a StringScanner object. Next, call the 'scan' method. It returns the matched string and at the same time increments its internally maintained "scan pointer". This is implemented using a pointer to char(char*).
The 'skip' method is similar to 'scan', but returns the length of the matched string.

s = StringScanner.new( "abcdefg" )   # scan pointer is on 'a', index 0
puts s.scan( /a/ )        # returns 'a'. scan pointer is on 'b', index 1
puts s.skip( /bc/ )       # returns 2. scan pointer is on 'd', index 3

After calling 'scan' or 'skip', the previous "scan pointer" is preserved in the StringScanner object. So, str[ prev pointer..current pointer ] is the "matched string" (the string returned from 'scan') -- we can get it by calling the 'matched' method. Here's an example:

puts s.matched            # returns 'bc'. scan pointer doesn't move
puts s.scan( /a/ )        # returns nil. again, scan pointer doesn't move.
puts s.matched            # returns 'bc'.

It is also possible to put the scan pointer back to its previous position. This can be accomplished by using the 'unscan' method. However, 'unscan' can only undo one 'scan' because the StringScanner object can only preserve one "previous pointer" at a time.

puts s.scan( /de/ )       # returns 'de'. scan pointer is on 'f', index 5
s.unscan                  # scan pointer is on 'd', index 3
puts s.scan( /def/ )      # returns 'def'. scan pointer is on 'g', index 6

For more details, see the reference manual. But of course the source code is the most inportant documentation, I think :-)

Ruby version of strscan

The Ruby version of StringScanner (StringScanner_R) resembles the C version, but has these requirements:

This is troublesome, but there's no resolution to this problem.

If you only want to use the C version, simply put this in your code:

StringScanner.must_C_version

Copyright (c) 1999-2001 Minero Aoki <aamine@loveruby.net>