"zero-copy" parsing of in-memory data

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
NB: This is loosely based on techniques described in chapter 8 of
'Higher Order Perl', a book well worth of having provided one is capable
of restricting oneself to reading about the concepts instead of getting
distracted by the typically grotty code examples.

perl maintains a 'current matching position' (=> perldoc -f pos) when
matching strings against regexes. Normally, the position is reset to 0
after each match but this can be avoided by adding the 'g' (keep
position after a successful match) and 'c' (don't reset position after a
failed match) flags. Furhter, a \G assertion is available which anchors
the match at the current position even when trying to match a different
regex. As also explained in 'perldoc perlop', this can be used to take a
string apart piece by piece ('lexical analysis').

A match operator not bound to a specific variable defaults to matching against
$_ which is a dynamically scoped symbol/ variable ('global').

Subroutine arguments are always passed by reference, eg,
invoking a subroutine as


makes the actual scalar $a available as first argument of the

It's possible to bind 'an entity' to a 'package-global' name by assiging
a reference to the entity to the corresponding glob. Such an assignment
can use 'local' to restrict visibility of the binding to 'subroutines
called from within the current block' (and subroutines called from such
subroutines). The value previously bound to the name is saved and
automatically restored once the binding goes out of scope. A nice
side-effect of that is that it can be used to make an argument to a
function accessible by name without copying its value, as the more

my $bla = $_[0]

would do.

Pulling this together, it's possible to start a top-level parsing
routine like this

sub parse_json
    local *_ = $_[0];

which will make the first argument accessible as $_ without copying the
value and then descent into a recursive descent parser operating on $_,
example routine (starting position is on the leading "):

sub parse_string()
    my $s;


    /\G"/gc && last;
    /\G([^"\]+)/gc && do {
        $s .= decode('UTF-8', $1);

    /\G\(["\/\])/gc && do {
        $s .= $1;

    /\G\([bfnrt])/gc && do {
        $s .= $json_escapes;

    /\G\u/gc && do {
        $s .= parse_u_escapes();
    parser_error('weird shit in string');

    p_alloc('string '%s'', $s);
    return $s;

Re: "zero-copy" parsing of in-memory data


Quoted text here. Click to load it

There are (at least) two bugs in this, namely,

Quoted text here. Click to load it

A $s = '' should appear so that empty strings are returned as emtpy
strings and not as undef/


A check for 'end of input' should appear here for more sensible error
messages, eg

die('unexpected EOF') if pos() == length();

Quoted text here. Click to load it

Re: "zero-copy" parsing of in-memory data

On 1/5/2015 00:20, Rainer Weikusat wrote:
Quoted text here. Click to load it

This is the best Perl book for me. I think it will also help people  
using other languages as well

Site Timeline