# Decoding "BER" Integers

#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
Perl pack/ unpack support a "Binary Encoded Representation" for
non-negative integers of (theoretically) arbitrary width. This works by
dividing the actual value into 7-bit groups (from least significant to
most significant) and putting each 7-bit group in the lower 7 bits of an
octet/ byte, starting with the most significant group (eg, 267, binary
100001011 would be grouped into 0000010 0001101 and stored
left-to-right). The 8th bit of each octet except the last making up the
number is set to one (so 267 would be encoded as bytes 10000010
00001101). Unfortunately, while unpack can be used to decode this
format, there doesn't seem to be a way to determine the encoded length
of a decoded integer (other than determining the position of the highest
set bit in the decoded value and calculate the number of needed octets
from that).

I'm planning to use/ using integers encoded in this way to represent
string lengths in some binary message format for representing a sequence
of structures composed of a sequence of members of certain types. In
order to decode this, it's obviously necessary to skip over the already
decoded parts of the message string, hence, I need to know the length of
a string length. My idea for dealing with this problem was to decode the
numbers octet-by-octet and count them by doing so. So far, I came up
with three algorithms for doing this and the one doing the unpack('C5',
...) seems best to me.

Any other ideas?

Two additional remarks: The length of a complete message is restricted
to a 32-bit integer. Perl supports O(1) deletion from the beginning of a
string.

-----------
use Benchmark;

my @lens = map { pack('w', rand(1 << 9)) } 0 .. 100;

sub uni
{
my \$v = \$_[0];
my (\$cur, \$n, \$x);

do {
++\$n;

\$cur = unpack('C', \$v);
\$x <<= 7;
\$x |= \$cur & 0x7f;

substr(\$v, 0, 1, '');
} while \$cur > 127;

return (\$x, \$n);
}

sub s1
{
my \$v = \$_[0];
my (\$cur, \$n, \$x);

while ((\$cur = unpack('C', \$v)) > 127) {
\$x |= \$cur & 0x7f;
\$x <<= 7;

++\$n;
substr(\$v, 0, 1, '');
}

substr(\$v, 0, 1, '');
return (\$x | \$cur, \$n + 1);
}

sub un_all
{
my \$v = \$_[0];
my (\$x, \$n);

for (unpack('C5', \$v)) {
++\$n;

if (\$_ < 128) {
substr(\$v, 0, \$n, '');
return (\$x | \$_, \$n);
}

\$x |= \$_ & 0x7f;
\$x <<= 7;
}
}

timethese(-3,
{
uni => sub {
uni(\$_) for @lens;
},

s1 => sub {
s1(\$_) for @lens;
},

un_all => sub {
un_all(\$_) for @lens;
}});

## Re: Decoding "BER" Integers

If it's speed you're after, a) avoid copying the string and b) use a regexp.

sub w_len
{
if (\$_[0] =~ /\A[\x80-\xFF]*[\x00-\x7F]/) {
( unpack("w",\$_[0]), \$+[0] );
} else {
# input contains no octets below 0x80
();
}
}

/Bo Lindbergh

## Re: Decoding "BER" Integers

While this sort-of answers the question you pulled out of the text I
wrote, it's not usuable in the context of the benchmark I posted because
the string has to be copied to avoid changing the original as the
encoded length also has to be removed from it (because of the
requirements for actual use I posted --- in particular, I was also
interested in observable effects - if any - of calling substr per octet
or calling it once after the encoded length was determined). Also, at
least in my testing, using \$+[0] caused the performance to become
seriously grotty. But /g and pos can be used instead. The regex is also
more complicated than necessary as it really only needs to find the
position of the first octet/ char with a numerical value < 128.

Implementation taking the above into account:

sub re
{
my \$v = \$_[0];
my (\$x, \$n);

\$x = unpack('w', \$v);

\$v =~ /[\x00-\x7f]/g;
\$n = pos(\$v);
substr(\$v, 0, \$n, '');

return (\$x, \$n);
}

Starts to pull ahead clearly (for me) once lengths reach 2**15 (1 << 15, 32768)
below, it's slower than the otherwise fasted and becomes worse as maximum
lengths decrease. Since 'equally distributed set of numbers from 0
.. 32767' is not realistic in practice, I won't use it, but the general
idea is not bad (aka 'I should have though of that myself, thanks').

Random thought: Considering that regex matches are string operations
(and substr could be considered one, too), can I count on perl leaving
my binary data alone instead of starting to group it into characters
according to some supposedly secret encoding convention?

Might someone be hiding behind the coalshed here?