Parsing Question...

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
As a learning exercise, I am trying to write a web-based version of
'drawbot' in PHP.


I am interested in hearing ideas about how to approach the user input
parsing problem.  I would like to allow people to type in simple code
and have it executed, but I need to limit the code they can write to a
few pre-defined drawing functions, as well as control structures like
loops, if thens, etc...

Short of writing a parser, which is clearly beyond me, what are some
reasonable approaches to handling user input that will be executed?

Thanks in advance,

Re: Parsing Question...

cjl wrote:

Quoted text here. Click to load it

Writing a parser is the best option in the long-run. If you were to
attempt to interpret the user input some other way, like pure regular
expressions, then you would fall into a lot of traps, and your interpreter
would behave oddly in many cases.

A full parser is a much better option: it will behave far more reliably
and would be a lot easier to extend, should you feel the need to add extra
features to the language at a later date.

Although it's a lot of work, there are some fairly well established
methods on writing them. What you basically need to write is three fairly
independent components: a tokeniser, a parser and an interpreter. None of
these share any code in common, except for the definitions of a few
constants and classes.

Firstly, a tokeniser, which reads the user input and splits it into a long
list of tokens. Each token should have the form:

    class Token
        var $token_type;     // integer
        var $token_value;    // mixed
        var $line;           // integer
        var $char;           // integer

Such that when you tokenize the following PHP:

    echo "Foo";

You end up with something like this (though imagine the inner arrays are
actually Token objects!):

        array(TOKEN_BUILTIN, "echo", 1, 1),
        array(TOKEN_STRING_DQUOTED, "Foo", 1, 6),
        array(TOKEN_TERMINATOR, NULL, 1, 11)

Note the $line and $char which contain the line number and character
number where this token was found? That helps when a later stage of your
program needs to print an error message -- it can inform the user of the
exact location where the error occurred.

Writing a tokeniser is probably the easiest step. The only slightly
difficult bits are things like "dealing with strings that contain
\"special\" characters", but even they are not too difficult!

Your tokeniser then passes this list over to the parser. The parser is
probably the hardest part you have to write. You have to convert the
stream of tokens into an "abstract syntax tree".

First you need to define the classes you'll build the AST out of. PHP 5's
object oriented features will be very useful here.  

    abstract class AstNode
        public $token;
        final public function __construct($t)
            $this->token = $t;
        abstract public function evaluate($machine);
    class AstNode_Script extents AstNode
        public $statements;
        public function evaluate($machine)
            foreach ($this->statements as $s)
    class AstNode_If extends AstNode
        public $condition_expression;
        public $execution_block;

        public function evaluate()
            if ($this->condition_expression->evaluate($machine))
    class AstNode_Constant_False extends AstNode
        public function evaluate($machine) { return FALSE; }
    // etc

Then write the parser itself, which takes the form:

    class Parser
        private $tokens;

        public function __construct($T)
            if (is_array($T))
                $this->tokens = $T;
                throw new Exception('Argh!');

        public function next()
            return array_shift($this->tokens);

        public function peek()
            return $this->tokens[0];

        public function get($type, $hissy_fit=FALSE)
            $next = $this->peek;
            if ($next->token_type==$type)
                return $this->next();
            elseif ($hissy_fit)
                throw new Exception('hissy fit');
                return FALSE;

        public function parseScript()
            $ast = new AstNode_Script($this->peek());
            $ast->statements = $this->parseCommand();
            while ($this->peek())
                $ast->statements = $this->parseCommand();
            return $ast;

        // And then you write parseCommand, which in turn probably
        // calls things like parseConditional, parseExpression,
        // parseFunctionCall and so forth.

The third part of the job is interpreting the AST, but if you look at my
AstNode_* classes above, you'll see they have the logic built into them.
All you then need to do is:


Where machine is an object capable of keeping track of things like
variable values, function definitions and so forth.

It's quite a bit of work, but it's certainly do-able. It helps if you have
a good book on compilers -- I'd recommend Watt & Brown "Programming
Language Processors in Java". As you might guess from the title, it
teaches you to write parsers, compilers and interpreters in Java, but the
same techniques can easily be applied to any object-oriented language, and
with a little more imagination, to non-OO languages too.

A few months back, partly as an experiment, but partly because I thought
it would be useful for a project of mine, I designed my own scripting
language and wrote a tokeniser, parser and machine for it in PHP. It
supports variables (numeric, string and multi-dimensional arrays),
functions, comments, and has all the normal numeric, string and array
operators built-in. Scalar (non-array) variables, are automatically
typecast as arrays (such that they become single-element arrays) and array
variables are automatically typecast as scalars (the first value in the
array is used, the rest are discarded).  

The reason I wrote it is that it would allow user-supplied code to run in a
"sandbox" environment, so that if it crashed, or tampered with variables,
or whatever, it wouldn't cause any problems for the rest of the site.

It's half-finished, the syntax is sort of crazy and it needs improving,
which is why I've not foisted it upon the general public. But if you want
a copy, I'd be happy to send you one, licensed under the GPL.

Here's an example of using it:

$p = <<<PROG

/* Function names can be arbitrary strings. No parentheses used. */
function "my concatenation function", $a, $b
    /* Uses "let VAR := EXPR" for assignment. A bit like Turing. */
    let $result := $a . $b;
    /* Perlish syntax for returning function results. */

let $quux = call "my concatenation function", "foo", "bar";

/* Print automatically appends a new line, a la Python */
print $quux;


$r = eval_programme($p);


Toby A Inkster BSc (Hons) ARCS
Contact Me ~
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!

Re: Parsing Question...


That is the single best response ever given to a newsgroup post. Thank

Obviously, I have got a lot of reading to do.

I'm wondering if there is a simpler approach...after all, I want the
user input to be valid php, I just want to limit what they can type to
a few functions I write ( circle(), line(), etc..) and a few control
structures. Maybe I could create an object which includes member
functions which override all native php functions, and have the user
input actually be calls to that objects methods, and only pass through
the ones that I want to allow?

As far as the approach you are suggesting, some googling showed:

Which maybe can help me?

Anyway, back to the drawing board.


Re: Parsing Question...

cjl wrote:

Quoted text here. Click to load it

If the user input is to be valid PHP, the "obvious" solution is eval(),
but this will totally destroy your security. You could use regular
expressions to check for "naughty" functions (like SQL queries, file
system manipulation, TCP sockets, etc), but then you end up:

    (a) playing catch-up with the features of PHP itself. As new
        functions are added to the language, you'll need to evaluate
        how naughty they are, and add them to the block list.  

    (b) naively blocking more innocent PHP like:
        print "fopen";

(It's worth mentioning, that you'll also need to include in your block
list "naughty" functions from any of your own or third-party libraries you

Quoted text here. Click to load it

Aye -- that is indeed the dream: the ability to have an eval() function
that works within a single object, such that any function calls are
silently re-written to "$this->function()", any globals to
"$this->variable" and any constants to "self::CONSTANT".

Although PHP doesn't have such a "safe eval" function built in, it
shouldn't be too difficult to build one. As your language follows PHP
syntax rules, you can use PHP's built-in tokeniser:

    $tokens = token_get_all($source);

Then loop through that list, looking for all tokens of type T_VARIABLE and
re-pointing them at object members; finding T_STRING (which despite the
name is an "non-$-identifier" token, so could be either a function call or
a constant) and heuristically (e.g. UPPERCARE is assumed to be a
constant; MixedOr_lower_case is assumed to be a method.) re-pointing it at
a class constant or object method; and finally finding T_EVAL and
replacing it with T_ECHO. You would then need to loop through the token
list and re-assemble it as source code before passing it through an eval()
function wrapper within the object.

Sounds complicated; but is simpler than implementing your own real parser
and interpreter; and could probably be done in less than 50 lines of code.

I wouldn't be happy running it on a production system though without
substantial hack-testing!

Quoted text here. Click to load it

Quite possibly -- it does look quite good. If I'd known of its existence
when I started my scripting language, I might not have attempted to write
a scripting language. But I certainly learnt a lot --especially about OO
PHP -- from doing so, so I don't regret it.

Quoted text here. Click to load it

No problem -- I'd guessed that nobody else in this group had been crazy
enough to attempt a scripting language parser and interpreter in PHP, so
if I didn't help you, nobody would!

Toby A Inkster BSc (Hons) ARCS
Contact Me ~
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!

Re: Parsing Question...


Once again, thanks.

After reading up on the (very) little information I could find about
the tokenizer, I am going to try and give it a shot by parsing the
tokens. However, instead of a blacklist, I was thinking about using a
whitelist, which should be slightly more hack-proof than a blacklist?

Anyway, back to the drawing board. If I end up with anything that
works, I'll repost to this thread.


Site Timeline