RSS XML Streams

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi all,

I'd like to process RSS XML downloaded from the web.

What I mean is that:
I start downloading the RSS_XML from the website and while
downloading I start processing the data, in this way I can choose what
to do with the data without completing the download.

For this I'm using
LWP::UserAgent that is calling my callback routine ie:

my $res            = $ua->request(
        HTTP::Request->new( GET => $rss ), \&rss_stream_SAX


XML:SAX with my MySAXHandler

What I was wondering is if there are some CPAN modules already written
for this goal.


I've already searched on CPAN for several modules... but I found nothing
for my task. But since there are so many options I'd like a feedback
from somebody "in the field" ;-)

Re: RSS XML Streams

On Fri, 1 May 2009 11:01:02 +0200, (isecc) wrote:

Quoted text here. Click to load it

Not from the field but I know what you are looking for.
Actually you probably are not looking for streams parsing.
Instead you need to instantiate a non-blocking parser that knows how
to parse indiscriminate chunks at a time of a continous document.
This requires that after the chunk is parsed, the parser maintain a small buffer
consisting of just the data from the last buffer that couldn't be parsed yet.
Then passing in the next chunk to be appended to the last chunk. The cycle
repeats until you have no more chunks left to pass in.

Meanwhile, as the parser is working on the chunks, it is dishing out SAX
as they are found.

From your SAX callback handlers, you will be able to set a flag that can be read
your Agent request handler where you can in turn cancel the Parse and then 'die'
within the
agent handler and bail out of the request.

As I noticed, XML::Parser (which is a wrapper for Expat) just happens to have
ExpatNB (non-blocking).
Its only non-blocking in as much as it will instantiate the parser, then allow
you to pass in 'chunks'
of xml to be parsed while you wait.

So  'parse_more( $string ) ' -> it calls your SAX callback where you set a flag
when you get what you want ->
back from parsing chunk, check if the flag was set.
If you want to bail out of the parse, call 'parse_done'. Then bail out of your
request handler with a 'die'.

I don't know if Parse will allow you to continue if it encounters a real xml
error though, it may emit a die on ya.
I don't like this if it does it.

I already have this mechanism in place with a xml parser of my own, and if this
doesen't get you what you want,
let me know and I can work with you.

Some INFO:

XML::Parser::ExpatNB Methods
The class XML::Parser::ExpatNB is a subclass of XML::Parser::Expat used for
non-blocking access to the expat library.
It does not support the parse, parsestring, or parsefile methods, but it does
have these additional methods:

parse_more(DATA) - Feed expat more text to munch on.
parse_done  - Tell expat that it's gotten the whole document.

parse_start([ OPT => OPT_VALUE [...]])
Create and return a new instance of XML::Parser::ExpatNB. Constructor options
may be provided.
If an init handler has been provided, it is called before returning the ExpatNB
object. Documents are
parsed by making incremental calls to the parse_more method of this object,
which takes a string.
A single call to the parse_done method of this object, which takes no arguments,
that the document is finished.

If there is a final handler installed, it is executed by the parse_done method
before returning
and the parse_done method returns whatever is returned by the final handler.

The below code is not fleshed out, however you would create a new parser, add
youre SAX callback functions,
set up flags. In stream_h under TEST:

sub stream_h
    my $string = shift;
    print $string;        # or to a file, or append to $result->content
        # TEST:
        $parse->parse_more ($string);     # send more junk to the parser
        if ($we_have_sax_info_lets_quit)  # check your SAX function set flags
(parse_more won't return until its done)
            $parse->parse_done;        # tell parser your done
            die "dont need no more";   # tell ua request to stop collecting data
            return '';
        return $string;   # docs say to do this but they are wrong, the content
is not asigned.
                          # at this point you could catenate this to the
$result-content, buy why would you

use strict;
use warnings;
use Data::Dumper;
require LWP::UserAgent;
require HTTP::Request;

my $cnt = 0;

sub stream_h
    my $string = shift;
    print $string;        # or to a file, or append to $result->content

    # TEST: stop the request if $cnt > 2
    # check x-died in the result header below or just set a user flag
    die "dont need no more" if (++$cnt > 2);
    return $string;
my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new(GET =>
' ');
my $result = $ua->request( $request, \&stream_h );

print "\n\n", Dumper($result);


What I found out is that Agent has a special thread that fills up a default
buffer of 4k chunks. After that it tends to sleep until the buffer falls
below that level then requests more data. The buffer empty's upon each return
request handler and will stop when the buffer is full. So there won't be a
where the download will continue in the background even if you pause the program
prompting for user input or something. This can be checked with various sleep,
code, but I think this is the case.

Good luck, let me know what comes of this.
Like I said, I have special parse code if this parser doesen't do what you need.


Site Timeline