How to read from URL line-wise?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

First, profuse apologies for the original posting of this query,
which had gibberish ("sdfsdfsf") in the Subject: line.  I goofed.


I'm looking for the "moral equivalent" of the (fictitious) `openremote`
function below:

    my $handle = openremote( ' ' ) or die $!;
    while ( <$handle> ) {
        # etc.
        # do stuff with $_
    close $handle;

IOW, I'm looking for a way to open a read handle to a remote file
so that I can read from it *line-by-line*.  (Typically this file
will be larger than I want to read all at once into memory.  IOW,
I want to avoid solutions based on stuffing the value returned into
LWP::Simple::get into an IO::String.)

I'm sure this is really basic stuff, but I have not been able to
find it after a lot of searching.



Re: How to read from URL line-wise?

Quoted text here. Click to load it
Quoted text here. Click to load it

I very much doubt that HTTP supports such a line-by-line retrieval. And
if line-by-line is not supported by the underlying protocol, then at the
very best you can only hope for a local simulation, but at that point
the resource has been retrived in full.already.  


Re: How to read from URL line-wise?

Quoted text here. Click to load it

Not line-by-line (files don't support that either on most platforms),
but byte-ranges are supported by HTTP/1.1. Whether the server supports
it for the file is another question, but most servers do for files
stored in the file system (but not dynamically created content).

But I associate "line-by-line" with sequential access, not random
access, and you are of course always free to process the response in
little chunks as you receive it (see "Handlers in LWP::UserAgent for a
standard way of doing this).


   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   |         | die Satzbestandteile des Satzes nicht mehr
__/   | | zusammenpaßt. -- Ralph Babel

Re: How to read from URL line-wise?

Quoted text here. Click to load it

Thanks for this pointer!  This approaches what I'm after.  I'd
hoped to find a package (in some obscure corner of LWP) that already
implemented this line-oriented interface to the stream, but I guess
I'll have to write it myself.  (Conceptually it's not a hard thing
to do, but IME *robust* implementations of even simple tasks like
this one can take a lot more work than one would expect.)


Re: How to read from URL line-wise?

Quoted text here. Click to load it

I was under the impression that HTTP supported incremental downloads
(some fixed number of bytes at a time); if so, a client could easily
implement a line-by-line interface to that stream...  But now I
think I need to do some homework and review HTTP.



Re: How to read from URL line-wise?

Quoted text here. Click to load it

HTTP does not support "line-by-line" retrieval. You can get stuff in
chunks smaller than the whole file, however, and use a small buffer
to emulate some sort of record based read. I've written HTTP read/write
code from bare sockets in Perl, it's certainly doable, but it's a
project you need a lot of time testing with: there can be a lot of
variation in the way things are returned depending on server

Things to consider:

At a high level HTTP/1.1 has a "Range:" header that can be used to
request a fragment of a large resource. Most servers support returning
just a portion of a resource IF that resource is a static file on disk.
If it is a dynamic page, YMMV.

In HTTP/1.0, you don't get Range:, but you also don't have to deal with
"Tranfer-Encoding: chunked" (more in a bit). You either have a
"Content-Length" header specifying the whole length of the result or you
work blind. In either scenario, you sysread off the socket until you get
your record separator or zero length read. In theory, you could read one
byte at a time and just let the kernel handle your buffering. That might
be slow, and can induce confusion between "bytes" and "characters" when
dealing with 21st century character encoding awareness.

In HTTP/1.1, besides the Content-Length or flying blind option, the
server can do it's own break-into-useful-size bits. This results in a
Tranfer-Encoding: chunked header on the response, and interleaved chunk
sizes in the body of the response. Again you can use sysread. This sort
of response is very common for dynamic content like CGI or compressed-on
the-fly pages. (Chunk sizes I've observed often look like the output of
compressing a page in 4096 byte blocks and then sending the output as
a HTTP chunk.) Unless you understand the chunking protocol, the chunk
sizes will corrupt the body content.  

The server will probably never compress-on-the-fly unless you add the
appropriate "Accept-Encoding" header, but if you are dealing with
truely large text files (as implied by the question), you do the network
a favor allowing the server to compress them. Then, of course, you need
to do chunked decompression, too.  

And there is a whole level of insanity to doing SSL stuff on your own.
I punted that in my own code by using Net::SSLeay::Handle, which shows
you a read HTTPS line-by-line example in the docs. An example that does
not handle any of the complexity or subtlety of real HTTP/HTTPS. (In
particular, the lack of a Host: header breaks a large part of the modern

would not be surprised if there is easy to find code for the OP's problem

Site Timeline