[ANN] Net::ChooseFName

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

If you need (or thought of) something like this module, please comment
on features you miss.  It took a lot of time to chip it out of my
version of recursive NET downloader; I may be able to spend some more
time to make it yet better...



    Net::ChooseFName - Perl extension for choosing a name of a local mirror
    of a net (e.g., FTP or HTTP) resource.

      use Net::ChooseFName;
      $namer = Net::ChooseFName->new(max_length => 64);     # Copies to CD ok

      $name = $namer->find_name_by_response($LWP_response);
      $name = $namer->find_name_by_response($LWP_response, $as_if_content_type);

      $name = $namer->find_name_by_url($url, $suggested_name,
                                       $content_type, $content_encoding);
      $name = $namer->find_name_by_url($url, $suggested_name, $content_type);
      $name = $namer->find_name_by_url($url, $suggested_name);
      $name = $namer->find_name_by_url($url);

      $namer_returns_undef = Net::ChooseFName->failer();    # Funny constructor

    This module helps to pick up a local file name for a remote resource
    (e.g., one downloaded from Internet). It turns out that this is a tricky
    business; keep in mind that most servers are misconfigured, most URLs
    are malformed, and most filesystems are limited w.r.t. possible
    filenames. As a result most downloaders fail to work in some situations
    since they choose names which are not supported on particular
    filesystems, or not useful for "file:///"-related work.

    Because of the many possible twists and ramifications, the design of
    this module is to be as much configurable as possible. One of ways of
    configurations is a rich system of options which influence different
    steps of the process. To cover cases when options are not flexible
    enough, the process is broken into many steps; each step is easily
    overridable by subclassing "Net::ChooseFName".

    The defaults are chosen to be as safe as possible while not getting very
    much into the ways. For example, since "%" is a special character on
    DOSish shells, to simplify working from command line on such systems, we
    avoid this letter in generated file names. Similarly, since MacOS has
    problems with filenames with 8-bit characters, we avoid them too; since
    may Unix programs have problem with spaces in file names, we massage
    them into underscores; the length of the longest file path component is
    restricted to 255 chars.

    Note that in many situations it is advisable to make these restrictions
    yet stronger. For example, for copying to CD one should restrict names
    yet more ("max_length => 64"); for copying to MSDOS file systems enable
    option "'8+3' => 1".

    [In the description of methods the $self argument is omitted.]

  Principal methods
    new(OPT1 => $val1, ...)
        Constructor method. Creates an object with given options. Default
        values for the unspecified options are (comments list in which
        methods this option is used):

          protect       =>              # protect_characters()
                                        # $1 should contain the match
          protect_pref  => '@',         # protect_characters(),
          root          => '.',         # find_directory()
          dir_mode      => 0775,        # directory_found()
          mkpath        => 1,           # directory_found()
          max_suff_len  => 4,           # split_suffix()        'jpeg'
          keepsuff_same_mediatype => 1, # choose_suffix()
          type_suff     =>              # choose_suffix()
                           {'text/ftp-dir-listing' => '.dirl'}
          keep_suff     => { text/plain => 1,
                             application/octet-stream => 1 },
          short_suffices =>             # eight_plus_three()
                           {jpeg => 'jpg', html => 'htm',
                            'tar.bz2' => 'tbz', 'tar.gz' => 'tgz'},
          suggest_disposition => 1,     # find_name_by_response()
          suggested_only_basename => 1, # find_name_by_response(), raw_name()
          fix_url_backslashes => 1,     # protect_characters()
          max_length    => 255,         # fix_dups(), fix_component()
          cache_name    => 1,           # name_found()
          queryless_types =>            # url_takes_query()
                 { map(($_ => 1),       #
http://filext.com/detaillist.php?extdetail=DJV 2005/01
                       qw(image/djvu image/x-djvu image/dejavu image/x-dejavu
                          image/djvw image/x.djvu image/vnd.djvu ))},
          queryless_ext => { 'djvu' => 1, 'djv' => 1 }, # url_takes_query()

        The option "type_suff" is special so that the user-specified value
        is *added* to this hash, and not *replaces* it. Similarly, the value
        of option "html_suff" is used to populate the value for "text/html"
        of this hash.

        Other, options have "undef" as the default value. Their effects are
        documented in the documentation of the methods they affect. With the
        exception of "known_names", these options are booleans.

          html_suff                     # new()
          known_names                   # known_names() name_found(); hash ref
or undef
          only_known                    # known_names()
          hierarchical                  # raw_name(), find_directory()
          use_query                     # raw_name()
          8+3                           # fix_basename(), fix_component()
          keep_space                    # fix_component()
          keep_dots                     # fix_component()
          tolower                       # fix_component()
          dir_query                     # find_directory()
          site_dir                      # find_directory()
          ignore_existing_files         # fix_dups

          keep_nosuff, type_suff_no_enc, type_suff_fallback,
          type_suff_fallback_no_enc     # choose_suffix()

        Summary of the most useful in applications options (with defaults if

          html_suff                     # Suffix for HTML (dot will be prepended)
          root          => '.',         # Where to put files?
          mkpath        => 1,           # Create directories with chosen names?
          max_length    => 255,         # Maximal length of a path component
          ignore_existing_files         # Should the filename be "new"?
          cache_name    => 1,           # Return the same filename on the same
                                        #   even if file jumped to existence?
          hierarchical                  # Only the last component of URL path
          suggested_only_basename => 1, # Should suggested name be relative the
          use_query                     # Do not ignore the query part of URL?
                                        # Value is used as (literal) prefix of
          dir_query                     # Make the non-query part of URL a
          site_dir                      # Put the hostname part of URL into
          keepsuff_same_mediatype       # Preserve the file extensions matching
          8+3                           # Is the filesystem DOSish?
          keep_space                    # Map spaces in URL to spaces in
          tolower                       # Translate filenames to lowercase?

          type_suff, type_suff_no_enc, type_suff_fallback,
          keep_suff, keep_nosuff        # Hashes indexed by lowercased types;
                                        # Allow tuning choosing the suffix

    find_name_by_url($url, $suggested_name, $type, $enc)
        This method returns a suitable filename for the resource given its
        URL. Optional arguments are a suggested name (possibly, it will be
        modified according to options of the object), the content-type, and
        the content-encoding of the resource. If multiple content-encodings
        are required, specify them as an array reference.

        A chain of helper methods ("Transformation chain") is called to
        apply certain transformations to the name. "undef" is returned if
        any of the helper methods (except known_names() and protect_query())
        return undefined values; the caller is free to interpret this as
        "load to memory", if appropriate. These helper methods are listed in
        the following section.

    find_name_by_response($response [, $content_type])
        This method returns name given an LWP response object (and,
        optionally, an overriding "Content-Type"). If option
        "suggest_disposition" is TRUE, uses the header "Content-Disposition"
        from the response as the suggested name, then passes the fields from
        the response object to the method find_name_by_url().

  Transformation chain
    url_2resource($url [, $type, $encoding])
        This method returns $url modified by removing the parts related to
        access to *parts* of the resource. In particular, the *fragment*
        part is removed, as well as the *query* part if url_is_queryless()
        returns TRUE.

    known_names($url, $suggested, $type, $enc)
        The method find_name_by_url() will return the return value of this
        method (unless undef) immediately. Unless overriden, this method
        returns the value of the hash option "known_names" indexed by the
        $url. (By default this hash is empty.)

        If the option "only_known" is true, it is a fatal error if $url is
        not a key of this hash.

    raw_name($url, $suggested, $type, $enc)
        Returns the 0th approximation to the filename of the resource; the
        return value has two parts: the principal part, and the query string
        ("undef" if should not be used).

        If $suggested is undefined, returns the path part of the $url, and
        the query part, if present and if option "use_query" is TRUE).
        Otherwise either returns $suggested, or (if options
        "suggested_only_basename" and "hierarchical" are both true), returns
        the *path* part of the $url with the last component changed to
        $suggested; the query part is ignored in this case. In the latter
        case, if option "suggested_basename" is TRUE, only the last path
        component of $suggested is used.

    protect_characters($f, $query, $url, $suggested, $type, $enc)
        Returns the filename $f with necessary character-by-character
        translations performed. Unless overriden, it translates backslashes
        to slashes if the option "fix_url_backslashes" is TRUE, replaces
        characters matched by regular expression in the option "protect" by
        their hexadecimal representation (with the leader being the value of
        the option "protect_pref"), and replaces percent signs by the value
        of the option "protect_pref".

    protect_query($f, $query, $url, $suggested, $type, $enc)
        Returns $query with necessary character-by-character translations
        performed. Unless overriden, it translates slashes, backslashes, and
        characters matched byregular expression in the option "protect" by
        their hexadecimal representation (with the leader being the value of
        the option "protect_pref"), and replaces percent signs by the value
        of the option "protect_pref".

    find_directory($f, $query, $url, $suggested, $type, $enc)
        Returns a triple of the appropriate directory name, the relative
        filename, and a string to append to the filename, based on
        processed-so-far filename $f and the $query string.

        Unless overriden, does the following: unless the option
        "hierarchical" is TRUE, all but the last path components of $f are
        ignored. If the option "site_dir" is TRUE, the host part of the URL
        (as well as the port part - if non-standard) are prepended to the
        filename. The leading backslash is always stripped, and the option
        "root" is used as the lead components of the directory name. If
        $query is defined, and the option "dir_query" is true, $f is used as
        the last component of the directory, and $query as file name (with
        option "use_query" prepended).

        (Dirname is assumed to be "/"-terminated.)

    protect_directory($dirname, $f, $append, $url, $suggested, $type, $enc)
        Returns the provisional directory part of the filename. Unless
        overriden, replaces empty components by the string "empty" preceeded
        by the value of "protect_pref" option; then applies the method
        fix_component() to each component of the directory.

    directory_found($dirname, $f, $append, $url, $suggested, $type, $enc)
        A callback to process the calculated directory name. Unless
        overriden, it creates the directory (with permissions per option
        "dir_mode") if the option "mkpath" is TRUE.

        Actually, the directory name is the return value, so this is the
        last chance to change the directory name...

    split_suffix($f, $dirname, $append, $url, $suggested, $type, $enc)
        Breaks the last component $f of the filename into a pair of basename
        and suffix, which are returned. $dirname consists of other
        components of the filename, $append is the string to append to the
        basename in the future.

        Suffix may be empty, and is supposed to contain the leading dot (if
        applicable); it may contain more than one dot. Unless overriden, the
        suffix consists of all trailing non-empty started-by-dot groups with
        length no more than given by the option "max_suff_len" (not
        including the leading dot).

    choose_suffix($f, $suff, $dirname, $append, $url, $suggested, $type,
        Returns a pair of basename and appropriate suffix for a file. $f is
        the basename of the file, $suff is its suffix, $dirname consists of
        other components of file names, $append is the string to append to
        the basename.

        Different strategies applicable to this problem are:

        *   keep the file extension;

        *   replace by the "best" extension for this $type (and $enc);

        *   replace by the user-specified type-specific extension.

        Any of these has two variants: whether we want the encodings
        reflected in the suffix, or not. Unless overriden, chosing
        strategy/variant consists of several rounds.

        In the first round, choose user-specified suffix if $type is
        defined, and is (lowercased) in the option-hashes "type_suff" and
        "type_suff_no_enc" (choosing the variant based on which hash
        matched). Keep the current suffix if $type is not defined, or option
        "keepsuff_same_mediatype" is TRUE and the current suffix of the file
        matches $type and $enc (per database of known types and encodings).

        The second round runs if none of these was applicable. Choose
        user-specified suffix if $type is (lowercased) in the hashes
        "type_suff_fallback" or "type_suff_fallback_no_enc" (choosing
        variant as above); keep the current suffix if the type (lowercased)
        is in the hashes "keep_nosuff" or "keep_suff" (depending on whether
        $suff is empty or not).

        If none of these was applicable, the last round chooses the
        appropriate suffix by the database of known types and encodings; if
        not found, the existing suffix is preserved.

    fix_basename($f, $dirname, $suff, $url, $suggested, $type, $enc)
        Returns a pair of basename and suffix for a file. $f is the last
        component of the name of the file, $dirname consists of other
        components. Unless overriden, this method replaces an empty basename
        by "index" and applies fix_component() method to the basename;
        finally, if '8+3' otion is set, it converts the filename and suffix
        to a name suitable 8+3 filesystems.

    fix_dups($f, $dirname, $suff, $url, $suggested, $type, $enc)
        Given a basename, extension, and the directory part of the filename,
        modifies the basename (if needed) to avoid duplicates; should return
        the complete file name (combining the dirname, basename, and
        suffix). Unless overriden, appends a number to the basename
        (shortening basename if needed) so that the result is unique.

        This is a prime candidate for overriding (e.g., to ask user for
        confirmation of overwrite).

    name_found($url, $f, $dirname, $suff, $suggested, $type, $enc)
        The callback method to register the found name. Unless overridden,
        behaves like following: if option "cache_name" is TRUE, stores the
        found name in the "known_names" hash. Otherwise just returns the
        found name.

  Helper methods
    fix_component($component, $isdir)
        Returns a suitably modified value of a path component of a filename.
        The non-overriden method massages unescapes embedded SPACE
        characters; it removes starting/trailing, and converts the rest to
        "_" unless the option "keep_space" is TRUE; removes trailing dots
        unless the option "keep_dots" is TRUE; translates to lowercase if
        the option "tolower" is TRUE, truncates to "max_length" if this
        option is set, and applies the eight_plus_three() method if the
        option '8+3' is set.

    eight_plus_three($fname, $suffix)
        Returns the value of filename modified for filesystems with 8+3
        restriction on the filename (such as DOS). If $suffix is not given,
        calculates it from $fname; otherwise $suffix should include the
        leading dot, and $fname should have $suffix already removed. (Some
        parts of info may be moved between suffix and filename if judged

    url_takes_query($url [, $type, $encoding])
        This method returns TRUE if the *query* part of the URL is selecting
        a part of the resource (i.e., if it is behaves as a *fragment* part,
        and it is the client which should process this part). Such URLs are
        detected by $type (should be in hash option "queryless_types"), or
        by extension of the last path component (should be in hash option

Net::ChooseFName::Failer class
    A class which behaves as Net::ChooseFName, but always returns "undef".
    For convenience, the constructor is duplicated as a class method
    failer() in the class Net::ChooseFName.

    None by default.

    Documentation keeps mentioning *"unless overriden"*... Of course it is a
    generic remark applicable to any method of any class; however, please
    remember that methods of this class are designed to be overriden.

    There is no protection against a wanted directory name being already
    taken by a file.

    There is no restriction on length of overall file name, only on length
    of a component name.




    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.2 or, at
    your option, any later version of Perl 5 you may have available.

Site Timeline