potential changes to Locale-PO

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

I have been using Locale-PO-0.16 for two scripts that are now in
the source tree of ELinks <http://elinks.cz/ .  On the way, I
have patched PO.pm to add new features and fix bugs, and mailed
the patches to the maintainer Alan Schwartz.

New features:
- Locale::PO supports obsolete entries.
- The PO file parser no longer require newlines between entries.
- The PO file parser tries to preserve even semantically
  insignificant newlines in strings.
- The PO file parser remembers the line number where each msgid
  or msgstr begins.
- The save_file method returns undef and remembers $! if print

Bug fixes:
- Locale::PO preserves the complete set of flags in each entry,
  even those flags that it does not directly support.
- The PO file parser compares names of flags exactly and
  case-sensitively, like GNU Gettext does.  It no longer
  truncates e.g. objc-format to c-format.
- The php-format flag is now tristate, like c-format.
- The PO file parser binds $/ and $_ dynamically, thus insulating
  itself from the caller.
- The dump method dumps comments even if they are eq "0".

Documentation changes:
- Copied the copyright notice from README to PO.pm itself.
- Documented quoting and newlines in strings passed to/from methods.
- Documented the php_format, load_file, and save_file methods.
- Documented error handling in load_file_asarray and load_file_ashash.
- Documented the bugs that I know of.
- Separated getter and setter synopses from each other.  Also,
  repeat the synopsis above the description of each method.

Other changes:
- "use fields" and "my Locale::PO".
- Renamed normalize_str to _normalize_str, and dump_multi_comment
  to _dump_multi_comment.
- Locale::PO objects store the flags in a different format.
- Flag-setting functions silently map unsupported values
  (e.g. 42) to supported ones (e.g. 1), which they also return.
- It is possible to get a Locale::PO object without a msgid by
  loading an invalid PO file.  Writing such an entry back out
  does not generate a msgid, either.

I have not yet updated the tests, primarily because I've been
working in the ELinks source tree and importing the tests there
did not seem right.  (ELinks uses Locale::PO at build time only;
it doesn't install the patched version to the user's system.)
I intend to rectify this after I install some scripts to help
propagate changes between version control systems.

Now, Alan Schwartz has suggested that I take over the Locale-PO
module.  I am afraid of doing that: I don't know how long my
interest in this module will last, and I don't want to become
trapped in supporting it.  I don't currently have a PAUSE
account, either.  So, I'd like to know the c.l.p.m opinions on
such a change.

In any case, whether I become the maintainer or just submit
patches, I think it would be good to get in touch with the users
of the module, so that I could be sure that the changes are going
in the right direction and don't gratuitously break people's
programs.  Specifically:

- How important is it to run fast and use little memory?

- Is it necessary to support anything older than Perl 5.6.0?

- If a malformed PO file is being loaded, do you want warnings
  during the load, afterwards, or not at all?

- Do you access the hash of Locale::PO directly?

- Do you define subclasses of Locale::PO?

- Do you define any variables as 'my Locale::PO $foo' or check that
  '$foo->isa("Locale::PO")' or that 'ref($foo) eq "Locale::PO"'?

- The msgstr_n method returns a reference to a hash.  Do you
  modify that hash?  If so, do you expect the modifications
  to affect the Locale::PO object?

- The msgid, msgid_plural, msgstr, and msgstr_n accessor methods
  return strings in one format and want new strings in a
  different format.  I'd like to straighten this out so that the
  same format can be used in both directions.  Also, I'd like to
  make it possible and hopefully even easy to get the string with
  all \n etc. backslash sequences expanded out.  How should these
  things be done in a compatible way?

  (a) Keep the inconsistency:

  (b) New methods for different formats:

  (c) First arg is a hash of options:
      $po->msgid(, $po->msgid())
      This would require extra trickery with msgstr_n, which
      already takes a hash; and it might be too easy to mistype
      an option.

- PO files normally declare their charset.  In Unicode-capable
  Perls, it should be easy for users of Locale::PO to get the
  strings converted to Perl's internal Unicode representation.
  This applies both to the actual strings and to any comments.
  However, for the sake of applications that don't call
  bindtextdomain(), Locale::PO should preserve the exact bytes
  (including redundant shift sequences) as far as possible.
  Note also that some encodings can use the backslash ASCII code
  0x5C as part of a multibyte character, which may affect the
  quote and dequote methods.  How should the Unicode strings be

  (a) Each Locale::PO object holds the byte strings and the
      name of the charset.  Methods convert from/to Unicode
      when necessary.  There are two methods for changing the
      charset: one preserves the byte strings, and the other
      recodes them.
      Con: If you build a Locale::PO object from scratch (as
           opposed to loading it from a file), you need to select
           the charset before you set any Unicode strings.
      Con: If you change the charset in the Content-Type of the
           header entry and then save_file_fromarray, the other
           entries will keep their previous encoding.  To avoid
           that, one must loop over the entries and change the
           charset of each.

  (b) Each Locale::PO object holds a mix of byte strings and
      Unicode strings and remembers which is which (or perhaps it
      just tests with Encode::is_utf8).  It also holds the name
      of the charset of the byte strings.  Loading from a file
      stores byte strings and copies the name of the charset from
      the header entry.  Saving a string to a file recodes it to
      match the charset listed in the Content-Type of the header
      entry, unless it is a byte string and the charset is
      already correct.
      Pro: If one changes neither the Content-Type nor the
           strings, then loading a file and writing it back out
           does not alter the bytes.
      Pro: To recode the file, one only has to change the
           Content-Type; one need not know all the fields of
           Locale::PO that may hold strings.
      ???: To change the Content-Type without recoding the
           strings, one must loop over the entries and change
           their stored charset to match the Content-Type.
           But this is unusual to want, so it's OK if it is

  (c) As in (b) but entries hold a pointer to the header entry,
      instead of the name of the charset.  That way, they can
      also access Plural-Forms and whatever.
      Con: Too easy to get miscoded strings by changing the

  (d) Locale::PO objects hold byte strings only.  There are
      separate methods for converting strings to/from Unicode.
      Con: Recoding the whole file becomes difficult.

- Should changing msgid and msgstr wipe out the saved line numbers?

Interested persons may find a patched version as po/perl/Locale/PO.pm
in the ELinks GIT repository.  An even newer version is temporarily
at <http://www.iki.fi/kon/tmp/PO.pm until the end of March 2006.
I am not yet the maintainer of Locale-PO, this is not a formal
release, and future releases may be incompatible with these versions.

Site Timeline