Regex and chemistry

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

I'd like to parse equilibria and linear dependencies of concentrations as

Three parts:
1. Reactants
2. Products
3. Equilibrium constant.

1,2: Reactants and products are separated by /\s+[^\s]*>[^\s]*\s+/.
3: Equilibrium constant is in parenthesis with optional value.

Individual items in reactants and products are separated by /\s+\+\s+/.
Each item have a coefficient (decimal, integer or fraction (1/2)) and an

Equilibrium constant with optional value, reactants with coefficients*-1 and
products with coefficients like this:

2A + 3B <=> C (K=1.02e-2)
parses to: ['K=1.02e-2', [-2, 'A'], [-3, 'B'], [1, 'C']]

 (Kox) H2 + 1/2O2 => H2O
parses to: ['Kox', [-1, 'H2'], [-0.5,'O2'], [1, 'H2O]]

EP+ <-> E + P+ (Kdiss=1.06e-7)
parses to: ['Kdiss=1.06e-7', [-1,'EP+'], [1, 'E'], [1, 'P+']]

Two or three parts divided by /=/.
If three parts: one of the parts is a number (any notation).
Of the two additional parts one is identifier and the other is a linear
combination (integer or decimal notation) of concentrations.
Concentration is noted as [_ident_].

Identifier with optional value, and concentrations with coefficients.

CAtot = 2*[  CA2  ] + [CAKE]=1e-6
Parses to: ['CAtot=1e-6', ['2', 'CA2'], ['1', 'CAKE']]

 charge=0=[Na+] - 2 * [SO4 2-]
Parses to: ['charge=0', ['1', 'Na+'], ['-2', 'SO4 2-']]

[A] + [B] + 0.5*[C] = tot
Parses to: ['tot', ['1', 'A'], ['1', 'B'], ['0.5', 'C']]

I give some code down here that croaks on some errors in input and parses
the strings. Would you please be kind to comment on is and propose some
improvements. The croaks should give som meaningful hints to the user but
that is left out for now...


use strict;
use Carp;

my @test=(
"2A + 3B <=> C (K=1.02e-2)",
" (Kox) H2 + 1/2O2 => H2O",
"EP+ <-> E + P+ (Kdiss=1.06e-7)"
for (@test) {
 my $equi=ParseEqui($_);
 print "$_\nparses to: ";
print "-" x 80,"\n";

my @test2=(
"CAtot = 2*[CA2] + [CAKE]=1e-6",
" charge=0=[Na+] - 2 * [SO4 2-]",
"[A] + [B] + 0.5*[C] = tot "
for (@test2) {
 my $tot=ParseTot($_);
 print "$_\nParses to: [";

sub ParseEqui {
 croak unless (my @bits=split /\s+[^\s]*>[^\s]*\s+/)==2;
 $bits[0]=~s/(^|\s)\(([^\)]+)\)(\s|$)// ||
$bits[1]=~s/(^|\s)\(([^\)]+)\)(\s|$)// or croak;
 my $equi=[$2];
 for my $lr (0,1) {  #left or right?
  for (split /\s+\+\s+/,$bits[$lr]) {
   m/^\s*([\d\.]*)(\/([\d\.]*)|)\s*(.+?)\s*$/ or croak;
   my $coeff=$1?$2?$1/$3:$1:1;
   push @,[$lr?$coeff:-$coeff,$4];
 return $equi;

sub ParseTot {
 croak unless int 0.5*(my @bits=split /\s*=\s*/)==1;
 my $num;
 if (@bits==3) {
  my $i=0;
  while ($bits[$i]!~/^\s*([\d\.]+([eE](\+|-|)\d+)?)\s*$/) {
   croak if $i==3;
  $num=splice @bits,$i,1;
 @bits=reverse @bits if $bits[0]=~/\[/ && $bits[1]!~/\[/;
 my $tot=[defined $num?"$bits[0]=$num":$bits[0]];
 $bits[1]="+ $bits[1]";
 push @,[0+($2?$1.$3:$1.1),$4] while
 croak if $bits[1]=~/[^\s]/;
 return $tot;

sub Dumpit {
 my $in=shift;
 print "[";
 for (@) {
  if (ref($_)) {
   print "['",join("',\t'",@),"'],\t";
  } else {
   print "'$_',\t";
 print "]\n\n";

Site Timeline