preg_split problem

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi, all. I have a poroblem, and I'd really appreciate if someone helped  
me to solve it.

The problem is, I want to take the first two sentences of a string. To  
do that i need to split them whenever a dot occurs, and the join the  
first two array occurences in a new string but I have a problem beacuse  
the dot in the Croatian languages is not always used a sentence  
delimiter, but is often used in conjuction with numbers and acronyms. So  
I wanted to use a regular expression to split a string on every dot  
ocurence but not when a dot is precedeed by a number or a 'd' or a 'o'.
This is my best shot at it:

$string='Glavna skupština Društva će se održati 27.12.2007. (četvrtak) u  
11 sati u prostorijama Doma hrvatske voj­ske u Lori u Splitu.Atlas  
turistička agencija d.d. stekla je 22. i 23. studenog 2007. godine 2800  
vlastitih dionica.';
$uvod=preg_split('/((d\.o\.o\.)!|(d\.d\.)!|[0-9]!)|\./', $string);

But it doesn't work right. If someone knows how to slove this problem.  
Any help will be really appreciated.



Re: preg_split problem

taps128 wrote:
Quoted text here. Click to load it
Well I've made some progress.

$uvod=preg_split('(\D[^dDoO]\.\s)', $string );

I used this regex,it splits the string ok, but the last two characters  
beside the dot are gone from the spllited string.

 From 'Lori u Splitu' the last letters 'tu' are gone.

Re: preg_split problem

Quoted text here. Click to load it

Well, you main problem here is to decide WHEN a dot is ending a sentence.  
Not a very simple task without lists of known acronyms. Also, a dot after  
a number can end a sentence:"The coldest winter I remember was in 1985.  
The temperature the day my sister was born dropped as low as -21C.". How  
do you propose this is handled? Formulating the exact requirements before  
writing the regex is more then half the work.

A start for you (by no means complete):
A sentence is ended by a dot:
- followed by either $ (for which case we don't need a regex, it will end  
automatically there), or:
- at least one whitespace character (\s) (well, it should be, damn those  
kids nowadays), followed by a capital letter (\p, use utf-8 mode).

That would be:
...which doesn't split your string anywhere, as my rules for  
'sentence-ending' seem to be inadequeate for your string, or no sentence  
is ended.

I think this will require some hefty '(not) pre/proceded by' operators, as  
you'd like the matched text to be in the split. Even then a 100% success  
rate will most definitly by out of the question.
Rik Wasmus

Re: preg_split problem

Rik Wasmus wrote:
Quoted text here. Click to load it
tnx, that was what i feared.

Site Timeline