# find the longest common substring among 10 thousand strings

#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on

Hi, here's my problem.

I have 10k strings, each between 40 and 250 chars long, and I need to
find the longest common substrings and the respective number of
occurences among those 10k strings.

I have easily solved a simpler problem which is to assume that certain
characters (,,/,:,..) can be taken as token separators. This way I was
able to count the popularity of tokens. I would like to solve the
problem where absolutely all the character are same-class citizens and
can contribute to the longest string...

ideas?

Thanks

fillmore

## Re: find the longest common substring among 10 thousand strings

Old problem.
The longest common substring was already described some 25-30 years ago
in e.g. "Introductions to Algorithms" by Cormen, Leiserson, and Rivest.

However implementation is not necessarily easy and for 10k strings you
can expect a long runtime and heavy memory consumption,

jue

## Re: find the longest common substring among 10 thousand strings

On 09/01/2015 11:00 PM, Jürgen Exner wrote:

yeah. I should have mentioned that I was already aware of ways to find
the longest substring given two strings, but what I was really hoping
for is that someone had a way to do the job that was faster than 10k x
10k runs of an expensive algorithm :(

## Re: find the longest common substring among 10 thousand strings

On 02.09.15 05:31, Fillmore wrote:

In times of big money given to those analyzing Big Data
from Twitter, there should be an incentive for finding clever
solutions, but also lack of generosity in making any of these
freely available, I'd suppose.
OTOH, analysis of series of similar data has generated entire
database solutions, such as Kx. Although, this one originated
in the financial and insurance markets, I think. Based on ideas
from APL.

## Re: find the longest common substring among 10 thousand strings

Well, I think my gut is trying to tell me that if you find a solution,
then you can also write it up as a PhD thesis.

jue

## Re: find the longest common substring among 10 thousand strings

On 2/9/2015 05:32, Fillmore wrote:

In a single string there are  (n(n+1))/2  possible substrings
are you sure for what you are asking ?

## Re: find the longest common substring among 10 thousand strings

On 09/02/2015 03:04 AM, George Mpouras wrote:>
>
> In a single string there are  (n(n+1))/2  possible substrings
> are you sure for what you are asking ?

yes. I also strongly suspected this problem was hard. I was hoping
someone could suggest a novel approach to solving this problem before
I'm a pensionist :)

as I wrote, relying on certain chars to represent word boundaries has
solved a sub-problem the solution to which can be taken as a decent
approximation of the more general problem...

Thank you anyway

## Re: find the longest common substring among 10 thousand strings

On 2/9/2015 05:32, Fillmore wrote:

# good luck with that !

#!/usr/bin/perl
use strict; use warnings; use feature 'say';
my %db;
my \$string;
my \$substring;

while (my \$line = <DATA>) {
chomp \$line;
my %linenum;

for (my \$i=0; \$i<length \$line; \$i++) {
\$string = substr \$line, \$i;

for (my \$j=1; \$j<=length \$string; \$j++) {
my \$substring = substr \$string, 0, \$j;
\$db->++;

unless (exists \$linenum->) {
\$linenum->=1;
\$db->++
}
}
}
}

for (sort{length \$b <=> length \$a} keys %db) {
say "substring : \$_";
say "length    : ", length \$_;
say "instances : \$db->";
say "in lines  : \$db->";
say '-'x20
}

__DATA__
aaabbbb
aawerwr

## Re: find the longest common substring among 10 thousand strings

I'll give this a spin in the morning, but George I commend your perl
skills and your willingness to help. Thank you!

Luca

On 09/02/2015 07:36 PM, George Mpouras wrote:

## Re: find the longest common substring among 10 thousand strings

On 3/9/2015 06:37, Fillmore wrote:

Please do NOT it in any kind of production with all these strings you
mention.

It will consume all computer memory and kernel will start killing random
processes

## Re: find the longest common substring among 10 thousand strings

On 09/03/2015 02:30 AM, George Mpouras wrote:

no worries. In reality I just need to provide a high-level spec for a
project. Perl helps me bring me ahead with work and perform some
preventive action against those who say "it can't be done" or "it will
take a week of work" :)

Thanks again

## Re: find the longest common substring among 10 thousand strings

George, you are a rockstar. I know this probably won't help, my I'll
give it a spin in the morning. Thank you for your willingness to help.
It is very appreciated

On 09/02/2015 07:36 PM, George Mpouras wrote: