# sorting just the largest values

#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
I have a very large list of data (up to 20,000 lines), and I just want
to print out the largest 50 values. What is the quickest way to sort
this using perl?

- Alex Hart

## Re: sorting just the largest values

Alex Hart wrote:
> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?

my @largest = `sort -n yourfile | tail -50`;

John
--
use Perl;
program
fulfillment

## Re: sorting just the largest values

> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?

Most people wouldn't consider 20,000 items to be a "very large list."
I'd suggest first just trying the brute-force approach of reading them
into an array, sorting them, and taking the last 50 results.  Only if
this turns out to be unacceptably slow or memory-consuming for your
application should you consider something different.  Remember that
perl's sort routines are implemented in optimized C code, so even if you
implement a lower-time-complexity algorithm yourself, the value of N for
which it becomes faster may be quite large.

If the brute-force approach turns out to be unacceptable (again, based on
actual measurement, not guesswork), here's a linear-time (O(N)) selection
algorithm:

1) Read the first 50 values into the top 50 list.

2) For each subsequent value, find the smallest value in the top 50 list
that's less than it.  If there isn't one, do nothing.  If there is, kick
it out of the list and add the new value to the list.

Step 2 will be simpler to program if you sort the list once after step 1.
You'll find shift() and splice() particularly helpful.

[UNTESTED]

my @top;
while (<>) {
push @top,\$_ if \$.<=50;
@top=sort @top if \$.=50;
if (\$.>50) {
for (my \$i=49;\$i>=0;--\$i) {
if (\$top[\$i] lt \$_) {
# this is where the new value belongs
splice @top,\$i,0,\$_; # insert the new value
shift @top; # remove the smallest value
last;
}
}
}
}

## Re: sorting just the largest values

> Most people wouldn't consider 20,000 items to be a "very large list."
> I'd suggest first just trying the brute-force approach of reading
them
> into an array, sorting them, and taking the last 50 results. Only if

I'm developing a real-time application, and a particular sort was
taking several minutes when the list got up to 20,000. It might be a
problem with the data being too ordered to start with (or does per
shuffle before it sorts?).  I already extract the sort key and just
sort on that. I can try the packed string sort-key, but I think even a
slow method of finding the top 50 must be faster than sorting the whole
list.

I'll try some of the suggestions here, and I'll be sure to benchmark
along the way.

## Re: sorting just the largest values

> > Most people wouldn't consider 20,000 items to be a "very large list."
> > I'd suggest first just trying the brute-force approach of reading
> them
> > into an array, sorting them, and taking the last 50 results. Only if
>
> I'm developing a real-time application, and a particular sort was
> taking several minutes when the list got up to 20,000.

That is definitely pathological.

> It might be a
> problem with the data being too ordered to start with (or does per
> shuffle before it sorts?).

Depends on the version.  Which version are you using?

> I'll try some of the suggestions here, and I'll be sure to benchmark
> along the way.

Unfortunately, benchmarking is extremely difficult in cases where there are
unknown and unpredictable pathological patterns in the data.  Be careful.

Xho

--
Usenet Newsgroup Service                        \$9.95/Month 30GB

## Re: sorting just the largest values

> > > Most people wouldn't consider 20,000 items to be a "very large list."
> > > I'd suggest first just trying the brute-force approach of reading
> > them
> > > into an array, sorting them, and taking the last 50 results. Only if
> >
> > I'm developing a real-time application, and a particular sort was
> > taking several minutes when the list got up to 20,000.
>
> That is definitely pathological.

I agree. There's some gross inefficiency going on.  We should see the
sort, it's certainly speed-up-able.

In general, the solution to the "bottom-n" problem is a heap.  In pseudo-
code (using a mythical module Heap, but there are real ones on CPAN):

use Heap;
my \$h = Heap->new();
my \$n = 50;
for ( @data ) {
\$h->insert( \$_);
\$h->extract_maximum if \$h->size > \$n;
}
my @top_n = map \$h->extract_maximum, 1 .. \$h->size;

If the heap implementation doesn't have a ->size method, it is easy to

If \$n is larger than the data size, the algorithm does a heap sort on
the data.  If \$n is smaller, it is faster than the equivalent sort.

Anno

## Re: sorting just the largest values

>...
>In general, the solution to the "bottom-n" problem is a heap.  In pseudo-
>code (using a mythical module Heap, but there are real ones on CPAN):
>
>    use Heap;
>    my \$h = Heap->new();
>    my \$n = 50;
>    for ( @data ) {
>        \$h->insert( \$_);
>        \$h->extract_maximum if \$h->size > \$n;
>    }
>    my @top_n = map \$h->extract_maximum, 1 .. \$h->size;
>
>If the heap implementation doesn't have a ->size method, it is easy to
>
>If \$n is larger than the data size, the algorithm does a heap sort on
>the data.  If \$n is smaller, it is faster than the equivalent sort.
>
>Anno

to use one of the *newer* (than heapsort) priority-queue methods,
eg splay trees and the like?

(Or maybe no real difference?)

David

## Re: sorting just the largest values

> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?

For very large lists of data: while reading in the lines, keep a
sorted array with the up to 50 largest seen values.

But 20000 is not very large, or even large. Read the whole file and
sort it.

If you say "sort" to search.cpan.org, it will show you some modules
that are likely to be useful.

## Re: sorting just the largest values

Alex Hart wrote:

> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?
>