#### Do you have a question? Post it now! No Registration Necessary. Now with pictures!

## Re: sorting just the largest values

> I have a very large list of data (up to 20,000 lines), and I just want

> to print out the largest 50 values. What is the quickest way to sort

> this using perl?

Most people wouldn't consider 20,000 items to be a "very large list."

I'd suggest first just trying the brute-force approach of reading them

into an array, sorting them, and taking the last 50 results. Only if

this turns out to be unacceptably slow or memory-consuming for your

application should you consider something different. Remember that

perl's sort routines are implemented in optimized C code, so even if you

implement a lower-time-complexity algorithm yourself, the value of N for

which it becomes faster may be quite large.

If the brute-force approach turns out to be unacceptable (again, based on

actual measurement, not guesswork), here's a linear-time (O(N)) selection

algorithm:

1) Read the first 50 values into the top 50 list.

2) For each subsequent value, find the smallest value in the top 50 list

that's less than it. If there isn't one, do nothing. If there is, kick

it out of the list and add the new value to the list.

Step 2 will be simpler to program if you sort the list once after step 1.

You'll find shift() and splice() particularly helpful.

[UNTESTED]

my @top;

while (<>) {

push @top,$_ if $.<=50;

@top=sort @top if $.=50;

if ($.>50) {

for (my $i=49;$i>=0;--$i) {

if ($top[$i] lt $_) {

# this is where the new value belongs

splice @top,$i,0,$_; # insert the new value

shift @top; # remove the smallest value

last;

}

}

}

}

## Re: sorting just the largest values

> Most people wouldn't consider 20,000 items to be a "very large list."

> I'd suggest first just trying the brute-force approach of reading

them

> into an array, sorting them, and taking the last 50 results. Only if

I'm developing a real-time application, and a particular sort was

taking several minutes when the list got up to 20,000. It might be a

problem with the data being too ordered to start with (or does per

shuffle before it sorts?). I already extract the sort key and just

sort on that. I can try the packed string sort-key, but I think even a

slow method of finding the top 50 must be faster than sorting the whole

list.

I'll try some of the suggestions here, and I'll be sure to benchmark

along the way.

> I'd suggest first just trying the brute-force approach of reading

them

> into an array, sorting them, and taking the last 50 results. Only if

I'm developing a real-time application, and a particular sort was

taking several minutes when the list got up to 20,000. It might be a

problem with the data being too ordered to start with (or does per

shuffle before it sorts?). I already extract the sort key and just

sort on that. I can try the packed string sort-key, but I think even a

slow method of finding the top 50 must be faster than sorting the whole

list.

I'll try some of the suggestions here, and I'll be sure to benchmark

along the way.

## Re: sorting just the largest values

> > I'd suggest first just trying the brute-force approach of reading

> them

> > into an array, sorting them, and taking the last 50 results. Only if

>

> I'm developing a real-time application, and a particular sort was

> taking several minutes when the list got up to 20,000.

That is definitely pathological.

> It might be a

> problem with the data being too ordered to start with (or does per

> shuffle before it sorts?).

Depends on the version. Which version are you using?

> I'll try some of the suggestions here, and I'll be sure to benchmark

> along the way.

Unfortunately, benchmarking is extremely difficult in cases where there are

unknown and unpredictable pathological patterns in the data. Be careful.

Xho

--

-------------------- http://NewsReader.Com/ --------------------

Usenet Newsgroup Service $9.95/Month 30GB

## Re: sorting just the largest values

> > > Most people wouldn't consider 20,000 items to be a "very large list."

> > > I'd suggest first just trying the brute-force approach of reading

> > them

> > > into an array, sorting them, and taking the last 50 results. Only if

> >

> > I'm developing a real-time application, and a particular sort was

> > taking several minutes when the list got up to 20,000.

>

> That is definitely pathological.

I agree. There's some gross inefficiency going on. We should see the

sort, it's certainly speed-up-able.

In general, the solution to the "bottom-n" problem is a heap. In pseudo-

code (using a mythical module Heap, but there are real ones on CPAN):

use Heap;

my $h = Heap->new();

my $n = 50;

for ( @data ) {

$h->insert( $_);

$h->extract_maximum if $h->size > $n;

}

my @top

If the heap implementation doesn't have a ->size method, it is easy to

use an external counter instead.

If $n is larger than the data size, the algorithm does a heap sort on

the data. If $n is smaller, it is faster than the equivalent sort.

Anno

> > > I'd suggest first just trying the brute-force approach of reading

> > them

> > > into an array, sorting them, and taking the last 50 results. Only if

> >

> > I'm developing a real-time application, and a particular sort was

> > taking several minutes when the list got up to 20,000.

>

> That is definitely pathological.

I agree. There's some gross inefficiency going on. We should see the

sort, it's certainly speed-up-able.

In general, the solution to the "bottom-n" problem is a heap. In pseudo-

code (using a mythical module Heap, but there are real ones on CPAN):

use Heap;

my $h = Heap->new();

my $n = 50;

for ( @data ) {

$h->insert( $_);

$h->extract_maximum if $h->size > $n;

}

my @top

___n = map $h->extract___maximum, 1 .. $h->size;If the heap implementation doesn't have a ->size method, it is easy to

use an external counter instead.

If $n is larger than the data size, the algorithm does a heap sort on

the data. If $n is smaller, it is faster than the equivalent sort.

Anno

## Re: sorting just the largest values

>...

>In general, the solution to the "bottom-n" problem is a heap. In pseudo-

>code (using a mythical module Heap, but there are real ones on CPAN):

>

> use Heap;

> my $h = Heap->new();

> my $n = 50;

> for ( @data ) {

> $h->insert( $_);

> $h->extract_maximum if $h->size > $n;

> }

> my @top_n = map $h->extract_maximum, 1 .. $h->size;

>

>If the heap implementation doesn't have a ->size method, it is easy to

>use an external counter instead.

>

>If $n is larger than the data size, the algorithm does a heap sort on

>the data. If $n is smaller, it is faster than the equivalent sort.

>

>Anno

Perhaps (not sure about this) it would be faster (fewer steps?)

to use one of the

***newer***(than heapsort) priority-queue methods,

eg splay trees and the like?

(Or maybe no real difference?)

David

## Re: sorting just the largest values

> I have a very large list of data (up to 20,000 lines), and I just want

> to print out the largest 50 values. What is the quickest way to sort

> this using perl?

For very large lists of data: while reading in the lines, keep a

sorted array with the up to 50 largest seen values.

But 20000 is not very large, or even large. Read the whole file and

sort it.

If you say "sort" to search.cpan.org, it will show you some modules

that are likely to be useful.

## Re: sorting just the largest values

Alex Hart wrote:

> I have a very large list of data (up to 20,000 lines), and I just want

> to print out the largest 50 values. What is the quickest way to sort

> this using perl?

>

> Thanks in advance.

>

> - Alex Hart

Selection sort - just do the outer loop 50 times.

http://en.wikipedia.org/wiki/Selection_sort

The time complexity is O(n) (sorting top k items ;k<<n)

compared to O(n log n) for sorting the whole list.

gtoomey

> I have a very large list of data (up to 20,000 lines), and I just want

> to print out the largest 50 values. What is the quickest way to sort

> this using perl?

>

> Thanks in advance.

>

> - Alex Hart

Selection sort - just do the outer loop 50 times.

http://en.wikipedia.org/wiki/Selection_sort

The time complexity is O(n) (sorting top k items ;k<<n)

compared to O(n log n) for sorting the whole list.

gtoomey

#### Site Timeline

- » FAQ 1.3 Which version of Perl should I use?
- — Next thread in » PERL Discussions

- » FAQ 6.3 How can I pull out lines between two patterns that are themselves on different lin...
- — Previous thread in » PERL Discussions

- » s suffix question
- — Newest thread in » PERL Discussions

- » SSD partition alignment revisited
- — The site's Newest Thread. Posted in » Computer Hardware