|
Posted by cvh@LE on February 22, 2008, 2:23 pm
Please log in for more thread options >
>
>
> > have a web page calling PERL script that searches for patterns in 20,
> > 000 files + and returns link to files and lines found matching
> > pattern. I use a call to `find` and `egrep`
>
> > Q: Script works - but is straining under the load - files are in the
> > Gbs.
> > How to speed process? How simple to employ threads or slitting
> > off
> > new processes?
>
> > I know i should RTFM (LOL) and I will, but just looking for some
> > quick guidance/suggestions
>
> > pseudo code;
>
> > cd root of document directory
>
> > Load array with names of directories
>
> > forech subdir in @dirnames
>
> > cd $subdir
> > lots of if statements to figure what find command and what
> > option to use
> > @temp_array=`$long_find_grep_command`
> > push @temp_array onto big array
> > other processing
> > end foreach
>
> > what I'd like to do is to be able to simultaneously be searching more
> > than 1 subdirectory
>
> > TX for your help -
>
> Your idea is only likely to help if the directories reside on
> different
> disks, otherwise it will slow down the search by thrashing the disks.
>
> Better would be to analyze the type of requests. Maybe there
> are common searches you can cache. For example, a search for
> /the magic words are squeamish ossifrage/ need only be performed
> on files known to contain the common word "ossifrage".
To me this very much sounds like the 20k+ files are changed too often.
If this is the case you very likely might be able to speed up the
process by using an index of some sort which is updated by another
perl-process in regular periods, i.e. running as cron. I personally
recommend a sql database of some sort against which your web-request
run their queries.
this db can be updated every x mins.
another idea could be to have various flat-file index-database against
which you query using awk in subprocesses, since awk can be a lot
faster than perl in specific cases ...
|