Click here to get back home

Help: Content extraction

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Help: Content extraction Amy Lee 05-09-2008
Posted by Amy Lee on May 9, 2008, 10:30 pm
Please log in for more thread options
Hello,

I have a problem while I'm processing my sequence file. The file content
is like this.

>seq1
ACGGTC
ACTG
>seq2
CGATCC
ACCTC
>seq3
......

And I hope make every sequence into a single file. For example, a file
"seq1" content is
>seq1
ACGGTC
ACTG
And a file "seq2" content is
>seq2
CGATCC
ACCTC
and so on.

However, I'm only a newbie in perl, I don't know what to do. So could
anyone post some sample codes to do that? And I don't wanna use BioPerl
because others machines do not install this package although it's quite
useful.

Thank you very much~

Regards,

Amy Lee

Posted by Jürgen Exner on May 9, 2008, 11:43 pm
Please log in for more thread options
>I have a problem while I'm processing my sequence file.

I know text files, binary files, random access files, sequential files,
but I've never heard of a sequence file.

>The file content
>is like this.
>
>>seq1
>ACGGTC
>ACTG
>>seq2
>CGATCC
>ACCTC
>>seq3
>......
>
>And I hope make every sequence into a single file. For example, a file

What is a sequence?

>"seq1" content is
>>seq1
>ACGGTC
>ACTG
>And a file "seq2" content is
>>seq2
>CGATCC
>ACCTC
>and so on.

How is this desired content different from the original content? They
seem to be identical to me.

>However, I'm only a newbie in perl, I don't know what to do. So could
>anyone post some sample codes to do that?

Probably not without some much improved specification.

jue

Posted by Amy Lee on May 10, 2008, 3:16 am
Please log in for more thread options
Jue,

My most work is to process DNA so I save DNA sequences as a format called
FastA as you've seen before. And you could call my file dna.fasta, the
content is

>seq1
ACGGTC
ACTG
>seq2
CGATCC
ACCTC
>seq3
......

The "seq1" "seq2" "seq3" and "seqx" is the names of these sequences. I can
say, it's a mark. And under "seqx" it's DNA sequences. My point is quite
simple, I wanna extract every sequences as a file saved. I mean I can
extract sequences for dna.fasta and make a single file for every sequences.

There's an example.

From dna.fasta, I can make 3 sequences files and the names are from
mark names. They are seq1, seq2, seq3. In seq1, its content is
>seq1
ACGGTC
ACTG
In file seq2, its content is
>seq2
CGATCC
ACCTC
And so on. So from this I can deal with my sequences easily.

Thank you very much~

Regards,

Amy Lee

Posted by Jürgen Exner on May 10, 2008, 8:11 am
Please log in for more thread options
>My most work is to process DNA so I save DNA sequences as a format called
>FastA as you've seen before. And you could call my file dna.fasta, the
>content is
>
>>seq1
>ACGGTC
>ACTG
>>seq2
>CGATCC
>ACCTC
>>seq3
>......

From your previous description I thought those were 3 separte files.
Obviously I was wrong.

>The "seq1" "seq2" "seq3" and "seqx" is the names of these sequences. I can
>say, it's a mark. And under "seqx" it's DNA sequences. My point is quite
>simple, I wanna extract every sequences as a file saved. I mean I can
>extract sequences for dna.fasta and make a single file for every sequences.

So you want to split the file at each ">seq*" marker.

Well, then why not just loop (while (<>)) through the input file and
whenever you encounter such a marker (m//) close() the current output
file and open() a new one?

jue

Posted by Amy Lee on May 10, 2008, 11:04 am
Please log in for more thread options
On Sat, 10 May 2008 12:11:30 +0000, Jürgen Exner wrote:

>>My most work is to process DNA so I save DNA sequences as a format called
>>FastA as you've seen before. And you could call my file dna.fasta, the
>>content is
>>
>>>seq1
>>ACGGTC
>>ACTG
>>>seq2
>>CGATCC
>>ACCTC
>>>seq3
>>......
>
> From your previous description I thought those were 3 separte files.
> Obviously I was wrong.
>
>>The "seq1" "seq2" "seq3" and "seqx" is the names of these sequences. I can
>>say, it's a mark. And under "seqx" it's DNA sequences. My point is quite
>>simple, I wanna extract every sequences as a file saved. I mean I can
>>extract sequences for dna.fasta and make a single file for every sequences.
>
> So you want to split the file at each ">seq*" marker.
>
> Well, then why not just loop (while (<>)) through the input file and
> whenever you encounter such a marker (m//) close() the current output
> file and open() a new one?
>
> jue
Yes, you are right, and the codes is right for my work.

Thank you again~

Amy

Similar ThreadsPosted
hiw do i perform this extraction August 21, 2006, 4:12 am
FTP Link Extraction January 20, 2007, 10:06 am
Statistics Extraction January 23, 2007, 3:25 pm
Pattern extraction March 10, 2008, 4:14 am
Pattern extraction March 10, 2008, 4:42 am
Whats the way of wildcard HoH extraction? January 17, 2005, 3:36 pm
Extraction of bits using unpack August 17, 2005, 10:08 am
extraction of hostnames in hostfile January 2, 2006, 1:11 am
Column extraction in perl September 15, 2006, 12:31 pm
Extraction Fields from a file January 12, 2007, 4:16 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap