[wplug] Text searching

Mon Mar 17 09:50:39 EST 2003

Doug Green wrote:
> 
> Hi all-
> 
> I have some large text files that I need to search. They are genomic
> sequences, and consist of 4 letters in a block of 10, separated by a
> space. There are 6 such blocks on a line, and each line is numbered
> for the order of the first letter (maybe 20,000+ lines per file?).
> Essentially, the format looks like this (obviously, the content is
> different):
> 
> 1       atacaatagg atacaatagg atacaatagg atacaatagg atacaatagg
> atacaatagg
> 61     atacaatagg atacaatagg atacaatagg atacaatagg atacaatagg
> atacaatagg
> 
> I need to be able to search within this kind of text file for a string
> of letters that is maybe 30-40 letters long, ignoring the spaces and
> numbers. The whole point is that I need to locate the position of my
> search string within the original text. Is there some fancy way to
> grep the file, ignoring spaces and numbers? Or to somehow filter out
> the spaces and numbers, creating a new file (maybe some cat option
> piped into a new file??)?
> 
> Any help/suggestions are greatly appreciated! Thanks!
> 
> Doug

Doug,

I don't have a nifty filter command for you but you can find the
solution to this problem somewhere in the O'Reilly book "Beginning Perl
for Bioinformatics" (http://www.oreilly.com/catalog/begperlbio/). It is
at the Carneige library. The book provides a open source perl module,
BeginPerlBioinfo.pm, at
http://examples.oreilly.com/begperlbio/BeginPerlBioinfo.pm.

Also, if you want more powerful Perl stuff for Bioinformatics, check out
www.bioperl.org.

Good luck,

Paul