[wplug] Text searching

John Harrold jmh17 at pitt.edu
Mon Mar 17 09:57:56 EST 2003


Sometime in March Doug Green assaulted the keyboard and produced:

| Hi all-
| 
| I have some large text files that I need to search. They are genomic
| sequences, and consist of 4 letters in a block of 10, separated by a space.
| There are 6 such blocks on a line, and each line is numbered for the order
| of the first letter (maybe 20,000+ lines per file?). Essentially, the format
| looks like this (obviously, the content is different):
| 
| 1       atacaatagg atacaatagg atacaatagg atacaatagg atacaatagg atacaatagg
| 61     atacaatagg atacaatagg atacaatagg atacaatagg atacaatagg atacaatagg
| 
| I need to be able to search within this kind of text file for a string of
| letters that is maybe 30-40 letters long, ignoring the spaces and numbers.
| The whole point is that I need to locate the position of my search string
| within the original text. Is there some fancy way to grep the file, ignoring
| spaces and numbers? Or to somehow filter out the spaces and numbers,
| creating a new file (maybe some cat option piped into a new file??)?
| 
| Any help/suggestions are greatly appreciated! Thanks!
| 
| Doug
| 

there is probably a more elegant way to do this with sed and awk, but this is a
solution in perl. this will take the text in 'input.txt' and strip out all of
the new lines, white space and any numbers and stick the remaining text in
output.txt. then you should be able to grep though output.txt for what you need.


#! /usr/bin/perl

use strict;
use warnings;

MAIN:
{
    # this is the input file
    my $infile = "infile.txt";
    # this is the output file
    my $outfile = "outfile.txt";


    open(IN ,"<$infile");
    open(OUT ,">$outfile");
   
    while(<IN>){
      chomp $_; 
      # removing numbers
      $_ =~ s/\d//g;
      # removing whitespace
      $_ =~ s/\s//g;

      #writing to new file
      print OUT $_ ;
    }

    close(IN);
    close(OUT);
    exit 0;
}



-- 
---------------------------------------------------------------
john harrold               | "They that can give up essential  
     jmh at member.fsf.org |  liberty to obtain a little       
/"\                        |  temporary safety deserve neither 
\ / ASCII ribbon campaign  |  liberty nor safety."             
 X  against HTML mail      |                                  
/ \                        |  Benjamin Franklin
---------------------------------------------------------------
gpg --keyserver keys.indymedia.org --recv-key F65A739E
---------------------------------------------------------------
Beware of all enterprises that require new clothes.
--Henry David Thoreau
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://penguin.wplug.org/pipermail/wplug/attachments/20030317/ed5d97ed/attachment-0001.bin


More information about the wplug mailing list