i'm new perl regex appreciate help. parsing blast outputs. right now, can account hits e-value contains integers , decimals. how can include hits e-value in scientific notation?
blastoutput.txt
score e sequences producing significant alignments: (bits) value ref|wp_001577367.1| hypothetical protein [escherichia coli] >... 75.9 4e-15 ref|wp_001533923.1| cytotoxic necrotizing factor 1 [escherich... 75.9 7e-15 ref|wp_001682680.1| cytotoxic necrotizing factor 1 [escherich... 75.9 7e-15 ref|zp_15044188.1| cytotoxic necrotizing factor 1 domain prot... 40.0 0.002 ref|yp_650655.1| hypothetical protein ypa_0742 [yersinia pest... 40.0 0.002 alignments >ref|wp_001577367.1| hypothetical protein [escherichia coli]
parse.pl
open (file, './blastoutput.txt'); $marker = 0; @one; @acc; @desc; @score; @evalue; $counter=0; while(<file>){ chomp; if($marker==1){ if(/^(\d+)\|(.+?)\|\s(.*?)\s(\d+)(\.\d+)? +(\d+)([\.\d+]?) *$/) { #if(/^(\d+)\|(.+?)\|\s(.*?)\s(\d+)(\.\d+)? +(\d+)((\.\d+)?(e.*?)?) *$/) $one[$counter] = $1; $acc[$counter] = $2; $desc[$counter] = $3; $score[$counter] = $4+$5; if(! $7){ $evalue[$counter] = $6; }else{ $evalue[$counter] = $6+$7; } $counter++; } } if(/sequences producing significant alignments/){ $marker = 1; }elsif(/alignments/){ $marker = 0; }elsif(/no significant similarity found/){ last; } } for(my $i=0; $i < scalar(@one); $i++){ print "$one[$i] | $acc[$i] | $desc[$i] | $score[$i] | $evalue[$i]\n"; } close file;
you can match number in scientific notation (or not) this:
\d+(?:\.\d+)?+(?:e[+-]?\d+)?+
with code:
if (/^([^|]+)\|([^|]+)\|\s++(.*?)\s(\d+(?:\.\d+)?+)\s+(\d+(?:\.\d+)?+(?:e[+-]?\d+)?+)\s*$/) { $one[$counter] = $1; $acc[$counter] = $2; $desc[$counter] = $3; $score[$counter] = $4; $evalue[$counter] = $5; $counter++; }
(i have added possessive quantifiers ++
, ?+
reduce number of backtracking steps as possible, 3th group use lazy quantifier. best use more precise pattern if possible description part.)
Comments
Post a Comment