| Class | Bio::FastaFormat |
| In: |
lib/bio/db/fasta.rb
|
| Parent: | DB |
Treats a FASTA formatted entry, such as:
>id and/or some comments <== comment line ATGCATGCATGCATGCATGCATGCATGCATGCATGC <== sequence lines ATGCATGCATGCATGCATGCATGCATGCATGCATGC ATGCATGCATGC
The precedent ’>’ can be omitted and the trailing ’>’ will be removed automatically.
f_str = <<END_OF_STRING >sce:YBR160W CDC28, SRM5; cyclin-dependent protein kinase catalytic subunit [EC:2.7.1.-] [SP:CC28_YEAST] MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEG VPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYME GIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNL KLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGC IFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFP QWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES >sce:YBR274W CHK1; probable serine/threonine-protein kinase [EC:2.7.1.-] [SP:KB9S_YEAST] MSLSQVSPLPHIKDVVLGDTVGQGAFACVKNAHLQMDPSIILAVKFIHVP TCKKMGLSDKDITKEVVLQSKCSKHPNVLRLIDCNVSKEYMWIILEMADG GDLFDKIEPDVGVDSDVAQFYFQQLVSAINYLHVECGVAHRDIKPENILL DKNGNLKLADFGLASQFRRKDGTLRVSMDQRGSPPYMAPEVLYSEEGYYA DRTDIWSIGILLFVLLTGQTPWELPSLENEDFVFFIENDGNLNWGPWSKI EFTHLNLLRKILQPDPNKRVTLKALKLHPWVLRRASFSGDDGLCNDPELL AKKLFSHLKVSLSNENYLKFTQDTNSNNRYISTQPIGNELAELEHDSMHF QTVSNTQRAFTSYDSNTNYNSGTGMTQEAKWTQFISYDIAALQFHSDEND CNELVKRHLQFNPNKLTKFYTLQPMDVLLPILEKALNLSQIRVKPDLFAN FERLCELLGYDNVFPLIINIKTKSNGGYQLCGSISIIKIEEELKSVGFER KTGDPLEWRRLFKKISTICRDIILIPN END_OF_STRING f = Bio::FastaFormat.new(f_str) puts "### FastaFormat" puts "# entry" puts f.entry puts "# entry_id" p f.entry_id puts "# definition" p f.definition puts "# data" p f.data puts "# seq" p f.seq puts "# seq.type" p f.seq.type puts "# length" p f.length puts "# aaseq" p f.aaseq puts "# aaseq.type" p f.aaseq.type puts "# aaseq.composition" p f.aaseq.composition puts "# aalen" p f.aalen
| DELIMITER | = | RS = "\n>" | Entry delimiter in flatfile text. | |
| DELIMITER_OVERRUN | = | 1 | (Integer) excess read size included in DELIMITER. |
| data | [RW] | The seuqnce lines in text. |
| definition | [RW] | The comment line of the FASTA formatted data. |
| entry_overrun | [R] |
Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.
# File lib/bio/db/fasta.rb, line 119
119: def initialize(str)
120: @definition = str[/.*/].sub(/^>/, '').strip # 1st line
121: @data = str.sub(/.*/, '') # rests
122: @data.sub!(/^>.*/m, '') # remove trailing entries for sure
123: @entry_overrun = $&
124: end
Returens the Bio::Sequence::AA.
# File lib/bio/db/fasta.rb, line 204
204: def aaseq
205: Sequence::AA.new(seq)
206: end
Parsing FASTA Defline (using identifiers method), and shows accession numbers. It returns an array of strings.
# File lib/bio/db/fasta.rb, line 260
260: def accessions
261: identifiers.accessions
262: end
Returns comments.
# File lib/bio/db/fasta.rb, line 183
183: def comment
184: seq
185: @comment
186: end
Parsing FASTA Defline (using identifiers method), and shows a possibly unique identifier. It returns a string.
# File lib/bio/db/fasta.rb, line 239
239: def entry_id
240: identifiers.entry_id
241: end
Parsing FASTA Defline (using identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.
# File lib/bio/db/fasta.rb, line 248
248: def gi
249: identifiers.gi
250: end
Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or ":"-separated IDs. It returns a Bio::FastaDefline instance.
# File lib/bio/db/fasta.rb, line 229
229: def identifiers
230: unless defined?(@ids) then
231: @ids = FastaDefline.new(@definition)
232: end
233: @ids
234: end
Returens the length of Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 199
199: def nalen
200: self.naseq.length
201: end
Returens the Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 194
194: def naseq
195: Sequence::NA.new(seq)
196: end
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
#!/usr/bin/env ruby
require 'bio'
factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
flatfile.each do |entry|
p entry.definition
result = entry.fasta(factory)
result.each do |hit|
print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
p hit.lap_at
end
end
# File lib/bio/db/fasta.rb, line 150
150: def query(factory)
151: factory.query(@entry)
152: end
Returns a joined sequence line as a String.
# File lib/bio/db/fasta.rb, line 157
157: def seq
158: unless defined?(@seq)
159: unless /\A\s*^\#/ =~ @data then
160: @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up
161: else
162: a = @data.split(/(^\#.*$)/)
163: i = 0
164: cmnt = {}
165: s = []
166: a.each do |x|
167: if /^# ?(.*)$/ =~ x then
168: cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1
169: else
170: x.tr!(" \t\r\n0-9", '') # lazy clean up
171: i += x.length
172: s << x
173: end
174: end
175: @comment = cmnt
176: @seq = Bio::Sequence::Generic.new(s.join(''))
177: end
178: end
179: @seq
180: end
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.
# File lib/bio/db/fasta.rb, line 220
220: def to_biosequence
221: Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat)
222: end