| Class | Bio::KEGG::Taxonomy |
| In: |
lib/bio/db/kegg/taxonomy.rb
|
| Parent: | Object |
Parse the KEGG ‘taxonomy’ file which describes taxonomic classification of organisms.
The KEGG ‘taxonomy’ file is available at
| leaves | [R] | |
| path | [R] | |
| root | [RW] | |
| tree | [R] |
# File lib/bio/db/kegg/taxonomy.rb, line 26
26: def initialize(filename, orgs = [])
27: # Stores the taxonomic tree as a linked list (implemented in Hash), so
28: # every node need to have unique name (key) to work correctly
29: @tree = Hash.new
30:
31: # Also stores the taxonomic tree as a list of arrays (full path)
32: @path = Array.new
33:
34: # Also stores all leaf nodes (organism codes) of every intermediate nodes
35: @leaves = Hash.new
36:
37: # tentative name for the root node (use accessor to change)
38: @root = 'Genes'
39:
40: hier = Array.new
41: level = 0
42: label = nil
43:
44: File.open(filename).each do |line|
45: next if line.strip.empty?
46:
47: # line for taxonomic hierarchy (indent according to the number of # marks)
48: if line[/^#/]
49: level = line[/^#+/].length
50: label = line[/[A-z].*/]
51: hier[level] = sanitize(label)
52:
53: # line for organims name (unify different strains of a species)
54: else
55: tax, org, name, desc = line.chomp.split("\t")
56: if orgs.nil? or orgs.empty? or orgs.include?(org)
57: species, strain, = name.split('_')
58: # (0) Grouping of the strains of the same species.
59: # If the name of species is the same as the previous line,
60: # add the species to the same species group.
61: # ex. Gamma/enterobacteria has a large number of organisms,
62: # so sub grouping of strains is needed for E.coli strains etc.
63: #
64: # However, if the species name is already used, need to avoid
65: # collision of species name as the current implementation stores
66: # the tree as a Hash, which may cause the infinite loop.
67: #
68: # (1) If species name == the intermediate node of other lineage
69: # Add '_sp' to the species name to avoid the conflict (1-1), and if
70: # 'species_sp' is already taken, use 'species_strain' instead (1-2).
71: # ex. Bacteria/Proteobacteria/Beta/T.denitrificans/tbd
72: # Bacteria/Proteobacteria/Epsilon/T.denitrificans_ATCC33889/tdn
73: # -> Bacteria/Proteobacteria/Beta/T.denitrificans/tbd
74: # Bacteria/Proteobacteria/Epsilon/T.denitrificans_sp/tdn
75: #
76: # (2) If species name == the intermediate node of the same lineage
77: # Add '_sp' to the species name to avoid the conflict.
78: # ex. Bacteria/Cyanobacgteria/Cyanobacteria_CYA/cya
79: # Bacteria/Cyanobacgteria/Cyanobacteria_CYB/cya
80: # Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_MC1/mgm
81: # -> Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya
82: # Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya
83: # Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_sp/mgm
84: sp_group = "#{species}_sp"
85: if @tree[species]
86: if hier[level+1] == species
87: # case (0)
88: else
89: # case (1-1)
90: species = sp_group
91: # case (1-2)
92: if @tree[sp_group] and hier[level+1] != species
93: species = name
94: end
95: end
96: else
97: if hier[level] == species
98: # case (2)
99: species = sp_group
100: end
101: end
102: # 'hier' is an array of the taxonomic tree + species and strain name.
103: # ex. [nil, Eukaryotes, Fungi, Ascomycetes, Saccharomycetes] +
104: # [S_cerevisiae, sce]
105: hier[level+1] = species # sanitize(species)
106: hier[level+2] = org
107: ary = hier[1, level+2]
108: warn ary.inspect if $DEBUG
109: add_to_tree(ary)
110: add_to_leaves(ary)
111: add_to_path(ary)
112: end
113: end
114: end
115: return tree
116: end
Add a new path [node, subnode, subsubnode, …, leaf] under the root node and stores leaf nodes to the every intermediate nodes as an Array.
# File lib/bio/db/kegg/taxonomy.rb, line 140
140: def add_to_leaves(ary)
141: leaf = ary.last
142: ary.each do |node|
143: @leaves[node] ||= Array.new
144: @leaves[node] << leaf
145: end
146: end
Add a new path [node, subnode, subsubnode, …, leaf] under the root node and every intermediate nodes stores their child nodes as a Hash.
# File lib/bio/db/kegg/taxonomy.rb, line 129
129: def add_to_tree(ary)
130: parent = @root
131: ary.each do |node|
132: @tree[parent] ||= Hash.new
133: @tree[parent][node] = nil
134: parent = node
135: end
136: end
Compaction of intermediate nodes of the resulted taxonomic tree.
- If child node has only one child node (grandchild), make the child of grandchild as a grandchild. ex. Plants / Monocotyledons / grass family / osa --> Plants / Monocotyledons / osa
# File lib/bio/db/kegg/taxonomy.rb, line 161
161: def compact(node = root)
162: # if the node has children
163: if subnodes = @tree[node]
164: # obtain grandchildren for each child
165: subnodes.keys.each do |subnode|
166: if subsubnodes = @tree[subnode]
167: # if the number of grandchild node is 1
168: if subsubnodes.keys.size == 1
169: # obtain the name of the grandchild node
170: subsubnode = subsubnodes.keys.first
171: # obtain the child of the grandchlid node
172: if subsubsubnodes = @tree[subsubnode]
173: # make the child of grandchild node as a chlid of child node
174: @tree[subnode] = subsubsubnodes
175: # delete grandchild node
176: @tree[subnode].delete(subsubnode)
177: warn "--- compact: #{subsubnode} is replaced by #{subsubsubnodes}" if $DEBUG
178: # retry until new grandchild also needed to be compacted.
179: retry
180: end
181: end
182: end
183: # repeat recurseively
184: compact(subnode)
185: end
186: end
187: end
Traverse the taxonomic tree by the depth first search method under the given (root or intermediate) node.
# File lib/bio/db/kegg/taxonomy.rb, line 224
224: def dfs(parent, &block)
225: if children = @tree[parent]
226: yield parent, children
227: children.keys.each do |child|
228: dfs(child, &block)
229: end
230: end
231: end
Similar to the dfs method but also passes the current level of the nest to the iterator.
# File lib/bio/db/kegg/taxonomy.rb, line 235
235: def dfs_with_level(parent, &block)
236: @level ||= 0
237: if children = @tree[parent]
238: yield parent, children, @level
239: @level += 1
240: children.keys.each do |child|
241: dfs_with_level(child, &block)
242: end
243: @level -= 1
244: end
245: end
Reduction of the leaf node of the resulted taxonomic tree.
- If the parent node have only one leaf node, replace parent node with the leaf node. ex. Plants / Monocotyledons / osa --> Plants / osa
# File lib/bio/db/kegg/taxonomy.rb, line 196
196: def reduce(node = root)
197: # if the node has children
198: if subnodes = @tree[node]
199: # obtain grandchildren for each child
200: subnodes.keys.each do |subnode|
201: if subsubnodes = @tree[subnode]
202: # if the number of grandchild node is 1
203: if subsubnodes.keys.size == 1
204: # obtain the name of the grandchild node
205: subsubnode = subsubnodes.keys.first
206: # if the grandchild node is a leaf node
207: unless @tree[subsubnode]
208: # make the grandchild node as a child node
209: @tree[node].update(subsubnodes)
210: # delete child node
211: @tree[node].delete(subnode)
212: warn "--- reduce: #{subnode} is replaced by #{subsubnode}" if $DEBUG
213: end
214: end
215: end
216: # repeat recursively
217: reduce(subnode)
218: end
219: end
220: end