Let's start by looking at the GFF file format. Because these are meant to be a flexible file format, it contains some required information and lots of optional information. The rows include:
Here is the first line of one of the homework files:
NC_060925.1
Curated Genomic
pseudogene
144134
146717
.
-
.
ID=gene-SEPTIN14P14;Dbxref=GeneID:107105354,HGNC:HGNC:51701;Name=SEPTIN14P14;description=septin 14 pseudogene 14;gbkey=Gene;gene=SEPTIN14P14;gene_biotype=pseudogene;gene_synonym=SEPT14P14;pseudo=true
In order to focus on the information we are interested in, we first need to filter out lines that aren't of the correct type of feature.
Where will we find that information?
In order to focus on the information we are interested in, we first need to filter out lines that aren't of the correct type of feature.
Where will we find that information?
1for line in open(fname):
2 line = line.rstrip().split("\t")
3 if line[2] != "pseudogene":
4 continue
We also want to know how long is each pseudogene. How can we get that information?
We also want to know how long is each pseudogene. How can we get that information?
1for line in open(fname):
2 line = line.rstrip().split("\t")
3 if line[2] != "pseudogene":
4 continue
5 length = int(line[6]) - int(line[5])
Finally, we need to separate all of the attributes. Since we know that they should be in tag-value pairs, this seems like an ideal case for a dictionary, especially since there is no gaurentee that each line will have all the same attributes.
To begin with, we need a break apart the attributes field. We know that each pair is separated by a semicolon so let's make a list using that as the delimiter.
1fields = line[8].split(";")
Now if we look at fields[0], we see it is equal to "ID=gene-SEPTIN14P14"
How do we handle working on one list item at a time? A for loop! And inside the loop, we need to break up the tag-value pairs into their two parts.
1info = {}
2for pair in fields:
3 key, value = pair.split("=")
4 info[key] = value
The gene name is one of the attributes under the tag "gene". But before we add it to our master dictionary, we need to check if there has already been an entry with the gene name (there are multiple occurances of some pseudogenes). Also, if it is a duplicate, we want to keep track of how many copies and the largest size.
To do this, we can use a conditional statement. Assuming that we have a dictionary named genes to keep track of gene entries...
1name = info['gene']
2if name not in genes: # We've never seen this gene
3 genes[name] = info
4 genes[name]['count'] = 1 # We need to be able to keep count
5 genes[name]['length'] = length # We need to keep track of lengths
6else:
7 genes[name]['count'] += 1 # We add another sighting
8 # Only keep the biggest length
9 genes[name]['length'] = max(genes[name]['length'], length)
1genes = {}
2for line in open(fname):
3 line = line.rstrip().split("\t")
4 if line[2] != "pseudogene":
5 continue
6 length = int(line[6]) - int(line[5])
7 info = {}
8 for pair in fields:
9 key, value = pair.split("=")
10 info[key] = value
11 name = info['gene']
12 if name not in genes:
13 genes[name] = info
14 genes[name]['count'] = 1
15 genes[name]['length'] = length
16 else:
17 genes[name]['count'] += 1
18 genes[name]['length'] = max(genes[name]['length'], length)
Writing a GFF file parser | 1 |
---|
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |