Writing a GFF file parser

Let's start by looking at the GFF file format. Because these are meant to be a flexible file format, it contains some required information and lots of optional information. The rows include:

  1. seqid The name of the sequence where the feature is located
  2. source The procedure or algorithm used to generate the feature
  3. type The type of feature
  4. start The starting base position (1-based)
  5. end The ending base position (1-based, inclusive)
  6. score Numeric value generally indicating the confidence of the source in this feature (. for null value)
  7. strand Single character indicating the strand of the feature: +, -, . for undertermined, ? for unknown
  8. phase The phase of CDS features, 0, 1, or 2 (for CDS features) or . for everything else
  9. attributes A list of tag-value pairs separated by semicolons with additional information

Presenter Notes

And example entry

Here is the first line of one of the homework files:

NC_060925.1
Curated Genomic
pseudogene
144134
146717
.
-
.
ID=gene-SEPTIN14P14;Dbxref=GeneID:107105354,HGNC:HGNC:51701;Name=SEPTIN14P14;description=septin 14 pseudogene 14;gbkey=Gene;gene=SEPTIN14P14;gene_biotype=pseudogene;gene_synonym=SEPT14P14;pseudo=true

Presenter Notes

Which fields are relevant?

In order to focus on the information we are interested in, we first need to filter out lines that aren't of the correct type of feature.

Where will we find that information?

Presenter Notes

Which fields are relevant?

In order to focus on the information we are interested in, we first need to filter out lines that aren't of the correct type of feature.

Where will we find that information?

1for line in open(fname):
2    line = line.rstrip().split("\t")
3    if line[2] != "pseudogene":
4        continue

Presenter Notes

How long?

We also want to know how long is each pseudogene. How can we get that information?

Presenter Notes

How long?

We also want to know how long is each pseudogene. How can we get that information?

1for line in open(fname):
2    line = line.rstrip().split("\t")
3    if line[2] != "pseudogene":
4        continue
5    length = int(line[6]) - int(line[5])

Presenter Notes

How can we break up the attributes field?

Finally, we need to separate all of the attributes. Since we know that they should be in tag-value pairs, this seems like an ideal case for a dictionary, especially since there is no gaurentee that each line will have all the same attributes.

To begin with, we need a break apart the attributes field. We know that each pair is separated by a semicolon so let's make a list using that as the delimiter.

1fields = line[8].split(";")

Now if we look at fields[0], we see it is equal to "ID=gene-SEPTIN14P14"

Presenter Notes

Going from a list to a dictionary

How do we handle working on one list item at a time? A for loop! And inside the loop, we need to break up the tag-value pairs into their two parts.

1info = {}
2for pair in fields:
3    key, value = pair.split("=")
4    info[key] = value

Presenter Notes

Keeping track of one entry per gene

The gene name is one of the attributes under the tag "gene". But before we add it to our master dictionary, we need to check if there has already been an entry with the gene name (there are multiple occurances of some pseudogenes). Also, if it is a duplicate, we want to keep track of how many copies and the largest size.

To do this, we can use a conditional statement. Assuming that we have a dictionary named genes to keep track of gene entries...

1name = info['gene']
2if name not in genes:               # We've never seen this gene
3    genes[name] = info
4    genes[name]['count'] = 1        # We need to be able to keep count
5    genes[name]['length'] = length  # We need to keep track of lengths
6else:
7    genes[name]['count'] += 1       # We add another sighting
8    # Only keep the biggest length
9    genes[name]['length'] = max(genes[name]['length'], length)

Presenter Notes

Putting it all together

 1genes = {}
 2for line in open(fname):
 3    line = line.rstrip().split("\t")
 4    if line[2] != "pseudogene":
 5        continue
 6    length = int(line[6]) - int(line[5])
 7    info = {}
 8    for pair in fields:
 9        key, value = pair.split("=")
10        info[key] = value
11    name = info['gene']
12    if name not in genes:
13        genes[name] = info
14        genes[name]['count'] = 1
15        genes[name]['length'] = length
16    else:
17        genes[name]['count'] += 1
18        genes[name]['length'] = max(genes[name]['length'], length)

Presenter Notes