Loading GFF file(s) into a pandas dataframe and searching by features/attributes

GFF: gene-finding format or generic feature format.

Sep 22, 2023

GFF files store gene co-ordinates of a genome. For more on GFF format, please visit: https://en.wikipedia.org/wiki/General_feature_format

Anyway, I was looking into ways of reading GFF files in python and came up with couple of python modules that can do that, namely - gffutils and gffpandas. Tried to check them out, but gffutils seemed a bit complicated and gffpandas was unable to open the GFF file for Arabidopsis 11 genome release (link).

So I directly attempted to load the data into pandas dataframe and filter/search using pandas functions/features.

Loading the GFF file (Gzipped) was pretty straitforward:

import pandas as pd
tair_df = pd.read_csv('Araport11_GFF3_genes_transposons.current.gff.gz', 
                      sep='\t', 
                      comment='#', 
                      encoding='cp1252', 
                      header=None, 
                     compression='gzip')

tair_df.columns = ['chromosome', 'data_source', 'type', 'start', 'end', 'score', 'strand', 'frame', 'features']

This was pretty straightforward. The features colum, however, had a lot of info. For example, the feature colum for the first row had:

'ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1;symbol=NAC001;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1;locus=2200935;locus_type=protein_coding'

It seemes these are some of the features (or attributes in gffpandas vocabulary) that one would like to filter on. So, how to achieve that?

One way of doing so would be to convert this features string to a dictionary, where each feature will be a dictionary key and the feature description will be in the values of the dictionary. To convert the features string into a dictionary, we can use the following fuction:

def string_to_dict(input_string):
    '''Function to convert feature string into a dictionary.'''
    try:
        # Split the input string into a list of key-value pairs using semicolon as the delimiter
        feature_list = input_string.split(';')

        # Filter the list to only include items that contain an equal sign ('=')
        feature_list = [x for x in feature_list if '=' in x]

        # Create a dictionary by splitting each key-value pair at the first equal sign encountered
        feature_dict = dict(x.split('=', 1) for x in feature_list)
    except:
        # If there's an exception (e.g., invalid input), print the original input string and return an empty dictionary
        print(input_string)
        feature_dict = dict()
    
    # Return the resulting dictionary
    return feature_dict

# The code provided assumes there's a variable 'inp_str' containing the input string
# You should replace 'inp_str' with an actual input string before calling the function.
# Example usage:
# input_string = "key1=value1;key2=value2;key3=value3"
# result = string_to_dict(input_string)
# This will convert the input string into a dictionary and store it in the 'result' variable.

To convert the features string into a dictionary and storing as a new column:

tair_df['features_dict'] = tair_df.features.apply(string_to_dict)

Filtering entries:

The dataframe has two types of data to filter on. Some key attributes/features are present in single column (e.g., data_source, type, start, end, frame, strand). It is realtively easy to filter based on them. For example, looking for all proteins that are in the “+” strand:

tair_df[(tair_df.type == 'protein') & (tair_df.strand == '+')]

       chromosome data_source     type     start       end score strand frame  \
15           Chr1   Araport11  protein    3760.0    5630.0     .      +     .   
200          Chr1   Araport11  protein   23519.0   31079.0     .      +     .   
244          Chr1   Araport11  protein   23519.0   31079.0     .      +     .   
603          Chr1   Araport11  protein   53022.0   54494.0     .      +     .   
617          Chr1   Araport11  protein   52239.0   54494.0     .      +     .   
...           ...         ...      ...       ...       ...   ...    ...   ...   
927135       ChrC   Araport11  protein  139856.0  140650.0     .      +     .   
927140       ChrC   Araport11  protein  140704.0  141171.0     .      +     .   
927147       ChrC   Araport11  protein  141854.0  143708.0     .      +     .   
927168       ChrC   Araport11  protein  152506.0  152787.0     .      +     .   
927175       ChrC   Araport11  protein  152806.0  154312.0     .      +     .   

                                                 features  \
15      ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derive...   
200     ID=AT1G01040.1-Protein;Name=AT1G01040.1;Derive...   
244     ID=AT1G01040.2-Protein;Name=AT1G01040.2;Derive...   
603     ID=AT1G01110.1-Protein;Name=AT1G01110.1;Derive...   
617     ID=AT1G01110.2-Protein;Name=AT1G01110.2;Derive...   
...                                                   ...   
927135  ID=ATCG01230.1-Protein;Name=ATCG01230.1;Derive...   
927140  ID=ATCG01240.1-Protein;Name=ATCG01240.1;Derive...   
927147  ID=ATCG01250.1-Protein;Name=ATCG01250.1;Derive...   
927168  ID=ATCG01300.1-Protein;Name=ATCG01300.1;Derive...   
927175  ID=ATCG01310.1-Protein;Name=ATCG01310.1;Derive...

Filtering based on the data within the feature string (or feature dictionary), for example - looking for all protein_coding genes:

tair_df[tair_df.features_dict.apply(lambda x: x.get('locus_type', '') == 'protein_coding')]

       chromosome data_source  type     start       end score strand frame  \
0            Chr1   Araport11  gene    3631.0    5899.0     .      +     .   
17           Chr1   Araport11  gene    6788.0    9130.0     .      -     .   
138          Chr1   Araport11  gene   11649.0   13714.0     .      -     .   
157          Chr1   Araport11  gene   23121.0   31227.0     .      +     .   
249          Chr1   Araport11  gene   31170.0   33171.0     .      -     .   
...           ...         ...   ...       ...       ...   ...    ...   ...   
927349       ChrM   Araport11  mRNA   79559.0   80005.0     .      +     .   
927351       ChrM   Araport11  gene   58315.0   59481.0     .      +     .   
927352       ChrM   Araport11  mRNA   58315.0   59481.0     .      +     .   
927363       ChrM   Araport11  gene  114762.0  116345.0     .      -     .   
927364       ChrM   Araport11  mRNA  114762.0  116345.0     .      -     .   

                                                 features  \
0       ID=AT1G01010;Name=AT1G01010;Note=NAC domain co...   
17      ID=AT1G01020;Name=AT1G01020;Note=Arv1-like pro...   
138     ID=AT1G01030;Name=AT1G01030;Note=AP2/B3-like t...   
157     ID=AT1G01040;Name=AT1G01040;Note=dicer-like 1;...   
249     ID=AT1G01050;Name=AT1G01050;Note=pyrophosphory...   
...                                                   ...   
927349  ID=ATMG01270.1;Parent=ATMG01270;id2=gene-rps7;...   
927351  ID=ATMG01275;id2=gene-nad1;Name=nad1;exception...   
927352  ID=ATMG01275.1;Parent=ATMG01275;id2=gene-nad1;...   
927363  ID=ATMG01360;id2=gene-cox1;Name=cox1;gbkey=Gen...   
927364  ID=ATMG01360.1;Parent=ATMG01360;id2=gene-cox1;...   

                                            features_dict  
0       {'ID': 'AT1G01010', 'Name': 'AT1G01010', 'Note...  
17      {'ID': 'AT1G01020', 'Name': 'AT1G01020', 'Note...  
138     {'ID': 'AT1G01030', 'Name': 'AT1G01030', 'Note...  
157     {'ID': 'AT1G01040', 'Name': 'AT1G01040', 'Note...  
249     {'ID': 'AT1G01050', 'Name': 'AT1G01050', 'Note...  
...                                                   ...  
927349  {'ID': 'ATMG01270.1', 'Parent': 'ATMG01270', '...  
927351  {'ID': 'ATMG01275', 'id2': 'gene-nad1', 'Name'...  
927352  {'ID': 'ATMG01275.1', 'Parent': 'ATMG01275', '...  
927363  {'ID': 'ATMG01360', 'id2': 'gene-cox1', 'Name'...  
927364  {'ID': 'ATMG01360.1', 'Parent': 'ATMG01360', '...  

[27483 rows x 10 columns]

That’s it for now!

Link to GitHub repo: https://github.com/Mahdi-Moosa/Load_GFF_to_Pandas/

Mahdi’s Musings

Discussion about this post