Loading GFF file(s) into a pandas dataframe and searching by features/attributes
GFF: gene-finding format or generic feature format.
GFF files store gene co-ordinates of a genome. For more on GFF format, please visit: https://en.wikipedia.org/wiki/General_feature_format
Anyway, I was looking into ways of reading GFF files in python and came up with couple of python modules that can do that, namely - gffutils and gffpandas. Tried to check them out, but gffutils seemed a bit complicated and gffpandas was unable to open the GFF file for Arabidopsis 11 genome release (link).
So I directly attempted to load the data into pandas dataframe and filter/search using pandas functions/features.
Loading the GFF file (Gzipped) was pretty straitforward:
import pandas as pd
tair_df = pd.read_csv('Araport11_GFF3_genes_transposons.current.gff.gz',
sep='\t',
comment='#',
encoding='cp1252',
header=None,
compression='gzip')
tair_df.columns = ['chromosome', 'data_source', 'type', 'start', 'end', 'score', 'strand', 'frame', 'features']
This was pretty straightforward. The features colum, however, had a lot of info. For example, the feature colum for the first row had:
'ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1;symbol=NAC001;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1;locus=2200935;locus_type=protein_coding'
It seemes these are some of the features (or attributes in gffpandas vocabulary) that one would like to filter on. So, how to achieve that?
One way of doing so would be to convert this features string to a dictionary, where each feature will be a dictionary key and the feature description will be in the values of the dictionary. To convert the features string into a dictionary, we can use the following fuction:
def string_to_dict(input_string):
'''Function to convert feature string into a dictionary.'''
try:
# Split the input string into a list of key-value pairs using semicolon as the delimiter
feature_list = input_string.split(';')
# Filter the list to only include items that contain an equal sign ('=')
feature_list = [x for x in feature_list if '=' in x]
# Create a dictionary by splitting each key-value pair at the first equal sign encountered
feature_dict = dict(x.split('=', 1) for x in feature_list)
except:
# If there's an exception (e.g., invalid input), print the original input string and return an empty dictionary
print(input_string)
feature_dict = dict()
# Return the resulting dictionary
return feature_dict
# The code provided assumes there's a variable 'inp_str' containing the input string
# You should replace 'inp_str' with an actual input string before calling the function.
# Example usage:
# input_string = "key1=value1;key2=value2;key3=value3"
# result = string_to_dict(input_string)
# This will convert the input string into a dictionary and store it in the 'result' variable.
To convert the features string into a dictionary and storing as a new column:
tair_df['features_dict'] = tair_df.features.apply(string_to_dict)
Filtering entries:
The dataframe has two types of data to filter on. Some key attributes/features are present in single column (e.g., data_source, type, start, end, frame, strand). It is realtively easy to filter based on them. For example, looking for all proteins that are in the “+” strand:
tair_df[(tair_df.type == 'protein') & (tair_df.strand == '+')]
chromosome data_source type start end score strand frame \
15 Chr1 Araport11 protein 3760.0 5630.0 . + .
200 Chr1 Araport11 protein 23519.0 31079.0 . + .
244 Chr1 Araport11 protein 23519.0 31079.0 . + .
603 Chr1 Araport11 protein 53022.0 54494.0 . + .
617 Chr1 Araport11 protein 52239.0 54494.0 . + .
... ... ... ... ... ... ... ... ...
927135 ChrC Araport11 protein 139856.0 140650.0 . + .
927140 ChrC Araport11 protein 140704.0 141171.0 . + .
927147 ChrC Araport11 protein 141854.0 143708.0 . + .
927168 ChrC Araport11 protein 152506.0 152787.0 . + .
927175 ChrC Araport11 protein 152806.0 154312.0 . + .
features \
15 ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derive...
200 ID=AT1G01040.1-Protein;Name=AT1G01040.1;Derive...
244 ID=AT1G01040.2-Protein;Name=AT1G01040.2;Derive...
603 ID=AT1G01110.1-Protein;Name=AT1G01110.1;Derive...
617 ID=AT1G01110.2-Protein;Name=AT1G01110.2;Derive...
... ...
927135 ID=ATCG01230.1-Protein;Name=ATCG01230.1;Derive...
927140 ID=ATCG01240.1-Protein;Name=ATCG01240.1;Derive...
927147 ID=ATCG01250.1-Protein;Name=ATCG01250.1;Derive...
927168 ID=ATCG01300.1-Protein;Name=ATCG01300.1;Derive...
927175 ID=ATCG01310.1-Protein;Name=ATCG01310.1;Derive...
Filtering based on the data within the feature string (or feature dictionary), for example - looking for all protein_coding genes:
tair_df[tair_df.features_dict.apply(lambda x: x.get('locus_type', '') == 'protein_coding')]
chromosome data_source type start end score strand frame \
0 Chr1 Araport11 gene 3631.0 5899.0 . + .
17 Chr1 Araport11 gene 6788.0 9130.0 . - .
138 Chr1 Araport11 gene 11649.0 13714.0 . - .
157 Chr1 Araport11 gene 23121.0 31227.0 . + .
249 Chr1 Araport11 gene 31170.0 33171.0 . - .
... ... ... ... ... ... ... ... ...
927349 ChrM Araport11 mRNA 79559.0 80005.0 . + .
927351 ChrM Araport11 gene 58315.0 59481.0 . + .
927352 ChrM Araport11 mRNA 58315.0 59481.0 . + .
927363 ChrM Araport11 gene 114762.0 116345.0 . - .
927364 ChrM Araport11 mRNA 114762.0 116345.0 . - .
features \
0 ID=AT1G01010;Name=AT1G01010;Note=NAC domain co...
17 ID=AT1G01020;Name=AT1G01020;Note=Arv1-like pro...
138 ID=AT1G01030;Name=AT1G01030;Note=AP2/B3-like t...
157 ID=AT1G01040;Name=AT1G01040;Note=dicer-like 1;...
249 ID=AT1G01050;Name=AT1G01050;Note=pyrophosphory...
... ...
927349 ID=ATMG01270.1;Parent=ATMG01270;id2=gene-rps7;...
927351 ID=ATMG01275;id2=gene-nad1;Name=nad1;exception...
927352 ID=ATMG01275.1;Parent=ATMG01275;id2=gene-nad1;...
927363 ID=ATMG01360;id2=gene-cox1;Name=cox1;gbkey=Gen...
927364 ID=ATMG01360.1;Parent=ATMG01360;id2=gene-cox1;...
features_dict
0 {'ID': 'AT1G01010', 'Name': 'AT1G01010', 'Note...
17 {'ID': 'AT1G01020', 'Name': 'AT1G01020', 'Note...
138 {'ID': 'AT1G01030', 'Name': 'AT1G01030', 'Note...
157 {'ID': 'AT1G01040', 'Name': 'AT1G01040', 'Note...
249 {'ID': 'AT1G01050', 'Name': 'AT1G01050', 'Note...
... ...
927349 {'ID': 'ATMG01270.1', 'Parent': 'ATMG01270', '...
927351 {'ID': 'ATMG01275', 'id2': 'gene-nad1', 'Name'...
927352 {'ID': 'ATMG01275.1', 'Parent': 'ATMG01275', '...
927363 {'ID': 'ATMG01360', 'id2': 'gene-cox1', 'Name'...
927364 {'ID': 'ATMG01360.1', 'Parent': 'ATMG01360', '...
[27483 rows x 10 columns]
That’s it for now!
Link to GitHub repo: https://github.com/Mahdi-Moosa/Load_GFF_to_Pandas/