FONtools combines:
With FONtools:
FON is a program-friendly and extensible format to store genomic annotations. Since FON is using the JSON format for storing data, and JSON stands for JavaScript Object Notation, we named our format FON for Feature Object Notation.
Genomics annotations are mainly stored in files using the BED or GFF, specifically GFF3, formats.
Format | Base format | Parse | Hierarchical | Extensible | Coordinates |
---|---|---|---|---|---|
BED | tab | Simple | No | Limited | 0-based |
GFF | tab | Complex | Yes | Yes | 1-based |
FON | JSON | Existing JSON libraries | Possible | Yes | 0-based |
While BED is simple to parse, it was not designed to store hierarchical annotations, such as exons on a transcript. Instead BED12 "sub-splits" columns using commas instead of tabulations to store exons coordinates of transcripts. Alternatively, GFF allows hierarchical annotations, but is difficult to parse. GFF translates such structures into multiple records linked with a common ID. This approach is generic and describes annotations as a graph, thus requiring more complex code to parse it.
To overcome these limitations, FON format enables simple parsing and hierarchical annotation storage by capitalizing on the strengths of the JSON format:
FON isn't intended to replace GFF to share genomic annotations, but rather to simplify, ease and streamline the use of annotations within programs and pipelines.
FON1 is the first version of the Feature Object Notation format. It stores features in a list. Each feature is a dictionary with a set of defined keys. New keys for each feature can be freely added or removed, none of them are required. Programs using specific key(s) should provide the option to select by their name which key(s) to use (for example the --key
option of fon_mask_fasta
). Chromosome and scaffolds can be described in the assembly key with their name, level and length.
Example for one zebrafish transcript (one feature):
{
"fon_version": 1,
"assembly": [
{
"name": "1",
"level": "chromosome",
"length": 59578282
},
{
"name": "2",
"level": "chromosome",
"length": 59640629
},
{
"name": "KN149708.1",
"level": "scaffold",
"length": 20567
}
],
"features": [
{
"transcript_stable_id": "ENSDART00000171909",
"gene_stable_id": "ENSDARG00000099339",
"gene_name": "pacsin3",
"protein_stable_id": "ENSDARP00000138886",
"chrom": "7",
"strand": "+",
"transcript_version": "2",
"gene_version": "4",
"transcript_biotype": "protein_coding",
"gene_biotype": "protein_coding",
"exons": [[54260003, 54260129], [54263505, 54263662]],
"exons_on_transcript": [[0, 126], [126, 283]],
"cds_exons": [[54260075, 54260129], [54263505, 54263662]],
"cds_exons_on_transcript": [[72, 126], [126, 283]],
"cds_exons_frame": [0, 0],
"cds_exons_frame_on_transcript": [0, 0],
"utr5_exons": [[54260003, 54260075]],
"utr5_exons_on_transcript": [[0, 72]],
"utr3_exons": [],
"utr3_exons_on_transcript": [],
"seq": "TTGGTCTCGCGTCTTGTTCTTCACAGTTTGACGACAGCCGCCATCATTCCGTGCTGCAAGGGCGACCCCAAAATGTCTTCCAACGGTGATCTGCAGGACGTTGGGAGTTGGGACAGCTTCTGGGAGCCTGGAAACTACAAGAGGACGGTTAAGCGCATTGACGACGGCTACAAACTTTGCAACGAGCTGGTCAGCTGCTTCCAGGAGCGGGCCAAGATTGAGAAGGGCTATTCCCAGCAGCTGAGCGACTGGGCTAGGAAATGGAGAGGCATTGTGGAGAAAG",
"go": {
"GO:0097320": {
"term": "plasma membrane tubulation",
"domain": "biological_process",
"sources": []
}
}
}
]
}
Field descriptions:
exons
, cds_exons
, utr5_exons
, utr3_exons
: Lists of exons, exons from the coding sequence (CDS), and exons from the 5' and 3' UTRs. Each exon is a list of start and end genomic coordinates. All coordinates are 0-based and relative to the forward genomic strand. In the example above, the transcript ENSDART00000171909 starts at position 54260003 on chromosome 7, this will be equal to the first exon start.XX_on_transcript
: Lists of exons, CDS etc coordinates translated to transcript coordinates. The first exon start is equal to 0 and the last exon end is equal to the length of the transcript. Translation includes the strand of the transcript: coordinates are forward to the transcript making these coordinates directly usable on the transcript sequence stored in seq
.cds_exons_frame
: Frame of the first nucleotide. With the coding sequence ATGGCA, the following 4 exons would have frame 0, 1, 2 and 0 respectively:
0
|
ATG-GCA
1
|
TG-GCA
2
|
G-GCA
0
|
GCA
seq
contains the transcript sequence.go
holds Gene Ontology (GO) terms, domains and sources. Optional: GO is only imported with the --go
option from the import_ensembl
script.Future versions might address limitation of the currently available FON version. For example, FON1 doesn't allow features to be stored hierarchically; they are stored in a list. Contributions to add new FON versions are welcome.
See tags page.
pip3 install fontools
If you don't have root permission, install in your home using --user
option:
pip3 install fontools --user
Scripts are installed in $HOME/.local/bin
, which should be added to your shell PATH to run the scripts. After adding for example $HOME/.local/bin
to your PATH, try:
import_ensembl -h
If you get an error message like import_ensembl: command not found
, then your PATH isn't properly configured.
FONtools depend on pyfaidx for reading FASTA and pyfnutils for logging.
Script | Description |
---|---|
FON | |
import_ensembl | Import Ensembl sequence and annotations |
fon_import | Import annotations to FON (from GFF3 for now) |
fon_transform | Transform FON file |
FON/GFF3/FASTA/TAB | |
merge_annot | Merge FON/GFF3/FASTA files |
ensembl2ucsc | Convert names from Ensembl to UCSC (in FASTA, GFF3 and tab) |
FASTA | |
fasta_format | Format and/or Sort FASTA file (split sequence) |
fon_mask_fasta | Mask sequence (FASTA) using FON |
fasta_seq_length | Create tab file with sequence(s) length from FASTA file |
The import_ensembl
script creates and maintains an Ensembl-based annotation repository including:
Annotations are imported using fon_import
, then fon_transform
is used:
To create FON files restricted to a biotype, for example protein coding transcripts,
To create FON files selecting the longest isoform of each gene,
To create "metagene" FON files obtained by merging all isoforms of a gene together. Example of how a metagene is obtained from 3 isoforms:
These "metagenes" can be used for counting HTS reads per gene, where reads mapping to any isoforms will map to the metagene.
The script is compatible with Ensembl and Ensembl Genomes (see option --division
/-n
).
The import_ensembl
script aims to maintain a local Ensembl-based repository. Using it requires to set multiple options. But most of these options will be the same each time import_ensembl
is used. In most cases, the data will always be stored in the same directories and only options specifying the release number or the species will change and be specified on the command-line. To this end, all import_ensembl
options can be set for convenience in a JSON config file, in addition to the command-line. This config file can be placed:
$HTS_CONFIG_PATH
. $HTS_CONFIG_PATH
can be defined by the user.$XDG_CONFIG_HOME/hts
. $XDG_CONFIG_HOME
is defined by your desktop environment.---path_config
option to set the directory where to find a fontools.json
config file. This option is not used in this tutorial.To configure import_ensembl
script:
mkdir /data/sai
mkdir /data/sai/download
/data/sai
directory (sai stands for Sequence Annotations & Indices). This is intended for system-wide installation. Alternatively, you can change it to the directory of your choice, for example, if you want to store the data in a sai
directory in your home:
mkdir ~/sai
mkdir ~/sai/download
~/.bashrc
:
export HTS_CONFIG_PATH="/etc/hts"
sai
directory in your home:
mkdir ~/sai/config
HTS_CONFIG_PATH
environment variable, add in your ~/.bashrc
:
export HTS_CONFIG_PATH="$HOME/sai/config"
fontools.json
config file in /etc/hts
:
{
"fontools_path_main": "/data/sai",
"fontools_path_download": "/data/sai/download"
}
sai
directory in your home (please replace smith
by your username), create a fontools.json
config file in ~/sai/config
:
{
"fontools_path_main": "/home/smith/sai",
"fontools_path_download": "/home/smith/sai/download"
}
import_ensembl
actions, you can create the following directory. This will automatically create a different log file per Ensembl release. To specify the location for the log from the script, use the --path_log
/-l
option:
mkdir /data/sai/log
sai
directory in your home:
mkdir ~/sai/log
mkdir /data/sai/annots
cd /data/sai/annots
sai
directory in your home:
mkdir ~/sai/annots
cd ~/sai/annots
wget https://github.com/dpryan79/ChromosomeMappings/archive/refs/heads/master.tar.gz
tar xvfz master.tar.gz
rm -f master.tar.gz
mv ChromosomeMappings-master ChromosomeMappings
fontools_path_mapping
to fontools.json
config file:
{
"fontools_path_main": "/data/sai",
"fontools_path_download": "/data/sai/download",
"fontools_path_mapping": "/data/sai/annots/ChromosomeMappings"
}
sai
directory in your home (please replace smith
by your username):
{
"fontools_path_main": "/home/smith/sai",
"fontools_path_download": "/home/smith/sai/download",
"fontools_path_mapping": "/home/smith/sai/annots/ChromosomeMappings"
}
If you haven't set the environment variable HTS_CONFIG_PATH
(see above), then:
---path_config
option to set the directory to find a fontools.json
config file or,--fontools_path_main
and --fontools_path_download
on the command line.To list available species, use (for Ensembl 104):
import_ensembl -r 104 -s list
To get Ensembl 104 data for 4 species using 10 cores:
import_ensembl -r 104 -s danio_rerio,saccharomyces_cerevisiae,homo_sapiens,mus_musculus -p 10
To select what data are generated, use the --steps
/-t
option. Currently, the following steps are available:
genome
step download FASTA genome sequences, map chromosome/contig names to UCSC names if requested, sort FASTA files, and create chromosome length file.gene
step download GFF annotations, import them to FON files, map chromosome/contig names to UCSC names if requested, and create FON files with:
bowtie2
and star
to create indices for Bowtie2 and STAR respectively.all
of the above steps. This is the default.To import terms, domains and sources from Gene Ontology (GO), add the --go
option.
To import from Ensembl (-n
) release 104 (-r
), get FASTA and GFF (-t genome,gene
) and convert to FON:
import_ensembl -n ensembl -r 104 -s caenorhabditis_elegans -t genome,gene
This command will print detailed log that are recorded in log/ensembl104.log
:
2021-05-06 13:53:05,501 - import_ensembl - INFO - Starting (caenorhabditis_elegans,104)
2021-05-06 13:53:06,303 - import_ensembl - INFO - Found assembly WBcel235 (toplevel)
2021-05-06 13:53:06,304 - import_ensembl - INFO - Downloading fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
2021-05-06 13:53:49,983 - import_ensembl - INFO - Downloading gff3/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.104.gff3.gz
2021-05-06 13:54:02,400 - import_ensembl - INFO - Downloading fasta/caenorhabditis_elegans/cdna/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz
2021-05-06 13:54:22,319 - import_ensembl - INFO - Downloading fasta/caenorhabditis_elegans/ncrna/Caenorhabditis_elegans.WBcel235.ncrna.fa.gz
2021-05-06 13:54:25,103 - import_ensembl - INFO - Sorting to /data/sai/seqs/caeele_genome_all_ensembl_wbcel235.fa
2021-05-06 13:54:25,104 - import_ensembl - INFO - Start ['fasta_format', '--sort', '--input', '/data/sai/download/ftp.ensembl.org/pub/release-104/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz', '--output', '/data/sai/seqs/caeele_genome_all_ensembl_wbcel235.fa']
2021-05-06 13:54:28,613 - import_ensembl - INFO - Start ['cp', '/data/sai/download/ftp.ensembl.org/pub/release-104/gff3/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.104.gff3.gz', '/data/sai/annots/caeele_cdna_all_ensembl104.gff3.gz']
2021-05-06 13:54:28,617 - import_ensembl - INFO - Start ['gzip', '-d', '/data/sai/annots/caeele_cdna_all_ensembl104.gff3.gz']
2021-05-06 13:54:28,858 - import_ensembl - INFO - Creating chromosome length file /data/sai/annots/caeele_genome_all_ensembl_wbcel235_chrom_length.tab
2021-05-06 13:54:28,859 - import_ensembl - INFO - Start ['fasta_seq_length', '--input', '/data/sai/seqs/caeele_genome_all_ensembl_wbcel235.fa', '--output', '/data/sai/annots/caeele_genome_all_ensembl_wbcel235_chrom_length.tab']
2021-05-06 13:54:29,322 - import_ensembl - INFO - Creating chromosome length file for UCSC /data/sai/annots/caeele_genome_all_ensembl_wbcel235_ucsc_names_chrom_length.tab
2021-05-06 13:54:29,322 - import_ensembl - INFO - Start ['ensembl2ucsc', '--input', '/data/sai/annots/caeele_genome_all_ensembl_wbcel235_chrom_length.tab', '--output', '/data/sai/annots/caeele_genome_all_ensembl_wbcel235_ucsc_names_chrom_length.tab', '--path_mapping', '/data/sai/annots/ChromosomeMappings/WBcel235_ensembl2UCSC.txt']
2021-05-06 13:54:29,344 - import_ensembl - INFO - Importing annotation
2021-05-06 13:54:29,344 - import_ensembl - INFO - Start ['fon_import', '--annotation', '/data/sai/download/ftp.ensembl.org/pub/release-104/gff3/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.104.gff3.gz', '--data_source', 'ensembl', '--fasta', '/data/sai/download/ftp.ensembl.org/pub/release-104/fasta/caenorhabditis_elegans/cdna/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz', '--fasta', '/data/sai/download/ftp.ensembl.org/pub/release-104/fasta/caenorhabditis_elegans/ncrna/Caenorhabditis_elegans.WBcel235.ncrna.fa.gz', '--cdna', '--exclude_no_seq', '--biotype', 'all,protein_coding', '--output', '/data/sai/annots/caeele_cdna_${biotype}_ensembl104.fon${version}.json', '--output_format', 'fon']
2021-05-06 13:54:39,531 - import_ensembl - INFO - Transform FON (union,protein_coding)
2021-05-06 13:54:39,531 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_protein_coding_ensembl104.fon1.json', '--method', 'union', '--output', '/data/sai/annots/caeele_cdna_union2gene_protein_coding_ensembl104.fon${version}.json']
2021-05-06 13:54:42,214 - import_ensembl - INFO - Transform FON (longest,protein_coding)
2021-05-06 13:54:42,214 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_protein_coding_ensembl104.fon1.json', '--method', 'longest', '--output', '/data/sai/annots/caeele_cdna_longest_transcript_protein_coding_ensembl104.fon${version}.json']
2021-05-06 13:54:44,178 - import_ensembl - INFO - Transform FON (union,all)
2021-05-06 13:54:44,178 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_all_ensembl104.fon1.json', '--method', 'union', '--output', '/data/sai/annots/caeele_cdna_union2gene_all_ensembl104.fon${version}.json']
2021-05-06 13:54:48,050 - import_ensembl - INFO - Transform FON (longest,all)
2021-05-06 13:54:48,050 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_all_ensembl104.fon1.json', '--method', 'longest', '--output', '/data/sai/annots/caeele_cdna_longest_transcript_all_ensembl104.fon${version}.json']
The following files will be created:
├── annots
│ ├── caeele_cdna_all_ensembl104.fon1.json <-- All transcripts
│ ├── caeele_cdna_all_ensembl104.gff3 <-- All transcripts (GFF3)
│ ├── caeele_cdna_longest_transcript_all_ensembl104.fon1.json <-- Longest transcript per all gene
│ ├── caeele_cdna_longest_transcript_protein_coding_ensembl104.fon1.json <-- Longest transcript per protein-coding gene
│ ├── caeele_cdna_protein_coding_ensembl104.fon1.json <-- All transcripts of protein-coding gene
│ ├── caeele_cdna_union2gene_all_ensembl104.fon1.json <-- Metagenes of all genes
│ ├── caeele_cdna_union2gene_protein_coding_ensembl104.fon1.json <-- Metagenes of protein-coding genes
│ ├── caeele_genome_all_ensembl_wbcel235_chrom_length.tab <-- Chromosome lengths (TAB)
│ ├── caeele_genome_all_ensembl_wbcel235_ucsc_names_chrom_length.tab <-- Chromosome lengths with UCSC names (TAB)
│ └── ChromosomeMappings <-- Ensembl to/from UCSC name mapping
│ .
│ .
│ └── Zv9_UCSC2ensembl.txt
├── config
│ └── fontools.json
├── download
│ └── ftp.ensembl.org
│ └── pub
│ └── release-104
│ ├── fasta
│ │ └── caenorhabditis_elegans
│ │ ├── cdna
│ │ │ └── Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz
│ │ ├── dna
│ │ │ └── Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
│ │ └── ncrna
│ │ └── Caenorhabditis_elegans.WBcel235.ncrna.fa.gz
│ └── gff3
│ └── caenorhabditis_elegans
│ └── Caenorhabditis_elegans.WBcel235.104.gff3.gz
├── log
│ └── ensembl104.log
└── seqs
├── caeele_genome_all_ensembl_wbcel235.fa <-- Sequence (FASTA)
└── caeele_genome_all_ensembl_wbcel235.fa.fai <-- FASTA index
Annotations can be imported into FON from any GFF source. For example, to import gene annotations for Xenopus tropicalis from Xenbase:
cd /data/sai/downloads
wget -m http://ftp.xenbase.org/pub/Genomics/JGI/Xentr10.0/XENTR_10.0_Xenbase.gff3
cd /data/sai/downloads
wget -m http://ftp.xenbase.org/pub/Genomics/JGI/Xentr10.0/XENTR_10.0_genome.fasta.gz
--cdna
option to specify FASTA file containts cDNA instead of genomic sequence.fon_import --annotation "/data/sai/downloads/ftp.xenbase.org/pub/Genomics/JGI/Xentr10.0/XENTR_10.0_Xenbase.gff3" \
--output '/data/sai/annots/xentro_cdna_${biotype}_xenbase100.fon${version}.json' \
--fasta "/data/sai/downloads/ftp.xenbase.org/pub/Genomics/JGI/Xentr10.0/XENTR_10.0_genome.fasta.gz" \
--output_format fon \
--biotype all
--output
is a path written as a simple string or a template string (string.Template).Transform FON files: select longest isoform or merge isoforms. For example to select the longest isoform of each gene:
fon_transform --fon "/data/sai/annots/caeele_cdna_protein_coding_ensembl104.fon1.json" \
--method "longest" \
--output '/data/sai/annots/caeele_cdna_longest_transcript_protein_coding_ensembl104.fon${version}.json'
An easy way to merge the sequence and annotations of multiple species together is to input each sequence and annotation in a comma-separated list of files, using the merge_annot
script. In this example annotations and sequences from zebrafish and yeast are merged:
cd /data/sai
merge_annot --input_fasta "seqs/zebrafish.fa,seqs/yeast.fa" \
--output_fasta "seqs/zebrafish_plus_yeast.fa" \
--input_gff "annots/zebrafish.gff3,annots/yeast.gff3" \
--output_gff "annots/zebrafish_plus_yeast.gff3" \
--input_fon "annots/zebrafish.fon1.json,annots/yeast.fon1.json" \
--output_fon "annots/zebrafish_plus_yeast.fon1.json"
The script ensembl2ucsc
can be used to translate chromosome/contig names from Ensembl to UCSC names (for C. Elegans) using ChromosomeMappings:
cd /data/sai/annots
ensembl2ucsc --input "caeele_genome_all_ensembl_wbcel235_chrom_length.tab" \
--output "caeele_genome_all_ensembl_wbcel235_ucsc_names_chrom_length.tab" \
--path_mapping "ChromosomeMappings/WBcel235_ensembl2UCSC.txt"
fasta_format
: Format FASTA file
--sort
)--seq_length
)fon_mask_fasta
: Mask part(s) of sequence (FASTA) with Ns
fon_mask_fasta --input_fon "selected_loci.fon1.json" \
--input_fasta "genome.fa" \
--output_fasta "genome_mask.fa" \
--extension "50" \
--exterior_extension "100"
--input_fon
) are masked with Ns in the sequence (--input_fasta
). By default, the list of interval coordinates from the exons
key of each feature in the FON file is used. To use a different key, use the --key
option; for example use --key "cds_exons"
to mask the coding sequences.--extension
and each feature can be extended by --exterior_extension
. --exterior_extension
value is by default equal to --extension
value. For example, using --extension 2
on [[10,15], [20, 30]] will mask [[8,17], [18, 32]], while --exterior_extension 2
will mask [[8,15], [20, 32]].--inverse
, only interval coordinates from FON are kept intact, the rest of the sequence is replaced by Ns.fasta_seq_length
: Create tabulated file with sequence(s) name and length from FASTA file.
FONtools are distributed under the Mozilla Public License Version 2.0 (see /LICENSE).
Copyright © 2015-2023 Charles E. Vejnar