Welcome to the UROPA documentation!¶

UROPA is a command line based tool intended for genomic region annotation.
Advantages of UROPA
- Detect the most appropriate annotation of peaks, utilizing parameters such as
- feature type
- feature anchor
- feature direction relative to peak location
- filtering for attribute values, e.g. “protein_coding”
- strand specificity
- Utilization of any available GTF files as annotation reference
- Multiple queries can be processed in a single run
- Graduated annotation due to prioritization
- Multiple output tables (allhits, finalhits, besthits)
- Visual summary for annotation evaluation
- Preparation of custom annotation files with the UROPA to GTF utility
How to cite
Please cite the paper describing UROPA when using it in your research: Kondili M, Fust A, Preussner J, Kuenne C, Braun T, Looso M. UROPA: a tool for Universal RObust Peak Annotation. Scientific Reports. 2017;7:2593. doi:10.1038/s41598-017-02464-y.
Contribute
Support
If you have any issue feel free to send an email to Mario Looso
Licence
The project is licensed under the MIT license (see MIT License)
About UROPA¶
UROPA (‘Universal RObust Peak Annotator’) is a command line based tool, intended for universal genomic range (e.g. peaks) annotation. Based on a configuration file, different target features can be prioritized with multiple integrated queries. These can be sensitive for feature type, distance, strand specificity, feature attributes (eg. protein_coding) or anchor position relative to the feature. UROPA can incorporate reference annotation files (GTF) from different sources, as well as custom reference annotation files produced by the user. Statistics and plots transparently summarize the annotation process. UROPA is implemented in Python and R.
- The user provides a reference annotation file in GTF format
- The user provides ranges/peaks of interest in BED file format
- The user provides a configuration file in JavaScript Object Notation (JSON) format
The tool permits linkage of individual queries including prioritization (see Configuration file). Arbitrary features in the reference annotation file can be addressed in a granular way (see Usage). Filtering on additional annotation columns in the reference annotation file is supported. UROPA generates publication ready graphical statistics. The output is given in easily-readable tab-delimited tables (see Output files).
Many usage examples are presented in the Application examples.
Run UROPA
To start the UROPA peak annotation, the minimal command is:
uropa –i <config.json>
A template of the configuration file is available from our GitHub repository (see sample_config.json). An overview about the usage and available parameters is displayed with
uropa --help
Installation¶
There are three different ways to install UROPA. We recommend to install UROPA using the conda package manager.
Conda package manager¶
Make sure to have conda installed, e.g. via
- Miniconda
- download the Miniconda installer for Python 3
- run
bash Miniconda3-latest-Linux-x86_64.sh
to install Miniconda - Answer the question “Do you wish the installer to prepend the Miniconda install location to PATH in your /home/…/.bashrc ?” with yes
- OR do
export PATH=dir/to/miniconda3:$PATH
after installation process
The UROPA installation is now as easy as
conda create --name uropa
conda activate uropa
conda install python uropa
.
Biocontainers / Docker¶
If you have a running Docker environment, you can pull a biocontainer with UROPA and all dependencies via
docker pull quay.io/biocontainers/uropa:latest_tag
using the latest tag from the taglist, e.g.1.2.1--py27r3.3.2_0
docker pull loosolab/uropa
Installation from source¶
You can also install UROPA from the source PyPI package. Note that this comes without the R dependencies for auxillary scripts:
pip install uropa
To fulfill all other dependencies, R/Rscript, v3.3.0 or higher (follow the instructions on url) is needed. Futhermore, follow the subsequent instructions within R environment to install mandatory packages:
install.packages(c("ggplot2","devtools","gplots","gridExtra","jsonlite", "VennDiagram","snow","getopt","tidyr","UpSetR"))
source("https://bioconductor.org/biocLite.R")
biocLite(c("RBGL","graph"))
- further package infos can be found at CRAN and Bioconductor
- In order to visualize the Chow-Ruskey plot with uropa_summary.R, install the modified Vennerable package from our fork:
library(devtools)
install_github("jenzopr/Vennerable")
Usage¶
## Command-line parameters
List of current parameters for command-line usage and their explanation:
Usage: uropa [options]
Available options:
-h, --help print this help message and further details on the configuration file
-i confing.json, --input confing.json filename of configuration file [mandatory]
-p prefix, --prefix prefix prefix for result files, can include subdirectories [basename of config]
-r, --reformat create an additional compact and line-reduced table as result file
-s, --summary additional visualisation of results in graphical format will be created
-t n, --threads n multiprocessed run: n = number of threads to run annotation process
-add-comments show comment lines in output files explaining the columns
-l uropa.log, --log uropa.log log file name for messages and warnings
-d, --debug print verbose messages (for debugging purposes)
-v, --version print the version and exit
Note
The -p
parameter can either contain the prefix of the result files, like -p abc
gives abc_allhits.txt in the working directory,
or the result directory and prefix, like -p xy/abc
which stores the result files in directory xy. If the directory xy is not existing, it will be created.
## Docker container usage
sudo docker run --rm -v <path-to-input-files-on-HOST>:<path-to-container-mnt> UROPA:LATEST uropa <UROPA-Paramters> -p <path-to-container-mnt>/'your-file-prefix'
Note
-v
parameter mounts a HOST folder into your docker CONTAINER. This folder should contain the input files for UROPA and also the result files will be stored here.
No files will be stored in the container!!
--rm
removes/closes the container after the run
Make sure to use the uropa -p option specifying the output directory and prefix, otherwise results are lost in the container environment!
Configuration file¶
The configuration file is a JavaScript Object Notation formatted file that allows keys and values easily to be set. For running UROPA, a template of the config file as below is provided:
{
"queries":[
{"feature":"", "feature.anchor": "", "distance":"", "strand":"",
"direction":"", "internals":"", "filter.attribute":"",
"attribute.value":"", "show.attributes":"" }
],
"priority": "",
"gtf": ".gtf",
"bed": ".bed"
}
Three global keys are mandatory: 'queries'
, 'gtf'
, and 'bed'
, additionally
there is an optional global key 'priority'
.
In a default annotation, only the 'gtf'
and 'bed'
keys needs to be specified by the user (relative file paths). The key 'queries'
has to be present in the config file, but can be left empty
(e.g. "queries": []
). Empty or missing key-value pairs are filled with their default values by UROPA.
Queries¶
The 'queries'
key field is a list with 1 to n entries, each specifying one valid annotation query with specific parameters
for UROPAs annotation process.
Hint
- If more than one query is given, curly brackets should be used
, for example
[{}, {}]
. - Doublecheck spelling of keywords and comma settings, otherwise the UROPA annotation might differ from expectations.
Each query can specify the following keys:
Query-specific keys
feature: Peaks will be annotated only to listed features from the 3rd column of the file specified by
'gtf'
.Default: All available features from
'gtf'
.Example:
'feature': ['gene','transcript']
or'feature': 'exon'
.feature.anchor: The position(s) from which the distance to the peak center will be calculated. The best annotation is defined as the minimum distance if multiple anchors are defined. Valid distances are less or equal to the distance key value(specified by
'distance'
, see below).Default:
['start', 'center', 'end']
Example:
'feature.anchor': ['start']
distance: Maximum permitted distance from the genomic feature anchor to the peak center. If only one value is given, this distance is valid in both directions from the feature anchor. If two values are given, the first value corresponds to the distance upstream of the feature anchor, and the second value to the distance downstream of the feature anchor.
Default:
100000
Example:
'distance': [2000,5000]
or'distance': [5000]
or'distance': 5000
.strand: The desired strand of the annotated feature relative to the peak. A constraint on strand specificity is only successfully evaluated if strand information is available for the feature and the peak.
Default:
['ignore', 'same', 'opposite']
Example:
'strand': ['same']
or'strand': 'same'
.direction: Define the peak location relative to the feature’s location, in respect of its orientation. A peak is ‘upstream’ if its center is upstream of a feature anchor position. Accordingly, a peak is ‘downstream’ if its center is downstream of a feature anchor position. Visual repesentation in Figure 1 of section Application examples
Default:
'any_direction'
Example:
'direction': ['upstream','downstream']
internals: A minimum overlap fraction for annotations, where peaks are located (partially) within features or vice versa. This key allows to annotate peaks to features with a wide size range, such as large genes, which would otherwise be removed due to the distance thresholds. Further, this key can be used to allow partial overlaps to be reported, even if the feature anchor is too far away. A visual repesentation is included in Figure 1 of section Application examples.
Default:
0.0
for not reporting partially overlapping annotations if they are located outside of distance thresholds.Furthermore,
'T', 'True', 'Y', 'Yes'
are allowed and will be converted to1.0
,'F', 'False', 'N', 'No'
are allowed and will be converted to0.0
.Example:
'internals':'1.0'
for allowing annotations where peaks are fully within features or vice versa.'internals':'0.1'
for allowing annotations where peaks overlap features (or vice versa) by at least 10 percent.filter.attribute : Key filters the attributes found in the 9th column of the GTF file. If a
'filter.attribute'
is given, only features that have a'attribute.value'
for this attribute is kept as valid annotations. If this key is set, the key'attribute.value'
is mandatory, too (see below).Default:
'None'
Example:
'filter.attribute': ['gene_type']
attribute.value : Corresponding attribute value for the
'filter.attribute'
found in the 9th column of the GTF file. If a'filter.attribute'
is given, only features that have a'attribute.value'
for this attribute can be valid annotations.Default:
'None'
Example:
'attribute.value': ['protein_coding']
show.attributes: A list of attributes found in the 9th column of the GTF file which should appear in the final output tables. If non existent attributes are specified, annotated peaks will display
'NA'
for those attributes. If set to'all'
the list of attributes will be defined from all possible attributes in the annotated features from the gtf.Default:
'None'
Example:
'show.attributes':['gene_id', 'gene_biotype']
Prioritizing queries¶
priority: Allows multiple queries to be treated as a hierarchy, which means that a peak can be annotated according to subsequent queries only if no match to the preceding query is found.
If ‘False’, all given queries are weighted equally and any feature matching with any of these queries will be a valid annotation.
If only one query is provided, the value of ‘priority’ has no influence on the annotation process.
Allowed values are one of 'T', True', 'Y', 'Yes'
or 'F', 'False' ,'N' ,'No'
.
Default: 'False'
Example: 'priority':'Yes'
Annotation database (GTF)¶
gtf: A path to a file in standard GTF format (9 columns), as described by Ensembl GTF format. The GTF file acts as annotation database. If your annotation database is not in the Ensembl GTF format, a conversion can be done by UROPA. For more information see UROPA to GTF utility.
Required, no default.
Genomic regions (BED)¶
bed: A path to a file in BED format, as described by Ensembl Bed format. The BED file can be any tab-delimited file containing the genomic regions, e.g. enriched regions from a peak-calling tool (e.g. MACS2, MUSIC, FindPeaks, CisGenome, PeakSeq), with a minimum of 3 columns (chr/start/stop).
Required, no default.
Output files¶
UROPA provides multiple output files, each providing valuable information in either a more extended or a more condense way in order to cover the needs of a user. Details are given below:
File overview¶
- allhits.txt: Basic output table, reports for each peak all valid annotations and additionally NA rows for peaks without valid annotation.
- finalhits.txt: Filtered output table, it reports the best (closest) feature according to the config criteria for each peak. If multiple queries are given, it reports the best annotation taking multiple queries into account.
- finalhits.bed: Similar tp finalhits.txt in bed format. This means there is no header, the column order is as followed: peak_chr peak_start peak_end peak_id peak_strand peak_score feature feat_start feat_end feat_strand feat_anchor distance genomic_location + all attributes that are given in the config file
- besthits.txt: This table is only produced if more than one query is given. It reports the best annotation per query for each peak.
- besthits_compact.txt: Another format of the besthits produced by using the
-r
parameter. A compact table with best per query annotation for each peak in one row is presented. - summary.pdf: Statistical summary of the UROPA annotation generated by using the
-s
parameter.
Note
If no prefix is specified, the result files will be stored in the working diractory with the basename of the config file. For example,``uropa -i histonMarkPeaks.json`` leads to an allhits file called histonMarkPeaks_allhits.txt stored in the working directory.
With specified prefix, the result files will be stored in the working diractory with the prefix as part of the output file names.
uropa -i histonMarkPeaks.json -p abc
results in an allhits file called abc_allhits.txt.
But if for instance the prefix would be defined as -p xy/abc
all results would be stored in folder xy,
which is be created if it does not exist already, with abc as prefixes of the output files.
Output columns explanation¶
The four output tables mentioned above contain informative columns about the performed peak annotation:
- peak_id, peak_start, peak_center, peak_end, peak_strand: Peak information with id if available, otherwise a peak id in chr:start-end format will be created.
- feature, feat_start, feat_end, feat_strand: The information of the genomic feature that annotates the peak as extracted by the GTF file.
- feat_anchor: The position of the annotated genomic feature which was used for distance calculation. If
feature.anchor
is given in config, only this will be used. If multiplefeature.anchor
were given, the distance to all of them is calculated and the minimum distance is chosen. - distance : Absolute distance from peak center to feature anchor. Closest feature anchor if multiple are specified, see column feat_anchor.
- genomic_location: The position of the peak relative to the annotated feature direction (e.g. upstream = peak located upstream of the feature, see Figure 2 in Application examples).
- feat_ovl_peak: Percentage of the peak that is covered by the feature (1.0 = 100%, this corresponds to the genomic_location “PeakInsideFeature”).
- peak_ovl_feat: Percentage of the feature that is covered by the peak (1.0 corresponds to the genomic_location “FeatureInsidePeak”).
- gene_name, gene_id, gene_type,…: Attributes that have been given in the key
show.atttributes
with their values extracted from the GTF.
Hint
- If
filter.attribute
key is used, make sure to specify these attributes are also selected in theshow.attributes
key, to have an easy confirmation of the filtering. - Make sure to define attributes for a notification in the output. Otherwise the annotated peaks will be reported without any information of the assigned features.
- query: Valid query ID for the current annotation (0-based).
Output files (one query)¶
UROPA annotation with one query results in two output tables. Those are the allhits and finalhits. For a configuration as given below, the allhits is given in Table 1, and the finalhits is displayed in Table 2. Peak and annotation files are further described in the Application examples section.
The UROPA annotation process for one query can run into three cases for each peak:
- Case 1: No query gives any feature for annotating the peak, this leads to no valid annotation at all -> NA row in allhits and finalhits.
- Case 2: There is one valid annotation for the specified query -> annotation will be given in allhits and finalhits.
- Case 3: There are multiple valid annotations for the specified query -> all valid annotations will be given in the allhits, the best annotation (smallest distance) will be presented in the finalhits.
{
"queries":[
{"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
"filter.attribute":"gene_type", "attribute.value":"protein_coding",
"show.attributes":["gene_name","gene_type"]}],
"priority" : "False",
"gtf":"gencode.v19.annotation.gtf" ,
"bed":"ENCFF001VFA.bed"
}
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | gene_type | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||||||
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | gene | 43881065 | 43904614 | - | start | 253 | overlapStart | 0.57 | 0.09 | HNRNPF | protein_coding | 0 |
… | |||||||||||||||||
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98190908 | 98262240 | - | start | 261 | upstream | 0.0 | 0.0 | CHD1 | protein_coding | 0 |
… | |||||||||||||||||
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175810949 | 175815976 | + | start | 937 | overlapStart | 0.31 | 0.3 | NOP16 | protein_coding | 0 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175815748 | 175816772 | + | start | 1165 | FeatureInsidePeak | 0.22 | 1.0 | HIGD2A | protein_coding | 0 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175792471 | 175828666 | + | start | 24442 | PeakInsideFeature | 1.0 | 0.14 | ARL10 | protein_coding | 0 |
… |
Table 5.1: Excerpt of table allhits for one query as described in the configuration above.
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | gene_type | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||||||
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | gene | 43881065 | 43904614 | - | start | 253 | overlapStart | 0.57 | 0.09 | HNRNPF | protein_coding | 0 |
… | |||||||||||||||||
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98190908 | 98262240 | - | start | 261 | upstream | 0.0 | 0.0 | CHD1 | protein_coding | 0 |
… | |||||||||||||||||
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175810949 | 175815976 | + | start | 937 | overlapStart | 0.31 | 0.3 | NOP16 | protein_coding | 0 |
… |
Table 5.2: Excerpt of table finalhits for one query as described in the configuration above.
As displayed in Table 1 and Table 2, peak 355 is a representative of Case 1. There is no valid annotation at all, thus there is an NA row in both output tables.
The peaks 356 and 765 belong to Case 2, there is one valid annotation for them, their annotation is displayed in the same way in allhits and finalhits.
Peak 769 has three valid annotations for the specified query (Case 3). All of them are displayed in the allhits output. In the finalhits only the best annotation, the one for gene NOP16 with the minimal distance of 937, is represented.
Output files (multiple queries)¶
UROPA annotation with multiple queries and default priority results in at least three output tables.
Those are the allhits, finalhits, and besthits.
If the -r
parameter is added in the command line call, there will the additional output compact file.
Furthermore, if the -s
parameter is also added, the summary file is generated.
With a configuration as given below, the generated output files are generated as presented in Tables 3 to 6 and Figure 1.
Peak and annotation files are further described in the Application examples section.
The UROPA annotation process for multiple queries allows an additional case in relation to the cases described for one query above:
- Case 1 to 3 as described for one query
- Case 4: There are valid annotations for one peak for multiple queries -> all valid annotations will be given in the allhits, the best annotation (smallest distance across all queries) will be presented in the finalhits. Additionally, the best annotation per query will be displayed in the besthits output.
{
"queries":[
{"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
"filter.attribute":"gene_type", "attribute.value":"protein_coding",
"show.attributes":["gene_name","gene_type"]},
{"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
"filter.attribute":"gene_type", "attribute.value":"lincRNA"},
{"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
"filter.attribute":"gene_type", "attribute.value":"misc_RNA"},
],
"priority" : "False",
"gtf": "gencode.v19.annotation.gtf",
"bed": "ENCFF001VFA.peaks.bed"
}
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | gene_type | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||||||
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 |
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | gene | 43881065 | 43904614 | - | start | 253 | overlapStart | 0.57 | 0.09 | HNRNPF | protein_coding | 0 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 |
… | |||||||||||||||||
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98190908 | 98262240 | - | start | 261 | upstream | 0.0 | 0.0 | CHD1 | protein_coding | 0 |
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98264875 | 98330717 | + | start | 22 | overlapStart | 0.5 | 0.03 | CTD-2007H13.3 | protein_coding | 1 |
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98272342 | 98272451 | - | start | 7598 | downstream | 0.0 | 0.0 | Y_RNA | protein_coding | 2 |
… | |||||||||||||||||
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175810949 | 175815976 | + | start | 937 | overlapStart | 0.31 | 0.3 | NOP16 | protein_coding | 0 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175815748 | 175816772 | + | start | 1165 | FeatureInsidePeak | 0.22 | 1.0 | HIGD2A | protein_coding | 0 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175792471 | 175828666 | + | start | 24442 | PeakInsideFeature | 1.0 | 0.14 | ARL10 | protein_coding | 0 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 |
… |
Table 5.3: Excerpt of table allhits for three queries as described in the configuration above.
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | gene_type | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||||||
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0,1,2 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | gene | 43881065 | 43904614 | - | start | 253 | overlapStart | 0.57 | 0.09 | HNRNPF | protein_coding | 0 |
… | |||||||||||||||||
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98264875 | 98330717 | - | start | 22 | overlapStart | 0.5 | 0.03 | CTD-2007H13.3 | protein_coding | 1 | |
… | |||||||||||||||||
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175810949 | 175815976 | + | start | 937 | overlapStart | 0.31 | 0.3 | NOP16 | protein_coding | 0 |
… |
Table 5.4: Excerpt of table finalhits for three queries as described in the configuration above.
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | gene_type | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||||||
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 |
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | gene | 43881065 | 43904614 | - | start | 253 | overlapStart | 0.57 | 0.09 | HNRNPF | protein_coding | 0 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 |
… | |||||||||||||||||
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98190908 | 98262240 | - | start | 261 | upstream | 0.0 | 0.0 | CHD1 | protein_coding | 0 |
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98264875 | 98330717 | + | start | 22 | overlapStart | 0.5 | 0.03 | CTD-2007H13.3 | protein_coding | 1 |
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene | 98272342 | 98272451 | - | start | 7598 | downstream | 0.0 | 0.0 | Y_RNA | protein_coding | 2 |
… | |||||||||||||||||
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene | 175810949 | 175815976 | + | start | 937 | overlapStart | 0.31 | 0.3 | NOP16 | protein_coding | 0 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 |
… |
Table 5.5: Excerpt of table besthits for three queries as described in the configuration above.
Note
The besthits is only generated if multiple queries are specified and the priority flag is set to FALSE! If this flag is TRUE, there will be only one valid query. There can be multiple valid annotations for one peak, but all based on one query. In this case only the allhits and finalhits are produced.
Same as in the first example with one query, peak_355 has no valid annotation and is represented as NA row (correspond to Case 1). In the allhits (Table 3) and besthits (Table 5) there will be one NA row for each query. In the finalhits (Table 4) there will be only one NA row for all queries.
The peak_356 has only for one query a valid annotation, as given in allhits, finalhits, and besthits (Case 2). In allhits and besthits there are additional NA rows for this peak for the other queries without a hit.
For peak_765 there are valid annotations for all given queries as displayed in the allhits, (Case 4). The best of these is the annotation for the lincRNA, therefore this annotation is displayed in the finalhits. Because there is only one valid annotation for each query, they will be displayed in the same way in the besthits.
This is different for peak_769. As described above this peaks equates to Case 3. With multiple queries, there will be additional NA rows for the invalid queries in the allhits and besthits.
With multiple queries UROPA provides an option to reformat the besthits table in order to condense the best per query annotations for each peak in one row. A reformatted example for the besthits of Table 5 is presented in Tables 6. The Reformatted_HitsperPeak represents all information for each peak in one row. Within this format the information for query 0 is always given at the first position, for query 1 at second positon etc.
To receive this output format, the parameter -r
has to be added to the command line call.
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | gene_type | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||||||
peak_355 | chr15 | 79211550 | 79217124 | 79222698 | . | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | NA,NA,NA | 0,1,2 |
peak_356 | chr10 | 43902516 | 43904360.5 | 43906205 | . | gene,NA,NA | 43881065,NA,NA | 43904614,NA,NA | -,NA,NA | start,NA,NA | 253,NA,NA | overlapStart,NA,NA | 0.57,NA,NA | 0.09,NA,NA | HNRNPF,NA,NA | protein_coding,NA,NA | 0,1,2 |
… | |||||||||||||||||
peak_765 | chr5 | 98262863 | 98264852.5 | 98266842 | . | gene,gene,gene | 98190908,98264875,98272342 | 98262240,98330717,98272451 | -,+,- | start,start,start | 261,22,7598 | upstream,overlapStart,downstream | 0.0,0.5,0.0 | 0.0,0.03,0.0 | CHD1,CTD-2007H13.3,Y_RNA | protein_coding,protein_coding,protein_coding | 0,1,2 |
… | |||||||||||||||||
peak_769 | chr5 | 175814508 | 175816913.5 | 1751574319 | . | gene,NA,NA | 175810949,NA,NA | 175815976,NA,NA | +,NA,NA | start,NA,NA | 937,NA,NA | overlapStart,NA,NA | 0.31,NA,NA | 0.3,NA,NA | NOP16 | protein_coding | 0,1,2 |
… |
Table 5.6: Excerpt of table Reformatted_HitsperPeak for three queries as described in the configuration above.
Summary Visualisation¶
In order to generate a global summary, one can apply the -s
parameter during the command line call.
This summary is visualising a global overview of the generated UROPA annotations, and several usefull plots,
like the distribution of the annotated peaks to the features of interst and the genomic locations according to the features of interest.
Detailed information about the different plots and how they are dedicated to the different output file formats is explained below.
Overview of what can be found in the summary visualsation
- An abstract of the UROPA annotation including the used peak and annotation files
- Number of peaks and number of annotated peaks
- Specified queries, and value of the priority flag (Figure 1A)
- Number of peaks per query
Graphs based on the ‘finalhits’ output:
- Density plot displaying the distance per feature across all queries (Figure 1B)
- Pie chart illustrating the genomic locations of the peaks per annotated feature (Figure 1C)
- Bar plot displaying the occurrence of the different features, if there is more than one feature assigned for peak annotation (not illustrated due to one feature in this example)
Figure 1A-C would be the summary for the first example UROPA run given above with only one query
Graphs based on the ‘besthits’ output:
- Distribution of the distances per feature per query is displayed in a histogram (Figure 1D)
- Pie chart illustrating the genomic locations of the peaks per annotated feature (not illustrated)
- Pairwise comparisons among all queries is evaluated within a venn diagram (more than one query is needed; one pairwise comparison displayed in Figure 1E)
- Chow Ruskey plot with comparison across all defined queries (for three to five annotation queries)(Figure 1F)

Figure 1: Summary file example for queries as described above: (A) Summary of specified queries, used annotation and peak files, and how many peaks were present and annotated, (B) Distance density for all features based on finalhits, (C) Pie Chart representing genomic location for each feature across finalhits, (D) Distance per query per feature across besthits, (E) Pairwise comparison across all queries displayed in Venn diagrams, (F) Chow Ruskey plot to compare all queries.
Application examples¶
In this section several typical use cases for UROPA are presented. A detailed introduction of all available keys can be found in the section Configuration file. A detailed information about the output formats can be found in the section Output files.
In Figure 1 a general overview about the genomic location of peaks and their direction is displayed.
It illustrates the definition of a set of config keys such as direction
or internals
.
Direction is illustrated in the entities upstream and downstream of a feature, taking the strand into account.
The genomic location of a peak is illustrated with its entities (given in the Output files).

Figure1: Locations and directions of peaks (green) and features(blue).
There are five peaks in close proximity to gene A: The first peak (from left) is located upstream of Gene A and would
be annotated, if direction:upstream
is configured. The second peak would also be annotated within a
direction:upstream
configuration, because the peak center is upstream of the feature start position(genomic location for this peak is
overlapStart). The third peak would not be annotated with direction:upstream
configuration. However, in case of annotation (due to other configuration),
the genomic location of this peak is PeakInsideFeature.
The next two peaks (third and fourth from left), are annotated in a direction:downstream
configuration. As described for
upstream, both peaks are located downstream of gene A, but report different genomic locations: overlapEnd and downstream.
The next peak from left could be annotated for gene B which is located inside the peak, representing the genomic location FeatureInsidePeak.
The last two peaks close to gene C have the genomic locations overlapStart and overlapEnd. Only the last peak has a peak center located upstream of the peak and is found in a direction:upstream
configuration.
The following examples are based on either H3K4me1 (ENCFF001SUE) or POLR2A (ENCFF001VFA) peaks, and on either Ensembl (Homo_sapiens.GRCh37.75) or Gencode v19 (gencode.v19.annotation) GTF files. Detailed references see Used example peak and annotation files.
Example 1: feature.anchor
¶
UROPA maximise flexibility in terms of annotation. With the key feature.anchor
it is possible to decide from which anchor point of the feature the distance to the peak (center) should be calculated.
Typically for other tools, the distance is calculated from the TSS of a gene, correspond to start
in UROPA. Besides start, UROPA supports the center
and end
of the feature.
This could be useful for the annotation of histone marks and can be applied in the query definition by setting the feature.anchor
to the suitable position.
By default, the distances from all three positions to the peak center are calculated and the closest distance is chosen.
Only if the chosen distance is smaller or equal to the maximum distance defined in the distance
key, the peak will be annotated for that feature.
The allhits Table 1 represents the annotation for a peak located inside a feature(third peak from right in Figure 1).
Assuming gene A is very large, the peak center is far away from the start of the gene. If using the center as feature.anchor
it is found as a valid annotation.
That is what happens for peak 71 with the configuration as followed:
{
"queries": [
{"feature":"gene", "distance":5000, "feature.anchor": "start", "show.attributes":"gene_name"},
{"feature": "gene","distance":5000, "feature.anchor":"center"}],
"priority" : "False",
"gtf": "gencode.v19.annotation.gtf",
"bed": "ENCFF001SUE.bed"
}
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | ||||||||||||||||
peak_71 | chr22 | 18161387 | 18161441.5 | 18161496 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 |
peak_71 | chr22 | 18161387 | 18161441.5 | 18161496 | . | gene | 18111621 | 18213388 | + | center | 1063 | PeakInsideFeature | 1.0 | 0.01 | BCL2L13 | 1 |
… |
Table 6.1: Excerpt of allhits result file for a configuration with two queries applying different feature.anchor
values. With "feature.anchor":"start"
there is no valid annotation (query 0), with "feature.anchor":"center"
the annotation for gene BCL2L13 is valid (query 1).
Hint
If the size of the chosen feature is highly variable, the key internals
could be used (see Example 3).
Example 2: direction
¶
In the following example the utility of the key direction
is illustrated. It can be a very important player for a specialized annotation.
Compare the peaks with upstream direction in Figure 1.
If direction:upstream
is used, peaks will be annotated to a feature if the peak centre is upstream of the feature start and the distance from the feature.anchor
is smaller than the chosen distance
value.
Same takes effect for direction:downstream
where the location of the peak is expected to be downstream of the gene end.
Thus, the location of the peak is relative to the feature’s direction. An overlap of the feature to the start or end of the peak is partially possible, but the overlap of the peak needs to be < 50%.
This example is illustrated on the peak displayed in Figure 2. It is located between two genes. Because both distances are very small, it can be tricky to decide which annotation is the best. With respect to gene ATAD3C it is located downstream, with respect to ATAD3B is is located upstream. Depending on the nature of the peaks, the more suitable configuration can be used for the direction key.

Figure 2: H3K4me1 peak_21044 (chr1:1,403,500-1,408,500) annotated with the Gencode GTF. By eye, there are two valid annotation, the genes ATAD3B and ATAD3C. Depending on the biological background of the peak, one allocation is more suitable than the other. Due to the knowledge that the peaks represent H3K4me1 marks, a location upstream of a gene is more likely than downstream.
In the following excerpt of a result file, the UROPA annotation was performed with two queries:
{
"queries": [
{"feature": "gene", "distance":1000, "show.attributes":"gene_name"},
{"feature": "gene", "distance":1000, "direction":"upstream"}],
"gtf": "gencode.v19.annotation.gtf",
"bed": "ENCFF001SUE.bed"
}
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | ||||||||||||||||
peak_21044 | chr1 | 1406116 | 1406250.5 | 1406385 | . | gene | 1407143 | 1433228 | + | start | 892 | upstream | 0.0 | 0.0 | ATAD3B | 0 |
peak_21044 | chr1 | 1406116 | 1406250.5 | 1406385 | . | gene | 1385069 | 1405538 | + | end | 712 | downstream | 0.0 | 0.0 | ATAD3C | 0 |
peak_21044 | chr1 | 1406116 | 1406250.5 | 1406385 | . | gene | 1407143 | 1433228 | + | start | 892 | upstream | 0.0 | 0.0 | ATAD3B | 1 |
… |
Table 6.2: Result table allhits for H3K4me1 peak 21044, annotated for two genes. The query 0 represents default direction("direction":"any_direction"
). The second query represents an annotation with specified direction. Within query 1 only annotations upstream of the feature are allowed.
The peak 21044 displayed in Figure 2 would be annotated for both genes as displayed in Table 2. For query 0 the final hit for this peak is gene ATAD3C due to the smaller distance. However, the annotation for gene ATAD3B might be more suitable, because H3K4me1 is known to flank enhancers which are located upstream of genes. This annotation behaviour is reached with query 1. In this case the annotation for the downstream located feature is no longer valid.
Example 3: internals
¶
In some cases the relation of feature and peak size differs a lot. In these cases peak annotations might get lost even if the peak is located within a feature and vise versa because the limiting distance value is reached.
To avoid this, the internals
key can be used. With this key, peaks are allowed to be annotated for features even if the distance is larger than specified.
By default the parameter is set to False.
Note
Compare to Example 1: With "internals":"True"
it would not be necessary to identify the most appropriate feature.anchor
because the peak is located inside the feature and it would not be rejected by exceeding the given distance value.
This example is based on the peak displayed in Figure 3. The peak is very large and the region includes three different genes.

Figure 3: POLR2A peak 13 (chr6:27,858,000-27,863,000) annotated with Ensembl. The peak is very large: without using internals
,
some features get lost because of large distances.
Setting this key ensures to keep features that are located within peaks and vice versa.
The first query (query 0) of the following configuration displays the default behaviour of the internals
key. In the second query (query 1) the key is set to TRUE
:
{
"queries":[
{"feature":"gene", "distance":500, "show.attributes":"gene_name"},
{"feature":"gene", "distance":500, "internals":"True"}],
"gtf":"Homo_sapiens.GRCh37.75.gtf",
"bed":"ENCFF001VFA.bed"
}
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | ||||||||||||||||
peak_13 | chr6 | 27857165 | 27860401 | 27863637 | . | gene | 27858093 | 27860884 | - | start | 483 | FeatureInsidePeak | 0.43 | 1.0 | HIST1H3J | 0 |
peak_13 | chr6 | 27857165 | 27860401 | 27863637 | . | gene | 27860477 | 27860963 | - | end | 76 | FeatureInsidePeak | 0.08 | 1.0 | HIST1H2AM | 0 |
peak_13 | chr6 | 27857165 | 27860401 | 27863637 | . | gene | 27861203 | 27861669 | + | start | 802 | FeatureInsidePeak | 0.07 | 1.0 | HIST1H2BO | 1 |
peak_13 | chr6 | 27857165 | 27860401 | 27863637 | . | gene | 27858093 | 27860884 | - | start | 483 | FeatureInsidePeak | 0.43 | 1.0 | HIST1H3J | 1 |
peak_13 | chr6 | 27857165 | 27860401 | 27863637 | . | gene | 27860477 | 27860963 | - | end | 76 | FeatureInsidePeak | 0.08 | 1.0 | HIST1H2AM | 1 |
… |
Table 6.3: allhits for POLR2A peak_13 with query key "internals":"False"
for query 0 and "internals":"True"
for query 1.
As displayed in Table 3, there are two valid annotation for the given configuration for query 0. But the third gene in this genomic regions is missed due to the large distance to any feature.anchor
.
This is different for query 1. Even with the exceeded distance limit, the third gene is annotated for this peak. The annotation for this peak is also a good example for the usage of the different output tables.
The annotation for gene HIST1H2AM would be represented in the finalhits for both queries.
Example 4: filter.attribute
+ attribute.value
¶
If the annotation should be more specific, the linked keys filter.attribute
and attribute.value
can be used. With those it is possible to further restrict the annotation.
For example, the peaks should not just genes but only protein coding genes.
The most general key aims for a specific feature, e.g. genes (with the key feature
). Further characteristics that should be fulfilled are added with the linked keys.
The first query of the following config has no further filters than "feature":"gene"
. The second query aims for genes that are “protein_coding”. The corresponding attribute is “gene_biotype” (See GTF file).
That attribute name has to be termed in the filter.attribute
key. The corresponding value which should be accepted has to be named in the attribute.value
key.
{
"queries":[
{"feature":"gene", "distance":5000, "show.attributes":["gene_name","gene_biotype"]},
{"feature":"gene", "distance":5000, "filter.attribute": "gene_biotype",
"attribute.value": "protein_coding"}],
"gtf":"Homo_sapiens.GRCh37.75.gtf",
"bed":"ENCFF001VFA.bed"
}
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | gene_biotype | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||||||
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | gene | 28832492 | 28837404 | + | end | 1014 | FeatureInsidePeak | 0.56 | 1.0 | SNHG3 | sense_intronic | 0 |
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | gene | 28832455 | 28865812 | + | start | 3935 | overlapStart | 0.95 | 0.25 | RCC1 | protein_coding | 0 |
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | gene | 28835071 | 28835274 | + | end | 1116 | FeatureInsidePeak | 0.03 | 1.0 | SNORA73B | snoRNA | 0 |
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | gene | 28832455 | 28865812 | + | start | 3935 | overlapStart | 0.95 | 0.25 | RCC1 | protein_coding | 1 |
… |
Table 6.4: Excerpt of table allhits with key feature:gene
and distance:5000
. For query 0 all annotations for the feature gene are valid, in query 1 the gene has to be protein coding to be a valid annotation.
As shown in the allhits Table 4, there are three valid annotations for peak 10 for query 0 but only one valid annotation for query 1. The final hit for query 0 would be the annotation for SNHG3 with a distance of 1014 bp. But this might not what one is expecting from an annotation run, as the gene biotype is intronic.
Tip
It is just possible to filter for values given in the attribute column. GTF source files can contain different attribute keys and values, so make sure the chosen values are present.
Example 5: priority
¶
More than one query can be defined.
If there are more queries, it is important to decide if they should be prioritized. In the preceded examples no priority was used, means all queries were evaluated.
The following examples illustrates the beneficial effect of priority
.
Configuration with two queries and without prioritisation:
{
"queries":[
{"feature":"gene", "distance":1000, "show.attributes":"gene_name"},
{"feature":"transcript", "distance":1000}],
"priority" : "False",
"gtf":"Homo_sapiens.GRCh37.75.gtf",
"bed":"ENCFF001VFA.bed"
}
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | ||||||||||||||||
peak_6 | chr7 | 5562617 | 5567820 | 5573023 | . | gene | 5567734 | 5567817 | - | start | 3 | FeatureInsidePeak | 0.01 | 1.0 | AC006483.1 | 0 |
peak_6 | chr7 | 5562617 | 5567820 | 5573023 | . | transcript | 5566782 | 5567729 | - | start | 91 | FeatureInsidePeak | 0.09 | 1.0 | ACTB | 1 |
peak_6 | chr7 | 5562617 | 5567820 | 5573023 | . | transcript | 5566787 | 5570232 | - | center | 689 | FeatureInsidePeak | 0.33 | 1.0 | ACTB | 1 |
peak_6 | chr7 | 5562617 | 5567820 | 5573023 | . | transcript | 5567734 | 5567817 | - | start | 3 | FeatureInsidePeak | 0.01 | 1.0 | AC006483.1 | 1 |
… | ||||||||||||||||
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 |
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | transcript | 28832863 | 28836145 | + | end | 245 | FeatureInsidePeak | 0.37 | 1.0 | SNHG3 | 1 |
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | transcript | 28836589 | 28862538 | + | start | 199 | overlapStart | 0.48 | 0.16 | RCC1 | 1 |
Table 6.5: Excerpt of allhits table for two queries without prioritisation.
The above set of queries will allow UROPA to annotate peaks for genes and transcripts. As priority is False (default), there is no query prioritized. As presented in the allhits Table 5, there are valid annotations for peak 6 with both queries. The annotation for the feature gene would be presented in the finalhits. For peak 10, there are only valid annotations for the second query, the annotation for the gene RCC1 correspond to the best annotation and would be represented in the finalhits.
Next, after changing the priority
flag in the configuration above to "priority":"TRUE"
, the result looks like:
peak_id | peak_chr | peak_start | peak_center | peak_end | peak_strand | feature | feat_start | feature_end | feat_strand | feat_anchor | distance | genomic_location | feat_ovl_peak | peak_ovl_feat | gene_name | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | ||||||||||||||||
peak_6 | chr7 | 5562617 | 5567820 | 5573023 | . | gene | 5567734 | 5567817 | - | start | 3 | FeatureInsidePeak | 0.01 | 1.0 | AC006483.1 | 0 |
… | ||||||||||||||||
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | transcript | 28832863 | 28836145 | + | end | 245 | FeatureInsidePeak | 0.37 | 1.0 | SNHG3 | 1 |
peak_10 | chr1 | 28832002 | 28836390 | 28840778 | . | transcript | 28836589 | 28862538 | + | start | 199 | overlapStart | 0.48 | 0.16 | RCC1 | 1 |
Table 6.6: Excerpt of allhits table with two queries with prioritisation.
If priority is TRUE, UROPA will annotate peaks with the first feature found, taking the order of queries into account. Once an annotation is assigned, the following queries are not evaluated at all for the current peak. The example is based on the same cases as above. As peak 6 was annotated by query 0, query 1 is not evaluated. For peak 10, there was no valid annotation for query 0, thus query 1 was evaluated and a valid annotation was identified.
Hint
- For priority true there will not be an NA row for queries without valid annotations.
- If there is no valid annotation for a peak across all queries, there is a combined NA row for all queries (NA NA … NA 0,1)
- There will be no besthist if priority is TRUE, as there is only one final annotation per peak
Used example peak and annotation files¶
Annotation:
- Ensembl database of the human genome, version hg19 (GRCh37): Ensembl genome
- Human Gencode genome, version hg19: Gencode genome
Peak and signal files based on ChIP-seq of GM12878 immortalized cell line:
Note
To find the used peak files you have to choose hg19 at Processed data (default is GRCh38).
Hint
Peak ids are manually added to simplify the description of different peaks.
UROPA to GTF utility¶
The GTF file is a common format used for annotation. UROPA accepts all GTF files downloaded from any online databases, such as UCSC, ensembl, or gencode. Additionally, custom GTF files can be used.
The gencode v19 annotation GTF is illustrated in Table 7.1.
chr1 | HAVANA | gene | 11869 | 14412 | . | + | . | gene_id “ENSG00000223972” ; transcript_id “ENSG00000223972.4”;gene_type “pseudogene”; gene_status “KNOWN”; gene_name “DDX11L1”;transcript_type “pseudogene”;transcript_status “KNOWN”;transcript_name “DDX11L1”; level 2; havana_gene “OTTHUMG00000000961.2”; |
… |
Table 7.1: First row of gencode v19 GTF file
The uropa2gtf-tool transforms annotation files that are not in GTF format. Files for annotation can be generated for instance by the UCSC Table Browser , or many other data bases. For the conversion, the input annotation file needs to have a header, and there need to be columns with information about the location: ‘chr’, ‘start’, and ‘end’ . Additionally, the file should be tab separated.
Another requirement is a file name without dots.
For example, if transcription factor binding sites (tfbs) from the UCSC table browser were downloaded, the name of the file should be the name of the table: ‘wgEncodeAwgTfbsBroadHuvecCtcfUniPk.txt’. The file name will be used in the attribute column.
There are two variations of the GTF file generator:
- One file should be converted and used for annotation. The GTF file keeps the same as the base file name.
- Several files should be used for annotation. In this case the input should be a folder with all annotation files included (but no others). The files will be converted one by one; additionally one merged GTF file (called uropa2gtf_’basename of the input dir’.gtf) will be created. For the merged file, the explicit file names are important for distinguishing the annotated features.
The generated files will be stored in the same directory as the input file is located.
Beside the mandatory input, there are three optional arguments that can be given for the transformation. Those are the source, feature and number of threads. The usage of this utility is very simple:
uropa2gtf.R -i input
This is the basic usage, by using further parameters, further features can be specified with
uropa2gtf.R -i input -s yourSource -f yourFeature -t number-threads
e.g.
uropa2gtf.R -i wgEncodeAwgTfbsBroadHuvecCtcfUniPk.txt -s ucsc -f tfbs -t 5
The two arguments source and feature are used for the GTF reformatting itself. The argument threads can be used if multiprocessing should be used. There should be no spaces between the character and the equal sign when using the parameters in the command line call.
If optional arguments are given, they will overwrite information from the input file(s).
Within the transformation, the input file is checked for information that is necessary for the GTF file format, like chr, start, end, strand, and others.
If the information is present in the input file, it will be adopted to the new GTF file.
The optional arguments are used in the GTF file for the corresponding columns, if one optional argument is not given and this information is also not present in the input file,
the column will be filled with undefined. For other information that is not present in the input file, the column will be filled with dots.
All additional columns presented in the input file will be merged in the attributes column of the new GTF file. All that information can be shown as annotation specification using the show.attribute
key using UROPA.
Furthermore, these are the attributes which can be filtered for specific values with the two linked keys filter.attribute
and attribute.value
.
The custom GTF transformation is useful if the peaks should not be annotated to a gene, but for example to known tfbs or other regulatory elements. For instance, this is handy for an ATAC-seq peak annotation.
#bin | chrom | chromStart | chromEnd | name | score | strand | signalValue | pValue | qValue | peak |
74 | chr1 | 1310465 | 1310835 | . | 244 | . | 372.141 | -1 | 482.217 | 185 |
76 | chr1 | 3407792 | 3408060 | . | 1000 | . | 178.305 | -1 | 482.214 | 129 |
… |
Table 7.2: Downloaded table from UCSC Table Browser (wgEncodeAwgTfbsBroadHuvecCtcfUniPk) for CTCF transcription factor from Uniform TFBS track.
After transformation with -f tfbs
and -s ucsc
, the GTF format annotation file will look as displayed in Table 7.3.
chr1 | ucsc | tfbs | 1310465 | 1310835 | 244 | . | . | bin 74; signalvalue 372.141; pvalue -1; qvalue 482.217; peak 185; table wgEncodeAwgTfbsBroadHuvecCtcfUniPk |
chr1 | ucsc | tfbs | 3407792 | 3408060 | 1000 | . | . | bin 76; signalvalue 178.305; pvalue -1; qvalue 482.217; peak 129; table wgEncodeAwgTfbsBroadHuvecCtcfUniPk |
Table 7.3: GTF file download from UCSC table browser for wgEncodeAwgTfbsBroadHuvecCtcfUniPk
UROPA GUI¶
A Web Platform for genomic region annotation
The annotation of genomic ranges such as peaks resulting from ChIP-seq/ATAC-seq or other techniques represents a fundamental task of bioinformatics analysis with crucial impact on many downstream analyses. In our previous work, we introduced the Universal Robust Peak Annotator (UROPA), a flexible command line based tool which considerably extends the functionality of existing annotation software. In order to reduce the complexity for biologists and clinicians, we have implemented an intuitive web-based graphical user interface (GUI) and fully functional service platform for UROPA. This extension will empower all users to generate annotations for regions of interest interactively.
Features¶
- Input: BED file of peaks or other genomic regions
- Reference: GTF file of desired target features (e.g. genes, transcripts, probes, repeats, …); source = Gencode/Ensembl (102 organisms included) or custom upload
- Association rules: VERY diverse and easily combinable to a complex ruleset (see Configuration file)
- Persistence: Unique identifier is created on the server and results will remain available temporarily using the respective link
- Hosted: Either online on our web server or as a local R Shiny installation
Availability and Implementation¶
The open source UROPA GUI server was implemented in R Shiny and Python and is available from UROPA_GUI . Please make sure to check loosolab for a comprehensive overview of all our projects and implemented tools.
Try UROPA GUI¶
The UROPA_GUI contains all necessary data to quickly sample the capabilities of UROPA GUI. * UROPA GUI user guide: Stepwise tutorial.
- Example data: The following demo GTF and BED files are available on the server. Homo_sapiens.hg19.GRCh37.75_genes_v2.gft -> Human GTF file with reference genes ENCFF001VFA.pol2.sub.bed -> POLR2A ChIP-seq experiment (14989 peaks)
Note
It is mandatory to select filter.attribute
before attribute.value
. Depending on the GTF file, it might take some time for the attribute.value
to be loaded. Please be patient!
How to cite¶
- Kondili M, Fust A, Preussner J, Kuenne C, Braun T, and Looso M. UROPA: a tool for Universal RObust Peak Annotation. Scientific Reports 7 (2017), doi: https://www.nature.com/articles/s41598-017-02464-y
- Hendrik Schultheis, Jens Preussner, Annika Fust, Mette Bentsen, Carsten Kuenne, Mario Looso: UROPA: A Web Platform for genomic region annotation
License¶
This project is licensed under the MIT license.
MIT License¶
Copyright (c) 2016 MPI for Heart and Lung Research
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Contact
Questions? Please contact Mario Looso. Thanks!
Help¶
If you have trouble please email Mario Looso. Please report bugs on our Github issue tracker. They will be addressed as soon as possible.
FAQ
- How to define multiple queries in the configuration file?
- “queries”:[{},{}] the total of all queries should be in squared brackets and a single query inside curly brackets. Different queries are separated by commas.
- What is the tabix Error about?
- This error occurs if not all genomic locations represented in the peak file, are also constituted in the GTF annotation file. Those peaks can not be annotated and will be NA in the outputs.
- How to annotate with default values?
- In order to activate default values, the key itself shouldn’t be present in the config file. If you want to annotate with default values only, do not remove the queries key, but leave it empty like “queries”:[]