Output files¶

UROPA provides multiple output files, each providing valuable information in either a more extended or a more condense way in order to cover the needs of a user. Details are given below:

File overview¶

allhits.txt: Basic output table, reports for each peak all valid annotations and additionally NA rows for peaks without valid annotation.
finalhits.txt: Filtered output table, it reports the best (closest) feature according to the config criteria for each peak. If multiple queries are given, it reports the best annotation taking multiple queries into account.
finalhits.bed: Similar tp finalhits.txt in bed format. This means there is no header, the column order is as followed: peak_chr peak_start peak_end peak_id peak_strand peak_score feature feat_start feat_end feat_strand feat_anchor distance genomic_location + all attributes that are given in the config file
besthits.txt: This table is only produced if more than one query is given. It reports the best annotation per query for each peak.
besthits_compact.txt: Another format of the besthits produced by using the -r parameter. A compact table with best per query annotation for each peak in one row is presented.
summary.pdf: Statistical summary of the UROPA annotation generated by using the -s parameter.

Note

If no prefix is specified, the result files will be stored in the working diractory with the basename of the config file. For example,``uropa -i histonMarkPeaks.json`` leads to an allhits file called histonMarkPeaks_allhits.txt stored in the working directory.

With specified prefix, the result files will be stored in the working diractory with the prefix as part of the output file names. uropa -i histonMarkPeaks.json -p abc results in an allhits file called abc_allhits.txt. But if for instance the prefix would be defined as -p xy/abc all results would be stored in folder xy, which is be created if it does not exist already, with abc as prefixes of the output files.

Output columns explanation¶

The four output tables mentioned above contain informative columns about the performed peak annotation:

peak_id, peak_start, peak_center, peak_end, peak_strand: Peak information with id if available, otherwise a peak id in chr:start-end format will be created.
feature, feat_start, feat_end, feat_strand: The information of the genomic feature that annotates the peak as extracted by the GTF file.
feat_anchor: The position of the annotated genomic feature which was used for distance calculation. If feature.anchor is given in config, only this will be used. If multiple feature.anchor were given, the distance to all of them is calculated and the minimum distance is chosen.
distance : Absolute distance from peak center to feature anchor. Closest feature anchor if multiple are specified, see column feat_anchor.
genomic_location: The position of the peak relative to the annotated feature direction (e.g. upstream = peak located upstream of the feature, see Figure 2 in Application examples).
feat_ovl_peak: Percentage of the peak that is covered by the feature (1.0 = 100%, this corresponds to the genomic_location “PeakInsideFeature”).
peak_ovl_feat: Percentage of the feature that is covered by the peak (1.0 corresponds to the genomic_location “FeatureInsidePeak”).
gene_name, gene_id, gene_type,…: Attributes that have been given in the key show.atttributes with their values extracted from the GTF.

Hint

If filter.attribute key is used, make sure to specify these attributes are also selected in the show.attributes key, to have an easy confirmation of the filtering.
Make sure to define attributes for a notification in the output. Otherwise the annotated peaks will be reported without any information of the assigned features.

query: Valid query ID for the current annotation (0-based).

Output files (one query)¶

UROPA annotation with one query results in two output tables. Those are the allhits and finalhits. For a configuration as given below, the allhits is given in Table 1, and the finalhits is displayed in Table 2. Peak and annotation files are further described in the Application examples section.

The UROPA annotation process for one query can run into three cases for each peak:

Case 1: No query gives any feature for annotating the peak, this leads to no valid annotation at all -> NA row in allhits and finalhits.
Case 2: There is one valid annotation for the specified query -> annotation will be given in allhits and finalhits.
Case 3: There are multiple valid annotations for the specified query -> all valid annotations will be given in the allhits, the best annotation (smallest distance) will be presented in the finalhits.

{
"queries":[
        {"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
            "filter.attribute":"gene_type", "attribute.value":"protein_coding",
            "show.attributes":["gene_name","gene_type"]}],
 "priority" : "False",
 "gtf":"gencode.v19.annotation.gtf" ,
 "bed":"ENCFF001VFA.bed"
}

peak_id	peak_chr	peak_start	peak_center	peak_end	peak_strand	feature	feat_start	feature_end	feat_strand	feat_anchor	distance	genomic_location	feat_ovl_peak	peak_ovl_feat	gene_name	gene_type	query
…
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0
peak_356	chr10	43902516	43904360.5	43906205	.	gene	43881065	43904614	-	start	253	overlapStart	0.57	0.09	HNRNPF	protein_coding	0
…
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98190908	98262240	-	start	261	upstream	0.0	0.0	CHD1	protein_coding	0
…
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175810949	175815976	+	start	937	overlapStart	0.31	0.3	NOP16	protein_coding	0
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175815748	175816772	+	start	1165	FeatureInsidePeak	0.22	1.0	HIGD2A	protein_coding	0
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175792471	175828666	+	start	24442	PeakInsideFeature	1.0	0.14	ARL10	protein_coding	0
…

Table 5.1: Excerpt of table allhits for one query as described in the configuration above.

peak_id	peak_chr	peak_start	peak_center	peak_end	peak_strand	feature	feat_start	feature_end	feat_strand	feat_anchor	distance	genomic_location	feat_ovl_peak	peak_ovl_feat	gene_name	gene_type	query
…
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0
peak_356	chr10	43902516	43904360.5	43906205	.	gene	43881065	43904614	-	start	253	overlapStart	0.57	0.09	HNRNPF	protein_coding	0
…
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98190908	98262240	-	start	261	upstream	0.0	0.0	CHD1	protein_coding	0
…
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175810949	175815976	+	start	937	overlapStart	0.31	0.3	NOP16	protein_coding	0
…

Table 5.2: Excerpt of table finalhits for one query as described in the configuration above.

As displayed in Table 1 and Table 2, peak 355 is a representative of Case 1. There is no valid annotation at all, thus there is an NA row in both output tables.

The peaks 356 and 765 belong to Case 2, there is one valid annotation for them, their annotation is displayed in the same way in allhits and finalhits.

Peak 769 has three valid annotations for the specified query (Case 3). All of them are displayed in the allhits output. In the finalhits only the best annotation, the one for gene NOP16 with the minimal distance of 937, is represented.

Output files (multiple queries)¶

UROPA annotation with multiple queries and default priority results in at least three output tables. Those are the allhits, finalhits, and besthits. If the -r parameter is added in the command line call, there will the additional output compact file. Furthermore, if the -s parameter is also added, the summary file is generated. With a configuration as given below, the generated output files are generated as presented in Tables 3 to 6 and Figure 1.

Peak and annotation files are further described in the Application examples section.

The UROPA annotation process for multiple queries allows an additional case in relation to the cases described for one query above:

Case 1 to 3 as described for one query
Case 4: There are valid annotations for one peak for multiple queries -> all valid annotations will be given in the allhits, the best annotation (smallest distance across all queries) will be presented in the finalhits. Additionally, the best annotation per query will be displayed in the besthits output.

{
    "queries":[
        {"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
            "filter.attribute":"gene_type",  "attribute.value":"protein_coding",
            "show.attributes":["gene_name","gene_type"]},
        {"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
            "filter.attribute":"gene_type",  "attribute.value":"lincRNA"},
        {"feature":"gene", "distance":10000, "feature.anchor":"start", "internals":"True",
            "filter.attribute":"gene_type",  "attribute.value":"misc_RNA"},
          ],
"priority" : "False",
"gtf": "gencode.v19.annotation.gtf",
"bed": "ENCFF001VFA.peaks.bed"
}

peak_id	peak_chr	peak_start	peak_center	peak_end	peak_strand	feature	feat_start	feature_end	feat_strand	feat_anchor	distance	genomic_location	feat_ovl_peak	peak_ovl_feat	gene_name	gene_type	query
…
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2
peak_356	chr10	43902516	43904360.5	43906205	.	gene	43881065	43904614	-	start	253	overlapStart	0.57	0.09	HNRNPF	protein_coding	0
peak_356	chr10	43902516	43904360.5	43906205	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1
peak_356	chr10	43902516	43904360.5	43906205	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2
…
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98190908	98262240	-	start	261	upstream	0.0	0.0	CHD1	protein_coding	0
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98264875	98330717	+	start	22	overlapStart	0.5	0.03	CTD-2007H13.3	protein_coding	1
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98272342	98272451	-	start	7598	downstream	0.0	0.0	Y_RNA	protein_coding	2
…
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175810949	175815976	+	start	937	overlapStart	0.31	0.3	NOP16	protein_coding	0
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175815748	175816772	+	start	1165	FeatureInsidePeak	0.22	1.0	HIGD2A	protein_coding	0
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175792471	175828666	+	start	24442	PeakInsideFeature	1.0	0.14	ARL10	protein_coding	0
peak_769	chr5	175814508	175816913.5	1751574319	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1
peak_769	chr5	175814508	175816913.5	1751574319	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2
…

Table 5.3: Excerpt of table allhits for three queries as described in the configuration above.

peak_id	peak_chr	peak_start	peak_center	peak_end	peak_strand	feature	feat_start	feature_end	feat_strand	feat_anchor	distance	genomic_location	feat_ovl_peak	peak_ovl_feat	gene_name	gene_type	query
…
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0,1,2
peak_356	chr10	43902516	43904360.5	43906205	.	gene	43881065	43904614	-	start	253	overlapStart	0.57	0.09	HNRNPF	protein_coding	0
…
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98264875	98330717	-	start	22 \| overlapStart		0.5	0.03	CTD-2007H13.3	protein_coding	1
…
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175810949	175815976	+	start	937	overlapStart	0.31	0.3	NOP16	protein_coding	0
…

Table 5.4: Excerpt of table finalhits for three queries as described in the configuration above.

peak_id	peak_chr	peak_start	peak_center	peak_end	peak_strand	feature	feat_start	feature_end	feat_strand	feat_anchor	distance	genomic_location	feat_ovl_peak	peak_ovl_feat	gene_name	gene_type	query
…
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1
peak_355	chr15	79211550	79217124	79222698	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2
peak_356	chr10	43902516	43904360.5	43906205	.	gene	43881065	43904614	-	start	253	overlapStart	0.57	0.09	HNRNPF	protein_coding	0
peak_356	chr10	43902516	43904360.5	43906205	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1
peak_356	chr10	43902516	43904360.5	43906205	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2
…
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98190908	98262240	-	start	261	upstream	0.0	0.0	CHD1	protein_coding	0
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98264875	98330717	+	start	22	overlapStart	0.5	0.03	CTD-2007H13.3	protein_coding	1
peak_765	chr5	98262863	98264852.5	98266842	.	gene	98272342	98272451	-	start	7598	downstream	0.0	0.0	Y_RNA	protein_coding	2
…
peak_769	chr5	175814508	175816913.5	1751574319	.	gene	175810949	175815976	+	start	937	overlapStart	0.31	0.3	NOP16	protein_coding	0
peak_769	chr5	175814508	175816913.5	1751574319	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1
peak_769	chr5	175814508	175816913.5	1751574319	.	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2
…

Table 5.5: Excerpt of table besthits for three queries as described in the configuration above.

Note

The besthits is only generated if multiple queries are specified and the priority flag is set to FALSE! If this flag is TRUE, there will be only one valid query. There can be multiple valid annotations for one peak, but all based on one query. In this case only the allhits and finalhits are produced.

Same as in the first example with one query, peak_355 has no valid annotation and is represented as NA row (correspond to Case 1). In the allhits (Table 3) and besthits (Table 5) there will be one NA row for each query. In the finalhits (Table 4) there will be only one NA row for all queries.

The peak_356 has only for one query a valid annotation, as given in allhits, finalhits, and besthits (Case 2). In allhits and besthits there are additional NA rows for this peak for the other queries without a hit.

For peak_765 there are valid annotations for all given queries as displayed in the allhits, (Case 4). The best of these is the annotation for the lincRNA, therefore this annotation is displayed in the finalhits. Because there is only one valid annotation for each query, they will be displayed in the same way in the besthits.

This is different for peak_769. As described above this peaks equates to Case 3. With multiple queries, there will be additional NA rows for the invalid queries in the allhits and besthits.

With multiple queries UROPA provides an option to reformat the besthits table in order to condense the best per query annotations for each peak in one row. A reformatted example for the besthits of Table 5 is presented in Tables 6. The Reformatted_HitsperPeak represents all information for each peak in one row. Within this format the information for query 0 is always given at the first position, for query 1 at second positon etc.

To receive this output format, the parameter -r has to be added to the command line call.

peak_id	peak_chr	peak_start	peak_center	peak_end	peak_strand	feature	feat_start	feature_end	feat_strand	feat_anchor	distance	genomic_location	feat_ovl_peak	peak_ovl_feat	gene_name	gene_type	query
…
peak_355	chr15	79211550	79217124	79222698	.	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	NA,NA,NA	0,1,2
peak_356	chr10	43902516	43904360.5	43906205	.	gene,NA,NA	43881065,NA,NA	43904614,NA,NA	-,NA,NA	start,NA,NA	253,NA,NA	overlapStart,NA,NA	0.57,NA,NA	0.09,NA,NA	HNRNPF,NA,NA	protein_coding,NA,NA	0,1,2
…
peak_765	chr5	98262863	98264852.5	98266842	.	gene,gene,gene	98190908,98264875,98272342	98262240,98330717,98272451	-,+,-	start,start,start	261,22,7598	upstream,overlapStart,downstream	0.0,0.5,0.0	0.0,0.03,0.0	CHD1,CTD-2007H13.3,Y_RNA	protein_coding,protein_coding,protein_coding	0,1,2
…
peak_769	chr5	175814508	175816913.5	1751574319	.	gene,NA,NA	175810949,NA,NA	175815976,NA,NA	+,NA,NA	start,NA,NA	937,NA,NA	overlapStart,NA,NA	0.31,NA,NA	0.3,NA,NA	NOP16	protein_coding	0,1,2
…

Table 5.6: Excerpt of table Reformatted_HitsperPeak for three queries as described in the configuration above.

Summary Visualisation¶

In order to generate a global summary, one can apply the -s parameter during the command line call. This summary is visualising a global overview of the generated UROPA annotations, and several usefull plots, like the distribution of the annotated peaks to the features of interst and the genomic locations according to the features of interest. Detailed information about the different plots and how they are dedicated to the different output file formats is explained below.

Overview of what can be found in the summary visualsation

An abstract of the UROPA annotation including the used peak and annotation files
Number of peaks and number of annotated peaks
Specified queries, and value of the priority flag (Figure 1A)
Number of peaks per query

Graphs based on the ‘finalhits’ output:

Density plot displaying the distance per feature across all queries (Figure 1B)
Pie chart illustrating the genomic locations of the peaks per annotated feature (Figure 1C)
Bar plot displaying the occurrence of the different features, if there is more than one feature assigned for peak annotation (not illustrated due to one feature in this example)

Figure 1A-C would be the summary for the first example UROPA run given above with only one query

Graphs based on the ‘besthits’ output:

Distribution of the distances per feature per query is displayed in a histogram (Figure 1D)
Pie chart illustrating the genomic locations of the peaks per annotated feature (not illustrated)
Pairwise comparisons among all queries is evaluated within a venn diagram (more than one query is needed; one pairwise comparison displayed in Figure 1E)
Chow Ruskey plot with comparison across all defined queries (for three to five annotation queries)(Figure 1F)

Figure 1: Summary file example for queries as described above: (A) Summary of specified queries, used annotation and peak files, and how many peaks were present and annotated, (B) Distance density for all features based on finalhits, (C) Pie Chart representing genomic location for each feature across finalhits, (D) Distance per query per feature across besthits, (E) Pairwise comparison across all queries displayed in Venn diagrams, (F) Chow Ruskey plot to compare all queries.