ENCODE sets analyzed
- Coding: This is a selection of 300 coding mRNAs
- lncRNAs: This is a list of all appr. 1000 lincRNAs that have been found significantly expressed at the 5% level in humans + appr. 50 lincRNAs that were selected from lncRNAdb.
- Random: At the moment that is a list of random regions that were found to be significantly expressed. Used to debug the random regions used for the background.
Main table fields
- Gene Id: This is the ENSEMBL ID of the gene. In some rare cases for known lncRNAs that have been added but are not in GENCODE it is just the plain name.
- GENCODE: Summary of information from gencode including the name,
annotation type (protein_coding, antisense, lincRNA,...) and the
annotation status (known, putative, novel). Note: all lincRNAs have
been filtered to be "intergenic" and some lincRNAs annotated as
antisense are actually not overlapping a protein coding gene but are
just antisense to its neighbouring gene.
- lncRNAdb: Name of the lincRNA from lncRNAdb. Details can bee looked up by searching for the name directly at the database.
- RNAcode: Minimum p-value observed while scanning the exonic regions of the gencode transcript. A p-value cutoff of 0.01 has a false positive and false negative rate of around 5%.
- Fraction mappable / Map diversity: upper line: The
fraction of mappable (i.e. alignable) positions of the exons of the
original ENSEMBL transcript. Important: The number is a hyperlink
to the UCSC genome browser, lower line: A metric
describing how ambigous the mapping is. Roughly speaking if it is
0.5 there are two distinct but equally likely orthologous regions in
the genome. 1.0 means it is uniquely mapped.
- Avg. reads exon / p-value: upper line: The average
number of reads covering exons in human and the exonic mapped
regions in other species. A plus sign indicates that there is a
cufflinks transcript overlapping.
lower line: Emprical p-value of the read count calculated
from the background distribution of the random regions. Note: the
coloring at the moment is according to the read count and not the
Read coverage diagrams
- The red bars on the top are the exons in the reference GENCODE
transcript (more precisely the union of all isoforms of a GENCODE
gene). The diagram is always shown in the plus direction, i.e. genes
on the negative strand are shown from left to right in 3'-5' orientation.
- The actual data tracks show the number of reads each position. The
tracks are scaled uniformly, the maximum number of reads (corrsponding
to the maximum y-value) is shown next to the track on the
- Red regions overlap exonic region in the GENCODE transcript, blue
regions are outside exonic regions. Green regions consist of spliced
- The gray background shows that a region is mappable from human to
the respective species
- The upper diagram show the sum over all tissues, the lower diagram shows all the data for all tissues
- The human number in the "Fraction mappable" field takes you to the
UCSC genome browser. There is an entry for every unique exon observed
in all Cufflinks/(soon Scripture) transcripts overlapping the GENCODE
transcript. This allows one to discriminate between exon boundaries
and gaps because of missing positions in the alignment chain file. In
addition there is a fake "transcript" with a union of all exons. The
ends of this transcript indicate the region that is mappable between
human and the other species. If there are no cufflinks/scripture
transcripts in the region either the region could not be mapped or
there was not overlapping transcript found in the cufflinks/scripture