The ENCODE Encyclopedia

The ENCODE Encyclopedia comprises two levels of epigenomic and transcriptomic annotations (Figure 1). The ground level includes annotations such as peaks and quantifications for individual data types produced by the ENCODE uniform processing pipelines. The integrative level contains annotations generated by integrating multiple ground-level annotations. The core of the integrative level is the Registry of candidate cis-Regulatory Elements (cCREs) which are displayed in SCREEN, a web-based visualization engine designed specifically for the Registry. SCREEN allows the user to explore cCREs and investigate how these elements relate to other Encyclopedia annotations and raw ENCODE data.
The Registry of cCREs
cCREs are the subset of representative DNase hypersensitivity sites (rDHSs) supported by either histone modifications (H3K4me3 and H3K27ac) or CTCF-binding data. We start with 93 million individual DHSs across 706 DNase-seq profiles in human and 20 million individual DHSs from 173 DNase-seq profiles in mouse. For each respective species, we iteratively cluster the DHSs across all profiles and select the DHS with the highest signal (read depth normalized signal) as the rDHS for each cluster. This iterative clustering and selection process continues until it results in a list of non-overlapping rDHSs—2.2 million rDHSs in human and 1.2 rDHSs in mouse—representing all DHSs (Figure 3). We then further selected rDHSs with high DNase signal in at least one biosample (defined as a Z-score `>` 1.64, see details on defining high signal below). Finally, from this subset of high signal rDHSs, we selected all elements that were also supported by high H3K4me3, H3K27ac, and/or CTCF ChIP-seq signals in concerted biosamples (i.e., samples with complementary assay coverage). This resulted in a total of 926,535 human cCREs and 339,815 mouse cCREs (Figure 2).
Defining high epigenomic signals
For each rDHS, we computed the Z-scores of the log10 of DNase, H3K4me3, H3K27ac, and CTCF signals in each biosample with such data. Z-score computation is necessary for the signals to be comparable across biosamples because the uniform processing pipelines for DNase-seq and ChIP-seq data produce different types of signals. The DNase-seq signal is in sequencing-depth normalized read counts, whereas the ChIP-seq signal is the fold change of ChIP over input. Even for the ChIP-seq signal, which is normalized using a control experiment, substantial variation remains in the range of signals among biosamples.
To implement this Z-score normalization, we used the UCSC tool bigWigAverageOverBed to compute the signal for each rDHS for a DNase, H3K4me3, H3K27ac, or CTCF experiment. For DNase and CTCF, the signal was averaged across the genomic positions in the rDHS. The signals of H3K4me3 and H3K27ac were averaged across an extended region—the rDHS plus a 500-bp flanking region on each side—to account for these histone marks at the flanking nucleosomes. We then took the log10 of these signals and computed a Z-score for each rDHS compared with all other rDHSs within a biosample. rDHSs with a raw signal of 0 were assigned a Z-score of -10. For all analysis we defined "high signal" as a Z-score greater than 1.64, a threshold corresponding to the 95th percentile of a one-tailed test. We define a max-Z of a rDHS as the maximum z-score for a signal across all surveyed biosamples.

Classification of cCREs
Many uses of cCREs are based on the regulatory role associated with their biochemical signatures. Thus, we putatively defined cCREs in one of the following annotation groups based on each element’s dominant biochemical signals across all available biosamples. Analogous to GENCODE's catalog of genes, which are defined irrespective of their varying expression levels and alternative transcripts across different cell types, we provide a general, cell type-agnostic classification of cCREs based on the max-Zs as well as its proximity to the nearest annotated TSS:
- cCREs with promoter-like signatures (cCRE-PLS) fall within 200 bp (center to center) of an annotated GENCODE TSS and have high DNase and H3K4me3 signals (evaluated as DNase and H3K4me3 max-Z scores, defined as the maximal DNase or H3K4me3 Z scores across all biosamples with data; see Methods).
- cCREs with enhancer-like signatures (cCRE-ELS) have high DNase and H3K27ac max-Z scores and must additionally have a low H3K4me3 max-Z score if they are within 200 bp of an annotated TSS. The subset of cCREs-ELS within 2 kb of a TSS is denoted proximal (cCRE-pELS), while the remaining subset is denoted distal (cCRE-dELS).
- DNase-H3K4me3 cCREs have high H3K4me3 max-Z scores but low H3K27ac max-Z scores and do not fall within 200 bp of a TSS.
- CTCF-only cCREs have high DNase and CTCF max-Z scores and low H3K4me3 and H3K27ac max-Z scores.
In addition to the cell type-agnostic classification described above, we evaluated the biochemical activity of each cCRE in each individual cell type using the corresponding DNase, H3K4me3, H3K27ac, and CTCF data (Figure 3). All cCREs with low DNase Z-scores in a particular cell type are bundled into one “inactive” state for that cell type; the remaining “active” cCREs are divided into eight states according to their epigenetic signal Z-scores, producing nine possible states in total. The three groups described above—cCRE-PLS, cCRE-ELS, and CTCF-only cCRE—apply to the active cCREs within a particular cell type. Two additional groups are defined with respect to individual cell types: an inactive group, containing all cCREs in the inactive state, and a DNase-only group, containing cCREs with high DNase Z-scores but low H3K4me3, H3K27ac, and CTCF Z-scores within the cell type. Importantly, while the classification schemes in Figures 2 and 3 place each cCRE into only one activity group, the signal strengths for all recorded epigenetic features are retained for each cCRE in the Registry, and these can be used for customized searches by users.
We also attempt to make group assignments for cCREs in a particular biosample not fully covered by the four core assays, making some approximations. For samples with DNase data, we classify elements using the available marks. For example, if a sample lacks H3K27ac its cCREs will be assigned to the PLS and DNase-H3K4me3 groups but not the pELS or dELS groups. For biosamples lacking DNase data, we do not have the resolution to identify specific elements. Therefore, for these biosamples, we simply label the cCRE as having a high or low signal for every available assay. In these biosamples, cCREs with low H3K4me3, H3K27ac, or CTCF signals were labelled “unclassified” because we were unable to classify them as low-DNase without DNase data. In both SCREEN and in downloadable files biosamples lacking data are clearly labeled as such.

Genomic Footprint of the cCREs
We analyzed the percentage of the genome covered by each group of cCREs, considering only regions of the genome which are not blacklisted (~3.2 billion bases for human and 2.7 billion bases for mouse). In total, 7.9% of the mappable genome is covered by cCREs (0.3% by cCREs-PLS, 1.1% by cCREs-pELS, 5.8% by cCREs-dELS, 0.2% by DNase-H3K4me3 cCREs and 0.7% by CTCF-only cCREs) and 3.4% of the mappable mouse genome is covered by cCREs (Figure 4). The lower coverage for mouse is due to the smaller number of cell types with data to define cCREs.

Additional properties of cCREs
We performed additional analysis on including:
- Comprehensiveness of the Registry
- Conservation
- Overlap with other epigenomic and transcriptomic data
- Experimental validation from mouse transgenic assays
Details and results from these analyses can be found at our companion website, http://encyclopedia.wenglab.org/.
Integration with ground level annotations
In addition to hosting the Registry of cCREs, SCREEN also integrates the ground level Encyclopedia annotations. Under the cCRE Details page for each cCRE are the overlapping ground level annotations with links to their derived experiments. Additionally, some annotations, such as histone mark and TF ChIP-seq peaks, gene expression, and RAMPAGE transcription levels are highlighted further with specific tabs under the cCRE Details page.
Using cCREs to Interpret GWAS Variants
Curating GWAS Results
We downloaded associations reported in the NHGRI-EBI genome-wide association studies (GWAS) catalog as of January 1, 2019. Because mixed populations complicate linkage disequilibrium (LD) structures, we only selected studies that were performed on a single population. For each study, we downloaded all reported SNPs (p < 10-6), even those that were just under genome wide significance. We then intersected all reported SNPs and SNPs in high LD (r2 > 0.7), with GRCh38 cCREs. These results are available through the SCREEN GWAS app.
Determining Cell Types with cCREs Enriched in GWAS SNPs
For studies with more than 25 lead SNPs, we performed biosample enrichment analysis. For each study, we generated a matching set of control SNPs as follows: for each SNP in the study (p-value < 10-6) we selected a SNP on Illumina and Affymetrix SNP chips that fell within the same population-specific minor allele frequency (MAF) quartile and the same distance to TSS quartile. We repeated this process 500 times, generating 500 random control SNPs for each GWAS SNP. Then, for both GWAS and control SNPs, we retrieved all SNPs in high linkage disequilibrium (LD r2 > 0.7), creating LD groups. This method was adapted and modified from the Uncovering Enrichment through Simulation (UES) method developed by the Klein Lab (Hayes et al. 2015).
To assess whether the cCREs in a biosample were enriched in the GWAS SNPs, we intersected GWAS and control LD groups with cCREs with an H3K27ac Z-score `>` 1.64 in the biosample. To avoid overcounting, we pruned the overlaps, counting each LD group once per biosample. We modified the UES method by calculating p-values from Z-scores for performing statistical testing. We calculated enrichment for overlapping cCREs by comparing the GWAS LD groups with the 500 matched controls. Finally, we applied a false discovery rate threshold of 5% to each study.

How to Cite the ENCODE Encyclopedia, the Registry of cCREs, and SCREEN
- ENCODE Project Consortium, Jill E. Moore, Michael J. Purcaro, Henry E. Pratt, Charles B. Epstein, Noam Shoresh, Jessika Adrian, et al. 2020. “Expanded Encyclopaedias of DNA Elements in the Human and Mouse Genomes.” Nature 583 (7818): 699–710.
Introduction
SCREEN and the ENCODE Encyclopedia are deeply integrated with the UCSC Genome Browser to facilitate genome-wide visualization of all of the Encyclopedia’s annotations. You can visualize all ground-level anotations from the Encylcopedia using our "mega-trackhub", which contains peaks and signal for all the core DNA-based and RNA-based assays available at ENCODE as well as integrative annotations related to the Registry of cCREs. Alternatively, SCREEN offers the capability to generate customized trackhubs with cCRE-related data including DNase-seq, H3K4me3 ChIP-seq, H3K27ac ChIP-seq, CTCF ChIP-seq, and RNA-seq. Select a tab below for more information or to access the mega-trackhub.
cCRE tracks
If you are interested in visualizing the Registry of cCREs alongside your own data, you can use the buttons below:Human (GRCh38) Mouse (mm10)
Mega trackhubs
We offer mega trackhubs for human and mouse which provide access to all the ground level data in the ENCODE Encyclopedia. These mega hubs are divided into three hubs for each species: a DNA-based hub, containing assays targeting DNase accessibility, DNA binding by transcription factors, DNA methylation, and other DNA-related features; an RNA-based hub, containing assays targeting RNA expression, RNA binding protein occupancy, and other RNA-related features; and an integrative hub, containing cCREs and the epigenetic data used to derive them. Mouse hubs are available on the mm10 genome; human hubs are available both on hg19 and GRCh38.You can use the buttons below to access the trackhubs at UCSC:
DNA RNA Integrative
Custom trackhubs in SCREEN
Nearly everywhere a genomic feature with genomic coordinates is presented on SCREEN, an accompanying UCSC button is available, which leads to a view of the surrounding genomic neighborhood in the UCSC Genome Browser. Examples of features with associated UCSC Genome Browser buttons include cCREs, genes, RAMPAGE TSSs, and annotations from external datasets such as the FANTOM Consortium’s catalogs. The locations of these UCSC Genome Browser buttons are presented in Figure 1.

Figure 1. Locations of UCSC Genome Browser buttons in the main search table (top left), RNA-seq expression view (top right), RAMPAGE expression view (bottom left), and FANTOM intersection view (bottom right).
Clicking a UCSC Genome Browser button will bring you to a Genome Browser Configuration view, which allows you to select which data you are interested in viewing in the region surrounding your feature of interest. An example of this view is shown in Figure 2. The configuration view displays the coordinates of the selected feature at the top; when the genome browser is opened, it will be centered on these coordinates, expanded 7,500 basepairs upstream and downstream.

Figure 2. The Genome Browser configuration view, showing a selected cCRE with coordinates, the cell type selector with four biosamples selected, and handles for rearranging cell type order highlighted in red.
The default view includes cell type-agnostic cCRE tracks; you may select whether to view a single track of 7-group classifications or an expanded set of three 9-state classification tracks, one for each H3K4me3, H3K27ac, and CTCF, using the 7-group/9-state toggle buttons. cCREs within the five group track will be either red for promoter-like, yellow for enhancer-like, blue for CTCF-bound, or gray for inactive; cCREs within the 9-state tracks will be colored if the corresponding Z-score is greater than 1.64 and will be gray otherwise.
Next, you may use the cell type selector to add tracks for individual cell types to your visualization. Selecting a cell type will add the cell type’s cCRE tracks; the 7-group/9-state selection you made above will apply to these tracks too. Signal tracks for DNase-seq, H3K4me3 ChIP-seq, H3K27ac ChIP-seq, CTCF ChIP-seq, and RNA-seq will also be added when they are available. The colored box to the right of the cell type name shows which of the four core epigenomic marks have signal available for the given cell type, and a check mark in the rightmost RNA-seq column indicates that a cell type has RNA-seq signal available.
Clicking a cell type will add it to the Selected Biosamples list. If you would like to rearrange the order in which the cell type tracks appear in the UCSC genome browser before you open it, you may do so by clicking the handles to the left of the cell type names and dragging the cell types up or down. When you are content with your selections, click the Open in UCSC button to open the Genome Browser.
Part of the Genome Browser view for the selection in Figure 2 is shown in Figure 3. The selected cCRE is highlighted in blue at the center of the screen. The view shows General, or cell type-agnostic, cCREs on top, followed by cCRE activity and the available DNase-seq and RNA-seq data for A172, and then the cCRE activity and the available histone mark ChIP-seq data for ACC112. You may further customize your view with additional tracks from the Encyclopedia Trackhub if desired, as described below.

Figure 3. Part of the genome browser view for the selection in Figure 2, showing cell type-agnostic cCREs (top) and data for A172 and A549.