Genome Biology & Applied Bioinformatics: Day 2 Practical

Mehmet Tevfik DORAK

Course web page: http://www.dorak.info/genbiol

Bioinformatics links web page: http://www.dorak.info/mtd/bioinf.html

For SNP/gene lists and input files used in this practical, see the course web page.

I. Genes/Proteins

Go to: ENTREZ GENE (http://www.ncbi.nlm.nih.gov/gene)

Enter your gene symbol (for example, TFRC)

Choose the human gene from search results

Check the Table of Contents of the page on top right corner

You will notice that there is much more information than just basic information, most notably:

· RNA-seq results (within Genomic regions, transcripts, and products)

· Link to dbSNP list of SNPs (See SNP Geneview Report)

· Pathways

· Interactions

Go to: UNIPROT (http://www.uniprot.org/uniprot/?query=*&fil=organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22)

Enter your protein symbol (for example TFRC)

Choose the correct protein (preferably review version) from search results

Click on the Entry code

Check the Table of Contents of the page on the left

Note that under Sequence, proteins variants are listed

Cross-references lists useful links

How to get a complete list of SNPs in a gene:

Go to Ensembl: http://www.ensembl.org

Choose Human and enter your ggene name (for example, TFRC), and choose the direct link, which is: http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000072274;r=3:196027183-196082189

From the menu on the left, under Variation, click on Variant Table.

The complete list of variants will appear

On top right of the list. There is an Excel icon to save the complete list.

Now try to get the same information using Ensembl Biomart facility: http://www.ensembl.org/biomart

Select the options you want (Ensembl ID for TFRC is ENSG00000072274), and click on Count on top right to get an idea of how many variants have been selected. Then, click on Results. You can save the results in selected format from this screen.

Other useful resources to obtain complete information on genes/proteins:

· PheGenI: http://www.ncbi.nlm.nih.gov/gap/phegeni (more than one gene can be searched; most useful for eQTL and dbGAP association results)

· HumanMine: http://www.humanmine.org (genetic, proteomic and metabolomic data)

For gene/protein expression analysis, see the PowerPoint presentation.

II. SNPs

For already known SNPs (with rsIDs) you want to assess, try the following protocol:

(If you do not have your own SNP to analyse, try rs1800562)

· SNiPA

· rVARbase

· mQTLdb

· GWASdb - GRASP - MR catalogue

· HaploReg v4 - RegulomeDB; VaDE

· SNPnexus, PheGenI, wANNOVAR, CADD (the score is provided by SNiPA, but not the 80+ features)

Just in case: GTEx (if no result, try “test your own”, Blood eQTL Browser, Chicago eQTL Browser

· PheWas catalogue

· Open SNP, SNPedia

· PharmKGB, Immunobase, HumanMine

For new SNPs with no rsID, try the following:

· Reference Variant Store (https://rvs.u.hpc.mssm.edu/queries): The most comprehensive database of variants (currently 520 million). Can be queried by chromosome coordinates. Enter the coordinates as shown on the page (i.e., 7:140430000-140440000).

· Structure PPi (http://structureppi.bioinfo.cnio.es/Structure): Provides precomputed scores, and allows new quesries in a very simple format as explained on the webpage. Also extracts data from dbNSFP (see below). Try TFRC:S142G for practice.

· CADD (http://cadd.gs.washington.edu/home): Precomputed scores for all potential genetic variants are available for download (very large files). Besides, a list of varriants can be submitted as simple five column VCF format files (chr coordinate(hg19) rsID ref_allele alt_allele). See the provided sample file. The file can be uploaded at http://cadd.gs.washington.edu/score. Make sure you check the box for “Include underlying annotation in output (not only the scores)”. The results will be provided as a ZIP file (use 7-Zip to open).

· DANN and EIGEN scores: These are improved versions of CADD, but not yet available for queries vvia a web interface. Complete set of precomputed scores can be downloaded with help from a computer scientist/bioinformatician.

· RegulomeDB: Missing data rate is high, but worth a try for non-coding region mutations. Use the following address as a template and enter your own chromosome number (instead of 6) and hg19 coordinate (instead of 31636741): http://regulomedb.org/snp/chr6/31636741. For each position genomic features will be listed.

· SNPeffect (http://snpeffect.switchlab.org): List protein mutations with and without rsIDs. Worth checking if your mutation is already assessed here. Otherwise, you can submit a new job for analysis including a FASTA sequence and the information on the position you want to mutate in the next step (see: http://snpeffect.switchlab.org/help#Input_formats).

· dbWGFB (http://bioinfo.au.tsinghua.edu.cn/dbwgfp): Check the examples for input file formats, and follow the instructions for file upload. Complete results can be downloaded as a single file for each chromosome (http://bioinfo.au.tsinghua.edu.cn/dbwgfp/downloads.php).

· dbNSFP (https://sites.google.com/site/jpopgen/dbNSFP): This database cannot be accessed via a web interface, but via ANNOVAR and some other tools. See the web page for further information. Results can also be downloaded with help from a computer scientist/bioinformatician.

III. SNP Sets

Generation of an ssSNP list:

HaploReg v4 (http://www.broadinstitute.org/mammals/haploreg/haploreg_v4.php): Without changing the Set Option (butt check to see what they are), enter rs2395185 in the Query box. Click Submit. The resulting list is for all SNPs correlating with the queried SNP with an r²>0.80. Now go to Set Options, and change the LD threshold to 0.6, and the Output Mode to Text to download more detailed results as a text file. Open the file in Excel. Now you have a SNP set. Analyse this list in the following tools:

· RegulomeDB

· PheGenI

· SNPnexus

· CADD (using the SNPnexus Genomic Coordinates output with some rearrangement, you can obtain a CADD input file).

What is your conclusion? Does the lead SNP rs2395185 look like the causal SNP, or one of its proxies more likely to be the causal SNP for its associations.

Analyzing an existing set of independent SNPs:

To functionally annotate a group of SNPs in one try, the following tools can be used:

· RegulomeDB

· PheGenI

· SNPnexus

· CADD

Of these, RegulomeDB provides a score (1a to 6) mainly for regulatory functions of non-coding region SNPs. Thus, a very functional missense SNP may be scored as non-functional unless it also has some regulatory function. Try, for example, rs1800562, which is a missense mutation causing hereditary hemochromatosis. Missing data rate is high.

PheGenI is an NCBI database, and accepts around 400 SNPs for each analysis. It provides basic information on SNPs, their associations in GWAS catalog and NCBI dbGAP, and their eQTL status (but only from NCBI eQTL Browser, which is based on lymphoblastoid cells).

SNPnexus can analyse up to 100,000 SNPs and provides comprehensive results including basic information on SNPs (from up to three browsers as selected by the user), HapMap frequencies in different populations, information on locations in microRNA sequences or CpG islands, whether the SNP is listed in COSMIC database of somatic cancer mutations and whether it changes a microRNA or transcription factor binding site, and associations as listed in GWAS catalog. Complete options are in the opening page and selections should be made by the user. The analysis may take a while depending on the number of SNPs submitted, but if an e-mail address is entered, a link will be sent when the analysis is complete. The complete set of results can be downloaded as a single Excel file with multiple worksheets.

CADD website requires an input file in VCF format. This is basically a five column tab-separated list of SNPs including the following columns (in this order):

Chromosome # position (hg19) rsID reference allele alternative allele

All of this information is available in SNPnexus output (genomic coordinates worksheet), but not in this order. Therefore, a CADD input file can be prepared with some rearrangement of the SNPnexus output. If rsID is not known, a dot should be placed in its place (leaving it blank or entering anything other than a dot will not work). Make sure, the file is tab separated, and no cosmetic arrangement has been made on it (if the columns do not look like aligned, leave them as they are). This file need to be saved as a flat txt file (Word or Excel will not work). The file cannot be larger than 2MB, and large file needs to be gzip compressed (using 7-Zip) before uploading. CADD Score Variants page (http://cadd.gs.washington.edu/score) has all instructions, and the file upload is also achieved at this page. Make sure you check the box next to “Include underlying annotation in output (not only the scores)”. This way, the CADD output will include not only raw and normalized CADD scores (under PHRED column), but also 80+ features of the SNPs. Detailed information on the output is given at http://cadd.gs.washington.edu/info (see also the supplementary table 1 at page 22 of http://www.nature.com/ng/journal/v46/n3/extref/ng.2892-S1.pdf). The normalized (or scaled) C-scores provided by CADD are called PHRED-like and listed in the last column in the output file. It is calculated as (-10*log10(rank/total)) and ranks a SNP relative to all possible substitutions of the human genome (8.6x10^9). A PHRED score of greater of equal 10 indicates that these SNPs are predicted to be the 10% most deleterious substitutions in the human genome (including hypothetical ones), a score of greater or equal 20 indicates the 1% most deleterious, and a score of 30 indicates the 0.1% most deleterious (the example SNP rs1800562 has a PHRED score of 26). CADD scores can be obtained from SNiPA for individual SNPs, but for SNP sets, it is easier to obtain them directly from CADDD web site.

CADD has been superseded by DANN and EIGEN, but there is no easy way of getting these scores (DANN scores van be obtained from ANNOVAR). They can be downloaded but these would be very large files and will require help from computer scientists or bioinformaticians.

IV. Gene Sets

Gene sets can be analysed for a number of enrichments or overlaps (Gene Ontology (GO) functions; KEGG pathways; microRNA targets; sets of genes upregulated or downregulated in certain conditions/treatments; etc). In this practical, we will use DAVID and GSEA/MSigDB. Another noteworthy web-based tool is WebGelstat. HumanMine also offers gene set analysis (and particularly useful for metabolic disorders).

The main issue with the gene set analysis is the generation of the gene sets from GWAS results. In the past, the nearest genes were used as the target genes of each SNP, but this approach has been invalidated. The target of a SNP is either the gene where the SNP is within if the SNP is a coding region SNP with some effect on the protein function, or a splice site SNP with an effect of splicing. Otherwise, the SNP usually influences a phenotype via eQTL effect, and its target gene(s) is the gene(s) whose expression shows a correlation with genotypes of the SNP. Note that a gene may show eQTL effect on multiple genes, and all of those will be represented in the gene set.

Another issue is that most SNPs have very strong proxies and those SNP sets are frequently present together. The question is then whether to include statistically similar SNPs in the process of generating a gene set and if so, what r² value should be used. These are yet unresolved issues and no evidence-based guidelines are available. However, a common sense approach would be to include ssSNPs with r² value of 0.6 or greater since it has been documented that a causal SNP may have a correlation with the lead SNP with r2 as low as 0.50.

The gene list should only use gene symbols that are officially recognized. It is best to stick with official gene symbols as listed in NCBI ENTREZ GENE. Whichever nomenclature is used, consistency is important and all gene names should be in the same format in the list. Once the list is ready, it is simply a matter of submitting the list to DAVID or GSEA and following the steps as shown on the screen n of each tool. Almost all analyses will yield statistically significant results. It will then be a matter of common sense approach how to interpret the results. The statistical significance as well as the false discovery rate (FDR) levels are helpful, but consistency and biological plausibility are also important. Doing this analysis is not the real challenge, but interpretation of the results is.

Sample SNP and gene sets, and input files are provided via the course web page. Please also consult the practical presentations for each practical posted on the course web page.

Mehmet Tevfik DORAK