Why and how to use biomaRt?
The bioinformatics work includes the gene annotation work. In recent years more and more biological data has become available. Meanwhile, how to get the access these valuable data resources and analyse the data is important for comprehensive bioinformatics data analysis. The biomaRt is a very useful tool to achieve that. Now there are two questions: 1. Why to use biomaRt? 2. How to use biomaRt?
Let us first get the concept of BioMart. The BioMart project (http://www.biomart.org) provides free software and data services to the international scientific community in order to foster scientific collaboration and facilitate the scientific discovery process. Examples of BioMart databases are Ensembl, Uniprot and HapMap. However, if the dataset is big and the conversion from different datasets is troublesome, we need a bioinformatics tool which could do it automatically. The biomaRt is the package which provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. The major databases (e.g. Ensembl, Uniprot)give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R&Bioconductor.
The first way to use BioMart is online ID conversion. We could go to website: http://useast.ensembl.org/biomart/martview/ and then select the corresponding datasets, filters and attributes. If we click the ‘Results’ button, we could see the final outputs.
The second way is to use biomaRt, which is a R&Bioconductor package. There are 2 steps: (1) select the Mart database and (2) use getBM to get the gene annotation. However, how many Mart database does the package have? And how do we get the correct setting from filters and attributes from the corresponding datasets? We could use function ‘listMart’ and ‘listDatasets’ to check the database, meanwhile the function ‘listFilters’ and ‘listAttributes’ are useful for you to get the correct setting . Let ‘s check the corresponding results from R.
Mart version by the command listMarts()
version 1 ENSEMBL GENES 78 (SANGER UK) 2 ENSEMBL VARIATION 78 (SANGER UK) 3 ENSEMBL REGULATION 78 (SANGER UK) 4 VEGA 58 (SANGER UK) 5 ENSEMBL FUNGI 25 (EBI UK) 6 ENSEMBL FUNGI VARIATION 25 (EBI UK) 7 ENSEMBL METAZOA 25 (EBI UK) 8 ENSEMBL METAZOA VARIATION 25 (EBI UK) 9 ENSEMBL PLANTS 25 (EBI UK) 10 ENSEMBL PLANTS VARIATION 25 (EBI UK) 11 ENSEMBL PROTISTS 25 (EBI UK) 12 ENSEMBL PROTISTS VARIATION 25 (EBI UK) 13 MSD (EBI UK) 14 WTSI MOUSE GENETICS PROJECT (SANGER UK) 15 WORMBASE 220 (CSHL US) 16 MGI (JACKSON LABORATORY US) 17 PRIDE (EBI UK) 18 INTERPRO (EBI UK) 19 UNIPROT (EBI UK) 20 PARAMECIUM GENOME (CNRS FRANCE) 21 PARAMECIUM BIBLIOGRAPHY (CNRS FRANCE) 22 EUREXPRESS (MRC EDINBURGH UK) 23 Phytozome 24 Metazome 25 HAPMAP 27 (NCBI US) 26 INTOGEN ONCOMODULES 27 EUROPHENOME 28 IKMC GENES AND PRODUCTS (IKMC) 29 EMAGE GENE EXPRESSION 30 EMAP ANATOMY ONTOLOGY 31 EMAGE BROWSE REPOSITORY 32 GERMONLINE 33 SIGENAE OLIGO ANNOTATION (ENSEMBL 61) 34 SIGENAE OLIGO ANNOTATION (ENSEMBL 59) 35 SIGENAE OLIGO ANNOTATION (ENSEMBL 56) 36 BCCTB Bioinformatics Portal (UK and Ireland) 37 Predictive models of gene regulation from processed high-throughput epigenomics data: K562 vs. Gm12878 38 Predictive models of gene regulation from processed high-throughput epigenomics data: Hsmm vs. Hmec 39 PANCREATIC EXPRESSION DATABASE (BARTS CANCER INSTITUTE UK) 40 Multi-species: marker, QTL, SNP, gene, germplasm, phenotype, association, with Gene annotations 41 Grapevine 8x, stuctural annotation with Genetic maps (genetic markers..) 42 Grapevine 12x.0, stuctural and functional annotation with Genetic maps (genetic markers..) 43 Wheat, stuctural annotation with Genetic maps (genetic markers..) 44 Arabidopsis Thaliana TAIRV10, genes functional annotation 45 Zea mays ZmB73, genes functional annotation 46 Tomato, stuctural and functional annotation 47 Populus trichocarpa, genes functional annotation 48 Populus trichocarpa, genes functional annotation V2.0 49 Botrytis cinerea T4, genes functional annotation 50 Botrytis cinerea B0510, genes functional annotation 51 Leptosphaeria maculans, genes functional annotation 52 VectorBase Genes 53 VectorBase Variation 54 VectorBase Expression 55 GRAMENE 40 ENSEMBL GENES (CSHL/CORNELL US) 56 GRAMENE 40 VARIATION (CSHL/CORNELL US)
Datasets version by the command listDatasets(ensembl)
[1] "oanatinus_gene_ensembl" "cporcellus_gene_ensembl" [3] "gaculeatus_gene_ensembl" "lafricana_gene_ensembl" [5] "itridecemlineatus_gene_ensembl" "choffmanni_gene_ensembl" [7] "csavignyi_gene_ensembl" "fcatus_gene_ensembl" [9] "rnorvegicus_gene_ensembl" "psinensis_gene_ensembl" [11] "cjacchus_gene_ensembl" "ttruncatus_gene_ensembl" [13] "scerevisiae_gene_ensembl" "celegans_gene_ensembl" [15] "csabaeus_gene_ensembl" "oniloticus_gene_ensembl" [17] "trubripes_gene_ensembl" "amexicanus_gene_ensembl" [19] "pmarinus_gene_ensembl" "eeuropaeus_gene_ensembl" [21] "falbicollis_gene_ensembl" "ptroglodytes_gene_ensembl" [23] "etelfairi_gene_ensembl" "cintestinalis_gene_ensembl" [25] "nleucogenys_gene_ensembl" "sscrofa_gene_ensembl" [27] "ocuniculus_gene_ensembl" "dnovemcinctus_gene_ensembl" [29] "pcapensis_gene_ensembl" "tguttata_gene_ensembl" [31] "mlucifugus_gene_ensembl" "hsapiens_gene_ensembl" [33] "pformosa_gene_ensembl" "mfuro_gene_ensembl" [35] "tbelangeri_gene_ensembl" "ggallus_gene_ensembl" [37] "xtropicalis_gene_ensembl" "ecaballus_gene_ensembl" [39] "pabelii_gene_ensembl" "xmaculatus_gene_ensembl" [41] "drerio_gene_ensembl" "lchalumnae_gene_ensembl" [43] "tnigroviridis_gene_ensembl" "amelanoleuca_gene_ensembl" [45] "mmulatta_gene_ensembl" "pvampyrus_gene_ensembl" [47] "panubis_gene_ensembl" "mdomestica_gene_ensembl" [49] "acarolinensis_gene_ensembl" "vpacos_gene_ensembl" [51] "tsyrichta_gene_ensembl" "ogarnettii_gene_ensembl" [53] "dmelanogaster_gene_ensembl" "mmurinus_gene_ensembl" [55] "loculatus_gene_ensembl" "olatipes_gene_ensembl" [57] "ggorilla_gene_ensembl" "oprinceps_gene_ensembl" [59] "dordii_gene_ensembl" "oaries_gene_ensembl" [61] "mmusculus_gene_ensembl" "mgallopavo_gene_ensembl" [63] "gmorhua_gene_ensembl" "aplatyrhynchos_gene_ensembl" [65] "saraneus_gene_ensembl" "sharrisii_gene_ensembl" [67] "meugenii_gene_ensembl" "btaurus_gene_ensembl" [69] "cfamiliaris_gene_ensembl"
Filter function by the command listFilters(ensembl)
name description 1 chromosome_name Chromosome name 2 start Gene Start (bp) 3 end Gene End (bp) 4 band_start Band Start 5 band_end Band End 6 marker_start Marker Start
Attribute function by the command listAttributes(ensembl)
name description 1 ensembl_gene_id Ensembl Gene ID 2 ensembl_transcript_id Ensembl Transcript ID 3 ensembl_peptide_id Ensembl Protein ID 4 ensembl_exon_id Ensembl Exon ID 5 description Description 6 chromosome_name Chromosome Name
Besides the database ID conversion (e.g. ID,symbol, name) , the biomaRt could achieve the information of SNP, alternative splicing, exon, intron, 5’utr, 3’utr as well.
The third way is to use Biomart Perl API, it is also one of the most convenient way to access BioMart programmatically. We would not introduce it in detail in this post.
Generally speaking, it is an amazing bioinformatics tool, and moreover, it is free!
Nice, but I want to know more!!!