Why and how to use biomaRt?

The bioinformatics work includes the gene annotation work. In recent years more and more biological data has become available.  Meanwhile, how to get the access these valuable data resources and analyse the data is important for comprehensive bioinformatics data analysis. The biomaRt is a very useful tool to achieve that. Now there are two questions: 1. Why to use biomaRt? 2. How to use biomaRt?

Let us first get the concept of BioMart. The BioMart project (http://www.biomart.org) provides free software and data services to the international scientific community in order to foster scientific collaboration and facilitate the scientific discovery process. Examples of BioMart databases are Ensembl, Uniprot and HapMap.  However, if the dataset is big and the conversion from different datasets  is troublesome, we need a bioinformatics tool which could do it automatically. The biomaRt is the package which provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. The major databases (e.g. Ensembl, Uniprot)give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R&Bioconductor.

The first way to use BioMart is online ID conversion. We could go to website: http://useast.ensembl.org/biomart/martview/ and then select the corresponding datasets, filters and attributes. If we click the ‘Results’ button, we could see the final outputs.

The second way is to use biomaRt, which is a R&Bioconductor package. There are 2 steps: (1) select the Mart database and (2) use getBM to get the gene annotation. However, how many Mart database does the package have? And how do we get the correct setting from filters and attributes from the corresponding datasets?  We could use function ‘listMart’ and ‘listDatasets’ to check the database, meanwhile the function ‘listFilters’ and ‘listAttributes’ are useful for you to get the correct setting . Let ‘s check the corresponding results from R.

Mart version by the command listMarts()

version
1 ENSEMBL GENES 78 (SANGER UK)
2 ENSEMBL VARIATION 78 (SANGER UK)
3 ENSEMBL REGULATION 78 (SANGER UK)
4 VEGA 58 (SANGER UK)
5 ENSEMBL FUNGI 25 (EBI UK)
6 ENSEMBL FUNGI VARIATION 25 (EBI UK)
7 ENSEMBL METAZOA 25 (EBI UK)
8 ENSEMBL METAZOA VARIATION 25 (EBI UK)
9 ENSEMBL PLANTS 25 (EBI UK)
10 ENSEMBL PLANTS VARIATION 25 (EBI UK)
11 ENSEMBL PROTISTS 25 (EBI UK)
12 ENSEMBL PROTISTS VARIATION 25 (EBI UK)
13 MSD (EBI UK)
14 WTSI MOUSE GENETICS PROJECT (SANGER UK)
15 WORMBASE 220 (CSHL US)
16 MGI (JACKSON LABORATORY US)
17 PRIDE (EBI UK)
18 INTERPRO (EBI UK)
19 UNIPROT (EBI UK)
20 PARAMECIUM GENOME (CNRS FRANCE)
21 PARAMECIUM BIBLIOGRAPHY (CNRS FRANCE)
22 EUREXPRESS (MRC EDINBURGH UK)
23 Phytozome
24 Metazome
25 HAPMAP 27 (NCBI US)
26 INTOGEN ONCOMODULES
27 EUROPHENOME
28 IKMC GENES AND PRODUCTS (IKMC)
29 EMAGE GENE EXPRESSION
30 EMAP ANATOMY ONTOLOGY
31 EMAGE BROWSE REPOSITORY
32 GERMONLINE
33 SIGENAE OLIGO ANNOTATION (ENSEMBL 61)
34 SIGENAE OLIGO ANNOTATION (ENSEMBL 59)
35 SIGENAE OLIGO ANNOTATION (ENSEMBL 56)
36 BCCTB Bioinformatics Portal (UK and Ireland)
37 Predictive models of gene regulation from processed high-throughput epigenomics data: K562 vs. Gm12878
38 Predictive models of gene regulation from processed high-throughput epigenomics data: Hsmm vs. Hmec
39 PANCREATIC EXPRESSION DATABASE (BARTS CANCER INSTITUTE UK)
40 Multi-species: marker, QTL, SNP, gene, germplasm, phenotype, association, with Gene annotations
41 Grapevine 8x, stuctural annotation with Genetic maps (genetic markers..)
42 Grapevine 12x.0, stuctural and functional annotation with Genetic maps (genetic markers..)
43 Wheat, stuctural annotation with Genetic maps (genetic markers..)
44 Arabidopsis Thaliana TAIRV10, genes functional annotation
45 Zea mays ZmB73, genes functional annotation
46 Tomato, stuctural and functional annotation
47 Populus trichocarpa, genes functional annotation
48 Populus trichocarpa, genes functional annotation V2.0
49 Botrytis cinerea T4, genes functional annotation
50 Botrytis cinerea B0510, genes functional annotation
51 Leptosphaeria maculans, genes functional annotation
52 VectorBase Genes
53 VectorBase Variation
54 VectorBase Expression
55 GRAMENE 40 ENSEMBL GENES (CSHL/CORNELL US)
56 GRAMENE 40 VARIATION (CSHL/CORNELL US)

Datasets version by the command listDatasets(ensembl)

[1] "oanatinus_gene_ensembl" "cporcellus_gene_ensembl"
[3] "gaculeatus_gene_ensembl" "lafricana_gene_ensembl"
[5] "itridecemlineatus_gene_ensembl" "choffmanni_gene_ensembl"
[7] "csavignyi_gene_ensembl" "fcatus_gene_ensembl"
[9] "rnorvegicus_gene_ensembl" "psinensis_gene_ensembl"
[11] "cjacchus_gene_ensembl" "ttruncatus_gene_ensembl"
[13] "scerevisiae_gene_ensembl" "celegans_gene_ensembl"
[15] "csabaeus_gene_ensembl" "oniloticus_gene_ensembl"
[17] "trubripes_gene_ensembl" "amexicanus_gene_ensembl"
[19] "pmarinus_gene_ensembl" "eeuropaeus_gene_ensembl"
[21] "falbicollis_gene_ensembl" "ptroglodytes_gene_ensembl"
[23] "etelfairi_gene_ensembl" "cintestinalis_gene_ensembl"
[25] "nleucogenys_gene_ensembl" "sscrofa_gene_ensembl"
[27] "ocuniculus_gene_ensembl" "dnovemcinctus_gene_ensembl"
[29] "pcapensis_gene_ensembl" "tguttata_gene_ensembl"
[31] "mlucifugus_gene_ensembl" "hsapiens_gene_ensembl"
[33] "pformosa_gene_ensembl" "mfuro_gene_ensembl"
[35] "tbelangeri_gene_ensembl" "ggallus_gene_ensembl"
[37] "xtropicalis_gene_ensembl" "ecaballus_gene_ensembl"
[39] "pabelii_gene_ensembl" "xmaculatus_gene_ensembl"
[41] "drerio_gene_ensembl" "lchalumnae_gene_ensembl"
[43] "tnigroviridis_gene_ensembl" "amelanoleuca_gene_ensembl"
[45] "mmulatta_gene_ensembl" "pvampyrus_gene_ensembl"
[47] "panubis_gene_ensembl" "mdomestica_gene_ensembl"
[49] "acarolinensis_gene_ensembl" "vpacos_gene_ensembl"
[51] "tsyrichta_gene_ensembl" "ogarnettii_gene_ensembl"
[53] "dmelanogaster_gene_ensembl" "mmurinus_gene_ensembl"
[55] "loculatus_gene_ensembl" "olatipes_gene_ensembl"
[57] "ggorilla_gene_ensembl" "oprinceps_gene_ensembl"
[59] "dordii_gene_ensembl" "oaries_gene_ensembl"
[61] "mmusculus_gene_ensembl" "mgallopavo_gene_ensembl"
[63] "gmorhua_gene_ensembl" "aplatyrhynchos_gene_ensembl"
[65] "saraneus_gene_ensembl" "sharrisii_gene_ensembl"
[67] "meugenii_gene_ensembl" "btaurus_gene_ensembl"
[69] "cfamiliaris_gene_ensembl"

Filter function by the command listFilters(ensembl)

name description
1 chromosome_name Chromosome name
2 start Gene Start (bp)
3 end Gene End (bp)
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start

Attribute function by the command listAttributes(ensembl)

                   name           description
1       ensembl_gene_id       Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3    ensembl_peptide_id    Ensembl Protein ID
4       ensembl_exon_id       Ensembl Exon ID
5           description           Description
6       chromosome_name       Chromosome Name

Besides the database ID conversion (e.g. ID,symbol, name) , the biomaRt could achieve the information of SNP, alternative splicing, exon, intron, 5’utr, 3’utr as well.

The third way is to use Biomart Perl API, it is also one of the most convenient way to access BioMart programmatically.  We would not introduce it in detail in this post.

Generally speaking, it is an amazing bioinformatics tool, and moreover, it is free!

 

Categories: Uncategorized