pubmed2ensembl: a resource for mining the biological literature on genes

PLoS One. 2011;6(9):e24716. doi: 10.1371/journal.pone.0024716. Epub 2011 Sep 29.

Abstract

Background: The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.

Methodology/principal findings: To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.

Conclusion/significance: By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Chromosome Mapping / methods
  • Computational Biology / methods*
  • Data Mining
  • Database Management Systems
  • Databases, Genetic
  • Genome
  • Genomics
  • Humans
  • Information Storage and Retrieval
  • MEDLINE
  • Models, Biological
  • Models, Genetic
  • PubMed*
  • Software
  • User-Computer Interface