Text mining

Information hidden in the literature needs to be accessible for computing. Therefore efficient techniques of text mining are crucial.

Articles can be linked to genes either by their gene symbols, names, identifiers or primary sequences [21325301] and used to annotate genes in an automatic fashion [21980353]. Text miming can provide valuable evidence linking diverse resources like MEDLINE, PubMed Central, GEO and PDB [22685160].

Text mining provides easy access to the evidence through links to the relevant literature [18487273; 19468046]. Processing biomedical data resources and the biomedical scientific literature to produce innovative solutions and gain new insights via computational linguistic is a challenge.

The Arts of Text-Mining

Text mining is the extraction and processing and discovery of knowledge from text.

It can consistent of four phases:

  1. Information retrieval
  2. Information extraction
  3. Building knowledge base
  4. Knowledge discovery

Information retrieval is the process of acquiring a selection of relevant documents and is usually the result of queries via search engines. Search queries can be enhanced e.g. Query expansion: Adding synonyms to a search query.

Information extraction is identification of all entities and relationships between them. It requires natural language processing including syntactic parsing as well as mapping of entities onto unique identifier (e.g. entrez gene IDs). The process is non-trivial because of the of the very nature of natural languages. Specifically it is difficult because of entity disambiguation as a single name such as Foxo3a might denote the gene as well as protein and often it is difficult which species it referees to. Relationships are inferred from concurrences of entities within textual structures.

The outcome are statements that may represent facts. Evidence Facts extracted in these way can either be used directly to populate a database or to assist curation.

Building a knowledge base encompasses integration of the extracted into a database.

Knowledge discovery aims to identify hidden or as of yet undiscovered knowledge by applying data mining algorithms on the extracted information.

Phases of Implementation

  1. Retrieve all the full texts of relevant articles.
  2. Apply deep-text mining.
  3. Integrate information into a entity-relations schema.
  4. To guarantee up-to-date content and hypothesis generation regular reanalysis of the literature is required.

Text-mining is something that is very important to manage the huge amount of biological information in an automatic way. It should certainly consider the automated creation of an ontology via natural language processing as a complementary approach. Though, this needs to be handled in a careful manner so that data integrity can be guarantee. Thus, it is very good to get sophistiacted natural language processing involved. Only a combination of human curation and machine-driven mining can scale with the ever increasing data amount.

Edit tutorial

Comment on This Data Unit