Data Mining Protocol -belongs to-> Ontology

Created by Hevok on Dec. 16, 2012, 1:06 a.m.


Data Mining Protocol

Denigma is constructed to break the genetic code of life and therefore pave the way to find effective interventions to make aging negligible. The vast amount of biological data however is hidden in the scientific literature and unaccessible for computations.

Introduction

The identification of topics and concepts associated with a document or collection of documents is a common task for Denigma and can help in:

  • Annotation and categorization of documents in a corpus of scientific literature
  • Modelling biological processes related to aging
  • Artificial intelligence
  • Selecting effective anti-aging interventions

Concept

The concept is based on the assumption that it is be possible to describe what an scientist has been working in order to support collaboration. Theoretically this can be achieved by:

  • track data document she/he reads
  • map these terms in an ontology
  • aggregate to produce a short list of topics

The first questions that arise is to how to map the documents she/he reads to the ontology terms? The solution to this is to use document to data entry similarity for the mapping.

The second question is how to aggregate to get a shorter list? The answer is to use spreading activation algorithm for aggregation

Approach

Denigma data entries and categories are used as ontology terms. Categories as generalized concepts are itself defined by data entries.

What a certain document is about can be approached in two ways:

  1. Statistically Select words and phrases using TF-IDF that characterizes the document
  2. Controlled Vocabulary / Ontology Map a document to a list of terms from a controlled vocabulary

The first approach is flexible and does not require creating and maintaining an ontology, while the second approach can tie documents to a rich knowledge base and make it accessible for computation.

Using Denigma's data entries as an ontology offers the best of both approaches.

Each data entry is a concept in the ontology.

Terms are linked via Denigma's tag, category and hierarchy system as well as by inter-data entry links and data entry relations.

It is a consensus ontology created, kept current and maintained by a diverse community. The overall content quality is high. Terms have unique IDs (URLs) and are "self-describing" for people as well as machines. The underlying graphs provide the structure of data entry tags, categories, hierarchy, links and relations.

Data Graph

Denigma data entry graph is a thesaurus. The graph composed of data entry links is similar to the world-wide-web network, but highly systematic and structured in an unified fashion (i.e. easy accessible for computation).

Methods

The goal is given one or more documents, compute a ranked list of the top N data entries and/or categories that describe it.

The basic metric is document similarity between data entries and document(s). Variants to explore are the following:

  • Role of categories
  • Eliminating uninteresting data entries
  • Use of spreading activation
  • Using similarity scores for weighting links
  • Number of spreading activation pulses
  • Individual or set of query documents, etc.

Spreading Activation

Associative retrieval means that it is possible to retrieve relevant documents if they are associated with other documents that have been considered relevant by the user.

The document can be represented as nodes and their associations as links in a network. At each pulse/iteration, spread activation to adjacent nodes. Some nodes will have higher activation than others.

The constraints are:

  • Distance
  • Fan out
  • Path constraints
  • Activation threshold

1. Method: Ranking Categories Directly

The first method is to use Denigma data entry text and categories to predict concepts:

Input Query doc(s) -similar to (Cosine similarity)-> Similar data entries -> Denigma category graph

The output are ranked categories:

  1. Links
  2. Cosine similarity

2. Method: Spreading Activation on Category Links Graph

The second method is similar to the first but uses spreading activation on category links graph to get aggregated concepts. The output are ranked concepts based on final activation score.

3. Method: Spreading Activation on Entry Links Graph

It is possible to predict concepts that are NOT present in the category hierarchy by using the data concepts. For this use spreading activation on data entry links graph.

As threshold ignore spreading activation to articles with less than 0.4 cosine similarity score. The edge weights are the cosine similarity between linked articles. The output are ranked concepts based on final activation score.

Evaluation

In an initial informal evaluation the results are compared against our own judgments. Download scientific articles from internet and predict concepts. Then use single documents and group of related documents.

For a single document inn general more pulses lead to more generalized concepts.

For the prediction of a set of test documents (e.g. data entries) concepts can be discovered that are not in the category hierarchy.

Select data entries randomly and predict their categories, links, and relations:

Query doc(s) -similar to (Cosine similarity)-> Average Similarity

It is observed that data entries are linked often with both super and sup categories.

It the system predicts a category three levels higher in hierarchy than the original category the predictions is considered to be correct.

Category Prediction Evaluation

Spreading activation with two pulses works the best. Only considering data entries with similarity > 0.5 is a good threshold.

Data Entry Prediction Evaluation

Spreading activation with one pulse works the best and againg only considering data entries with similarity > 0.5 is a threshold.

Prediction Accuracy

The prediction accuracy is affected by three issues:

  • To what extent the concept is represented in Denigma.
  • Presence of links between semantically related concepts.
  • Presence of links between irrelevant data entries (term definitions, announcements)

Therefore two possible solutions are suggested:

  • Use average similarity score to measure the extent of concept representation within Denigma
  • Use existing semantic relatedness measures to handle presence or absence of semantically related links

Potential Applications

There are two immediately obvious applications for the described approach:

  • Recommending categories and links for new data entries.
  • Automate the process of building a knowledge base from a corpus (scientific literature)

Further Enhancements

The links in Denigma can be classified with machine learning techniques in order to:

  • Predict semantic type of data entries
  • Control the flow of spreading activation

To speed the computation up into the time-frame of a few seconds execution the heterogeneous parallel programming on multiple processors / clusters shall be exploited.

The data entry corpus and ontology should be redefined.

Lastly, the gap between Denigma and a formal ontologies need to be bridged.

Document expansion with Denigma derived ontology terms.

Conclusion

The fundamental data unit (data entry) can be used to describe documents and different methods employing the data entry text, tags, categories, links and relations. The average similarity should be used to judge the accuracy of prediction. The method is easily extendable to other data units.

data-mining1.jpg

Categories: Quest

Parent: Research


Update (Admin) | View


belongs to

The belongs to is the reverse of the has a relationship and definies that a certain data entry is a subcatgory of another one.

belongs_to.png

Tags: has a
Categories: Relationship

Update (Admin) | View


Ontology

Ontology (Greek on participle of "to be" and logia Science) is the philosophical study of the nature of being, Existence or Reality, as well as the basic Categories of being and their Relations.

It deals with what is real in the World. Therefore, the basic Question is what does really exists and what can be said to exist? It is a question of general Metaphysics in Philosophy. It is in contrast to the Epistemology that only deals with things of our perceptions, so what we see, what we hear and so on. Often our perceptions are betraying us. We can only experience the world with out perception but sometimes the perceptions might betray us so one has to know what is real in the world, i.e. what is True. To define what is really True and independent of our Perception is what Ontology original was intended to define.

An Ontology is an explicit, formal specification of a shared conceptualization. The Term is borrowed from Philosophy, where an Ontology is a systematic account of Existence. For Artificial Intelligence Systems, what "exists" is that which can be represented.

A Conceptualization is nothing else than a Model. One tries to form a model about a domain one is talking about. Inside this domain one tries to identify relevant Concepts and how this Concepts are related to each other. This model (i.e. the conceptualization) has to be explicit which means all Meanings of all Concepts has to be defined, nothing has to be left out. Everything need to be defined. This Definition must be formal, which means it must be understood by the Machine, i.e. it has to be Machine-Understandable, not only Machine-Readable but must be interpreted correctly. Only if you read it and interpret it correctly means that you understand it. One of the most important thing is that the things one is referring to must be shared among communication partner, so this model of conceptualization must be a shared conceptualization, there must be consensus about the Ontology. This is required otherwise one can not communicate.

For Communication the Semantic Triangle applies. In Language on has a Symbol that stands for a certain Object. However language is ambiguous a term might have multiple Meanings. One can only communicate with other if two or ore communication partners apply a shared Concept (i.e. the same Concept). Then communication and understanding is possible.

Ontology is the most critical enabling Technology in Semantic Web Applications. Basically an Ontology describes Terms, and Types of Relationships between Pairs of Terms. In such an Ontology can be expressed/represented by a List of Tuples in the form of (term x, relationship r, term y). For instance,

Denigma, is a kind of, Decipher Machine or

Aging, is a, Problem.

The basic Tasks in Ontology Development are Term Selection, Relationship Assignment and Evolution.

Normally an Ontology is developed by a small Group of Experts. However, this Approach does not scale with the ever increasing amount of Information. Specifically Experts have difficulty keeping up with Advances in Knowledge in the open dynamic World Wide Web Environment. Crowd Sourcing has the potential to be the most influential way to solve the problem of Ontology Development, by outsourcing a Task traditionally done by Experts to also non-Experts (typically a large Group of People) in the form of an open call (The Call of Duty).

An expression is correct if the majority of the users agree on it.

One approach to implement this is to aggregate Knowledge from common Web Users. Ontology can be used to redefine a Search Query [http://www.hahia.com] and in this way the Ontology evolves indirectly.

Definitely Crowd Sourcing will significantly change the Approach of Ontology Maintenance and Evolution.

An Ontology is a Data Model that represents a Domain and is utilized to reason both the Object in this Domain and the Relations between them. The application of Ontologies includes Artificial Intelligence, Semantic Web, Software Engineering and Information Architecture, where it is used as a form of Knowledge Representation about the World. An Ontology can also be understand as a set of Definitions of a formal Vocabulary with a huge potential in Information Technology.

Ontology is a Semantic skeleton of a Domain, i.e. it defines what we can have in Annotations. It defines Categories, Properties, Rules, etc. We can have many Ontologies to describe one Domain. It is only an Idea. We are not there.


Categories: News

Parent: Research


Update (Admin) | View


Comment on This Data Unit


Please log in for making a comment without the need to specify any credentials.