Topic Modelling

Created on Feb. 14, 2013, 3:22 p.m. by Hevok & updated by Hevok on May 2, 2013, 5:22 p.m.

Genism is a library for topic modeling, document indexing and similarity retrieval with large Corpera used for Natural Language Processing (NLP) and Information Retrieval.

Gensim allows to find similar texts. First a dictionary with the most interesting texts need to be created. Second all Articles need to be indexed using this dictionary. Third, new Articles can be passed and Gensim will show all similar Articles ordered by similarity.

  1. Formulate criteria for searching Lifespan Articles and create Dictionary for Gensim
  2. Different Articles are crawled by the Web Crawler and then checked bz Gensim, those that are higher than some similarity threshold are indexed

In this way we can find new and more relevant Articles from an existing set of Articles in an Automated way.

  1. We have 1 billion of papers
  2. We build dictionary of n phrases and words
  3. Web build two-dimensional matrix model from a dictionary X words from document. If document contains word from dictionary we put 1 in cell, otherwise 0. So in the end we have 1 billion n-dimensional vectors.
  4. Next, we have 1 document and we want to find similar documents. Following the same steps as in point 3 we build one n-dimensional vector for this Article.
  5. Compare vector from point 4 with all vectors from point 3. Vectors which give smallest angel are the most similar.

That is how it works in naive way. In real implementation there are a lot of tricks to decrease amount of computations.


Tags: articles, literature, data mining, models
Parent: Text mining

Update entry (Admin) | See changes

Comment on This Data Unit