Gene Expression Data

Accessing public available gene expression (microarray and RNA seq) data is crucial first step in order to get insights into gene expression activity changes that underly process like Aging and the Interventions that counteract it.

There are several resources that provide a common format for gene expression data, such as the following:

Here we will use GEO, the most-widely established and refereed repository.

First of all a query need to be formulated and used to search against datasets [http://www.ncbi.nlm.nih.gov/gds/]. For instance, the term "Aging" or a boolean combination of terms "(dietary OR caloric OR calorie) AND restriction" will result in filtering only the interesting Expressions Datasets with relevance to the aging process and its modulation be a restricted diet. Queries similar to those used for identifying Lifespan Factors for the Lifespan App can be employed.

If a Gene Set Expression (GSE) of interest is identified its detail view will provide more information and links about this particular Dataset, e.g. [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE38635].

There are different formats in which the data is provided. The Series Matrix File format is quit straightforward to work with and its link redirects to the NCBI FTP server [ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE38nnn/GSE38635/matrix/]. There the dataset(s) can be downloaded [ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE38nnn/GSE38635/matrix/GSE38635_series_matrix.txt.gz].

The downloaded archive can be extracted on Linux via the command tar xzvf file.tar.gz. On Windows an extraction software like WinRAR [http://www.rarlab.com/] or WinZip [http://www.winzip.com/] is required.

The extracted text file can be viewed via LibreOffice or Microsoft Excel via opening as Tab-separated data.

In the file the first rows are meta data describing various informations associated with this particular dataset, which is followed by a row with one or more ID_REF (identifier reference) that provide a unique identifier for a sample/contrast. After the ID_REF the actual probe ids are listed with the corresponding expression value/ratio for each sample/contrast. The probe ids need to be mapped against the in the meta data given sample platform file in order to obtain information about each transcript.

You may even be able to totally automate this process of obtaining an/all expression dataset/datasets for a given query and analyze them.

The expression profiles of individual genes can also be retrieved individually [http://www.ncbi.nlm.nih.gov/geoprofiles/].

gene-expression-data.png/
Edit tutorial

Comment on This Data Unit