Title: | Centrality-Based Pathway Enrichment |
---|---|
Description: | This package aims to find significant pathways through network topology information. It has several advantages compared with current pathway enrichment tools. First, pathway node instead of single gene is taken as the basic unit when analysing networks to meet the fact that genes must be constructed into complexes to hold normal functions. Second, multiple network centrality measures are applied simultaneously to measure importance of nodes from different aspects to make a full view on the biological system. CePa extends standard pathway enrichment methods, which include both over-representation analysis procedure and gene-set analysis procedure. <https://doi.org/10.1093/bioinformatics/btt008>. |
Authors: | Zuguang Gu |
Maintainer: | Zuguang Gu <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.8.0 |
Built: | 2024-11-21 04:35:03 UTC |
Source: | https://github.com/jokergoo/cepa |
Centrality-based pathway enrichment
Gene set enrichment analysis is broadly used in microarray data analysis
aimed to find which biological functions are affected by a group of
related genes behind the massive information. A lot of methods have been
developed under the framework of over-represented analysis (ORA) such
as GOstats
and GSEABase
. For a specific
form of gene sets, biological pathways are collections of correlated genes/proteins,
RNAs and compounds that work together to regulate specific biological
processes. Instead of just being a list of genes, a pathway contains
the most important information that is how the member genes interact
with each other. Thus network structure information is necessary for
the intepretation of the importance of the pathways.
In this package, the original pathway enrichment method (ORA) is extended by introducing network centralities as the weight of nodes which have been mapped from differentially expressed genes in pathways. There are two advantages compared to former work. First, for the diversity of genes' characters and the difficulties of covering the importance of genes from all aspects, we do not design a fixed measurement for each gene but set it as an optional parameter in the model. Researchers can select from candidate choices where different measurement reflects different aspect of the importance of genes. In our model, network centralities are used to measure the importance of genes in pathways. Different centrality measurements assign the importance to nodes from different aspects. For example, degree centrality measures the amount of neighbours that a node directly connects to, and betweenness centrality measures how many information streams must pass through a certain node. Generally speaking, nodes having large centrality values are central nodes in the network. It's observed that nodes represented as metabolites, proteins or genes with high centralities are essential to keep the steady state of biological networks. Moreover, different centrality measurements may relate to different biological functions. The selection of centralities for researchers depends on what kind of genes they think important. Second, we use nodes as the basic units of pathways instead of genes. We observe that nodes in the pathways include different types of molecules, such as single gene, complex and protein families. Assuming a complex or family contains ten differentially expressed member genes, in traditional ORA, these ten genes behave as the same position as other genes represented as single nodes, and thus they have effect of ten. It is not proper because these ten genes stay in a same node in the pathway and make functions with the effect of one node. Also, a same gene may locate in different complexes in a pathway and if taking the gene with effect of one, it would greatly decrease the importance of the gene. Therefore a mapping procedure from genes to pathway nodes is applied in our model. What's more, the nodes in pathways also include none-gene nodes such as microRNAs and compounds. These nodes also contribute to the topology of the pathway. So, when analyzing pathways, all types of nodes are retained.
The core function of the package is cepa.all
. There is also a parallel version
cepa.all.parallel
. User can refer to the vignette to find
how to use it (vignette("CePa")
).
Zuguang Gu <[email protected]>
Gu Z, Liu J, Cao K, Zhang J, Wang J. Centrality-based pathway enrichment: a systematic approach for finding significant pathways dominated by key genes. BMC Syst Biol. 2012 Jun 6;6(1):56.
## Not run: # load the pathway database data(PID.db) # if you only have a differential gene list or other genes of interest # in form of a list, you can apply the centrality-extended ORA method res = cepa.all(dif = dif, bk = bk, pc = PID.db$NCI) # in the above code, dif is your differential gene list, bk is your background # gene list which always be whole set of genes on a certain microarray. If you # do not have a background gene list, do not set it and the function would use # the whole human genome genes as default. pc is the pathway catalogue which in # this example is the NCI catalogue gathered from PID database. # after about 20 min, you can obtain a detailed report of the analysis report(res) # if you have a complete gene expression data, you can apply the centrality-extended # GSA methods res = cepa.all(mat = mat, label = label, pc = PID.db$NCI) # mat is your expression value matrix, label is the design of the microarray experiment. # By default, we use absolute value of t-value as the statistic for each gene and # use the mean value of genes' statistics as the pathway statistic. # after about 50 min, you can obtain a detailed report of the analysis report(res) ## End(Not run)
## Not run: # load the pathway database data(PID.db) # if you only have a differential gene list or other genes of interest # in form of a list, you can apply the centrality-extended ORA method res = cepa.all(dif = dif, bk = bk, pc = PID.db$NCI) # in the above code, dif is your differential gene list, bk is your background # gene list which always be whole set of genes on a certain microarray. If you # do not have a background gene list, do not set it and the function would use # the whole human genome genes as default. pc is the pathway catalogue which in # this example is the NCI catalogue gathered from PID database. # after about 20 min, you can obtain a detailed report of the analysis report(res) # if you have a complete gene expression data, you can apply the centrality-extended # GSA methods res = cepa.all(mat = mat, label = label, pc = PID.db$NCI) # mat is your expression value matrix, label is the design of the microarray experiment. # By default, we use absolute value of t-value as the statistic for each gene and # use the mean value of genes' statistics as the pathway statistic. # after about 50 min, you can obtain a detailed report of the analysis report(res) ## End(Not run)
Apply CePa algorithm on a single pathway
cepa(dif = NULL, bk = NULL, mat = NULL, label = NULL, pc, pathway = NULL, id = NULL, cen = "equal.weight", cen.name = if(is.function(cen)) deparse(substitute(cen)) else if(mode(cen) == "name") deparse(cen) else cen, nlevel = "tvalue_abs", plevel = "mean", iter = 1000)
cepa(dif = NULL, bk = NULL, mat = NULL, label = NULL, pc, pathway = NULL, id = NULL, cen = "equal.weight", cen.name = if(is.function(cen)) deparse(substitute(cen)) else if(mode(cen) == "name") deparse(cen) else cen, nlevel = "tvalue_abs", plevel = "mean", iter = 1000)
dif |
differential gene list |
bk |
background gene list. If background gene list are not specified, use whole human genes |
mat |
expression matrix in which rows are genes and columns are samples |
label |
a |
pc |
a |
pathway |
an |
id |
identify which pathway should be analysis in the pathway catalogue |
cen |
centrality measuments, it can ce a string, or function has been quote |
cen.name |
centrality measurement names. This argument should be set if the |
nlevel |
node level transformation, should be one of "tvalue", "tvalue_sq", "tvalue_abs". Also self-defined functions are allowed, see |
plevel |
pathway level transformation, should be one of "max", "min", "median", "sum", "mean", "rank". Also, self-defined functions are allowed, see |
iter |
number of simulations |
The function is a wrapper of cepa.ora
and cepa.univariate
.
Selection of which function depends on the arguments specified.
If dif
, bk
, pc
, pathway
, id
, cen
, cen.name
and iter
are specified, the arguments are passed to cepa.ora
. The centrality-extension
of over-representation analysis (ORA) will be applied on the list of differential genes.
If mat
, label
, pc
, pathway
, id
, cen
, cen.name
, nlevel
,
plevel
and iter
are specified, the arguments are passed to cepa.univariate
.
The centrality-extension of gene-set analysis (GSA) will be applied on the whole gene expressions.
This function is always called by cepa.all
. But you can still use it
if you want to analysis a single pathway under a specific centrality.
A cepa
class object
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will take about 45 min res.gsa = cepa(mat = eset, label = label, pc = PID.db$NCI, id = 2) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will take about 45 min res.gsa = cepa(mat = eset, label = label, pc = PID.db$NCI, id = 2) ## End(Not run)
Apply CePa algorithm on a list of pathways under multiple centralities
cepa.all(dif = NULL, bk = NULL, mat = NULL, label = NULL, pc, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), nlevel = "tvalue_abs", plevel = "mean", iter = 1000)
cepa.all(dif = NULL, bk = NULL, mat = NULL, label = NULL, pc, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), nlevel = "tvalue_abs", plevel = "mean", iter = 1000)
dif |
differential gene list |
bk |
background gene list. If background gene list are not specified, use whole human genes |
mat |
expression matrix in which rows are genes and columns are samples |
label |
a |
pc |
a |
cen |
centrality measuments, it can ce a string, or a function |
cen.name |
centrality measurement names. By default it is parsed from |
nlevel |
node level transformation, should be one of "tvalue", "tvalue_sq", "tvalue_abs". Also self-defined functions are allowed, see |
plevel |
pathway level transformation, should be one of "max", "min", "median", "sum", "mean", "rank". Also, self-defined functions are allowed, see |
iter |
number of simulations |
All the calculation can be achieved by this function. The function is wrapper of both ORA extension and GSA extension. It chooses corresponding procedure according to the arguments specified. If the arguments contain gene lists, then the calculation is sent to functions doing ORA extension. While if the arguments contain an expression matrix and a phenotype label, the GSA extension is evoked.
The function is a wrapper of cepa.ora.all
and cepa.univariate.all
.
This is the core function of the package. User can refer to the vignette to find
how to use it (vignette("CePa")
).
If dif
, bk
, pc
, cen
, cen.name
and iter
are specified, the arguments are passed to cepa.ora.all
. The centrality-extension
of over-representation analysis (ORA) will be applied on the list of differential genes.
If mat
, label
, pc
, cen
, cen.name
, nlevel
,
plevel
and iter
are specified, the arguments are passed to cepa.univariate.all
.
The centrality-extension of gene-set analysis (GSA) will be applied on the whole gene expressions.
There is a parallel version of the function: cepa.all.parallel
.
A cepa.all
class object
cen.name
cen
node level transformation, should be one of "tvalue", "tvalue_sq", "tvalue_abs". Also self-defined functions are allowed, see cepa.univariate.all
for detail.
pathway level transformation, should be one of "max", "min", "median", "sum", "mean", "rank". Also, self-defined functions are allowed, see cepa.univariate.all
for detail.
number of simulations
cen.name
cen
node level transformation, should be one of "tvalue", "tvalue_sq", "tvalue_abs". Also self-defined functions are allowed, see cepa.univariate.all
for detail.
pathway level transformation, should be one of "max", "min", "median", "sum", "mean", "rank". Also, self-defined functions are allowed, see cepa.univariate.all
for detail.
number of simulations
Zuguang Gu <[email protected]>
Gu Z, Liu J, Cao K, Zhang J, Wang J. Centrality-based pathway enrichment: a systematic approach for finding significant pathways dominated by key genes. BMC Syst Biol. 2012 Jun 6;6(1):56.
cepa
, cepa.ora.all
, cepa.univariate.all
, cepa.all.parallel
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) ## End(Not run)
use CePa package through parallel computing
cepa.all.parallel(dif = NULL, bk = NULL, mat = NULL, label = NULL, pc, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), nlevel = "tvalue_abs", plevel = "mean", iter = 1000, ncores = 2)
cepa.all.parallel(dif = NULL, bk = NULL, mat = NULL, label = NULL, pc, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), nlevel = "tvalue_abs", plevel = "mean", iter = 1000, ncores = 2)
dif |
differential gene list |
bk |
background gene list. If background gene list are not specified, use whole human genes |
mat |
expression matrix in which rows are genes and columns are samples |
label |
a |
pc |
a |
cen |
centrality measuments, it can ce a string, or a function |
cen.name |
centrality measurement names. By default it is parsed from |
nlevel |
node level transformation, should be one of "tvalue", "tvalue_sq", "tvalue_abs". Also self-defined functions are allowed, see |
plevel |
pathway level transformation, should be one of "max", "min", "median", "sum", "mean", "rank". Also, self-defined functions are allowed, see |
iter |
number of simulations |
ncores |
number of cores for parallel computing |
The function divides the pathway list into several parts and each part is sent to a core for parallel computing.
The package for parallel computing is snow
.
Note: there may be warnings saying connections not closed. In fact I have closed connections after the parallel computing is done. I don't know why this happens. Maybe you breaked the computing ahead manually. However it does not matter unless you have obsessive compulsive disorder.
A cepa.all
class object
Zuguang Gu <[email protected]>
Gu Z, Liu J, Cao K, Zhang J, Wang J. Centrality-based pathway enrichment: a systematic approach for finding significant pathways dominated by key genes. BMC Syst Biol. 2012 Jun 6;6(1):56.
cepa.all
## Not run: data(PID.db) # ORA extension data(gene.list) res.ora = cepa.all.parallel(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, ncores = 4) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") res.gsa = cepa.all.parallel(mat = eset, label = label, pc = PID.db$NCI, ncores = 4) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) res.ora = cepa.all.parallel(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, ncores = 4) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") res.gsa = cepa.all.parallel(mat = eset, label = label, pc = PID.db$NCI, ncores = 4) ## End(Not run)
Apply centrality-extended ORA on a single pathway
cepa.ora(dif, pc, bk = NULL, pathway = NULL, id = NULL, cen = "equal.weight", cen.name = if(is.function(cen)) deparse(substitute(cen)) else if(mode(cen) == "name") deparse(cen) else cen, iter = 1000)
cepa.ora(dif, pc, bk = NULL, pathway = NULL, id = NULL, cen = "equal.weight", cen.name = if(is.function(cen)) deparse(substitute(cen)) else if(mode(cen) == "name") deparse(cen) else cen, iter = 1000)
dif |
differential gene list |
pc |
a |
bk |
background gene list. If background gene list are not specified, use whole human genes |
pathway |
|
id |
identify which pathway in the catalogue |
cen |
centrality measuments, it can ce a string, function, or function that has been quoted |
cen.name |
centrality measurement names. This argument should be set if the |
iter |
number of simulations |
The function is always called by cepa.ora.all
. But you can still
use it if you realy want to analysis just one pathway under one centrality.
A cepa
class object
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) ## End(Not run)
Apply centrality-extented ORA on a list of pathways
cepa.ora.all(dif, pc, bk = NULL, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), iter = 1000)
cepa.ora.all(dif, pc, bk = NULL, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), iter = 1000)
dif |
differential gene list |
pc |
a |
bk |
background gene list. If background gene list are not specified, use whole human genes |
cen |
centrality measuments, it can ce a string, or a function |
cen.name |
centrality measurement names. By default it is parsed from |
iter |
number of simulations |
The traditional over-representation analysis (ORA) to find significant pathways uses a 2x2 contingency table to test the independency of genes belonging to a functional category and these genes being differentially expressed, usually by Fisher's exact test. The ORA only consider the number of genes and the function extend traditional ORA with network centralities.
The differential gene list and the background gene list should be indicated
with the same identifiers (e.g. gene symbol or refseq ID). All genes in
the differential gene list should exist in the background gene list. If users
use the PID.db
data, all genes should be formatted in gene symbol.
If the centrality measurement is set as a string, only pre-defined "equal.weight", "in.degree", "out.degree", "degree", "betweenness", "in.reach", "out.reach", "reach", "in.spread", "out.spread" and "spread" are allowed. More centrality measurements can be used by setting it as a function (such as closeness, cluster coefficient). In the function, we recommand users choose at least two centrality measurements. The default centralities are "equal.weight", "in.degree", "out.degree", "betweenness", "in.reach" and "out.reach".
However, in most circumstance, the function is called by cepa.all
.
A cepa.all
class object
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.ora.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.ora.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ## End(Not run)
Apply centrality-extended GSA on a single pathway
cepa.univariate(mat, label, pc, pathway = NULL, id = NULL, cen = "equal.weight", cen.name = if(is.function(cen)) deparse(substitute(cen)) else if(mode(cen) == "name") deparse(cen) else cen, iter = 1000, nlevel = "tvalue_abs", plevel = "mean", node.level.from.expr = NULL, node.level.t.value = NULL, r.node.level.from.expr = NULL)
cepa.univariate(mat, label, pc, pathway = NULL, id = NULL, cen = "equal.weight", cen.name = if(is.function(cen)) deparse(substitute(cen)) else if(mode(cen) == "name") deparse(cen) else cen, iter = 1000, nlevel = "tvalue_abs", plevel = "mean", node.level.from.expr = NULL, node.level.t.value = NULL, r.node.level.from.expr = NULL)
mat |
expression matrix in which rows are genes and columns are samples |
label |
a |
pc |
a |
pathway |
|
id |
identify the number of the pathway in the catalogue |
cen |
centrality measuments, it can ce a string, or function has been quote |
cen.name |
centrality measurement names |
nlevel |
node level transformation, should be one of "tvalue", "tvalue_sq", "tvalue_abs". Also self-defined functions are allowed, see |
plevel |
pathway level transformation, should be one of "max", "min", "median", "sum", "mean", "rank". Also, self-defined functions are allowed, see |
node.level.from.expr |
for simplicity of computing |
node.level.t.value |
for simplicity of computing |
r.node.level.from.expr |
for simplicity of computing |
iter |
number of simulations |
The function is always called by cepa.univariate.all
. But you can still
use it if you realy want to analysis just one pathway under one centrality.
A cepa
class object
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.univariate(mat = eset, label = label, pc = PID.db$NCI, id = 2) ## End(Not run)
## Not run: data(PID.db) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.univariate(mat = eset, label = label, pc = PID.db$NCI, id = 2) ## End(Not run)
Apply centrality-extented GSA on a list of pathways
cepa.univariate.all(mat, label, pc, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), nlevel = "tvalue_abs", plevel = "mean", iter = 1000)
cepa.univariate.all(mat, label, pc, cen = default.centralities, cen.name = sapply(cen, function(x) ifelse(mode(x) == "name", deparse(x), x)), nlevel = "tvalue_abs", plevel = "mean", iter = 1000)
mat |
expression matrix in which rows are genes and columns are samples |
label |
a |
pc |
a |
cen |
centrality measuments, it can ce a string, or a function |
cen.name |
centrality measurement names. By default it is parsed from |
nlevel |
node level transformation, should be one of "tvalue", "tvalue_sq", "tvalue_abs". Also self-defined functions are allowed |
plevel |
pathway level transformation, should be one of "max", "min", "median", "sum", "mean", "rank". Also, self-defined functions are allowed |
iter |
number of simulations |
The traditional gene-set analysis (GSA) to find significant pathways uses the whole expression matrix. GSA methods are implemented via either a univariate or a multivariate procedure. In univariate analysis, node level statistics are initially calculated from fold changes or statistical tests (e.g., t-test). These statistics are then combined into a pathway level statistic by summation or averaging. Multivariate analysis considers the correlations between genes in the pathway and calculates the pathway level statistic directly from the expression value matrix using Hotelling's T^2 test or MANOVA models. The function implement univariate procedure of GSA with network centralities.
If users use the PID.db
data, all genes should be formatted in gene symbol.
If the centrality measurement is set as a string, only pre-defined "equal.weight", "in.degree", "out.degree", "degree", "betweenness", "in.reach", "out.reach", "reach", "in.spread", "out.spread" and "spread" are allowed. More centrality measurements can be used by setting it as a function (such as closeness, cluster coefficient). In the function, we recommand users choose at least two centrality measurements. Note that the self-defined function should only contain one argument which is an igraph object. The default centralities are "equal.weight", "in.degree", "out.degree", "betweenness", "in.reach" and "out.reach".
The node level statistic can be self-defined. The self-defined function should contain two arguments: a vector for expression value in treatment class and a vector for expression value in control class.
The pathway level statistic can be self-defined. The self-defined function should only contain one argument: the vector of node-level statistic.
However, in most circumstance, the function is called by cepa.all
.
We are sorry that only the univariate procedures in GSA are extended. We are still trying to figure out the extension for the multivariate procedures in GSA.
A cepa.all
class object
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # GSA extension # P53_symbol.gct and P53.cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.univariate.all(mat = eset, label = label, pc = PID.db$NCI) ## End(Not run)
## Not run: data(PID.db) # GSA extension # P53_symbol.gct and P53.cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.univariate.all(mat = eset, label = label, pc = PID.db$NCI) ## End(Not run)
Differential gene list and background gene list
data(gene.list)
data(gene.list)
Differential gene list and background gene list was extracted from microarray data from GEO database. The accession number for the data set is GSE22058. The t-test was applied to find differentially expressed genes. Top 2000 genes were selected as the gene list.
A list containing two componets:
bk
background gene list, gene symbol
dif
differentially expressed gene list, gene symbol
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22058
data(gene.list) names(gene.list)
data(gene.list) names(gene.list)
Generate igraph object from edge list
generate.pathway(el)
generate.pathway(el)
el |
edge list, matrix with two columns. The first column is the input node and the second column is the output node. |
The function is a wrapper of graph.edgelist
and it generates
a directed graph.
In the function, repeated edged for two nodes will be eliminated.
Zuguang Gu <[email protected]>
edgelist = rbind(c("a", "b"), c("a", "b"), c("a", "c")) g = generate.pathway(edgelist)
edgelist = rbind(c("a", "b"), c("a", "b"), c("a", "c")) g = generate.pathway(edgelist)
get single cepa object from cepa.all object
get.cepa(x, id = NULL, cen = 1)
get.cepa(x, id = NULL, cen = 1)
x |
a |
id |
index or the name of the pathway |
cen |
index or the name of the centrality |
The 'cepa.all object contains the result for pathways under several centrality
measurements. In cepa.all
object, each pathway under a specific centrality
is a single cepa
object. The get.cepa
function is used to get the cepa
object from the cepa.all
object.
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ora = get.cepa(res.ora, id = 5, cen = 3) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) gsa = get.cepa(res.gsa, id = 5, cen = 3) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ora = get.cepa(res.ora, id = 5, cen = 3) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) gsa = get.cepa(res.gsa, id = 5, cen = 3) ## End(Not run)
Table of p-values of pathways
p.table(x, adj.method = NA, cutoff = ifelse(adj.method == "none", 0.01, 0.05))
p.table(x, adj.method = NA, cutoff = ifelse(adj.method == "none", 0.01, 0.05))
x |
a |
adj.method |
methods in |
cutoff |
cutoff for significance |
Since the p-values for each pathway are calculated for several centralities, the whole p-values are represented as a table.
Also it can extract significant pathways only.
A data matrix where rows are pathways and columns are centralities.
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) p.table(res.ora) p.table(res.ora, adj.method = "BH") # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) p.table(res.gsa) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) p.table(res.ora) p.table(res.ora, adj.method = "BH") # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) p.table(res.gsa) ## End(Not run)
names of the pathway nodes
pathway.nodes(pathway)
pathway.nodes(pathway)
pathway |
an |
If nodes in the pathway have names, then it returns a vector of nodes names. If nodes in the pathway have no name, it just returns the index of nodes (start from 1, after igraph version 0.6).
Zuguang Gu <[email protected]>
interaction = rbind(c("a", "b"), c("a", "c")) g = generate.pathway(interaction) pathway.nodes(g)
interaction = rbind(c("a", "b"), c("a", "c")) g = generate.pathway(interaction) pathway.nodes(g)
pathway catalogues from Pathway Interaction Database(PID)
data(PID.db)
data(PID.db)
The pathway data is parsed from XML format file provided by PID FTP site. The perl code for parsing pathway data can be found at author's website (https://mcube.nju.edu.cn/jwang/lab/soft/cepa/index.html ).
There are four pathway catalogues which are NCI_Nature, BioCarta, KEGG and Reactome.
Each catalogue contains at least three members: pathway list (pathList), interaction list (interactionList) and mappings from node id to gene id (mapping). The pathway list contains a list of pathways in which each pathway is represented by a list of interaction id. The interactions can be queried from the interaction list by the interaction id. The interaction list data is represented as a data frame in which the first column is the interaction id, the second column is the input node id and the third column is the output node id. In real biological pathways, a node in the pathway can be proteins, complex, families and none-gene nodes, so the mapping from node ids to gene ids is also provided. It must be noted that in this package, gene symbol is selected as the primary gene id, so if users apply the PID.db data, they should pay attension to the gene ids they transform.
Besides, in each catalogue, there also a node.name, node.type and version data. The node.name provides the custumed name for each node. The node.type provides the type for each node (e.g. it is a complex or compound). The version provides the version of the catalogue data.
Data has been updated to the lastest version by the day the package released (at 2012_07_19 09:34::20).
Note only part of pathways in the XML file are listed on the PID website. Also, we have set the minimum and maximum connected nodes when extracting pathways from PID, so not all the pathways listed on the PID website are in PID.db.
A list containing four component:
NCI_Nature-curated pathway catalogue
BioCarta pathway catalogue
KEGG pathway catalogue
Reactome pathway catalogue
Each pathway catalogue is a pathway.catalogue
class object. Each pathway
catalogue can be used directly in cepa.all
and cepa
data(PID.db) names(PID.db) PID.db$NCI plot(PID.db$NCI)
data(PID.db) names(PID.db) PID.db$NCI plot(PID.db$NCI)
Plot the cepa object
## S3 method for class 'cepa' plot(x, type = c("graph", "null"), ...)
## S3 method for class 'cepa' plot(x, type = c("graph", "null"), ...)
x |
a |
type |
identify the type for the plot |
... |
arguments passed to |
The function is wrapper of plotGraph
and plotNull
.
If type is specified to "graph", the graph of the network will be plotted (see plotGraph
for details).
If type is specified to "null", the null distribution of the pathway score
in the pathway will be plotted (see plotNull
for details).
if type is set to "graph", the function will return a igraph
object or a graphML
object of the pathway. Else it is NULL.
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) plot(res.ora) plot(res.ora, type = "null") # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa(mat = eset, label = label, pc = PID.db$NCI, id = 2) plot(res.gsa, type = "null") ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) plot(res.ora) plot(res.ora, type = "null") # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa(mat = eset, label = label, pc = PID.db$NCI, id = 2) plot(res.gsa, type = "null") ## End(Not run)
plot the cepa.all object
## S3 method for class 'cepa.all' plot(x, id = NULL, cen = 1, type = c("graph", "null"), tool = c("igraph", "Rgraphviz"), node.name = NULL, node.type = NULL, adj.method = "none", only.sig = FALSE, cutoff = ifelse(adj.method == "none", 0.01, 0.05), ...)
## S3 method for class 'cepa.all' plot(x, id = NULL, cen = 1, type = c("graph", "null"), tool = c("igraph", "Rgraphviz"), node.name = NULL, node.type = NULL, adj.method = "none", only.sig = FALSE, cutoff = ifelse(adj.method == "none", 0.01, 0.05), ...)
x |
a |
id |
index or the name for the pathway |
cen |
index or the name for the centrality |
type |
If the aim is to plot single pathway, then this argument is to identify the kind of the plotting. |
tool |
Use which tool to visualize the graph. Choices are 'igraph' and 'Rgraphviz' |
node.name |
node.name for each node |
node.type |
node.type for each node |
adj.method |
method of |
only.sig |
whether to show all pathways. If just show significant pathways, the names for each significant pathway will be draw. |
cutoff |
cutoff for significance |
... |
other arguments |
This function has two applications. First, it can draw heatmaps of p-values
of all pathways under different centrality measurements. To do it, users should set
x
, adj.method
, only.sig
, cutoff
arguments.
Second, it can draw figures of single
pathway under specific centrality measurement. Under this circumstance,
this function is just a wrapper of plot.cepa
. To do it,
users should set x
, id
, cen
, type
, tool
, node.name
and node.type
arguments. The
id
and cen
arguments are used to get single cepa
object that sent to the
plot function.
It must be noted that these two kinds of arguments should not be mixed.
There is also another popular method qvalue
to adjust p-values. However, errors
may occur when adjusting some kind of p-value list by qvalue
.
So qvalue
was not implemented into CePa. But still users can override the default
p.adjust to support qvalue by themselves, see the vignette.
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) plot(res.ora) plot(res.ora, id = 3) plot(res.ora, id = 3, type = "null") # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) plot(res.gsa) plot(res.gsa, id = 3, cen = 2) plot(res.gsa, id = 3, cen = 2, type = "null") ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) plot(res.ora) plot(res.ora, id = 3) plot(res.ora, id = 3, type = "null") # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) plot(res.gsa) plot(res.gsa, id = 3, cen = 2) plot(res.gsa, id = 3, cen = 2, type = "null") ## End(Not run)
plot pathway.catalogue object
## S3 method for class 'pathway.catalogue' plot(x, ...)
## S3 method for class 'pathway.catalogue' plot(x, ...)
x |
a |
... |
other arguments |
There are three fugures: A) Distribution of the number of member genes in each node; B) Distribution of the number of nodes in which a single gene resides; C) Relationship between node count and gene count in biological pathways.
Zuguang Gu <[email protected]>
data(PID.db) NCI = PID.db$NCI plot(NCI)
data(PID.db) NCI = PID.db$NCI plot(NCI)
Plot graph for the pathway network
plotGraph(x, node.name = NULL, node.type = NULL, draw = TRUE, tool = c("igraph", "Rgraphviz"), graph.node.max.size = 20, graph.node.min.size = 3, graph.layout.method = NULL)
plotGraph(x, node.name = NULL, node.type = NULL, draw = TRUE, tool = c("igraph", "Rgraphviz"), graph.node.max.size = 20, graph.node.min.size = 3, graph.layout.method = NULL)
x |
a |
node.name |
node.name for each node |
node.type |
node.type for each node |
draw |
Whether to draw the graph |
tool |
Use which tool to visualize the graph. Choices are 'igraph' and 'Rgraphviz' |
graph.node.max.size |
max size of the node in the graph |
graph.node.min.size |
min size of the node in the graph |
graph.layout.method |
function of the layout method. For the list of available methods, see |
Graph view of the pathway where the size of node is proportional to centrality value of the node.
By default, the layout for the pathway tree-like. If the number of pathway nodes is large, the layout would be a random layout.
Two packages can be selected to visualize the graph: igraph
and Rgraphviz
.
Default package is igraph
(in fact, this package just uses the data generated from
the layout function in igraph
package, which are the coordinate of nodes and edges.
And the I re-wrote the plotting function to generate the graph). From my personal view,
Rgraphviz
package generated more beautiful graphs.
If the tool
is set as igraph
, the function returns a igraph
object. And
if the tool
is set as Rgraphviz
, the function returns a graphAM
class object.
So if users don't satisfy, they can draw graphs of the network with their
own settings.
The function is always called through plot.cepa.all
and plot.cepa
.
A igraph
object of the pathway
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ora = get.cepa(res.ora, id = 5, cen = 3) plotGraph(ora) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) gsa = get.cepa(res.gsa, id = 5, cen = 3) plotGraph(gsa) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ora = get.cepa(res.ora, id = 5, cen = 3) plotGraph(ora) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) gsa = get.cepa(res.gsa, id = 5, cen = 3) plotGraph(gsa) ## End(Not run)
Plot the null distribution of the pathway score
plotNull(x)
plotNull(x)
x |
a |
There are two figures in the plotting.
A) Distribution of node score in the pathway under simulation. Since a pathway contains a list of nodes. The distribution of node score for the pathway in each simulation is measures by maximum value, the 75th quantile, median value and minimum value. The distribution of node score for the pathway in the real data is highlighted.
B) Histogram of simulated pathway scores.
The function is always called through plot.cepa.all
and plot.cepa
.
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ora = get.cepa(res.ora, id = 5, cen = 3) plotNull(ora) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) gsa = get.cepa(res.gsa, id = 5, cen = 3) plotNull(gsa) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) ora = get.cepa(res.ora, id = 5, cen = 3) plotNull(ora) # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) gsa = get.cepa(res.gsa, id = 5, cen = 3) plotNull(gsa) ## End(Not run)
print the cepa object
print the cepa object
## S3 method for class 'cepa' print(x, ...)
## S3 method for class 'cepa' print(x, ...)
x |
a |
... |
other arguments |
The function print procedure of the analysis, the centrality and the p-value for the pathway.
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) res.ora # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa(mat = eset, label = label, pc = PID.db$NCI, id = 2) res.gsa ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI, id = 2) res.ora # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa(mat = eset, label = label, pc = PID.db$NCI, id = 2) res.gsa ## End(Not run)
print the cepa.all object
## S3 method for class 'cepa.all' print(x, ...)
## S3 method for class 'cepa.all' print(x, ...)
x |
a |
... |
other arguments |
The function print the number of significant pathways under various centrality measures at p-value <= 0.01.
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) res.ora # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) res.gsa ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) res.ora # GSA extension # P53_symbol.gct and P53_cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("P53_symbol.gct") label = read.cls("P53.cls", treatment="MUT", control="WT") # will spend about 45 min res.gsa = cepa.all(mat = eset, label = label, pc = PID.db$NCI) res.gsa ## End(Not run)
print pathway.catalogue object
## S3 method for class 'pathway.catalogue' print(x, ...)
## S3 method for class 'pathway.catalogue' print(x, ...)
x |
a |
... |
other arguments |
Simply print the number of pathways in the catalogue
Zuguang Gu <[email protected]>
data(PID.db) NCI = PID.db$NCI NCI
data(PID.db) NCI = PID.db$NCI NCI
Calculate radiality centrality
radiality(graph, mode = c("all", "in", "out"))
radiality(graph, mode = c("all", "in", "out"))
graph |
an |
mode |
mode of the centrality |
The radiality is defined as sum(d_G + 1 - d(v, w))/(n - 1)
. where d(w, v)
is the
length of the shortest path from node w
to node v
, d_G
is the diameter of the network,
n is the size of the network.
Zuguang Gu <[email protected]>
require(igraph) pathway = barabasi.game(200) radiality(pathway)
require(igraph) pathway = barabasi.game(200) radiality(pathway)
Calculate largest reach centrality
reach(graph, weights=E(graph)$weight, mode=c("all", "in", "out"))
reach(graph, weights=E(graph)$weight, mode=c("all", "in", "out"))
graph |
an |
mode |
mode of the centrality |
weights |
If the edges in the graph have weight, then by default, the weight is used to calculate the length of the shortest path. Set it to NULL to supress the weight. |
The largest reach centrality measures how far a node can send or receive the information in the network. It is defined as the largest length of the shortest path from all the other nodes in the network.
# There is no example NULL
# There is no example NULL
Read CLS file which stores the phenotype data
read.cls(file, treatment, control)
read.cls(file, treatment, control)
file |
cls file path |
treatment |
string of treatment label in cls file |
control |
string of control label in cls file |
The CLS file format defines the phenotype data of microarray experiments. The first line is the number of samples, number of classes and the third number always be 1. These three numbers are seperated by spaces or tabs. The second line begins with #. The next two strings usually are the label of the phenotype. The third line is the label of each samples where same label represents the same class.
The first and the second line is ignored by this function and class labels are taken from the factor of the vector parsed from the third line.
A sampleLabel
class object
https://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
Zuguang Gu <[email protected]>
## Not run: # P53.cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") ## End(Not run)
## Not run: # P53.cls can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ label = read.cls("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53.cls", treatment="MUT", control="WT") ## End(Not run)
Read GCT format file which stores the expression values
read.gct(file)
read.gct(file)
file |
gct file path |
The GCT format is a tab delimited file format that stores the expression value matrix. The first line of the file is the version number which always be #1.2. The second line is the number of the size of genes and samples, seperated by space, usually for the initiation of reading the expression matrix. The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file. From the fourth line will be the expression value of each gene.
GCT file is used together with CLS file.
A matrix of the expression values, with rows correponding to genes and cols to samples.
https://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
Zuguang Gu <[email protected]>
## Not run: # expression data stored in a gct format file # P53_symbol.gct can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") head(eset) ## End(Not run)
## Not run: # expression data stored in a gct format file # P53_symbol.gct can be downloaded from # https://mcube.nju.edu.cn/jwang/lab/soft/cepa/ eset = read.gct("https://mcube.nju.edu.cn/jwang/lab/soft/cepa/P53_symbol.gct") head(eset) ## End(Not run)
Generate report for CePa analysis
report(x, adj.method = "none", cutoff = ifelse(adj.method == "none", 0.01, 0.05), template.file = system.file(package = "CePa", "extdata", "cepa.template"), only.sig = TRUE, dir.path = NULL, ...)
report(x, adj.method = "none", cutoff = ifelse(adj.method == "none", 0.01, 0.05), template.file = system.file(package = "CePa", "extdata", "cepa.template"), only.sig = TRUE, dir.path = NULL, ...)
x |
a |
adj.method |
methods in |
cutoff |
cutoff for significance |
template.file |
path of the template file |
only.sig |
whether to generate detailed report for every pathway. If it is set to FALSE, the page for every pathway under every centrality would be generated (there would be so many images!). |
dir.path |
dir name |
... |
other arguments |
The report is in HTML format that you can view it in you web browser. Networks for pathways can be visualized interactively (by using Cytoscape Web, in which you can drag the network, zoom in and zoom out the network). To load Flash Player successful in you browser, you may need to set the Flash security settings on your machine. See https://cytoscapeweb.cytoscape.org/tutorial and change the settings via https://www.macromedia.com/support/documentation/en/flashplayer/help/settings_manager04.html .
The report would locate at the current working directory. View the report
by clicking index.html
in the report directory.
There is also another popular method qvalue to adjust p-values. Turn to plot.cepa.all
to find out how to use qvalue.
https://cytoscapeweb.cytoscape.org/
Zuguang Gu <[email protected]>
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) report(res.ora) ## End(Not run)
## Not run: data(PID.db) # ORA extension data(gene.list) # will spend about 20 min res.ora = cepa.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI) report(res.ora) ## End(Not run)
Generate data structure of sample labels
sampleLabel(label, treatment, control)
sampleLabel(label, treatment, control)
label |
sample label vector |
treatment |
treatment label |
control |
control label |
Since sample label will not be modified in the analysis, this function is used to integrate all the label information in one single data object.
A sampleLabel
class object
Zuguang Gu <[email protected]>
sampleLabel(c("A", "B", "B", "A", "A", "A", "B", "B"), treatment = "A", control = "B")
sampleLabel(c("A", "B", "B", "A", "A", "A", "B", "B"), treatment = "A", control = "B")
store pathway data and pre-processing
set.pathway.catalogue(pathList, interactionList, mapping, min.node = 5, max.node = 500, min.gene = min.node, max.gene = max.node, ...)
set.pathway.catalogue(pathList, interactionList, mapping, min.node = 5, max.node = 500, min.gene = min.node, max.gene = max.node, ...)
pathList |
list of pathways |
interactionList |
list of interactions |
mapping |
a data frame or matrix providing mappings from gene id to pathway node id. The first column is node id and the second column is gene id. |
min.node |
minimum number of connected nodes in each pathway |
max.node |
maximum number of connected nodes in each pathway |
min.gene |
minimum number of genes in each pathway |
max.gene |
maximum number of genes in each pathway |
... |
other arguments, should have names, these data will be stored as a list member in the returned value from the function |
The pathway data will not be changed in the analysis, so the pathway data is integrated in one single data object by this function. Also, the function will do a little preprocess of the pathway data.
Basicly, a pathway contains a list of interactions. The pathList argument is a list where elements in the list is the vector of interaction IDs in the pathway. The interactions in the pathway can be got from a interaction list pool represented as interactionList argument. The interactionList argument stores the total interaction list in the pathway catalogue. It represents as a three columns data frame or matrix where the first column is the interaction id, the second column is the input node id and the third column is the output node id.
The mapping data frame provide the mapping from node id to gene id. The first column is the node id and the second column is the gene id.
Besides the pathList, interactionList and mapping arguments, more arguments can
be added to the function. These new data will be stored as the member of the list
that returned by the function. E.g., in the PID.db
data, each catalogue
is a pathway.catalogue
object. Except the pathList, interactionList and mapping arguments,
there are also node.name, node.type and version arguments.
The summary can be visualized by plot.pathway.catalogue
.
A pathway.catalogue
class object
Zuguang Gu <[email protected]>
## Not run: data(PID.db) catalogue = set.pathway.catalogue(pathList = PID.db$NCI$pathList[1:20], interactionList = PID.db$NCI$intertionList, mapping = PID.db$NCI$mapping) ## End(Not run)
## Not run: data(PID.db) catalogue = set.pathway.catalogue(pathList = PID.db$NCI$pathList[1:20], interactionList = PID.db$NCI$intertionList, mapping = PID.db$NCI$mapping) ## End(Not run)
Calculate radiality centrality
spread(graph, mode = c("all", "in", "out"), weights = E(graph)$weight, f = function(x) 1/x)
spread(graph, mode = c("all", "in", "out"), weights = E(graph)$weight, f = function(x) 1/x)
graph |
an |
mode |
mode of the centrality |
weights |
If edges in the graph have weight, then by default, the weight is used to calculate the length of the shortest path. Set it to NULL to supress the weight |
f |
function for the weaken rate |
The spread centrality measures how wide the node can send or receive the information in the network. Like the water wave, the effect would be weakened with the increase of the distance to other nodes.
# There is no example NULL
# There is no example NULL