--- title: "Check the heaviness of package dependencies" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Check the heaviness of package dependencies} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} library(pkgndep) pkgndep:::load_all_pkg_dep() ``` When developing R packages, we should try to avoid directly setting dependencies on "heavy packages". The "heaviness" for a package means, the number of additional dependency packages it brings to. If your package directly depends on a heavy package, it would bring several consequences: 1. Users need to install a lot of additional packages when installing your package which brings the risk that installation of some packages may fail and it makes your package cannot be installed. 2. The namespaces that are loaded into your R session after loading your package will be huge (you can see the loaded namespaces by `sessionInfo()`). 3. You package will be "heavy" as well and it may take long time to load your package. In the DESCRIPTION file of your package, there are "direct dependency pakcages" listed in the `Depends`, `Imports` and `LinkingTo` fields. There are also "indirect dependency packages" that can be found recursively for each of the direct dependency packages. Here what we called "dependency packages" are the union of the direct and indirect dependency packages. There are also packages listed in `Suggests` and `Enhances` fields in DESCRIPTION file, but they are not enforced to be installed when installing your package. Of course, they also have "indirect dependency packages". To get rid of the heavy packages that are not often used in your package, it is better to move them into the `Suggests`/`Enhances` fields and to load/install them only when they are needed. Here the **pkgndep** package checks the heaviness of the dependency packages of your package. For each package listed in the `Depends`, `Imports`, `LinkingTo` and `Suggests`/`Enhances` fields in the DESCRIPTION file, **pkgndep** checks how many additional packages your package requires. The summary of the dependency is visualized by a customized heatmap. As an example, I am developing a package called [**cola**](https://github.com/jokergoo/cola/) which depends on [a lot of other packages](https://github.com/jokergoo/ComplexHeatmap/blob/master/DESCRIPTION). The dependency heatmap looks like follows (please drag the figure to a new tab to see it in its actual size): ```{r, echo = FALSE} x = pkgndep:::ENV$all_pkg_dep[["cola"]] pdf(NULL) size = dependency_heatmap(x, help = FALSE) invisible(dev.off()) ``` ```{r, fig.width = size[1], fig.height = size[2], out.width = "1000px", echo = FALSE} dependency_heatmap(x) ``` In the heatmap, rows are the packages listed in `Depends`, `Imports` and `Suggests` fields, columns are the additional dependency packages required for each row package. The barplots on the right show the number of required package, the number of imported functions/methods/classes (parsed from NAMESPACE file) and the quantitative measure "heaviness" (the definition of heaviness will be introduced later). We can see if all the packages are put in the `Depends` or `Imports` field (i.e. movig all suggsted packages to `Imports`), in total `r x$n_by_all` packages are required, which are really a lot. Actually some of the heavy packages such as **WGCNA**, **clusterProfiler** and **ReactomePA** (the last three packages in the heatmap rows) are not very frequently used in **cola**, moving them to `Suggests` field and using them only when they are needed greatly helps to reduce the dependencies of **cola**. Now the number of required packages are reduced to only `r x$n_by_strong`. ## Usage To use this package: ```r library(pkgndep) pkg = pkgndep("package-name") # if the package is already installed dependency_heatmap(pkg) ``` or ```r pkg = pkgndep("path-of-the-package") # if the package has not been installed yet dependency_heatmap(pkg) ``` The value for `pkgndep()` should be 1. a CRAN/Bioconductor package, 2. an installed package, 3. a path of a local package, 4. URL of a GitHub repository. Executable examples: ```{r, echo = FALSE} if(grepl("devel", R.version$status)) { pkgndep = function(...) { pkgndep::pkgndep(..., online = FALSE) } } ``` ```{r} library(pkgndep) pkg = pkgndep("ComplexHeatmap") pkg ``` `pkgndep()` first needs to retrieve package databases both from remote repositories and local libraries, as you can see the message from above code. This only happens once and the database is internally saved and re-used. We can directly use `dependency_heatmap()` function to create the dependency heatmap: ```{r, echo = FALSE} pdf(NULL) size = dependency_heatmap(pkg, help = FALSE) invisible(dev.off()) ``` ```{r, fig.width = size[1], fig.height = size[2], out.width = "1000px"} dependency_heatmap(pkg) ``` You can set the `file` argument to directly save the image into a figure where the figure size is automatically calculated. Supported image formats are `png`/`jpg`/`svg`/`pdf`. ```{r, eval = FALSE} dependency_heatmap(pkg, file = "test.png") ``` `heaviness_report()` function can generate an HTML report for the dependency heaviness analysis on the package. ```{r, eval = FALSE} heaviness_report(pkg) ``` ## Heaviness The heaviness of package dependency can be measured quantitatively. **pkgndep** provides two measures: the absolute measure and the relative measure. The heaviness of a dependency package is calculated as follows. If package _B_ is in the `Depends`/`Imports`/`LinkingTo` fields of package _A_, which means, package _B_ is directly required for package _A_, denote `v1` as the total number of packages for package _A_, and denote `v2` as the total number of required packages if moving package _B_ to `Suggests` in package _A_ (which means, now _B_ is not enforced to be installed for package _A_). The absolute measure of heaviness is simply `v1 - v2` and relative measure is `(v1 + a)/(v2 + a)` where `a` is a small constant, e.g. 10. So here the absolute heaviness for package _B_ on package _A_ is the number of additional packages that package _B_ uniquely brings in. In the second scenario, if package _B_ is in the `Suggests`/`Enhances` fields of package _A_, now `v2` is the total number of required packages if moving package _B_ to `Imports` in package _A_, the absolute measure of heaviness is `v2 - v1` and relative measure is `(v2 + a)/(v1 + a)`. The heaviness score can be calculated by the function `heaviness()`: ```{r} heaviness(pkg) heaviness(pkg, rel = TRUE) ``` ## A fast version of `tools::package_dependencies()` The package dependencies are based on "package database" which is normally retrieved by `available.packages()`. In **tools** package, there is a `package_dependencies()` function that can be used to get a list of dependency packages. In the following example code, we retrieve the dependency packages for package **ggplot2**. ```{r} chooseCRANmirror(ind = 1) # choose the mirror fro RStudio db = available.packages() ``` ```{r} system.time(p1 <- tools::package_dependencies("ggplot2", db = db, recursive = TRUE)[[1]]) ``` In **pkgndep**, we implement a faster version of `package_dependencies()` function. First the database needs to be reformatted by `reformat_db()` function. The returned variable `db2` is a reference class object and its method `db2$package_dependencies()` can be used to retrieve dependency packages. ```{r} db2 = reformat_db(db) db2 system.time(p2 <- db2$package_dependencies("ggplot2", recursive = TRUE, simplify = TRUE)) ``` `p1` and `p2` are actually identical: ```{r} identical(sort(p1), sort(p2)) ``` ## Session info ```{r} sessionInfo() ```