Check the heaviness of package dependencies

When developing R packages, we should try to avoid directly setting dependencies on “heavy packages”. The “heaviness” for a package means, the number of additional dependency packages it brings to. If your package directly depends on a heavy package, it would bring several consequences:

  1. Users need to install a lot of additional packages when installing your package which brings the risk that installation of some packages may fail and it makes your package cannot be installed.
  2. The namespaces that are loaded into your R session after loading your package will be huge (you can see the loaded namespaces by sessionInfo()).
  3. You package will be “heavy” as well and it may take long time to load your package.

In the DESCRIPTION file of your package, there are “direct dependency pakcages” listed in the Depends, Imports and LinkingTo fields. There are also “indirect dependency packages” that can be found recursively for each of the direct dependency packages. Here what we called “dependency packages” are the union of the direct and indirect dependency packages.

There are also packages listed in Suggests and Enhances fields in DESCRIPTION file, but they are not enforced to be installed when installing your package. Of course, they also have “indirect dependency packages”. To get rid of the heavy packages that are not often used in your package, it is better to move them into the Suggests/Enhances fields and to load/install them only when they are needed.

Here the pkgndep package checks the heaviness of the dependency packages of your package. For each package listed in the Depends, Imports, LinkingTo and Suggests/Enhances fields in the DESCRIPTION file, pkgndep checks how many additional packages your package requires. The summary of the dependency is visualized by a customized heatmap.

As an example, I am developing a package called cola which depends on a lot of other packages. The dependency heatmap looks like follows (please drag the figure to a new tab to see it in its actual size):

## The best device size to visualize the complete plot is 38.91 x 11.82 (in inches),
## or use `plot(obj, fix_size = FALSE)` so that heatmap cells are not in fixed sizes.

In the heatmap, rows are the packages listed in Depends, Imports and Suggests fields, columns are the additional dependency packages required for each row package. The barplots on the right show the number of required package, the number of imported functions/methods/classes (parsed from NAMESPACE file) and the quantitative measure “heaviness” (the definition of heaviness will be introduced later).

We can see if all the packages are put in the Depends or Imports field (i.e. movig all suggsted packages to Imports), in total 257 packages are required, which are really a lot. Actually some of the heavy packages such as WGCNA, clusterProfiler and ReactomePA (the last three packages in the heatmap rows) are not very frequently used in cola, moving them to Suggests field and using them only when they are needed greatly helps to reduce the dependencies of cola. Now the number of required packages are reduced to only 65.

Usage

To use this package:

library(pkgndep)
pkg = pkgndep("package-name")  # if the package is already installed
dependency_heatmap(pkg)

or

pkg = pkgndep("path-of-the-package")  # if the package has not been installed yet
dependency_heatmap(pkg)

The value for pkgndep() should be 1. a CRAN/Bioconductor package, 2. an installed package, 3. a path of a local package, 4. URL of a GitHub repository.

Executable examples:

library(pkgndep)
pkg = pkgndep("ComplexHeatmap")
## retrieve package database from CRAN/Bioconductor (3.20)...
##   - 25066 remote packages on CRAN/Bioconductor.
##   - 191 packages installed locally.
## prepare dependency table...
## prepare reverse dependency table...
pkg
## 'ComplexHeatmap', version 2.23.0
## - 31 packages are required for installing 'ComplexHeatmap'.
## - 131 packages are required if installing packages listed in all fields in DESCRIPTION.

pkgndep() first needs to retrieve package databases both from remote repositories and local libraries, as you can see the message from above code. This only happens once and the database is internally saved and re-used.

We can directly use dependency_heatmap() function to create the dependency heatmap:

dependency_heatmap(pkg)

## The best device size to visualize the complete plot is 21.36 x 8.54 (in inches),
## or use `plot(obj, fix_size = FALSE)` so that heatmap cells are not in fixed sizes.

You can set the file argument to directly save the image into a figure where the figure size is automatically calculated. Supported image formats are png/jpg/svg/pdf.

dependency_heatmap(pkg, file = "test.png")

heaviness_report() function can generate an HTML report for the dependency heaviness analysis on the package.

heaviness_report(pkg)

Heaviness

The heaviness of package dependency can be measured quantitatively. pkgndep provides two measures: the absolute measure and the relative measure.

The heaviness of a dependency package is calculated as follows. If package B is in the Depends/Imports/LinkingTo fields of package A, which means, package B is directly required for package A, denote v1 as the total number of packages for package A, and denote v2 as the total number of required packages if moving package B to Suggests in package A (which means, now B is not enforced to be installed for package A). The absolute measure of heaviness is simply v1 - v2 and relative measure is (v1 + a)/(v2 + a) where a is a small constant, e.g. 10. So here the absolute heaviness for package B on package A is the number of additional packages that package B uniquely brings in.

In the second scenario, if package B is in the Suggests/Enhances fields of package A, now v2 is the total number of required packages if moving package B to Imports in package A, the absolute measure of heaviness is v2 - v1 and relative measure is (v2 + a)/(v1 + a).

The heaviness score can be calculated by the function heaviness():

heaviness(pkg)
##       grDevices        graphics            grid           stats         methods 
##               0               0               0               0               0 
##    RColorBrewer             png     matrixStats       codetools          digest 
##               1               1               1               0               1 
##         foreach      colorspace   GlobalOptions            clue      doParallel 
##               0               0               0               2               4 
##      GetoptLong        circlize         IRanges        dendsort            jpeg 
##               3               2               5               1               1 
##            tiff     fastcluster           Cairo    gridGraphics            glue 
##               1               1               1               1               1 
##        markdown        grImport          magick          gplots           knitr 
##               3               2               4               5               5 
##       grImport2        pheatmap        gridtext   GenomicRanges        testthat 
##               4              12              16              14              22 
##       rmarkdown      dendextend EnrichedHeatmap 
##              25              31              19
heaviness(pkg, rel = TRUE)
##       grDevices        graphics            grid           stats         methods 
##        1.000000        1.000000        1.000000        1.000000        1.000000 
##    RColorBrewer             png     matrixStats       codetools          digest 
##        1.025000        1.025000        1.025000        1.000000        1.025000 
##         foreach      colorspace   GlobalOptions            clue      doParallel 
##        1.000000        1.000000        1.000000        1.051282        1.108108 
##      GetoptLong        circlize         IRanges        dendsort            jpeg 
##        1.078947        1.051282        1.138889        1.024390        1.024390 
##            tiff     fastcluster           Cairo    gridGraphics            glue 
##        1.024390        1.024390        1.024390        1.024390        1.024390 
##        markdown        grImport          magick          gplots           knitr 
##        1.073171        1.048780        1.097561        1.121951        1.121951 
##       grImport2        pheatmap        gridtext   GenomicRanges        testthat 
##        1.097561        1.292683        1.390244        1.341463        1.536585 
##       rmarkdown      dendextend EnrichedHeatmap 
##        1.609756        1.756098        1.463415

A fast version of tools::package_dependencies()

The package dependencies are based on “package database” which is normally retrieved by available.packages(). In tools package, there is a package_dependencies() function that can be used to get a list of dependency packages. In the following example code, we retrieve the dependency packages for package ggplot2.

chooseCRANmirror(ind = 1) # choose the mirror fro RStudio
db = available.packages()
system.time(p1 <- tools::package_dependencies("ggplot2", db = db, recursive = TRUE)[[1]])
##    user  system elapsed 
##   0.217   0.000   0.217

In pkgndep, we implement a faster version of package_dependencies() function. First the database needs to be reformatted by reformat_db() function. The returned variable db2 is a reference class object and its method db2$package_dependencies() can be used to retrieve dependency packages.

db2 = reformat_db(db)
## prepare dependency table...
## prepare reverse dependency table...
db2
## A package database of 21681 packages.
##   - 21681 CRAN / 0 Bioconductor / 0 other packages.
system.time(p2 <- db2$package_dependencies("ggplot2", recursive = TRUE, simplify = TRUE))
##    user  system elapsed 
##   0.002   0.000   0.003

p1 and p2 are actually identical:

identical(sort(p1), sort(p2))
## [1] TRUE

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ComplexHeatmap_2.23.0 pkgndep_1.99.1        knitr_1.49           
## [4] rmarkdown_2.29       
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.9      compiler_4.4.2      rjson_0.2.23       
##  [4] crayon_1.5.3        parallel_4.4.2      cluster_2.1.6      
##  [7] jquerylib_0.1.4     IRanges_2.41.1      png_0.1-8          
## [10] yaml_2.3.10         fastmap_1.2.0       R6_2.5.1           
## [13] generics_0.1.3      shape_1.4.6.1       BiocGenerics_0.53.3
## [16] iterators_1.0.14    GetoptLong_1.1.0    circlize_0.4.16    
## [19] maketools_1.3.1     RColorBrewer_1.1-3  bslib_0.8.0        
## [22] rlang_1.1.4         cachem_1.1.0        xfun_0.49          
## [25] sass_0.4.9          sys_3.4.3           GlobalOptions_0.1.2
## [28] doParallel_1.0.17   cli_3.6.3           digest_0.6.37      
## [31] foreach_1.5.2       clue_0.3-66         lifecycle_1.0.4    
## [34] S4Vectors_0.45.2    evaluate_1.0.1      codetools_0.2-20   
## [37] buildtools_1.0.0    hash_2.2.6.3        stats4_4.4.2       
## [40] colorspace_2.1-1    BiocVersion_3.21.1  matrixStats_1.4.1  
## [43] tools_4.4.2         htmltools_0.5.8.1