Package 'BGData' reference manual

Title:	A Suite of Packages for Analysis of Big Genomic Data
Description:	An umbrella package providing a phenotype/genotype data structure and scalable and efficient computational methods for large genomic datasets in combination with several other packages: 'BEDMatrix', 'LinkedMatrix', and 'symDMatrix'.
Authors:	Gustavo de los Campos [aut], Alexander Grueneberg [aut, cre], Paulino Perez [ctb], Ana Vazquez [ctb]
Maintainer:	Alexander Grueneberg <[email protected]>
License:	MIT + file LICENSE
Version:	2.4.1
Built:	2025-01-20 18:25:10 UTC
Source:	https://github.com/quantgen/bgdata

A Suite of Packages for Analysis of Big Genomic Data

Description

Modern genomic datasets are big (large n), high-dimensional (large p), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.

Details

We have identified several approaches to tackle those challenges within R:

File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.
Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.
Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.
Multi-level parallelism: Exploit multi-core and multi-node computing.
Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).

The BGData package is an umbrella package that comprises several packages: BEDMatrix, LinkedMatrix, and symDMatrix. It features scalable and efficient computational methods for large genomic datasets such as genome-wide association studies (GWAS) or genomic relationship matrices (G matrix). It also contains a container class called BGData that holds genotypes, sample information, and variant information.

Example dataset

The extdata folder contains example files that were generated from the 250k SNP and phenotype data in Atwell et al. (2010). Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the size of the example dataset small. PLINK was used to convert the data to .bed and .raw files. FT10 has been chosen as a phenotype and is provided as an alternate phenotype file. The file is intentionally shuffled to demonstrate that the additional phenotypes are put in the same order as the rest of the phenotypes.

Convert Other Objects to BGData Objects

Description

Converts other objects to BGData objects by loading supplementary phenotypes and map files referenced by the object to be used for the sample information and variant information, respectively.

Currently supported are BEDMatrix objects, plain or nested in ColumnLinkedMatrix objects.

Usage

as.BGData(x, alternatePhenotypeFile = NULL, ...)

## S3 method for class 'BEDMatrix'
as.BGData(x, alternatePhenotypeFile = NULL, ...)

## S3 method for class 'ColumnLinkedMatrix'
as.BGData(x, alternatePhenotypeFile = NULL,
  ...)

## S3 method for class 'RowLinkedMatrix'
as.BGData(x, alternatePhenotypeFile = NULL,
  ...)
as.BGData(x, alternatePhenotypeFile = NULL, ...)

## S3 method for class 'BEDMatrix'
as.BGData(x, alternatePhenotypeFile = NULL, ...)

## S3 method for class 'ColumnLinkedMatrix'
as.BGData(x, alternatePhenotypeFile = NULL,
  ...)

## S3 method for class 'RowLinkedMatrix'
as.BGData(x, alternatePhenotypeFile = NULL,
  ...)

Arguments

`x`	An object. Currently supported are `BEDMatrix` objects, plain or nested in `ColumnLinkedMatrix` objects.
`alternatePhenotypeFile`	Path to an alternate phenotype file.
`...`	Additional arguments to the `read.table` or `fread` call (if data.table package is installed) call to parse the alternate pheno file.

Details

The .ped and .raw formats only allows for a single phenotype. If more phenotypes are required it is possible to store them in an alternate phenotype file. The path to such a file can be provided with alternatePhenotypeFile and will be merged with the existing sample information. The first and second columns of that file must contain family and within-family IDs, respectively.

For BEDMatrix objects: If a .fam file (which corresponds to the first six columns of a .ped or .raw file) of the same name and in the same directory as the .bed file exists, the sample information will be populated with the data stored in that file. Otherwise a stub that only contains an IID column populated with the rownames of geno(x) will be generated. The same will happen for a .bim file for the variant information.

For ColumnLinkedMatrix objects: See the case for BEDMatrix objects, but only the .fam file of the first node of the LinkedMatrix will be read and used for the sample information, and the .bim files of all nodes will be combined and used for the variant information.

Value

A BGData object.

Examples

# Path to example data
path <- system.file("extdata", package = "BGData")

# Convert a single BEDMatrix object to a BGData object
chr1 <- BEDMatrix::BEDMatrix(paste0(path, "/chr1.bed"))
bg1 <- as.BGData(chr1)

# Convert multiple BEDMatrix objects in a ColumnLinkedMatrix to a BGData object
chr2 <- BEDMatrix::BEDMatrix(paste0(path, "/chr2.bed"))
chr3 <- BEDMatrix::BEDMatrix(paste0(path, "/chr3.bed"))
clm <- ColumnLinkedMatrix(chr1, chr2, chr3)
bg2 <- as.BGData(clm)

# Load additional (alternate) phenotypes
bg3 <- as.BGData(clm, alternatePhenotypeFile = paste0(path, "/pheno.txt"))
# Path to example data
path <- system.file("extdata", package = "BGData")

# Convert a single BEDMatrix object to a BGData object
chr1 <- BEDMatrix::BEDMatrix(paste0(path, "/chr1.bed"))
bg1 <- as.BGData(chr1)

# Convert multiple BEDMatrix objects in a ColumnLinkedMatrix to a BGData object
chr2 <- BEDMatrix::BEDMatrix(paste0(path, "/chr2.bed"))
chr3 <- BEDMatrix::BEDMatrix(paste0(path, "/chr3.bed"))
clm <- ColumnLinkedMatrix(chr1, chr2, chr3)
bg2 <- as.BGData(clm)

# Load additional (alternate) phenotypes
bg3 <- as.BGData(clm, alternatePhenotypeFile = paste0(path, "/pheno.txt"))

Creates a New BGData Instance

Description

This function constructs a new BGData object.

Usage

BGData(geno, pheno = NULL, map = NULL)
BGData(geno, pheno = NULL, map = NULL)

Arguments

`geno`	A `geno` object that contains genotypes.
`pheno`	A `data.frame` that contains sample information (including phenotypes). A stub that only contains a `sample_id` column populated with either the rownames of `geno` or a sequence starting with `sample_` will be generated if `NULL`
`map`	A `data.frame` that contains variant information. A stub that only contains a `variant_id` column populated with either the colnames of `geno` or a sequence starting with `variant_` will be generated if `NULL`

Container for Phenotype and Genotype Data

Description

The BGData class is a container for genotypes, sample information, and variant information. The class is inspired by the .bed/.fam/.bim (binary) and .ped/.fam/.map (text) phenotype/genotype file formats of PLINK. It is used by several functions of this package such as GWAS for performing a Genome Wide Association Study or getG for calculating a genomic relationship matrix.

Details

There are several ways to create an instance of this class:

from arbitrary phenotype/genotype data using the BGData constructor function.
from a .bed file using as.BGData and BEDMatrix.
from a previously saved BGData object using load.BGData.
from multiple files (even a mixture of different file types) using LinkedMatrix.
from a .raw file (or a .ped-like file) using readRAW, readRAW_matrix, or readRAW_big.matrix.

A .ped file can be recoded to a .raw file in PLINK using plink --file myfile --recodeA, or converted to a .bed file using plink --file myfile --make-bed. Conversely, a .bed file can be transformed back to a .ped file using plink --bfile myfile --recode or to a .raw file using plink --bfile myfile --recodeA without losing information.

Accessors

In the following code snippets, x is a BGData object.

geno(x), geno(x) <- value:: Get or set genotypes.
pheno(x), pheno(x) <- value:: Get or set sample information.
map(x), map(x) <- value:: Get or set variant information.

Examples

X <- matrix(data = rnorm(100), nrow = 10, ncol = 10)
Y <- data.frame(y = runif(10))
MAP <- data.frame(means = colMeans(X), freqNA = colMeans(is.na(X)))
DATA <- BGData(geno = X, pheno = Y, map = MAP)

dim(geno(DATA))
head(pheno(DATA))
head(map(DATA))
X <- matrix(data = rnorm(100), nrow = 10, ncol = 10)
Y <- data.frame(y = runif(10))
MAP <- data.frame(means = colMeans(X), freqNA = colMeans(is.na(X)))
DATA <- BGData(geno = X, pheno = Y, map = MAP)

dim(geno(DATA))
head(pheno(DATA))
head(map(DATA))

Applies a Function on Each Row or Column of a File-Backed Matrix

Description

Similar to apply, but designed for file-backed matrices. The function brings chunks of an object into physical memory by taking subsets, and applies a function on either the rows or the columns of the chunks using an optimized version of apply. If nCores is greater than 1, the function will be applied in parallel using mclapply. In that case the subsets of the object are taken on the slaves.

Usage

chunkedApply(X, MARGIN, FUN, i = seq_len(nrow(X)),
  j = seq_len(ncol(X)), chunkSize = 5000L,
  nCores = getOption("mc.cores", 2L), verbose = FALSE, ...)
chunkedApply(X, MARGIN, FUN, i = seq_len(nrow(X)),
  j = seq_len(ncol(X)), chunkSize = 5000L,
  nCores = getOption("mc.cores", 2L), verbose = FALSE, ...)

Arguments

`X`	A file-backed matrix, typically the genotypes of a `BGData` object.
`MARGIN`	The subscripts which the function will be applied over. 1 indicates rows, 2 indicates columns.
`FUN`	The function to be applied.
`i`	Indicates which rows of `X` should be used. Can be integer, boolean, or character. By default, all rows are used.
`j`	Indicates which columns of `X` should be used. Can be integer, boolean, or character. By default, all columns are used.
`chunkSize`	The number of rows or columns of `X` that are brought into physical memory for processing per core. If `NULL`, all elements in `i` or `j` are used. Defaults to 5000.
`nCores`	The number of cores (passed to `mclapply`). Defaults to the number of cores as detected by `detectCores`.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.
`...`	Additional arguments to be passed to the `apply` like function.

Examples

# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Compute standard deviation of columns
chunkedApply(X = geno(bg), MARGIN = 2, FUN = sd)
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Compute standard deviation of columns
chunkedApply(X = geno(bg), MARGIN = 2, FUN = sd)

Applies a Function on Each Chunk of a File-Backed Matrix

Description

Similar to lapply, but designed for file-backed matrices. The function brings chunks of an object into physical memory by taking subsets, and applies a function on them. If nCores is greater than 1, the function will be applied in parallel using mclapply. In that case the subsets of the object are taken on the slaves.

Usage

chunkedMap(X, FUN, i = seq_len(nrow(X)), j = seq_len(ncol(X)),
  chunkBy = 2L, chunkSize = 5000L, nCores = getOption("mc.cores",
  2L), verbose = FALSE, ...)
chunkedMap(X, FUN, i = seq_len(nrow(X)), j = seq_len(ncol(X)),
  chunkBy = 2L, chunkSize = 5000L, nCores = getOption("mc.cores",
  2L), verbose = FALSE, ...)

Arguments

`X`	A file-backed matrix, typically the genotypes of a `BGData` object.
`FUN`	The function to be applied on each chunk.
`i`	Indicates which rows of `X` should be used. Can be integer, boolean, or character. By default, all rows are used.
`j`	Indicates which columns of `X` should be used. Can be integer, boolean, or character. By default, all columns are used.
`chunkBy`	Whether to extract chunks by rows (1) or by columns (2). Defaults to columns (2).
`chunkSize`	The number of rows or columns of `X` that are brought into physical memory for processing per core. If `NULL`, all elements in `i` or `j` are used. Defaults to 5000.
`nCores`	The number of cores (passed to `mclapply`). Defaults to the number of cores as detected by `detectCores`.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.
`...`	Additional arguments to be passed to the `apply` like function.

Examples

# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Compute column sums of each chunk
chunkedMap(X = geno(bg), FUN = colSums)
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Compute column sums of each chunk
chunkedMap(X = geno(bg), FUN = colSums)

File-Backed Matrices

Description

Functions with the chunkSize parameter work best with file-backed matrices such as BEDMatrix objects. To avoid loading the whole, potentially very large matrix into memory, these functions will load chunks of the file-backed matrix into memory and perform the operations on one chunk at a time. The size of the chunks is determined by the chunkSize parameter. Care must be taken to not set chunkSize too high to avoid memory shortage, particularly when combined with parallel computing.

Find related individuals in a relationship matrix

Description

Find related individuals in a relationship matrix.

Usage

findRelated(x, ...)

## S3 method for class 'matrix'
findRelated(x, cutoff = 0.03, ...)

## S3 method for class 'symDMatrix'
findRelated(x, cutoff = 0.03, verbose = FALSE,
  ...)
findRelated(x, ...)

## S3 method for class 'matrix'
findRelated(x, cutoff = 0.03, ...)

## S3 method for class 'symDMatrix'
findRelated(x, cutoff = 0.03, verbose = FALSE,
  ...)

Arguments

`x`	A matrix-like object with dimnames.
`...`	Additional arguments for methods.
`cutoff`	The cutoff between 0 and 1 for related individuals to be included in the output. Defaults to 0.03.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.

Value

A vector of names or indices of related individuals.

Methods (by class)

matrix: Find related individuals in matrices
symDMatrix: Find related individuals in symDMatrix objects

Examples

# Load example data
bg <- BGData:::loadExample()

G <- getG(geno(bg))
findRelated(G)
# Load example data
bg <- BGData:::loadExample()

G <- getG(geno(bg))
findRelated(G)

Performs Forward Regressions

Description

Performs forward regression of y on the columns of X. Predictors are added, one at a time, each time adding the one that produces the largest reduction in the residual sum of squares (RSS). The function returns estimates and summaries for the entire forward search. This function performs a similar search than that of step(, direction='forward'), however, FWD() is optimized for computational speed for linear models with very large sample size. To achieve fast computations, the software first computes the sufficient statistics X'X and X'y. At each step, the function first finds the predictor that produces the largest reduction in the sum of squares (this can be derived from X'X, X'y and the current solution of effects), and then updates the estimates of effects for the resulting model using Gauss Seidel iterations performed on the linear system (X'X)b=X'y, iterating only over the elements of b that are active in the model.

Usage

FWD(y, X, df = 20, tol = 1e-7, maxIter = 1000, centerImpute = TRUE,
    verbose = TRUE)
FWD(y, X, df = 20, tol = 1e-7, maxIter = 1000, centerImpute = TRUE,
    verbose = TRUE)

Arguments

`y`	The response vector (numeric nx1).
`X`	An (nxp) numeric matrix. Columns are the features (aka predictors) considered in the forward search. The rows of `X` must be matched to the entries of `y`.
`df`	Defines the maximum number of predictors to be included in the model. For complete forward search, set `df = ncol(X)`.
`tol`	A tolerance parameter to control when to stop the Gauss Seidel algorithm.
`maxIter`	The maximum number of iterations for the Gauss Seidel algorithm (only used when the algorithm is not stopped by the tolerance parameter).
`centerImpute`	Whether to center the columns of `X` and impute the missing values with the column means.
`verbose`	Use `verbose = TRUE` to print summaries of the forward search.

Value

A list with two entries:

B: (pxdf+1) includes the estimated effects for each predictor (rows) at each step of the forward search (df, in columns).
path: A data frame providing the order in which variables were added to the model (variable) and statistics for each step of the forward search (RSS, LogLik, VARE (the residual variance), DF, AIC, and BIC).

Getting/Setting Genotypes, Sample Information, and Variant Information

Description

A set of generic functions for getting/setting the genotypes, sample information, and variant information.

Usage

geno(x)
geno(x) <- value

pheno(x)
pheno(x) <- value

map(x)
map(x) <- value
geno(x)
geno(x) <- value

pheno(x)
pheno(x) <- value

map(x)
map(x) <- value

Arguments

x

The object from/on which to get/set genotypes, sample information, and variant information. Typically a BGData object.

value

Typically a geno object for the geno setter.

Typically a data.frame object for the pheno setter.

Typically a data.frame object for the map setter.

Examples

# Load example data
bg <- BGData:::loadExample()

# Access genotypes
geno(bg)

# Access sample information
pheno(bg)

# Access variant information
map(bg)
# Load example data
bg <- BGData:::loadExample()

# Access genotypes
geno(bg)

# Access sample information
pheno(bg)

# Access variant information
map(bg)

An Abstract S4 Class Union of Matrix-Like Types

Description

geno is a class union of several matrix-like types, many of them suitable for very large datasets.

Currently supported are LinkedMatrix, BEDMatrix, big.matrix,ff_matrix, and matrix.

Computes a Genomic Relationship Matrix

Description

Computes a positive semi-definite symmetric genomic relation matrix G=XX' offering options for centering and scaling the columns of X beforehand.

Usage

getG(X, center = TRUE, scale = TRUE, impute = TRUE, scaleG = TRUE,
  minVar = 1e-05, i = seq_len(nrow(X)), j = seq_len(ncol(X)), i2 = NULL,
  chunkSize = 5000L, nCores = getOption("mc.cores", 2L), verbose = FALSE)
getG(X, center = TRUE, scale = TRUE, impute = TRUE, scaleG = TRUE,
  minVar = 1e-05, i = seq_len(nrow(X)), j = seq_len(ncol(X)), i2 = NULL,
  chunkSize = 5000L, nCores = getOption("mc.cores", 2L), verbose = FALSE)

Arguments

`X`	A matrix-like object, typically the genotypes of a `BGData` object.
`center`	Either a logical value or a numeric vector of length equal to the number of columns of `X`. Numeric vector required if `i2` is used. If `FALSE`, no centering is done. Defaults to `TRUE`.
`scale`	Either a logical value or a numeric vector of length equal to the number of columns of `X`. Numeric vector required if `i2` is used. If `FALSE`, no scaling is done. Defaults to `TRUE`.
`impute`	Indicates whether missing values should be imputed. Defaults to `TRUE`.
`scaleG`	Whether XX' should be scaled. Defaults to `TRUE`.
`minVar`	Columns with variance lower than this value will not be used in the computation (only if `scale` is not `FALSE`).
`i`	Indicates which rows of `X` should be used. Can be integer, boolean, or character. By default, all rows are used.
`j`	Indicates which columns of `X` should be used. Can be integer, boolean, or character. By default, all columns are used.
`i2`	Indicates which rows should be used to compute a block of the genomic relationship matrix. Will compute XY' where X is determined by `i` and `j` and Y by `i2` and `j`. Can be integer, boolean, or character. If `NULL`, the whole genomic relationship matrix XX' is computed. Defaults to `NULL`.
`chunkSize`	The number of columns of `X` that are brought into physical memory for processing per core. If `NULL`, all columns of `X` are used. Defaults to 5000.
`nCores`	The number of cores (passed to `mclapply`). Defaults to the number of cores as detected by `detectCores`.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.

Details

If center = FALSE, scale = FALSE and scaleG = FALSE, getG produces the same outcome than tcrossprod.

Value

A positive semi-definite symmetric numeric matrix.

Examples

# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Compute a scaled genomic relationship matrix from centered and scaled
# genotypes
g1 <- getG(X = geno(bg))

# Disable scaling of G
g2 <- getG(X = geno(bg), scaleG = FALSE)

# Disable centering of genotypes
g3 <- getG(X = geno(bg), center = FALSE)

# Disable scaling of genotypes
g4 <- getG(X = geno(bg), scale = FALSE)

# Provide own scales
scales <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = sd)
g4 <- getG(X = geno(bg), scale = scales)

# Provide own centers
centers <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = mean)
g5 <- getG(X = geno(bg), center = centers)

# Only use the first 50 individuals (useful to account for population structure)
g6 <- getG(X = geno(bg), i = 1:50)

# Only use the first 100 markers (useful to ignore some markers)
g7 <- getG(X = geno(bg), j = 1:100)

# Compute unscaled G matrix by combining blocks of $XX_{i2}'$ where $X_{i2}$ is
# a horizontal partition of X. This is useful for distributed computing as each
# block can be computed in parallel. Centers and scales need to be precomputed.
block1 <- getG(X = geno(bg), i2 = 1:100, center = centers, scale = scales)
block2 <- getG(X = geno(bg), i2 = 101:199, center = centers, scale = scales)
g8 <- cbind(block1, block2)

# Compute unscaled G matrix by combining blocks of $X_{i}X_{i2}'$ where both
# $X_{i}$ and $X_{i2}$ are horizontal partitions of X. Similarly to the example
# above, this is useful for distributed computing, in particular to compute
# very large G matrices. Centers and scales need to be precomputed. This
# approach is similar to the one taken by the symDMatrix package, but the
# symDMatrix package adds memory-mapped blocks, only stores the upper side of
# the triangular matrix, and provides a type that allows for indexing as if the
# full G matrix is in memory.
block11 <- getG(X = geno(bg), i = 1:100, i2 = 1:100, center = centers, scale = scales)
block12 <- getG(X = geno(bg), i = 1:100, i2 = 101:199, center = centers, scale = scales)
block21 <- getG(X = geno(bg), i = 101:199, i2 = 1:100, center = centers, scale = scales)
block22 <- getG(X = geno(bg), i = 101:199, i2 = 101:199, center = centers, scale = scales)
g9 <- rbind(
    cbind(block11, block12),
    cbind(block21, block22)
)
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Compute a scaled genomic relationship matrix from centered and scaled
# genotypes
g1 <- getG(X = geno(bg))

# Disable scaling of G
g2 <- getG(X = geno(bg), scaleG = FALSE)

# Disable centering of genotypes
g3 <- getG(X = geno(bg), center = FALSE)

# Disable scaling of genotypes
g4 <- getG(X = geno(bg), scale = FALSE)

# Provide own scales
scales <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = sd)
g4 <- getG(X = geno(bg), scale = scales)

# Provide own centers
centers <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = mean)
g5 <- getG(X = geno(bg), center = centers)

# Only use the first 50 individuals (useful to account for population structure)
g6 <- getG(X = geno(bg), i = 1:50)

# Only use the first 100 markers (useful to ignore some markers)
g7 <- getG(X = geno(bg), j = 1:100)

# Compute unscaled G matrix by combining blocks of $XX_{i2}'$ where $X_{i2}$ is
# a horizontal partition of X. This is useful for distributed computing as each
# block can be computed in parallel. Centers and scales need to be precomputed.
block1 <- getG(X = geno(bg), i2 = 1:100, center = centers, scale = scales)
block2 <- getG(X = geno(bg), i2 = 101:199, center = centers, scale = scales)
g8 <- cbind(block1, block2)

# Compute unscaled G matrix by combining blocks of $X_{i}X_{i2}'$ where both
# $X_{i}$ and $X_{i2}$ are horizontal partitions of X. Similarly to the example
# above, this is useful for distributed computing, in particular to compute
# very large G matrices. Centers and scales need to be precomputed. This
# approach is similar to the one taken by the symDMatrix package, but the
# symDMatrix package adds memory-mapped blocks, only stores the upper side of
# the triangular matrix, and provides a type that allows for indexing as if the
# full G matrix is in memory.
block11 <- getG(X = geno(bg), i = 1:100, i2 = 1:100, center = centers, scale = scales)
block12 <- getG(X = geno(bg), i = 1:100, i2 = 101:199, center = centers, scale = scales)
block21 <- getG(X = geno(bg), i = 101:199, i2 = 1:100, center = centers, scale = scales)
block22 <- getG(X = geno(bg), i = 101:199, i2 = 101:199, center = centers, scale = scales)
g9 <- rbind(
    cbind(block11, block12),
    cbind(block21, block22)
)

Computes a Very Large Genomic Relationship Matrix

Description

Computes a positive semi-definite symmetric genomic relation matrix G=XX' offering options for centering and scaling the columns of X beforehand.

Usage

getG_symDMatrix(X, center = TRUE, scale = TRUE, impute = TRUE, scaleG = TRUE,
  minVar = 1e-05, blockSize = 5000L,
  folderOut = paste0("symDMatrix_", randomString()), vmode = "double",
  i = seq_len(nrow(X)), j = seq_len(ncol(X)), chunkSize = 5000L,
  nCores = getOption("mc.cores", 2L), verbose = FALSE)
getG_symDMatrix(X, center = TRUE, scale = TRUE, impute = TRUE, scaleG = TRUE,
  minVar = 1e-05, blockSize = 5000L,
  folderOut = paste0("symDMatrix_", randomString()), vmode = "double",
  i = seq_len(nrow(X)), j = seq_len(ncol(X)), chunkSize = 5000L,
  nCores = getOption("mc.cores", 2L), verbose = FALSE)

Arguments

`X`	A matrix-like object, typically the genotypes of a `BGData` object.
`center`	Either a logical value or a numeric vector of length equal to the number of columns of `X`. If `FALSE`, no centering is done. Defaults to `TRUE`.
`scale`	Either a logical value or a numeric vector of length equal to the number of columns of `X`. If `FALSE`, no scaling is done. Defaults to `TRUE`.
`impute`	Indicates whether missing values should be imputed. Defaults to `TRUE`.
`scaleG`	TRUE/FALSE whether xx' must be scaled.
`minVar`	Columns with variance lower than this value will not be used in the computation (only if `scale` is not `FALSE`).
`blockSize`	The number of rows and columns of each block. If `NULL`, a single block of the same length as `i` will be created. Defaults to 5000.
`folderOut`	The path to the folder where to save the `symDMatrix` object. Defaults to a random string prefixed with "symDMatrix_".
`vmode`	vmode of `ff` objects.
`i`	Indicates which rows of `X` should be used. Can be integer, boolean, or character. By default, all rows are used.
`j`	Indicates which columns of `X` should be used. Can be integer, boolean, or character. By default, all columns are used.
`chunkSize`	The number of columns of `X` that are brought into physical memory for processing per core. If `NULL`, all columns of `X` are used. Defaults to 5000.
`nCores`	The number of cores (passed to `mclapply`). Defaults to the number of cores as detected by `detectCores`.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.

Details

Even very large genomic relationship matrices are supported by partitioning X into blocks and calling getG on these blocks. This function performs the block computations sequentially, which may be slow. In an HPC environment, performance can be improved by manually distributing these operations to different nodes.

Value

A symDMatrix object.

Performs Single Marker Regressions Using BGData Objects

Description

Implements single marker regressions. The regression model includes all the covariates specified in the right-hand-side of the formula plus one column of the genotypes at a time. The data from the association tests is obtained from a BGData object.

Usage

GWAS(formula, data, method = "lsfit", i = seq_len(nrow(geno(data))),
  j = seq_len(ncol(geno(data))), chunkSize = 5000L,
  nCores = getOption("mc.cores", 2L), verbose = FALSE, ...)
GWAS(formula, data, method = "lsfit", i = seq_len(nrow(geno(data))),
  j = seq_len(ncol(geno(data))), chunkSize = 5000L,
  nCores = getOption("mc.cores", 2L), verbose = FALSE, ...)

Arguments

`formula`	The formula for the GWAS model without the variant, e.g. `y ~ 1` or `y ~ factor(sex) + age`. The variables included in the formula must be column names in the sample information of the `BGData` object.
`data`	A `BGData` object.
`method`	The regression method to be used. Currently, the following methods are implemented: `rayOLS` (see below), `lsfit`, `lm`, `lm.fit`, `glm`, `lmer`, and `SKAT`. Defaults to `lsfit`.
`i`	Indicates which rows of the genotypes should be used. Can be integer, boolean, or character. By default, all rows are used.
`j`	Indicates which columns of the genotypes should be used. Can be integer, boolean, or character. By default, all columns are used.
`chunkSize`	The number of columns of the genotypes that are brought into physical memory for processing per core. If `NULL`, all elements in `j` are used. Defaults to 5000.
`nCores`	The number of cores (passed to `mclapply`). Defaults to the number of cores as detected by `detectCores`.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.
`...`	Additional arguments for chunkedApply and regression method.

Details

The rayOLS method is a regression through the origin that can only be used with a y ~ 1 formula, i.e. it only allows for one quantitative response variable y and one variant at a time as an explanatory variable (the variant is not included in the formula, hence 1 is used as a dummy). If covariates are needed, consider preadjustment of y. Among the provided methods, it is by far the fastest.

Some regression methods may require the data to not contain columns with variance 0 or too many missing values. We suggest running summarize to detect variants that do not clear the desired minor-allele frequency and rate of missing genotype calls, and filtering these variants out using the j parameter of the GWAS function (see example below).

Value

The same matrix that would be returned by coef(summary(model)).

Examples

# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Detect variants that do not pass MAF and missingness thresholds
summaries <- summarize(geno(bg))
maf <- ifelse(summaries$allele_freq > 0.5, 1 - summaries$allele_freq,
    summaries$allele_freq)
exclusions <- maf < 0.01 | summaries$freq_na > 0.05

# Perform a single marker regression
res1 <- GWAS(formula = FT10 ~ 1, data = bg, j = !exclusions)

# Draw a Manhattan plot
plot(-log10(res1[, 4]))

# Use lm instead of lsfit (the default)
res2 <- GWAS(formula = FT10 ~ 1, data = bg, method = "lm", j = !exclusions)

# Use glm instead of lsfit (the default)
y <- pheno(bg)$FT10
pheno(bg)$FT10.01 <- y > quantile(y, 0.8, na.rm = TRUE)
res3 <- GWAS(formula = FT10.01 ~ 1, data = bg, method = "glm", j = !exclusions)

# Perform a single marker regression on the first 50 markers (useful for
# distributed computing)
res4 <- GWAS(formula = FT10 ~ 1, data = bg, j = 1:50)
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Detect variants that do not pass MAF and missingness thresholds
summaries <- summarize(geno(bg))
maf <- ifelse(summaries$allele_freq > 0.5, 1 - summaries$allele_freq,
    summaries$allele_freq)
exclusions <- maf < 0.01 | summaries$freq_na > 0.05

# Perform a single marker regression
res1 <- GWAS(formula = FT10 ~ 1, data = bg, j = !exclusions)

# Draw a Manhattan plot
plot(-log10(res1[, 4]))

# Use lm instead of lsfit (the default)
res2 <- GWAS(formula = FT10 ~ 1, data = bg, method = "lm", j = !exclusions)

# Use glm instead of lsfit (the default)
y <- pheno(bg)$FT10
pheno(bg)$FT10.01 <- y > quantile(y, 0.8, na.rm = TRUE)
res3 <- GWAS(formula = FT10.01 ~ 1, data = bg, method = "glm", j = !exclusions)

# Perform a single marker regression on the first 50 markers (useful for
# distributed computing)
res4 <- GWAS(formula = FT10 ~ 1, data = bg, j = 1:50)

Loads BGData (and Other) Objects from .RData Files

Description

This function is similar to load, but also initializes the different types of objects that can be used as genotypes in a BGData object.

Currently supported are ff_matrix, big.matrix, and BEDMatrix objects. If the object is of type LinkedMatrix, all nodes will be initialized with their appropriate method.

Usage

load.BGData(file, envir = parent.frame())
load.BGData(file, envir = parent.frame())

Arguments

`file`	The name of the .RData file to be loaded.
`envir`	The environment where to load the data.

Multi-Level Parallelism

Description

Functions with the nCores, i, and j parameters provide capabilities for both parallel and distributed computing.

For parallel computing, nCores determines the number of cores the code is run on. Memory usage can be an issue for higher values of nCores as R is not particularly memory-efficient. As a rule of thumb, at least around (nCores * object_size(chunk)) + object_size(result) MB of total memory will be needed for operations on file-backed matrices, not including potential copies of your data that might be created (for example lsfit runs cbind(1, X)). i and j can be used to include or exclude certain rows or columns. Internally, the mclapply function is used and therefore parallel computing will not work on Windows machines.

For distributed computing, i and j determine the subset of the input matrix that the code runs on. In an HPC environment, this can be used not just to include or exclude certain rows or columns, but also to partition the task among many nodes rather than cores. Scheduler-specific code and code to aggregate the results need to be written by the user. It is recommended to set nCores to 1 as nodes are often cheaper than cores.

Merge Two Data Frames Keeping the Order of the First

Description

This is a simplified version of merge useful for merging additional data into a BGData object while keeping the order of the data in the BGData object.

Usage

orderedMerge(x, y, by = c(1L, 2L))
orderedMerge(x, y, by = c(1L, 2L))

Arguments

`x`	Data frame
`y`	Data frame
`by`	Specifications of the columns used for merging. Defaults to the first two columns of the data frame, which traditionally has the family ID and the individual ID.

Value

Merged data frame

Center, scale, and impute data

Description

A faster version of scale with a similar interface that also allows for imputation. The main difference is that this version scales by the standard deviation regardless of whether centering is enabled or not. If centering is enabled, missing values are imputed by 0, otherwise by the mean of the column that contains the value.

Usage

preprocess(X, center = FALSE, scale = FALSE, impute = FALSE,
  nCores = getOption("mc.cores", 2L))
preprocess(X, center = FALSE, scale = FALSE, impute = FALSE,
  nCores = getOption("mc.cores", 2L))

Arguments

`X`	A numeric matrix.
`center`	Either a logical value or numeric vector of length equal to the number of columns of `X`.
`scale`	Either a logical value or numeric vector of length equal to the number of columns of `X`.
`impute`	Indicates whether missing values should be imputed.
`nCores`	The number of cores (passed to `mclapply`). Defaults to the number of cores as detected by `detectCores`.

Value

The centered, scaled, and imputed matrix.

Examples

# Load example data
bg <- BGData:::loadExample()

# Center and scale genotypes
W <- preprocess(as.matrix(geno(bg)), center = TRUE, scale = TRUE)
# Load example data
bg <- BGData:::loadExample()

# Center and scale genotypes
W <- preprocess(as.matrix(geno(bg)), center = TRUE, scale = TRUE)

Creates a BGData Object From a .raw File or a .ped-Like File

Description

Creates a BGData object from a .raw file (generated with --recodeA in PLINK). Other text-based file formats are supported as well by tweaking some of the parameters as long as the records of individuals are in rows, and phenotypes, covariates and markers are in columns.

Usage

readRAW(fileIn, header = TRUE, dataType = integer(), n = NULL,
  p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), nNodes = NULL, linked.by = "rows",
  folderOut = paste0("BGData_", sub("\\.[[:alnum:]]+$", "",
  basename(fileIn))), outputType = "byte", dimorder = if (linked.by ==
  "rows") 2L:1L else 1L:2L, verbose = FALSE)

readRAW_matrix(fileIn, header = TRUE, dataType = integer(), n = NULL,
  p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), verbose = FALSE)

readRAW_big.matrix(fileIn, header = TRUE, dataType = integer(),
  n = NULL, p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), folderOut = paste0("BGData_",
  sub("\\.[[:alnum:]]+$", "", basename(fileIn))), outputType = "char",
  verbose = FALSE)
readRAW(fileIn, header = TRUE, dataType = integer(), n = NULL,
  p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), nNodes = NULL, linked.by = "rows",
  folderOut = paste0("BGData_", sub("\\.[[:alnum:]]+$", "",
  basename(fileIn))), outputType = "byte", dimorder = if (linked.by ==
  "rows") 2L:1L else 1L:2L, verbose = FALSE)

readRAW_matrix(fileIn, header = TRUE, dataType = integer(), n = NULL,
  p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), verbose = FALSE)

readRAW_big.matrix(fileIn, header = TRUE, dataType = integer(),
  n = NULL, p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), folderOut = paste0("BGData_",
  sub("\\.[[:alnum:]]+$", "", basename(fileIn))), outputType = "char",
  verbose = FALSE)

Arguments

`fileIn`	The path to the plaintext file.
`header`	Whether `fileIn` contains a header. Defaults to `TRUE`.
`dataType`	The coding type of genotypes in `fileIn`. Use `integer()` or `double()` for numeric coding. Alpha-numeric coding is currently not supported for `readRAW` and `readRAW_big.matrix`: use the `--recodeA` option of PLINK to convert the .ped file into a .raw file. Defaults to `integer()`.
`n`	The number of individuals. Auto-detect if `NULL`. Defaults to `NULL`.
`p`	The number of markers. Auto-detect if `NULL`. Defaults to `NULL`.
`sep`	The field separator character. Values on each line of the file are separated by this character. If `sep = ""` (the default for `readRAW` the separator is "white space", that is one or more spaces, tabs, newlines or carriage returns.
`na.strings`	The character string used in the plaintext file to denote missing value. Defaults to `NA`.
`nColSkip`	The number of columns to be skipped to reach the genotype information in the file. Defaults to `6`.
`idCol`	The index of the ID column. If more than one index is given, both columns will be concatenated with "_". Defaults to `c(1, 2)`, i.e. a concatenation of the first two columns.
`nNodes`	The number of nodes to create. Auto-detect if `NULL`. Defaults to `NULL`.
`linked.by`	If `columns` a column-linked matrix (`ColumnLinkedMatrix`) is created, if `rows` a row-linked matrix (`RowLinkedMatrix`). Defaults to `rows`.
`folderOut`	The path to the folder where to save the binary files. Defaults to the name of the input file (`fileIn`) without extension prefixed with "BGData_".
`outputType`	The `vmode` for `ff` and `type` for `big.matrix` objects. Default to `byte` for `ff` and `char` for `big.matrix` objects.
`dimorder`	The physical layout of the underlying `ff` object of each node.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.

Details

The data included in the first couple of columns (up to nColSkip) is used to populate the sample information of a BGData object, and the remaining columns are used to fill the genotypes. If the first row contains a header (header = TRUE), data in this row is used to determine the column names for sample information and genotypes.

The genotypes can take several forms, depending on the function that is called (readRAW, readRAW_matrix, or readRAW_big.matrix). The following sections illustrate each function in detail.

readRAW

Genotypes are stored in a LinkedMatrix object where each node is an ff instance. Multiple ff files are used because the array size in ff is limited to the largest integer which can be represented on the system (.Machine$integer.max) and for genetic data this limitation is often exceeded. The LinkedMatrix package makes it possible to link several ff files together by columns or by rows and treat them similarly to a single matrix. By default a ColumnLinkedMatrix is used for the genotypes, but the user can modify this using the linked.by argument. The number of nodes to generate is either specified by the user using the nNodes argument or determined internally so that each ff object has a number of cells that is smaller than .Machine$integer.max / 1.2. A folder (see folderOut) that contains the binary flat files (named geno_*.bin) and an external representation of the BGData object in BGData.RData is created.

readRAW_matrix

Genotypes are stored in a regular matrix object. Therefore, this function will only work if the .raw file is small enough to fit into memory.

readRAW_big.matrix

Genotypes are stored in a filebacked big.matrix object. A folder (see folderOut) that contains the binary flat file (named BGData.bin), a descriptor file (named BGData.desc), and an external representation of the BGData object in BGData.RData are created.

Reloading a BGData object

To reload a BGData object, it is recommended to use the load.BGData function instead of the load function as load does not initialize ff objects or attach big.matrix objects.

Examples

# Path to example data
path <- system.file("extdata", package = "BGData")

# Convert RAW files of chromosome 1 to a BGData object
bg <- readRAW(fileIn = paste0(path, "/chr1.raw"))

unlink("BGData_chr1", recursive = TRUE)
# Path to example data
path <- system.file("extdata", package = "BGData")

# Convert RAW files of chromosome 1 to a BGData object
bg <- readRAW(fileIn = paste0(path, "/chr1.raw"))

unlink("BGData_chr1", recursive = TRUE)

Find non-overlapping segments based on a summary statistic

Description

Given a summary statistic and a threshold, this function computes the number of non-overlapping segments, each defined as a discovery (i.e., statistic[i] <= threshold) +/- a gap, in the same units as bp (often base-pair position).

Usage

segments(statistic, chr, bp, threshold, gap, trim = FALSE, verbose = FALSE)
segments(statistic, chr, bp, threshold, gap, trim = FALSE, verbose = FALSE)

Arguments

`statistic`	A statistic (e.g., BFDR or p-values).
`chr`	A vector containing the chromosome for each value of `statistic`.
`bp`	A vector containing the base-pair positions for each value of `statistic`.
`threshold`	The threshold to determine 'significance' (e.g., `1e-5` for p-values).
`gap`	1/2 of the length of the desired segments.
`trim`	Whether to collapse segments that were artifically inflated by `gap`. Defaults to `FALSE`.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.

Value

A data frame containing the following information:

`chr`	Chromosome
`start`	Index where segment starts within `statistic`.
`end`	Index where segment ends within `statistic`.
`length`	Length of segment.
`bpStart`	Base-pair position where segment starts.
`bpEnd`	Base-pair position where segment ends.
`bpLength`	Length of segment in base-pair positions.
`minValue`	Smallest value of `statistic` within segment.
`minValuePos`	Position of variant with the smallest value of `statistic` within segment.

Examples

library(BGData)

# Load example data
bg <- BGData:::loadExample()

# Perform GWAS
pValues <- GWAS(
    formula = FT10 ~ 1,
    data = bg,
    method = "rayOLS"
)

# Determine segments within +/- 1MB from a significant variant
segments <- segments(
    statistic = pValues[, 4],
    chr = map(bg)$chromosome,
    bp = map(bg)$base_pair_position,
    threshold = 1e-5,
    gap = 1e6,
    trim = FALSE,
    verbose = FALSE
)
library(BGData)

# Load example data
bg <- BGData:::loadExample()

# Perform GWAS
pValues <- GWAS(
    formula = FT10 ~ 1,
    data = bg,
    method = "rayOLS"
)

# Determine segments within +/- 1MB from a significant variant
segments <- segments(
    statistic = pValues[, 4],
    chr = map(bg)$chromosome,
    bp = map(bg)$base_pair_position,
    threshold = 1e-5,
    gap = 1e6,
    trim = FALSE,
    verbose = FALSE
)

Generates Various Summary Statistics

Description

Computes the frequency of missing values, the (minor) allele frequency, and standard deviation of each column of X.

Usage

summarize(X, i = seq_len(nrow(X)), j = seq_len(ncol(X)),
  chunkSize = 5000L, nCores = getOption("mc.cores", 2L),
  verbose = FALSE)
summarize(X, i = seq_len(nrow(X)), j = seq_len(ncol(X)),
  chunkSize = 5000L, nCores = getOption("mc.cores", 2L),
  verbose = FALSE)

Arguments

`X`	A matrix-like object, typically the genotypes of a `BGData` object.
`i`	Indicates which rows of `X` should be used. Can be integer, boolean, or character. By default, all rows are used.
`j`	Indicates which columns of `X` should be used. Can be integer, boolean, or character. By default, all columns are used.
`chunkSize`	The number of columns of `X` that are brought into physical memory for processing per core. If `NULL`, all elements in `j` are used. Defaults to 5000.
`nCores`	The number of cores (passed to `mclapply`). Defaults to the number of cores as detected by `detectCores`.
`verbose`	Whether progress updates will be posted. Defaults to `FALSE`.

Value

A data.frame with three columns: freq_na for frequencies of missing values, allele_freq for allele frequencies of the counted allele, and sd for standard deviations.

Examples

# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Summarize the whole dataset
sum1 <- summarize(X = geno(bg))

# Summarize the first 50 individuals
sum2 <- summarize(X = geno(bg), i = 1:50)

# Summarize the first 1000 markers (useful for distributed computing)
sum3 <- summarize(X = geno(bg), j = 1:100)

# Summarize the first 50 individuals on the first 1000 markers
sum4 <- summarize(X = geno(bg), i = 1:50, j = 1:100)

# Summarize by names
sum5 <- summarize(X = geno(bg), j = c("snp81233_C", "snp81234_C", "snp81235_T"))

# Convert to minor allele frequencies (useful if the counted alleles are not
# the minor alleles)
maf <- ifelse(sum1$allele_freq > 0.5, 1 - sum1$allele_freq, sum1$allele_freq)
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Summarize the whole dataset
sum1 <- summarize(X = geno(bg))

# Summarize the first 50 individuals
sum2 <- summarize(X = geno(bg), i = 1:50)

# Summarize the first 1000 markers (useful for distributed computing)
sum3 <- summarize(X = geno(bg), j = 1:100)

# Summarize the first 50 individuals on the first 1000 markers
sum4 <- summarize(X = geno(bg), i = 1:50, j = 1:100)

# Summarize by names
sum5 <- summarize(X = geno(bg), j = c("snp81233_C", "snp81234_C", "snp81235_T"))

# Convert to minor allele frequencies (useful if the counted alleles are not
# the minor alleles)
maf <- ifelse(sum1$allele_freq > 0.5, 1 - sum1$allele_freq, sum1$allele_freq)

Package 'BGData'

Help Index

A Suite of Packages for Analysis of Big Genomic Data

Description

Details

Example dataset

See Also

Convert Other Objects to BGData Objects

Description

Usage

Arguments

Details

Value

See Also

Examples

Creates a New BGData Instance

Description

Usage

Arguments

See Also

Container for Phenotype and Genotype Data

Description

Details

Accessors

See Also

Examples

Applies a Function on Each Row or Column of a File-Backed Matrix

Description

Usage

Arguments

See Also

Examples

Applies a Function on Each Chunk of a File-Backed Matrix

Description

Usage

Arguments

See Also

Examples

File-Backed Matrices

Description

See Also

Find related individuals in a relationship matrix

Description

Usage

Arguments

Value

Methods (by class)

Examples

Performs Forward Regressions

Description

Usage

Arguments

Value

Getting/Setting Genotypes, Sample Information, and Variant Information

Description

Usage

Arguments

See Also

Examples

An Abstract S4 Class Union of Matrix-Like Types

Description

See Also

Computes a Genomic Relationship Matrix

Description

Usage

Arguments

Details

Value

See Also

Examples

Computes a Very Large Genomic Relationship Matrix

Description

Usage

Arguments

Details

Value

See Also

Performs Single Marker Regressions Using BGData Objects

Description

Usage