Package 'hdImpute' reference manual

Title:	A Batch Process for High Dimensional Imputation
Description:	A correlation-based batch process for fast, accurate imputation for high dimensional missing data problems via chained random forests. See Waggoner (2023) <doi:10.1007/s00180-023-01325-9> for more on 'hdImpute', Stekhoven and Bühlmann (2012) <doi:10.1093/bioinformatics/btr597> for more on 'missForest', and Mayer (2022) <https://github.com/mayer79/missRanger> for more on 'missRanger'.
Authors:	Philip Waggoner [aut, cre]
Maintainer:	Philip Waggoner <[email protected]>
License:	MIT + file LICENSE
Version:	0.2.1
Built:	2024-10-27 05:20:37 UTC
Source:	https://github.com/pdwaggoner/hdimpute

Find features with (specified amount of) missingness

Description

Find features with (specified amount of) missingness

Usage

check_feature_na(data, threshold)
check_feature_na(data, threshold)

Arguments

`data`	A data frame or tibble.
`threshold`	Missingness threshold in a given column/feature as a proportion bounded between 0 and 1. Default set to sensitive level at 1e-04.

Value

A vector of column/feature names that contain missingness greater than threshold.

Examples

## Not run: 
check_feature_na(data = any_data_frame, threshold = 1e-04)

## End(Not run)
## Not run: 
check_feature_na(data = any_data_frame, threshold = 1e-04)

## End(Not run)

Find number of and which rows contain any missingness

Description

Find number of and which rows contain any missingness

Usage

check_row_na(data, which)
check_row_na(data, which)

Arguments

`data`	A data frame or tibble.
`which`	Logical. Should a list be returned with the row numbers corresponding to each row with missingness? Default set to FALSE.

Value

Either an integer value corresponding to the number of rows in data with any missingness (if which = FALSE), or a tibble containing: 1) number of rows in data with any missingness, and 2) a list of which rows/row numbers contain missingness (if which = TRUE).

Examples

## Not run: 
check_row_na(data = any_data_frame, which = FALSE)

## End(Not run)
## Not run: 
check_row_na(data = any_data_frame, which = FALSE)

## End(Not run)

High dimensional imputation via batch processed chained random forests Build correlation matrix

Description

High dimensional imputation via batch processed chained random forests Build correlation matrix

Usage

feature_cor(data, return_cor)
feature_cor(data, return_cor)

Arguments

`data`	A data frame or tibble.
`return_cor`	Logical. Should the correlation matrix be printed? Default set to FALSE.

Value

A cross-feature correlation matrix

References

Waggoner, P. D. (2023). A batch process for high dimensional imputation. Computational Statistics, 1-22. doi: <10.1007/s00180-023-01325-9>

van Buuren S, Groothuis-Oudshoorn K (2011). "mice: Multivariate Imputation by Chained Equations in R." Journal of Statistical Software, 45(3), 1-67. doi: <10.18637/jss.v045.i03>

Examples

## Not run: 
feature_cor(data = data, return_cor = FALSE)

## End(Not run)
## Not run: 
feature_cor(data = data, return_cor = FALSE)

## End(Not run)

Flatten and arrange cor matrix to be df

Description

Flatten and arrange cor matrix to be df

Usage

flatten_mat(cor_mat, return_mat)
flatten_mat(cor_mat, return_mat)

Arguments

`cor_mat`	A correlation matrix output from running `feature_cor()`
`return_mat`	Logical. Should the flattened matrix be printed? Default set to FALSE.

Value

A vector of correlation-based ranked features

Examples

## Not run: 
flatten_mat(cor_mat = cor_mat, return_mat = FALSE)

## End(Not run)
## Not run: 
flatten_mat(cor_mat = cor_mat, return_mat = FALSE)

## End(Not run)

Complete hdImpute process: correlation matrix, flatten, rank, create batches, impute, join

Description

Complete hdImpute process: correlation matrix, flatten, rank, create batches, impute, join

Usage

hdImpute(data, batch, pmm_k, n_trees, seed, save)
hdImpute(data, batch, pmm_k, n_trees, seed, save)

Arguments

`data`	Original data frame or tibble (with missing values)
`batch`	Numeric. Batch size.
`pmm_k`	Integer. Number of neighbors considered in imputation. Default set at 5.
`n_trees`	Integer. Number of trees used in imputation. Default set at 15.
`seed`	Integer. Seed to be set for reproducibility.
`save`	Should the list of individual imputed batches be saved as .rds file to working directory? Default set to FALSE.

Details

Step 1. group data by dividing the row_number() by batch size (batch, number of batches set by user) using integer division. Step 2. pass through group_split() to return a list. Step 3. impute each batch individually and time. Step 4. generate completed (unlisted/joined) imputed data frame

Value

A completed, imputed data set

References

Waggoner, P. D. (2023). A batch process for high dimensional imputation. Computational Statistics, 1-22. doi: <10.1007/s00180-023-01325-9>

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. doi: <10.1093/bioinformatics/btr597>

Examples

## Not run: 
impute_batches(data = data,
batch = 2,  pmm_k = 5, n_trees = 15,
seed = 123, save = FALSE)

## End(Not run)
## Not run: 
impute_batches(data = data,
batch = 2,  pmm_k = 5, n_trees = 15,
seed = 123, save = FALSE)

## End(Not run)

Impute batches and return completed data frame

Description

Impute batches and return completed data frame

Usage

impute_batches(data, features, batch, pmm_k, n_trees, seed, save)
impute_batches(data, features, batch, pmm_k, n_trees, seed, save)

Arguments

`data`	Original data frame or tibble (with missing values)
`features`	Correlation-based vector of ranked features output from running `flatten_mat()`
`batch`	Numeric. Batch size.
`pmm_k`	Integer. Number of neighbors considered in imputation. Default at 5.
`n_trees`	Integer. Number of trees used in imputation. Default at 15.
`seed`	Integer. Seed to be set for reproducibility.
`save`	Should the list of individual imputed batches be saved as .rds file to working directory? Default set to FALSE.

Details

Value

A completed, imputed data set

References

Waggoner, P. D. (2023). A batch process for high dimensional imputation. Computational Statistics, 1-22. doi: <10.1007/s00180-023-01325-9>

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. doi: <10.1093/bioinformatics/btr597>

Examples

## Not run: 
impute_batches(data = data, features = flat_mat,
batch = 2,  pmm_k = 5, n_trees = 15, seed = 123,
save = FALSE)

## End(Not run)
## Not run: 
impute_batches(data = data, features = flat_mat,
batch = 2,  pmm_k = 5, n_trees = 15, seed = 123,
save = FALSE)

## End(Not run)

Compute variable-wise mean absolute differences (MAD) between original and imputed dataframes.

Description

Compute variable-wise mean absolute differences (MAD) between original and imputed dataframes.

Usage

mad(original, imputed, round)
mad(original, imputed, round)

Arguments

`original`	A data frame or tibble with original values.
`imputed`	A data frame or tibble that has been imputed/completed.
`round`	Integer. Number of places to round MAD scores. Default set to 3.

Value

'mad_scores' as 'p' x 2 tibble. One row for each variable in original, from 1 to 'p'. Two columns: first is variable names ('var') and second is associated MAD score ('mad') as percentages for each variable.

Examples

## Not run: 
mad(original = original_data, imputed = imputed_data, round = 3)

## End(Not run)
## Not run: 
mad(original = original_data, imputed = imputed_data, round = 3)

## End(Not run)

Package 'hdImpute'

Help Index

Find features with (specified amount of) missingness

Description

Usage

Arguments

Value

Examples

Find number of and which rows contain any missingness

Description

Usage

Arguments

Value

Examples

High dimensional imputation via batch processed chained random forests Build correlation matrix

Description

Usage

Arguments

Value

References

Examples

Flatten and arrange cor matrix to be df

Description

Usage

Arguments

Value

Examples

Complete hdImpute process: correlation matrix, flatten, rank, create batches, impute, join

Description

Usage

Arguments

Details

Value

References

Examples

Impute batches and return completed data frame

Description

Usage

Arguments

Details

Value

References

Examples

Compute variable-wise mean absolute differences (MAD) between original and imputed dataframes.

Description

Usage

Arguments

Value

Examples