Title: | A Batch Process for High Dimensional Imputation |
---|---|
Description: | A correlation-based batch process for fast, accurate imputation for high dimensional missing data problems via chained random forests. See Waggoner (2023) <doi:10.1007/s00180-023-01325-9> for more on 'hdImpute', Stekhoven and Bühlmann (2012) <doi:10.1093/bioinformatics/btr597> for more on 'missForest', and Mayer (2022) <https://github.com/mayer79/missRanger> for more on 'missRanger'. |
Authors: | Philip Waggoner [aut, cre] |
Maintainer: | Philip Waggoner <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1 |
Built: | 2024-10-27 05:20:37 UTC |
Source: | https://github.com/pdwaggoner/hdimpute |
Find features with (specified amount of) missingness
check_feature_na(data, threshold)
check_feature_na(data, threshold)
data |
A data frame or tibble. |
threshold |
Missingness threshold in a given column/feature as a proportion bounded between 0 and 1. Default set to sensitive level at 1e-04. |
A vector of column/feature names that contain missingness greater than threshold
.
## Not run: check_feature_na(data = any_data_frame, threshold = 1e-04) ## End(Not run)
## Not run: check_feature_na(data = any_data_frame, threshold = 1e-04) ## End(Not run)
Find number of and which rows contain any missingness
check_row_na(data, which)
check_row_na(data, which)
data |
A data frame or tibble. |
which |
Logical. Should a list be returned with the row numbers corresponding to each row with missingness? Default set to FALSE. |
Either an integer value corresponding to the number of rows in data
with any missingness (if which = FALSE
), or a tibble containing: 1) number of rows in data
with any missingness, and 2) a list of which rows/row numbers contain missingness (if which = TRUE
).
## Not run: check_row_na(data = any_data_frame, which = FALSE) ## End(Not run)
## Not run: check_row_na(data = any_data_frame, which = FALSE) ## End(Not run)
High dimensional imputation via batch processed chained random forests Build correlation matrix
feature_cor(data, return_cor)
feature_cor(data, return_cor)
data |
A data frame or tibble. |
return_cor |
Logical. Should the correlation matrix be printed? Default set to FALSE. |
A cross-feature correlation matrix
Waggoner, P. D. (2023). A batch process for high dimensional imputation. Computational Statistics, 1-22. doi: <10.1007/s00180-023-01325-9>
van Buuren S, Groothuis-Oudshoorn K (2011). "mice: Multivariate Imputation by Chained Equations in R." Journal of Statistical Software, 45(3), 1-67. doi: <10.18637/jss.v045.i03>
## Not run: feature_cor(data = data, return_cor = FALSE) ## End(Not run)
## Not run: feature_cor(data = data, return_cor = FALSE) ## End(Not run)
Flatten and arrange cor matrix to be df
flatten_mat(cor_mat, return_mat)
flatten_mat(cor_mat, return_mat)
cor_mat |
A correlation matrix output from running |
return_mat |
Logical. Should the flattened matrix be printed? Default set to FALSE. |
A vector of correlation-based ranked features
## Not run: flatten_mat(cor_mat = cor_mat, return_mat = FALSE) ## End(Not run)
## Not run: flatten_mat(cor_mat = cor_mat, return_mat = FALSE) ## End(Not run)
Complete hdImpute process: correlation matrix, flatten, rank, create batches, impute, join
hdImpute(data, batch, pmm_k, n_trees, seed, save)
hdImpute(data, batch, pmm_k, n_trees, seed, save)
data |
Original data frame or tibble (with missing values) |
batch |
Numeric. Batch size. |
pmm_k |
Integer. Number of neighbors considered in imputation. Default set at 5. |
n_trees |
Integer. Number of trees used in imputation. Default set at 15. |
seed |
Integer. Seed to be set for reproducibility. |
save |
Should the list of individual imputed batches be saved as .rds file to working directory? Default set to FALSE. |
Step 1. group data by dividing the row_number()
by batch size (batch
, number of batches set by user) using integer division. Step 2. pass through group_split()
to return a list. Step 3. impute each batch individually and time. Step 4. generate completed (unlisted/joined) imputed data frame
A completed, imputed data set
Waggoner, P. D. (2023). A batch process for high dimensional imputation. Computational Statistics, 1-22. doi: <10.1007/s00180-023-01325-9>
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. doi: <10.1093/bioinformatics/btr597>
## Not run: impute_batches(data = data, batch = 2, pmm_k = 5, n_trees = 15, seed = 123, save = FALSE) ## End(Not run)
## Not run: impute_batches(data = data, batch = 2, pmm_k = 5, n_trees = 15, seed = 123, save = FALSE) ## End(Not run)
Impute batches and return completed data frame
impute_batches(data, features, batch, pmm_k, n_trees, seed, save)
impute_batches(data, features, batch, pmm_k, n_trees, seed, save)
data |
Original data frame or tibble (with missing values) |
features |
Correlation-based vector of ranked features output from running |
batch |
Numeric. Batch size. |
pmm_k |
Integer. Number of neighbors considered in imputation. Default at 5. |
n_trees |
Integer. Number of trees used in imputation. Default at 15. |
seed |
Integer. Seed to be set for reproducibility. |
save |
Should the list of individual imputed batches be saved as .rds file to working directory? Default set to FALSE. |
Step 1. group data by dividing the row_number()
by batch size (batch
, number of batches set by user) using integer division. Step 2. pass through group_split()
to return a list. Step 3. impute each batch individually and time. Step 4. generate completed (unlisted/joined) imputed data frame
A completed, imputed data set
Waggoner, P. D. (2023). A batch process for high dimensional imputation. Computational Statistics, 1-22. doi: <10.1007/s00180-023-01325-9>
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. doi: <10.1093/bioinformatics/btr597>
## Not run: impute_batches(data = data, features = flat_mat, batch = 2, pmm_k = 5, n_trees = 15, seed = 123, save = FALSE) ## End(Not run)
## Not run: impute_batches(data = data, features = flat_mat, batch = 2, pmm_k = 5, n_trees = 15, seed = 123, save = FALSE) ## End(Not run)
Compute variable-wise mean absolute differences (MAD) between original and imputed dataframes.
mad(original, imputed, round)
mad(original, imputed, round)
original |
A data frame or tibble with original values. |
imputed |
A data frame or tibble that has been imputed/completed. |
round |
Integer. Number of places to round MAD scores. Default set to 3. |
'mad_scores' as 'p' x 2 tibble. One row for each variable in original
, from 1 to 'p'. Two columns: first is variable names ('var') and second is associated MAD score ('mad') as percentages for each variable.
## Not run: mad(original = original_data, imputed = imputed_data, round = 3) ## End(Not run)
## Not run: mad(original = original_data, imputed = imputed_data, round = 3) ## End(Not run)