Go back to the attrition data set; and prepare an ensemble prediction algorithm. Show the following
# 1. Display the ensemlbe model results showing Risk and Coefficient Weights.
# 2. Show the Confusion Matrix appraising ensemble performance
# (using the test set)
# 3. Display the top ten employees most likely to leave
# using the full data set. Note that this means that y ou have to
# an rerun your "prediction" using the full data set.
here are MY top 10 employees most likely to bolt (displaying a reduced number of variables)
# person predictions$pred Age Attrition BusinessTravel DailyRate
#1 72 0.6542388 26 No Travel_Rarely 1443
#2 1961 0.6230436 38 No Travel_Rarely 1404
#3 2021 0.5715788 21 No Travel_Rarely 501
#4 1539 0.5619409 50 No Travel_Frequently 333
#5 1646 0.5464278 24 No T Travel_Frequently 567
#6 669 0.5384468 21 No Travel_Rarely 1343
#7 1198 0.5316000 20 No Travel_Rarely 805
#8 1019 0.5134828 22 No Travel_Rarely 217
#9 1296 0.4857101 28 No Travel_Frequently 193
#10 1286 0.4848228 28 No Travel_Frequently 791
Hints: (1) Take a look at the factor variable “Over18.” Is is a factor variable with 1 level; meaning, it offers no variation, thus it carries no information.
i suggest removing it. in fact, if you leave it in you will see that it jams-up some of the algorithms in the ensemble selection.
Recall that some dpylr commands appear to be “masked” with SuperLearner. so make sure you invoke dplyr using ::. e.g. dplyr::select()
the dependent variable is 2-level factor variable that numerically lists as 1/2 you need to convert this to a 0/1 variable for SuperLearner
REFERENCE: https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html
Read Employee Attrition Data
employee = read.csv("HR-Employee-Attrition.csv")
str(employee)
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Training and Testing Sets
#Load data
# Train and test sets
employee_df = employee %>% mutate( ID = row_number())
train_employee = employee_df %>% sample_frac(0.8)
test_employee = employee_df %>% anti_join(train_employee, by = "ID")
table(employee$Attrition)
##
## No Yes
## 1233 237
y_train = train_employee %>% dplyr::select(Attrition)
y_train = as.numeric(y_train$Attrition) - 1
y_test = test_employee %>% dplyr::select(Attrition)
y_test = as.numeric(y_test$Attrition) - 1
train_predictors = train_employee %>% dplyr::select(-Attrition, -ID, -EmployeeNumber, -Over18)
test_predictors = test_employee %>% dplyr::select(-Attrition, -ID, -EmployeeNumber,-Over18)
Single Model - to warm up to the task
# ====================================
# single model; to warm up
single.model <- SuperLearner(y_train, train_predictors,
family=binomial(),
SL.library=list("SL.randomForest"))
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
single.model
##
## Call:
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),
## SL.library = list("SL.randomForest"))
##
##
## Risk Coef
## SL.randomForest_All 0.102999 1
names(single.model)
## [1] "call" "libraryNames" "SL.library"
## [4] "SL.predict" "coef" "library.predict"
## [7] "Z" "cvRisk" "family"
## [10] "fitLibrary" "varNames" "validRows"
## [13] "method" "whichScreen" "control"
## [16] "cvControl" "errorsInCVLibrary" "errorsInLibrary"
## [19] "metaOptimizer" "env" "times"
More single stuff to warm up s’more
single.modelA <- SuperLearner(y_train,
train_predictors,
family=binomial(),
SL.library=list(c("SL.randomForest",
"screen.randomForest")
))
single.modelA
##
## Call:
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),
## SL.library = list(c("SL.randomForest", "screen.randomForest")))
##
##
## Risk Coef
## SL.randomForest_screen.randomForest 0.1135698 1
single.modelB <- SuperLearner(y_train,
train_predictors,
family=binomial(),
SL.library=
list(c("SL.glm", "screen.corP")))
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All()
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
## Error in cor.test.default(x, y = Y, method = method) :
## 'x' must be a numeric vector
## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() in full data
##
## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading
single.modelB
##
## Call:
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),
## SL.library = list(c("SL.glm", "screen.corP")))
##
##
## Risk Coef
## SL.glm_screen.corP 0.09458614 1
Full Model
full.model <- SuperLearner(y_train,
train_predictors,
family=binomial(),
SL.library=list("SL.randomForest",
"SL.glm",
"SL.ipredbagg",
"SL.bayesglm"
))
## Loading required package: arm
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.10-1, built: 2018-4-12)
## Working directory is /home/guest15/projects
## Loading required package: ipred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
# Return the model
full.model
##
## Call:
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),
## SL.library = list("SL.randomForest", "SL.glm", "SL.ipredbagg", "SL.bayesglm"))
##
##
##
## Risk Coef
## SL.randomForest_All 0.10208231 0.2726391
## SL.glm_All 0.09449071 0.0000000
## SL.ipredbagg_All 0.10537071 0.1016873
## SL.bayesglm_All 0.09343285 0.6256736
Predictions
predictions <- predict.SuperLearner(full.model, newdata=test_employee)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
Recode probabilities
conv.preds <- ifelse(predictions$pred>=0.5,1,0)
Create the confusion matrix
cm <- caret::confusionMatrix(as.factor(conv.preds), as.factor(y_test))
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 241 34
## 1 4 15
##
## Accuracy : 0.8707
## 95% CI : (0.8269, 0.9069)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.04679
##
## Kappa : 0.3838
##
## Mcnemar's Test P-Value : 2.546e-06
##
## Sensitivity : 0.9837
## Specificity : 0.3061
## Pos Pred Value : 0.8764
## Neg Pred Value : 0.7895
## Prevalence : 0.8333
## Detection Rate : 0.8197
## Detection Prevalence : 0.9354
## Balanced Accuracy : 0.6449
##
## 'Positive' Class : 0
##
============== Display the top ten most likely to leave Using the full data set
predictions <- predict.SuperLearner(full.model, newdata=employee)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
remove some variables for presentation purposes
fin = cbind(person = employee$EmployeeNumber, predictions$pred, employee[,1:4])
single out those still at the company
fin_10 = fin %>% dplyr::filter(Attrition == "No")
rename for ease
fin_10 = fin_10 %>% dplyr::rename(Top_Ten = "predictions$pred")
sort
fin_10 = fin_10 %>% dplyr::arrange(desc(Top_Ten))
Present the 10 most likely to bol
head(fin_10,10)
## person Top_Ten Age Attrition BusinessTravel DailyRate
## 1 1646 0.6529479 24 No Travel_Frequently 567
## 2 913 0.6142489 44 No Travel_Rarely 661
## 3 2021 0.5792467 21 No Travel_Rarely 501
## 4 436 0.5275962 32 No Travel_Rarely 588
## 5 1269 0.5129093 19 No Travel_Rarely 265
## 6 939 0.4959162 33 No Travel_Rarely 1198
## 7 390 0.4942819 26 No Travel_Frequently 496
## 8 1286 0.4833113 28 No Travel_Frequently 791
## 9 72 0.4764485 26 No Travel_Rarely 1443
## 10 1131 0.4745496 21 No Travel_Rarely 984
#============================================================================#