Ensemble Project

Go back to the attrition data set; and prepare an ensemble prediction algorithm. Show the following

          # 1. Display the ensemlbe model results showing Risk and Coefficient Weights.
          
          # 2. Show the Confusion Matrix appraising ensemble performance 
          # (using the test set)
          
          # 3. Display the top ten employees most likely to leave 
          # using the full data set.  Note that this means that y ou have to
          # an rerun your "prediction" using the full data set.

here are MY top 10 employees most likely to bolt (displaying a reduced number of variables)

#   person    predictions$pred Age Attrition    BusinessTravel      DailyRate
#1      72        0.6542388     26        No     Travel_Rarely      1443
#2    1961        0.6230436     38        No     Travel_Rarely      1404
#3    2021        0.5715788     21        No     Travel_Rarely      501
#4    1539        0.5619409     50        No     Travel_Frequently  333
#5    1646        0.5464278     24        No T   Travel_Frequently  567
#6     669        0.5384468     21        No     Travel_Rarely      1343
#7    1198        0.5316000     20        No     Travel_Rarely      805
#8    1019        0.5134828     22        No     Travel_Rarely      217
#9    1296        0.4857101     28        No     Travel_Frequently  193
#10   1286        0.4848228     28        No     Travel_Frequently  791

Hints: (1) Take a look at the factor variable “Over18.” Is is a factor variable with 1 level; meaning, it offers no variation, thus it carries no information.

i suggest removing it. in fact, if you leave it in you will see that it jams-up some of the algorithms in the ensemble selection.

Recall that some dpylr commands appear to be “masked” with SuperLearner. so make sure you invoke dplyr using ::. e.g. dplyr::select()
the dependent variable is 2-level factor variable that numerically lists as 1/2 you need to convert this to a 0/1 variable for SuperLearner

REFERENCE: https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html

Read Employee Attrition Data

employee = read.csv("HR-Employee-Attrition.csv")
                str(employee)

## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Training and Testing Sets

 #Load data
                
                      # Train and test sets
                      employee_df = employee %>% mutate( ID = row_number())
                      train_employee = employee_df %>% sample_frac(0.8)
                      test_employee = employee_df %>% anti_join(train_employee, by = "ID")

                            table(employee$Attrition)

## 
##   No  Yes 
## 1233  237

y_train = train_employee %>% dplyr::select(Attrition) 
y_train = as.numeric(y_train$Attrition) - 1

y_test = test_employee %>% dplyr::select(Attrition) 
y_test = as.numeric(y_test$Attrition) - 1



train_predictors = train_employee %>% dplyr::select(-Attrition, -ID, -EmployeeNumber, -Over18)
test_predictors = test_employee %>% dplyr::select(-Attrition, -ID, -EmployeeNumber,-Over18)

Single Model - to warm up to the task

# ====================================
              # single model; to warm up
              single.model <- SuperLearner(y_train, train_predictors,
                             family=binomial(),
                             SL.library=list("SL.randomForest"))

## Loading required package: randomForest

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

        single.model

## 
## Call:  
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),  
##     SL.library = list("SL.randomForest")) 
## 
## 
##                         Risk Coef
## SL.randomForest_All 0.102999    1

          names(single.model)

##  [1] "call"              "libraryNames"      "SL.library"       
##  [4] "SL.predict"        "coef"              "library.predict"  
##  [7] "Z"                 "cvRisk"            "family"           
## [10] "fitLibrary"        "varNames"          "validRows"        
## [13] "method"            "whichScreen"       "control"          
## [16] "cvControl"         "errorsInCVLibrary" "errorsInLibrary"  
## [19] "metaOptimizer"     "env"               "times"

More single stuff to warm up s’more

 single.modelA <- SuperLearner(y_train,
                                        train_predictors,
                                        family=binomial(),
                                        SL.library=list(c("SL.randomForest", 
                                        "screen.randomForest")
                                                        
                                        ))
          
          single.modelA

## 
## Call:  
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),  
##     SL.library = list(c("SL.randomForest", "screen.randomForest"))) 
## 
## 
##                                          Risk Coef
## SL.randomForest_screen.randomForest 0.1135698    1

          single.modelB <- SuperLearner(y_train,
                                        train_predictors,
                                        family=binomial(),
                                        SL.library=
                                          
                                          list(c("SL.glm", "screen.corP")))

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
##

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

## Error in cor.test.default(x, y = Y, method = method) : 
##   'x' must be a numeric vector

## Warning in FUN(X[[i]], ...): replacing failed screening algorithm, screen.corP , with All() in full data 
## 

## Warning in FUN(X[[i]], ...): prediction from a rank-deficient fit may be misleading

          single.modelB

## 
## Call:  
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),  
##     SL.library = list(c("SL.glm", "screen.corP"))) 
## 
## 
##                          Risk Coef
## SL.glm_screen.corP 0.09458614    1

Full Model

 full.model <- SuperLearner(y_train,
                                        train_predictors,
                                        family=binomial(),
                                        SL.library=list("SL.randomForest",
                                                        "SL.glm",
                                                        "SL.ipredbagg",
                                                        "SL.bayesglm" 
                                                        ))

## Loading required package: arm

## Loading required package: Matrix

## Loading required package: lme4

## 
## arm (Version 1.10-1, built: 2018-4-12)

## Working directory is /home/guest15/projects

## Loading required package: ipred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

                    # Return the model
                    full.model

## 
## Call:  
## SuperLearner(Y = y_train, X = train_predictors, family = binomial(),  
##     SL.library = list("SL.randomForest", "SL.glm", "SL.ipredbagg", "SL.bayesglm")) 
## 
## 
## 
##                           Risk      Coef
## SL.randomForest_All 0.10208231 0.2726391
## SL.glm_All          0.09449071 0.0000000
## SL.ipredbagg_All    0.10537071 0.1016873
## SL.bayesglm_All     0.09343285 0.6256736

Predictions

          predictions <- predict.SuperLearner(full.model, newdata=test_employee)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

Recode probabilities

conv.preds <- ifelse(predictions$pred>=0.5,1,0)

Create the confusion matrix

cm <- caret::confusionMatrix(as.factor(conv.preds), as.factor(y_test))
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 241  34
##          1   4  15
##                                           
##                Accuracy : 0.8707          
##                  95% CI : (0.8269, 0.9069)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 0.04679         
##                                           
##                   Kappa : 0.3838          
##                                           
##  Mcnemar's Test P-Value : 2.546e-06       
##                                           
##             Sensitivity : 0.9837          
##             Specificity : 0.3061          
##          Pos Pred Value : 0.8764          
##          Neg Pred Value : 0.7895          
##              Prevalence : 0.8333          
##          Detection Rate : 0.8197          
##    Detection Prevalence : 0.9354          
##       Balanced Accuracy : 0.6449          
##                                           
##        'Positive' Class : 0               
##

============== Display the top ten most likely to leave Using the full data set

   predictions <- predict.SuperLearner(full.model, newdata=employee)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

remove some variables for presentation purposes

    fin = cbind(person = employee$EmployeeNumber, predictions$pred, employee[,1:4])

single out those still at the company

    fin_10 = fin %>% dplyr::filter(Attrition == "No")

rename for ease

    fin_10 = fin_10 %>% dplyr::rename(Top_Ten = "predictions$pred")

sort

    fin_10 = fin_10 %>% dplyr::arrange(desc(Top_Ten))

Present the 10 most likely to bol

    head(fin_10,10)

##    person   Top_Ten Age Attrition    BusinessTravel DailyRate
## 1    1646 0.6529479  24        No Travel_Frequently       567
## 2     913 0.6142489  44        No     Travel_Rarely       661
## 3    2021 0.5792467  21        No     Travel_Rarely       501
## 4     436 0.5275962  32        No     Travel_Rarely       588
## 5    1269 0.5129093  19        No     Travel_Rarely       265
## 6     939 0.4959162  33        No     Travel_Rarely      1198
## 7     390 0.4942819  26        No Travel_Frequently       496
## 8    1286 0.4833113  28        No Travel_Frequently       791
## 9      72 0.4764485  26        No     Travel_Rarely      1443
## 10   1131 0.4745496  21        No     Travel_Rarely       984

#============================================================================#