1. Data Source:

The data for this analysis has been obtained from ‘Practical Machine Learning’ course offered by John Hopkins University on Coursera.

The course in turn has sourced the data from Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th Augmented Human (AH) International Conference in cooperation with ACM SIGCHI (Augmented Human’13) . Stuttgart, Germany: ACM SIGCHI, 2013

2. Introduction:

The growth of fitness tracking gadgets has led to a movemment of ‘quantified self’ wherein tech and fitness enthusiasts monitor the quality and quantity of their physical activity to achieve their fitness goals. The study above has its focus on the quality aspect of those physical activities.

Specifically, the study has analyzed and collected data for weight lifting exercises. The data has five target classes viz. A, B, C, D and E. Class A corresponds to the correct method of a weight lifting exercise while the other classes correspond to erroneous methods. The features for these classes are different body measurements obtained through digital gadgets. A detailed overview of the features can be found here

In this paper, we shall examine this data to build a ML classifier to classify the above five classes. Columns 1:7 in the data have been dropped as they correspond to row numbers, subject names and timestamps of observations making these columns irrelevant to our classification problem.

train_data_url <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
test_data_url <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
train_data_path <-  '.\\Project_data\\train_data.csv'
test_data_path <- '.\\Project_data\\test_data.csv'

if(!file.exists(train_data_path))
    {
        download.file(url=train_data_url, destfile=train_data_path, method='curl')    
    }
if(!file.exists(test_data_path))
    {
        download.file(url=test_data_url, destfile=test_data_path, method='curl')    
    }
train_data <- fread(train_data_path, drop=1:7)
test_data <- fread(test_data_path, drop=1:7)

3. Exploratory Data Analysis:

  1. Overview of the Data:
set.seed(333)
inTrain <- createDataPartition(train_data$classe, p=0.95, list=FALSE)
training <- train_data[inTrain,]
validation <- train_data[-inTrain,]
  1. Cleaning the Data:
dim(train_data)
[1] 19622   153
table(train_data[,classe])

   A    B    C    D    E 
5580 3797 3422 3216 3607 
na_percentage <- training[,lapply(.SD, function(x) sum(is.na(x))/.N >0.05)]
cols_to_drop <- which(as.logical(na_percentage))

training[, c(cols_to_drop):=NULL]
validation[, c(cols_to_drop):=NULL]
test_data[, c(cols_to_drop):=NULL]

training[, classe:=as.factor(classe)]
validation[, classe:=as.factor(classe)]

dim(training)
[1] 18643    53
sum(is.na(training))
[1] 0
sum(is.na(validation))
[1] 0
  1. Dimensionality Reduction:
nsv <- nearZeroVar(training, saveMetrics = TRUE)
sum(nsv$nzv==TRUE)
[1] 0
descrCor <- cor(training[,!c('classe')])
length(findCorrelation(descrCor, cutoff = .75))
[1] 21
preProc <- preProcess(training[,!'classe'], method = 'pca', thresh = 0.8)
training_transformed <- predict(preProc, training[,!'classe'])
validation_transformed <- predict(preProc, validation[,!'classe'])
test_transformed <- predict(preProc, test_data[,!'problem_id'])
preProc$numComp
[1] 12
  1. Visualizing Features:
g <- ggplot()+
    geom_point(mapping = aes(x=training_transformed$PC1, 
                             y=training_transformed$PC2,
                             col=training$classe), alpha=0.4) +
    labs(x='Principal Component 1', 
         y='Principal Component 2',
         title='Classes as per top 2 PCA components', 
         color='class')
print(g)

4. Building Classifier:

  1. Cross Validation:
fitControl <- trainControl(method = 'cv', number = 5)
  1. Classifer:
set.seed(111)
knn_classifier <- train(x=training_transformed, 
                        y=training$classe, 
                        method='knn',
                        trControl = fitControl)

knn_predictions <- predict(knn_classifier, validation_transformed)
knn_cm_matrix <- confusionMatrix(validation$classe, knn_predictions)
knn_cm_matrix$overall
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
     0.9509704      0.9379460      0.9355165      0.9636314      0.2900919 
AccuracyPValue  McnemarPValue 
     0.0000000            NaN 
set.seed(222)
rf_classifier <- train(x=training_transformed, 
              y=training$classe, 
              method='rf',
              ntree = 400,
              trControl = fitControl)

rf_predictions <- predict(rf_classifier, validation_transformed)
rf_cm_matrix <- confusionMatrix(validation$classe, rf_predictions)
rf_cm_matrix$overall
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
     0.9693565      0.9612287      0.9565419      0.9792314      0.2860061 
AccuracyPValue  McnemarPValue 
     0.0000000            NaN 
  1. Confusion Matrix Table:
rf_cm_matrix$table
          Reference
Prediction   A   B   C   D   E
         A 272   3   2   2   0
         B   3 182   3   0   1
         C   2   2 162   2   3
         D   3   0   2 154   1
         E   0   0   0   1 179

5. Conclusion:

  1. Out of Sample Predicted Error:
1 - rf_cm_matrix$overall[['Accuracy']]
[1] 0.03064351
  1. Predictions for unseen Test Set:
data.table(problem_id=1:20, 
           class=predict(rf_classifier, test_transformed))
    problem_id class
 1:          1     B
 2:          2     A
 3:          3     A
 4:          4     A
 5:          5     A
 6:          6     E
 7:          7     D
 8:          8     B
 9:          9     A
10:         10     A
11:         11     B
12:         12     C
13:         13     B
14:         14     A
15:         15     E
16:         16     E
17:         17     A
18:         18     B
19:         19     B
20:         20     B