The data for this analysis has been obtained from ‘Practical Machine Learning’ course offered by John Hopkins University on Coursera.
The course in turn has sourced the data from Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th Augmented Human (AH) International Conference in cooperation with ACM SIGCHI (Augmented Human’13) . Stuttgart, Germany: ACM SIGCHI, 2013
The growth of fitness tracking gadgets has led to a movemment of ‘quantified self’ wherein tech and fitness enthusiasts monitor the quality and quantity of their physical activity to achieve their fitness goals. The study above has its focus on the quality aspect of those physical activities.
Specifically, the study has analyzed and collected data for weight lifting exercises. The data has five target classes viz. A
, B
, C
, D
and E
. Class A
corresponds to the correct method of a weight lifting exercise while the other classes correspond to erroneous methods. The features for these classes are different body measurements obtained through digital gadgets. A detailed overview of the features can be found here
In this paper, we shall examine this data to build a ML classifier to classify the above five classes. Columns 1:7
in the data have been dropped as they correspond to row numbers, subject names and timestamps of observations making these columns irrelevant to our classification problem.
train_data_url <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
test_data_url <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
train_data_path <- '.\\Project_data\\train_data.csv'
test_data_path <- '.\\Project_data\\test_data.csv'
if(!file.exists(train_data_path))
{
download.file(url=train_data_url, destfile=train_data_path, method='curl')
}
if(!file.exists(test_data_path))
{
download.file(url=test_data_url, destfile=test_data_path, method='curl')
}
train_data <- fread(train_data_path, drop=1:7)
test_data <- fread(test_data_path, drop=1:7)
set.seed(333)
inTrain <- createDataPartition(train_data$classe, p=0.95, list=FALSE)
training <- train_data[inTrain,]
validation <- train_data[-inTrain,]
classe
are as shown below.NA
values are left, no imputation strategy is needed.dim(train_data)
[1] 19622 153
table(train_data[,classe])
A B C D E
5580 3797 3422 3216 3607
na_percentage <- training[,lapply(.SD, function(x) sum(is.na(x))/.N >0.05)]
cols_to_drop <- which(as.logical(na_percentage))
training[, c(cols_to_drop):=NULL]
validation[, c(cols_to_drop):=NULL]
test_data[, c(cols_to_drop):=NULL]
training[, classe:=as.factor(classe)]
validation[, classe:=as.factor(classe)]
dim(training)
[1] 18643 53
sum(is.na(training))
[1] 0
sum(is.na(validation))
[1] 0
nsv <- nearZeroVar(training, saveMetrics = TRUE)
sum(nsv$nzv==TRUE)
[1] 0
descrCor <- cor(training[,!c('classe')])
length(findCorrelation(descrCor, cutoff = .75))
[1] 21
preProc <- preProcess(training[,!'classe'], method = 'pca', thresh = 0.8)
training_transformed <- predict(preProc, training[,!'classe'])
validation_transformed <- predict(preProc, validation[,!'classe'])
test_transformed <- predict(preProc, test_data[,!'problem_id'])
preProc$numComp
[1] 12
g <- ggplot()+
geom_point(mapping = aes(x=training_transformed$PC1,
y=training_transformed$PC2,
col=training$classe), alpha=0.4) +
labs(x='Principal Component 1',
y='Principal Component 2',
title='Classes as per top 2 PCA components',
color='class')
print(g)
fitControl <- trainControl(method = 'cv', number = 5)
training_transformed
data. Random Forest provides higher accuracy on the validation dataset. Hence, RF classifier has been chosen for our predictions.set.seed(111)
knn_classifier <- train(x=training_transformed,
y=training$classe,
method='knn',
trControl = fitControl)
knn_predictions <- predict(knn_classifier, validation_transformed)
knn_cm_matrix <- confusionMatrix(validation$classe, knn_predictions)
knn_cm_matrix$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.9509704 0.9379460 0.9355165 0.9636314 0.2900919
AccuracyPValue McnemarPValue
0.0000000 NaN
set.seed(222)
rf_classifier <- train(x=training_transformed,
y=training$classe,
method='rf',
ntree = 400,
trControl = fitControl)
rf_predictions <- predict(rf_classifier, validation_transformed)
rf_cm_matrix <- confusionMatrix(validation$classe, rf_predictions)
rf_cm_matrix$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.9693565 0.9612287 0.9565419 0.9792314 0.2860061
AccuracyPValue McnemarPValue
0.0000000 NaN
rf_cm_matrix$table
Reference
Prediction A B C D E
A 272 3 2 2 0
B 3 182 3 0 1
C 2 2 162 2 3
D 3 0 2 154 1
E 0 0 0 1 179
1 - rf_cm_matrix$overall[['Accuracy']]
[1] 0.03064351
data.table(problem_id=1:20,
class=predict(rf_classifier, test_transformed))
problem_id class
1: 1 B
2: 2 A
3: 3 A
4: 4 A
5: 5 A
6: 6 E
7: 7 D
8: 8 B
9: 9 A
10: 10 A
11: 11 B
12: 12 C
13: 13 B
14: 14 A
15: 15 E
16: 16 E
17: 17 A
18: 18 B
19: 19 B
20: 20 B