Case Study

This is my first case study, showcasing acquired skills in R Programming. For this purpose I will use public data that explores habits of smart device users’ daily habits.

DATASET: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Prepare

  • Setting up my environment.

I dowloaded the dataset from Kaggle in .csv format and saved it on my harddrive. Then in Rstudio I installed necessary packages to proceed with my task.

install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("data.table", repos = "http://cran.us.r-project.org")
install.packages("rmarkdown", repos = "http://cran.us.r-project.org")
library(data.table)
library(readr)
library(dplyr)
library(tidyr)
library(skimr)
library(janitor)
library(lubridate)
library(ggplot2)
library(scales)
library(rmarkdown)

Then I set my working directory to the file in which I saved this dataset using setwd() function and proceeded to load the .csv files.

hourly_calories <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlyCalories_merged.csv")
hourly_intensities <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlyIntensities_merged.csv")
hourly_steps <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlySteps_merged.csv")
weight_log_info <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/weightLogInfo_merged.csv")
daily_activity <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep_day <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
  • Getting acquainted with the data.

After uploading all files into RStudio I began to examine it. I created an object containing all files to make my work more time efficient.

all_files <- c(hourly_calories, hourly_intensities, hourly_steps, weight_log_info, daily_activity, sleep_day)

First I used glimpse() to get some insights from the data.

glimpse(all_files)
## List of 38
##  $ Id                      : num [1:24084] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour            : chr [1:24084] "3/12/2016 12:00:00 AM" "3/12/2016 1:00:00 AM" "3/12/2016 2:00:00 AM" "3/12/2016 3:00:00 AM" ...
##  $ Calories                : int [1:24084] 48 48 48 48 48 48 48 48 48 49 ...
##  $ Id                      : num [1:24084] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour            : chr [1:24084] "3/12/2016 12:00:00 AM" "3/12/2016 1:00:00 AM" "3/12/2016 2:00:00 AM" "3/12/2016 3:00:00 AM" ...
##  $ TotalIntensity          : int [1:24084] 0 0 0 0 0 0 0 0 0 1 ...
##  $ AverageIntensity        : num [1:24084] 0 0 0 0 0 ...
##  $ Id                      : num [1:24084] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour            : chr [1:24084] "3/12/2016 12:00:00 AM" "3/12/2016 1:00:00 AM" "3/12/2016 2:00:00 AM" "3/12/2016 3:00:00 AM" ...
##  $ StepTotal               : int [1:24084] 0 0 0 0 0 0 0 0 0 8 ...
##  $ Id                      : num [1:33] 1.50e+09 1.93e+09 2.35e+09 2.87e+09 2.87e+09 ...
##  $ Date                    : chr [1:33] "4/5/2016 11:59:59 PM" "4/10/2016 6:33:26 PM" "4/3/2016 11:59:59 PM" "4/6/2016 11:59:59 PM" ...
##  $ WeightKg                : num [1:33] 53.3 129.6 63.4 56.7 57.2 ...
##  $ WeightPounds            : num [1:33] 118 286 140 125 126 ...
##  $ Fat                     : int [1:33] 22 NA 10 NA NA NA NA NA NA NA ...
##  $ BMI                     : num [1:33] 23 46.2 24.8 21.5 21.6 ...
##  $ IsManualReport          : chr [1:33] "True" "False" "True" "True" ...
##  $ LogId                   : num [1:33] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int [1:940] 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int [1:940] 728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int [1:940] 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
##  $ Id                      : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay                : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords       : int [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep      : int [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed          : int [1:413] 346 407 442 367 712 320 377 364 384 449 ...

Then to get more details about each file I used head() function.

head(hourly_calories)
##           Id          ActivityHour Calories
## 1 1503960366 3/12/2016 12:00:00 AM       48
## 2 1503960366  3/12/2016 1:00:00 AM       48
## 3 1503960366  3/12/2016 2:00:00 AM       48
## 4 1503960366  3/12/2016 3:00:00 AM       48
## 5 1503960366  3/12/2016 4:00:00 AM       48
## 6 1503960366  3/12/2016 5:00:00 AM       48
head(hourly_intensities)
##           Id          ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 3/12/2016 12:00:00 AM              0                0
## 2 1503960366  3/12/2016 1:00:00 AM              0                0
## 3 1503960366  3/12/2016 2:00:00 AM              0                0
## 4 1503960366  3/12/2016 3:00:00 AM              0                0
## 5 1503960366  3/12/2016 4:00:00 AM              0                0
## 6 1503960366  3/12/2016 5:00:00 AM              0                0
head(hourly_steps)
##           Id          ActivityHour StepTotal
## 1 1503960366 3/12/2016 12:00:00 AM         0
## 2 1503960366  3/12/2016 1:00:00 AM         0
## 3 1503960366  3/12/2016 2:00:00 AM         0
## 4 1503960366  3/12/2016 3:00:00 AM         0
## 5 1503960366  3/12/2016 4:00:00 AM         0
## 6 1503960366  3/12/2016 5:00:00 AM         0
head(weight_log_info)
##           Id                 Date WeightKg WeightPounds Fat   BMI
## 1 1503960366 4/5/2016 11:59:59 PM     53.3     117.5064  22 22.97
## 2 1927972279 4/10/2016 6:33:26 PM    129.6     285.7191  NA 46.17
## 3 2347167796 4/3/2016 11:59:59 PM     63.4     139.7731  10 24.77
## 4 2873212765 4/6/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 4/7/2016 11:59:59 PM     57.2     126.1044  NA 21.65
## 6 2891001357 4/5/2016 11:59:59 PM     88.4     194.8886  NA 25.03
##   IsManualReport        LogId
## 1           True 1.459901e+12
## 2          False 1.460313e+12
## 3           True 1.459728e+12
## 4           True 1.459987e+12
## 5           True 1.460074e+12
## 6           True 1.459901e+12
head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
head(sleep_day)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 413

After examining the results here is what I found:

  • all files have one column in common which is column “Id” therefore merging is tempting
  • files describing hourly activities have the same amount of rows
  • there are only 3 types of objects (characters, numeric or integers)
  • columns are written in PascalCase which is not to my liking so I will have to change it to SnakeCase
  • there were more participants in daily_activity dataset than in sleep dataset

Process

  • Merging files
data_by_hours <- bind_cols(hourly_calories, hourly_intensities, hourly_steps)
## New names:
## • `Id` -> `Id...1`
## • `ActivityHour` -> `ActivityHour...2`
## • `Id` -> `Id...4`
## • `ActivityHour` -> `ActivityHour...5`
## • `Id` -> `Id...8`
## • `ActivityHour` -> `ActivityHour...9`
  • Examining the new dataset
head(data_by_hours)
##       Id...1      ActivityHour...2 Calories     Id...4      ActivityHour...5
## 1 1503960366 3/12/2016 12:00:00 AM       48 1503960366 3/12/2016 12:00:00 AM
## 2 1503960366  3/12/2016 1:00:00 AM       48 1503960366  3/12/2016 1:00:00 AM
## 3 1503960366  3/12/2016 2:00:00 AM       48 1503960366  3/12/2016 2:00:00 AM
## 4 1503960366  3/12/2016 3:00:00 AM       48 1503960366  3/12/2016 3:00:00 AM
## 5 1503960366  3/12/2016 4:00:00 AM       48 1503960366  3/12/2016 4:00:00 AM
## 6 1503960366  3/12/2016 5:00:00 AM       48 1503960366  3/12/2016 5:00:00 AM
##   TotalIntensity AverageIntensity     Id...8      ActivityHour...9 StepTotal
## 1              0                0 1503960366 3/12/2016 12:00:00 AM         0
## 2              0                0 1503960366  3/12/2016 1:00:00 AM         0
## 3              0                0 1503960366  3/12/2016 2:00:00 AM         0
## 4              0                0 1503960366  3/12/2016 3:00:00 AM         0
## 5              0                0 1503960366  3/12/2016 4:00:00 AM         0
## 6              0                0 1503960366  3/12/2016 5:00:00 AM         0
colnames(data_by_hours)
##  [1] "Id...1"           "ActivityHour...2" "Calories"         "Id...4"          
##  [5] "ActivityHour...5" "TotalIntensity"   "AverageIntensity" "Id...8"          
##  [9] "ActivityHour...9" "StepTotal"
  • Deleting duplicate rows
data_by_hours <- subset(data_by_hours, select = c(-4,-5,-8,-9))
  • Renaming columns
data_by_hours <- data_by_hours %>% rename(id=Id...1, activity_hour=ActivityHour...2, calories=Calories, total_intensity=TotalIntensity, avg_intensity=AverageIntensity, steps=StepTotal)

colnames(daily_activity)
daily_activity <- daily_activity %>% rename(id=Id, activity_date = ActivityDate, total_steps = TotalSteps, total_distance = TotalDistance, tracker_distance = TrackerDistance, logged_activities_distance = LoggedActivitiesDistance, very_active_distance = VeryActiveDistance, moderately_active_distance = ModeratelyActiveDistance, light_active_distance = LightActiveDistance, sedentary_active_distance = SedentaryActiveDistance, very_active_minutes = VeryActiveMinutes, fairly_active_minutes = FairlyActiveMinutes, lightly_active_minutes = LightlyActiveMinutes, sedentary_minutes = SedentaryMinutes, calories = Calories)

colnames(sleep_day)
sleep_day <- sleep_day %>% rename(id = Id, sleep_day = SleepDay, total_sleep_records = TotalSleepRecords, total_minutes_asleep = TotalMinutesAsleep, total_time_in_bed = TotalTimeInBed)
head(data_by_hours)
##           id         activity_hour calories total_intensity avg_intensity steps
## 1 1503960366 3/12/2016 12:00:00 AM       48               0             0     0
## 2 1503960366  3/12/2016 1:00:00 AM       48               0             0     0
## 3 1503960366  3/12/2016 2:00:00 AM       48               0             0     0
## 4 1503960366  3/12/2016 3:00:00 AM       48               0             0     0
## 5 1503960366  3/12/2016 4:00:00 AM       48               0             0     0
## 6 1503960366  3/12/2016 5:00:00 AM       48               0             0     0
head(daily_activity)
##           id activity_date total_steps total_distance tracker_distance
## 1 1503960366     4/12/2016       13162           8.50             8.50
## 2 1503960366     4/13/2016       10735           6.97             6.97
## 3 1503960366     4/14/2016       10460           6.74             6.74
## 4 1503960366     4/15/2016        9762           6.28             6.28
## 5 1503960366     4/16/2016       12669           8.16             8.16
## 6 1503960366     4/17/2016        9705           6.48             6.48
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.44                       0.40
## 4                          0                 2.14                       1.26
## 5                          0                 2.71                       0.41
## 6                          0                 3.19                       0.78
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  3.91                         0                  30
## 4                  2.83                         0                  29
## 5                  5.04                         0                  36
## 6                  2.51                         0                  38
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    11                    181              1218     1776
## 4                    34                    209               726     1745
## 5                    10                    221               773     1863
## 6                    20                    164               539     1728
head(sleep_day)
##           id             sleep_day total_sleep_records total_minutes_asleep
## 1 1503960366 4/12/2016 12:00:00 AM                   1                  327
## 2 1503960366 4/13/2016 12:00:00 AM                   2                  384
## 3 1503960366 4/15/2016 12:00:00 AM                   1                  412
## 4 1503960366 4/16/2016 12:00:00 AM                   2                  340
## 5 1503960366 4/17/2016 12:00:00 AM                   1                  700
## 6 1503960366 4/19/2016 12:00:00 AM                   1                  304
##   total_time_in_bed
## 1               346
## 2               407
## 3               442
## 4               367
## 5               712
## 6               320
  • Changing DATE format
data_by_hours$activity_hour <- format(as_datetime(data_by_hours$activity_hour, format = "%m/%d/%Y %I:%M:%S %p"), "%d/%m/%Y %I:%M:%S %p")

data_by_hours$activity_hour <- parse_date_time(data_by_hours$activity_hour, "%d/%m/%Y %I:%M:%S %p")

str(data_by_hours)
## 'data.frame':    24084 obs. of  6 variables:
##  $ id             : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activity_hour  : POSIXct, format: "2016-03-12 00:00:00" "2016-03-12 01:00:00" ...
##  $ calories       : int  48 48 48 48 48 48 48 48 48 49 ...
##  $ total_intensity: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ avg_intensity  : num  0 0 0 0 0 ...
##  $ steps          : int  0 0 0 0 0 0 0 0 0 8 ...

Analysis

  • Summary statistics and visuals
daily_activity %>%
  select(total_steps,
         total_distance,
         sedentary_minutes,
         calories) %>%
  summary()
##   total_steps    total_distance   sedentary_minutes    calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0    Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8    1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5    Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2    Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5    3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0    Max.   :4900

Here I made basic summary of daily_activity dataset. The pattern seems clear: The more you walk the more you spend time resting and burn calories. But is it accurate for all tested subjects? Next is a visualization created on two arguments: total_steps and calories:

This viz negates previous insights. Steps and calories are not directly proportional which suggest other factors that are relevant to burnt calories. The dataset needs further examination. I decided to play some more with this two values and created a visualization showing steps and calories burnt of each fitbit user from a whole month.

calories <- hourly_calories %>% group_by(Id) %>% 
  summarise(sum_calories=sum(Calories),
            .groups = 'drop')

steps <- hourly_steps %>% group_by(Id) %>% 
  summarise(sum_steps=sum(StepTotal),
            .groups = 'drop')

calories_steps_corellation <- merge(calories, steps, by ="Id")

Moving on to another summary:

sleep_day %>%
  select(total_sleep_records,
         total_minutes_asleep,
         total_time_in_bed) %>%
  summary()
##  total_sleep_records total_minutes_asleep total_time_in_bed
##  Min.   :1.000       Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:1.000       1st Qu.:361.0        1st Qu.:403.0    
##  Median :1.000       Median :433.0        Median :463.0    
##  Mean   :1.119       Mean   :419.5        Mean   :458.6    
##  3rd Qu.:1.000       3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :3.000       Max.   :796.0        Max.   :961.0

Just like before, this summary tells me that the more one sleeps the more he stays in bed.

This visualization proves previous insight true. The trend line is very close to the points of this scatterplot which implies a strong correlation between minutes asleep and time spent in bed.

Summary

In this Case Study I showed what I have learned in R programming regarding Data Analysis. I dowloaded and loaded dataset from a public source into RStudio, used some functions from tidyverse library, cleaned and processed the data, did few aggregation functions, plotted clear and easy to understand charts and managed to deliver a couple of insights based on the story that data showed me. I am open to constructive criticism on how to improve my skills and what could have bedn done better for next time. I am always open to acquiring new knowledge :)