This is my first case study, showcasing acquired skills in R Programming. For this purpose I will use public data that explores habits of smart device usersâ daily habits.
DATASET: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore usersâ habits.
I dowloaded the dataset from Kaggle in .csv format and saved it on my harddrive. Then in Rstudio I installed necessary packages to proceed with my task.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("data.table", repos = "http://cran.us.r-project.org")
install.packages("rmarkdown", repos = "http://cran.us.r-project.org")
library(data.table)
library(readr)
library(dplyr)
library(tidyr)
library(skimr)
library(janitor)
library(lubridate)
library(ggplot2)
library(scales)
library(rmarkdown)
Then I set my working directory to the file in which I saved this dataset using setwd() function and proceeded to load the .csv files.
hourly_calories <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlyCalories_merged.csv")
hourly_intensities <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlyIntensities_merged.csv")
hourly_steps <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlySteps_merged.csv")
weight_log_info <- read.csv("mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/weightLogInfo_merged.csv")
daily_activity <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep_day <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
After uploading all files into RStudio I began to examine it. I created an object containing all files to make my work more time efficient.
all_files <- c(hourly_calories, hourly_intensities, hourly_steps, weight_log_info, daily_activity, sleep_day)
First I used glimpse() to get some insights from the data.
glimpse(all_files)
## List of 38
## $ Id : num [1:24084] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour : chr [1:24084] "3/12/2016 12:00:00 AM" "3/12/2016 1:00:00 AM" "3/12/2016 2:00:00 AM" "3/12/2016 3:00:00 AM" ...
## $ Calories : int [1:24084] 48 48 48 48 48 48 48 48 48 49 ...
## $ Id : num [1:24084] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour : chr [1:24084] "3/12/2016 12:00:00 AM" "3/12/2016 1:00:00 AM" "3/12/2016 2:00:00 AM" "3/12/2016 3:00:00 AM" ...
## $ TotalIntensity : int [1:24084] 0 0 0 0 0 0 0 0 0 1 ...
## $ AverageIntensity : num [1:24084] 0 0 0 0 0 ...
## $ Id : num [1:24084] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour : chr [1:24084] "3/12/2016 12:00:00 AM" "3/12/2016 1:00:00 AM" "3/12/2016 2:00:00 AM" "3/12/2016 3:00:00 AM" ...
## $ StepTotal : int [1:24084] 0 0 0 0 0 0 0 0 0 8 ...
## $ Id : num [1:33] 1.50e+09 1.93e+09 2.35e+09 2.87e+09 2.87e+09 ...
## $ Date : chr [1:33] "4/5/2016 11:59:59 PM" "4/10/2016 6:33:26 PM" "4/3/2016 11:59:59 PM" "4/6/2016 11:59:59 PM" ...
## $ WeightKg : num [1:33] 53.3 129.6 63.4 56.7 57.2 ...
## $ WeightPounds : num [1:33] 118 286 140 125 126 ...
## $ Fat : int [1:33] 22 NA 10 NA NA NA NA NA NA NA ...
## $ BMI : num [1:33] 23 46.2 24.8 21.5 21.6 ...
## $ IsManualReport : chr [1:33] "True" "False" "True" "True" ...
## $ LogId : num [1:33] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int [1:940] 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int [1:940] 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int [1:940] 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep : int [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int [1:413] 346 407 442 367 712 320 377 364 384 449 ...
Then to get more details about each file I used head() function.
head(hourly_calories)
## Id ActivityHour Calories
## 1 1503960366 3/12/2016 12:00:00 AM 48
## 2 1503960366 3/12/2016 1:00:00 AM 48
## 3 1503960366 3/12/2016 2:00:00 AM 48
## 4 1503960366 3/12/2016 3:00:00 AM 48
## 5 1503960366 3/12/2016 4:00:00 AM 48
## 6 1503960366 3/12/2016 5:00:00 AM 48
head(hourly_intensities)
## Id ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 3/12/2016 12:00:00 AM 0 0
## 2 1503960366 3/12/2016 1:00:00 AM 0 0
## 3 1503960366 3/12/2016 2:00:00 AM 0 0
## 4 1503960366 3/12/2016 3:00:00 AM 0 0
## 5 1503960366 3/12/2016 4:00:00 AM 0 0
## 6 1503960366 3/12/2016 5:00:00 AM 0 0
head(hourly_steps)
## Id ActivityHour StepTotal
## 1 1503960366 3/12/2016 12:00:00 AM 0
## 2 1503960366 3/12/2016 1:00:00 AM 0
## 3 1503960366 3/12/2016 2:00:00 AM 0
## 4 1503960366 3/12/2016 3:00:00 AM 0
## 5 1503960366 3/12/2016 4:00:00 AM 0
## 6 1503960366 3/12/2016 5:00:00 AM 0
head(weight_log_info)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 4/5/2016 11:59:59 PM 53.3 117.5064 22 22.97
## 2 1927972279 4/10/2016 6:33:26 PM 129.6 285.7191 NA 46.17
## 3 2347167796 4/3/2016 11:59:59 PM 63.4 139.7731 10 24.77
## 4 2873212765 4/6/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 4/7/2016 11:59:59 PM 57.2 126.1044 NA 21.65
## 6 2891001357 4/5/2016 11:59:59 PM 88.4 194.8886 NA 25.03
## IsManualReport LogId
## 1 True 1.459901e+12
## 2 False 1.460313e+12
## 3 True 1.459728e+12
## 4 True 1.459987e+12
## 5 True 1.460074e+12
## 6 True 1.459901e+12
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(sleep_day)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 413
After examining the results here is what I found:
data_by_hours <- bind_cols(hourly_calories, hourly_intensities, hourly_steps)
## New names:
## ⢠`Id` -> `Id...1`
## ⢠`ActivityHour` -> `ActivityHour...2`
## ⢠`Id` -> `Id...4`
## ⢠`ActivityHour` -> `ActivityHour...5`
## ⢠`Id` -> `Id...8`
## ⢠`ActivityHour` -> `ActivityHour...9`
head(data_by_hours)
## Id...1 ActivityHour...2 Calories Id...4 ActivityHour...5
## 1 1503960366 3/12/2016 12:00:00 AM 48 1503960366 3/12/2016 12:00:00 AM
## 2 1503960366 3/12/2016 1:00:00 AM 48 1503960366 3/12/2016 1:00:00 AM
## 3 1503960366 3/12/2016 2:00:00 AM 48 1503960366 3/12/2016 2:00:00 AM
## 4 1503960366 3/12/2016 3:00:00 AM 48 1503960366 3/12/2016 3:00:00 AM
## 5 1503960366 3/12/2016 4:00:00 AM 48 1503960366 3/12/2016 4:00:00 AM
## 6 1503960366 3/12/2016 5:00:00 AM 48 1503960366 3/12/2016 5:00:00 AM
## TotalIntensity AverageIntensity Id...8 ActivityHour...9 StepTotal
## 1 0 0 1503960366 3/12/2016 12:00:00 AM 0
## 2 0 0 1503960366 3/12/2016 1:00:00 AM 0
## 3 0 0 1503960366 3/12/2016 2:00:00 AM 0
## 4 0 0 1503960366 3/12/2016 3:00:00 AM 0
## 5 0 0 1503960366 3/12/2016 4:00:00 AM 0
## 6 0 0 1503960366 3/12/2016 5:00:00 AM 0
colnames(data_by_hours)
## [1] "Id...1" "ActivityHour...2" "Calories" "Id...4"
## [5] "ActivityHour...5" "TotalIntensity" "AverageIntensity" "Id...8"
## [9] "ActivityHour...9" "StepTotal"
data_by_hours <- subset(data_by_hours, select = c(-4,-5,-8,-9))
data_by_hours <- data_by_hours %>% rename(id=Id...1, activity_hour=ActivityHour...2, calories=Calories, total_intensity=TotalIntensity, avg_intensity=AverageIntensity, steps=StepTotal)
colnames(daily_activity)
daily_activity <- daily_activity %>% rename(id=Id, activity_date = ActivityDate, total_steps = TotalSteps, total_distance = TotalDistance, tracker_distance = TrackerDistance, logged_activities_distance = LoggedActivitiesDistance, very_active_distance = VeryActiveDistance, moderately_active_distance = ModeratelyActiveDistance, light_active_distance = LightActiveDistance, sedentary_active_distance = SedentaryActiveDistance, very_active_minutes = VeryActiveMinutes, fairly_active_minutes = FairlyActiveMinutes, lightly_active_minutes = LightlyActiveMinutes, sedentary_minutes = SedentaryMinutes, calories = Calories)
colnames(sleep_day)
sleep_day <- sleep_day %>% rename(id = Id, sleep_day = SleepDay, total_sleep_records = TotalSleepRecords, total_minutes_asleep = TotalMinutesAsleep, total_time_in_bed = TotalTimeInBed)
head(data_by_hours)
## id activity_hour calories total_intensity avg_intensity steps
## 1 1503960366 3/12/2016 12:00:00 AM 48 0 0 0
## 2 1503960366 3/12/2016 1:00:00 AM 48 0 0 0
## 3 1503960366 3/12/2016 2:00:00 AM 48 0 0 0
## 4 1503960366 3/12/2016 3:00:00 AM 48 0 0 0
## 5 1503960366 3/12/2016 4:00:00 AM 48 0 0 0
## 6 1503960366 3/12/2016 5:00:00 AM 48 0 0 0
head(daily_activity)
## id activity_date total_steps total_distance tracker_distance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## logged_activities_distance very_active_distance moderately_active_distance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## light_active_distance sedentary_active_distance very_active_minutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(sleep_day)
## id sleep_day total_sleep_records total_minutes_asleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## total_time_in_bed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
data_by_hours$activity_hour <- format(as_datetime(data_by_hours$activity_hour, format = "%m/%d/%Y %I:%M:%S %p"), "%d/%m/%Y %I:%M:%S %p")
data_by_hours$activity_hour <- parse_date_time(data_by_hours$activity_hour, "%d/%m/%Y %I:%M:%S %p")
str(data_by_hours)
## 'data.frame': 24084 obs. of 6 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour : POSIXct, format: "2016-03-12 00:00:00" "2016-03-12 01:00:00" ...
## $ calories : int 48 48 48 48 48 48 48 48 48 49 ...
## $ total_intensity: int 0 0 0 0 0 0 0 0 0 1 ...
## $ avg_intensity : num 0 0 0 0 0 ...
## $ steps : int 0 0 0 0 0 0 0 0 0 8 ...
daily_activity %>%
select(total_steps,
total_distance,
sedentary_minutes,
calories) %>%
summary()
## total_steps total_distance sedentary_minutes calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
Here I made basic summary of daily_activity dataset. The pattern seems clear: The more you walk the more you spend time resting and burn calories. But is it accurate for all tested subjects? Next is a visualization created on two arguments: total_steps and calories:
This viz negates previous insights. Steps and calories are not directly proportional which suggest other factors that are relevant to burnt calories. The dataset needs further examination. I decided to play some more with this two values and created a visualization showing steps and calories burnt of each fitbit user from a whole month.
calories <- hourly_calories %>% group_by(Id) %>%
summarise(sum_calories=sum(Calories),
.groups = 'drop')
steps <- hourly_steps %>% group_by(Id) %>%
summarise(sum_steps=sum(StepTotal),
.groups = 'drop')
calories_steps_corellation <- merge(calories, steps, by ="Id")
Moving on to another summary:
sleep_day %>%
select(total_sleep_records,
total_minutes_asleep,
total_time_in_bed) %>%
summary()
## total_sleep_records total_minutes_asleep total_time_in_bed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
Just like before, this summary tells me that the more one sleeps the more he stays in bed.
This visualization proves previous insight true. The trend line is very close to the points of this scatterplot which implies a strong correlation between minutes asleep and time spent in bed.
In this Case Study I showed what I have learned in R programming regarding Data Analysis. I dowloaded and loaded dataset from a public source into RStudio, used some functions from tidyverse library, cleaned and processed the data, did few aggregation functions, plotted clear and easy to understand charts and managed to deliver a couple of insights based on the story that data showed me. I am open to constructive criticism on how to improve my skills and what could have bedn done better for next time. I am always open to acquiring new knowledge :)