Submitted as Course Project for the course Getting and Cleaning Data (Coursera) by Diksha Jain


This file gives the overview of the files that have been uploaded as part of this project(repo) and detailed step-by-step process of tidying up data for this project.


Human Activity Recognition Using Smartphones

Files in this repository

  1. run_analysis.R: This file is an R script that performs the following actions as required by this project:
  • Merges the training and the test sets to create one data set.
  • Extracts only the measurements on the mean and standard deviation for each measurement.
  • Uses descriptive activity names to name the activities in the data set
  • Appropriately labels the data set with descriptive variable names.
  • From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
  1. This file explains the features present in the tidy dataset contained in AggData.txt

  2. AggData.txt: A txt file contaning the final tidy data set created in last step of the project goals.

How run-analysis.R works?

The file run_analysis.R is an R script file that performs the following tasks on Human Activity Recognition Data recorded using Samsung Galaxy S smartphone. The tasks are as follows:
1 - Merges the training and the test sets to create one data set.
2 - Extracts only the measurements on the mean and standard deviation for each measurement.
3 - Uses descriptive activity names to name the activities in the data set
4 - Appropriately labels the data set with descriptive variable names.
5 - From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Downloading the files

Download the zip file containing data files into a folder on your local computer. Extract the zip file. The extracted folder will be your working directory for the rest of this exercise.

Reading Data Set

  1. Read the train/X_train.txt into a variable Convert into a matrix of 7352x561. Then, convert into a dataframe.

  2. Read the train/subject_train.txt into a variable containing information about subjects.

  3. Read the train/y_train.txt into a variable containing information about activity labels.

  4. Using cbind(), bind the variables, and and assign to a new variable

On the same lines (step 1-4 above), read the test data from test/ directory into variables, and Bind these variables into a new variable

1. Merging the training and test sets to create one data set

Using rbind(), merge the variables and (created in previous section) into a new variable

2. Extracting only the measurements on the mean and standard deviation for each measurement

  1. Read the feature names from the file *features.txt" into a variable features.

  2. Assign the names of the dataframe from the vector features created in previous step.

  3. Using grep(), find out the indices of the column names that contain the words "mean" or "std".

  4. Retain the column indices derived in step 3, subject and y variable into the dataframe

3. Uses descriptive activity names to name the activities in the data set

  1. Read the activity names from the file activity_labels.txt into a variable

  2. Using a for loop on, replace all the activity labels in the column y of the dataframe to their activity labels read from the file in previous step.

4. Appropriately labels the data set with descriptive variable names.

  1. Assign the name 'activity' to the column y of the data frame

  2. Using sub(), make the following replacements to the column names of

  • change "acc" to "Acceleration"
  • change beginning "t" to "time"
  • change beginning "f" to "frequency"
  • remove the first occurrence of "-"
  • change "Mag" to "Magnitude"
  • change "mean()" to "Mean"
  • change "std()" to "STD"
  • change "freq()" to "Frequency"
  • change "Gyro" to "Gyroscope"

5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

  1. Load the package dplyr.

  2. While chaining, use group_by() to perform grouping by activity and subject.

  3. Chain the grouped data table to summarize_all() with function as mean to find average of all the columns (other than subject and activity) and save the result into a variable

  4. At the end, use write.table() to save the created into a file named AggData.txt with rownames=FALSE.