Computing Composite Scores (Mean)
Introduction
In this post, I am going to demonstrate how to compute composite scores or means aggregated over multiple items. There are at least two approaches to achieve the goal.
Let’s load necessary packages first.
library(tidyverse)
Create hypothetical data sets: one with complete data and one with missing values.
#id = unique id number for each participant
#se refers self-efficacy
#se1 refers to the first survey item of the self-efficacy scale
mydata<-data.frame(
id = c(1:3),
se1 = c(1, 3, 4),
se2 = c(2, 2, 5),
se3 = c(1, 3, 3)
)
missdata<-data.frame(
id = c(1:3),
se1 = c(1, 3, 4),
se2 = c(NA, 2, 5),
se3 = c(1, 3, NA)
)
Check our data set.
mydata
## id se1 se2 se3
## 1 1 1 2 1
## 2 2 3 2 3
## 3 3 4 5 3
missdata
## id se1 se2 se3
## 1 1 1 NA 1
## 2 2 3 2 3
## 3 3 4 5 NA
We are ready to explore the data.
First Approach: R Base Functions
Since we are interested in computing means, rowMeans
will do the work. We need to create a new variable called se to represent each participant’s overall level of self-efficacy and specify what columns or items are needed for computing the composite score for each person (mean in this case). Let’s play with our complete data set mydata
first.
mydata$se<-rowMeans(mydata[, c("se1", "se2", "se3")], na.rm=T)
mydata
## id se1 se2 se3 se
## 1 1 1 2 1 1.333333
## 2 2 3 2 3 2.666667
## 3 3 4 5 3 4.000000
na.rm
is an argument for determining how to deal with cases with missing values. It is not particularly relevant here since there is no missing value in mydata
.
Second Approach: Tidyverse
mydata %>% mutate (se = rowMeans(select(., c("se1", "se2", "se3")), na.rm=T))
## id se1 se2 se3 se
## 1 1 1 2 1 1.333333
## 2 2 3 2 3 2.666667
## 3 3 4 5 3 4.000000
mutate
is a great function to create new variables. select
is another function to select the variables needed. . (dot) refers to mydata
. Here, na.rm
is also not particularly relevant here since there is no missing value. It looks like we have identical values. Good!
Deal with Missing Values
na.rm
will be relevant when dealing with data containing missing values. na.rm = FALSE
is very similar to the idea of list-wise deletion. That is, R will not compute the composite score for any row or person that contains a missing value for the items you selected. On the contrary, na.rm = TRUE
is very similar to the idea of the full information approach. That is, R will utilize all the possible information from the items to compute the mean. If there is a missing value in one of the three items, R will still compute the mean based on the values of the other two items.
#list-wise deletion approach
missdata$list<-rowMeans(missdata[, c("se1", "se2", "se3")], na.rm=F)
#full information approach
missdata$full<-rowMeans(missdata[, c("se1", "se2", "se3")], na.rm=T)
missdata
## id se1 se2 se3 list full
## 1 1 1 NA 1 NA 1.000000
## 2 2 3 2 3 2.666667 2.666667
## 3 3 4 5 NA NA 4.500000
As you can see here, since there is a missing value for person 1 and person 3 in one of the self-efficacy items, na.rm=F
will discard all the other information from items that do contain information and will not compute the mean for that person. On the contrary, na.rm=T
will still compute the mean based on the information from items that do not have missing values. This idea is the same when using Tidyverse.
#list-wise deletion approach
missdata %>% mutate (list = rowMeans(select(., c("se1", "se2", "se3")), na.rm=F))
## id se1 se2 se3 list
## 1 1 1 NA 1 NA
## 2 2 3 2 3 2.666667
## 3 3 4 5 NA NA
#full information approach
missdata %>% mutate (full = rowMeans(select(., c("se1", "se2", "se3")), na.rm=T))
## id se1 se2 se3 full
## 1 1 1 NA 1 1.000000
## 2 2 3 2 3 2.666667
## 3 3 4 5 NA 4.500000
We have the identical results here. Using the full information approach na.rm=T
, for person 1, the mean is 1 ((1+1)/2) despite a missing value for item 2.