R

Missing Value (notes for myself)

p.16-17 Introduction This part explains the detailed steps of calculating Little’s test statistic to check for the Missing Completely at Random (MCAR) assumption, as described in the provided example. The test is based on comparing the means of observed and missing data patterns, with the computation of the test statistic $T_L$ and its interpretation.

Jul 19, 2025

SEM (notes for myself)

p.21 Covariance Derivation The covariance between two random variables $X_1$ and $X_2$ is defined as: $$ \text{Cov}(X_1, X_2) = \mathbb{E}\left[(X_1 - \mathbb{E}(X_1))(X_2 - \mathbb{E}(X_2))\right] $$Step 1: Expanding the expression First, expand the product inside the expectation:

Jul 2, 2025

Why Do We Need to Care About Endogeneity?

Let’s consider a simple linear regression model: $$ y = \beta_0 + \beta_1 x + \varepsilon $$Where: $y$: outcome variable $x$: predictor $\varepsilon$: error term (disturbance or residual) $\beta_0, \beta_1$: parameters to estimate What OLS Assumes For the OLS (Ordinary Least Squares) estimator to be unbiased, one of the Gauss-Markov assumptions is:

Jun 28, 2025

When `reorder` Fails

I am using the mpg dataset included in the ggplot2 package. #load packages library(tidyverse) library(tidytext) #view data mpg ## Rows: 234 ## Columns: 11 ## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "… ## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "… ## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.… ## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200… ## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, … ## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto… ## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4… ## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1… ## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2… ## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p… ## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c… mpg %>% group_by(trans) %>% count() %>% ungroup() %>% ggplot(mapping = aes(trans, n)) + geom_bar(stat = "identity")

Jul 13, 2024

Coefficient H

Please check McNeish (2018) for details. Here, I am using the NSCH dataset as an example. #load packages library(haven) library(tidyverse) library(userfriendlyscience) #import data data<-read_sav("nsch.sav") #clear the current graphics frame and get ready for the next plot plot.new()

Apr 8, 2024

Create Dummy Variables

Load Packages library(fastDummies) library(tidyverse) library(psych) Create a DataSet # Create a vector of race scores race <- c("White", "Black", "Asian", "Hispanic", "Other") # Generate random income values for each race (100 cases) set.seed(123) # for reproducibility income <- round(runif(100, min = 20000, max = 100000), digits = 2) # Repeat each race 20 times to get 100 cases race <- rep(race, each = 20) # Combine race and income into a data frame data <- data.frame(race, income) # Print the first few rows of the dataset print(head(data)) ## race income ## 1 White 43006.20 ## 2 White 83064.41 ## 3 White 52718.15 ## 4 White 90641.39 ## 5 White 95237.38 ## 6 White 23644.52 Create Dummy Variables data<-data %>% dummy_cols(select_columns = "race") Regress Income on Race (African Americans as the Reference Category) fit<-lm(income ~ race_Asian + race_Hispanic + race_Other + race_White, data=data) summary(fit) ## ## Call: ## lm(formula = income ~ race_Asian + race_Hispanic + race_Other + ## race_White, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -44169 -19531 -1137 18010 40481 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 66138 5066 13.055 <2e-16 *** ## race_Asian -15015 7165 -2.096 0.0388 * ## race_Hispanic -7004 7165 -0.977 0.3308 ## race_Other -7173 7165 -1.001 0.3193 ## race_White -2073 7165 -0.289 0.7730 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 22660 on 95 degrees of freedom ## Multiple R-squared: 0.05237, Adjusted R-squared: 0.01247 ## F-statistic: 1.313 on 4 and 95 DF, p-value: 0.2709

Feb 15, 2024

How to Display Chinese Characters in ggplot2

Introduction In this post, I am going to demonstrate how to display Chinese characters in ggplot2 #loading necessary packages library(rvest) library(xml2) library(tidyverse) library(tidytext) library(knitr) Let’s get our text data by scraping Chien-Ming Wang’s page on Wikipedia. I will not explain the process of web scraping in this post (but will do this in another post). The focus is on displaying Chinese character in ggplot2.

Sep 10, 2021

Re-coding Values

Introduction Recoding values is one of the most common tasks a researcher needs to do before data analysis. For me, often I need to prepare my data in R first before using it for advanced statistical analyses in Mplus. In this case, it is important to recode missing values to a specific extreme value (e.g., -999) since it will be more efficient for Mplus to recognize and handle missing values. In this post, I will demonstrate a way (Oh yeah! This is the beauty of R： All roads lead to Rome.) to handle the recording task using case_when function in Tidyverse. There are different ways to get this job done, but I feel that case_when makes the most sense to me. Let’s get started.

Aug 26, 2021

Computing Composite Scores (Mean)

Introduction In this post, I am going to demonstrate how to compute composite scores or means aggregated over multiple items. There are at least two approaches to achieve the goal.

Jul 23, 2021