Statistical Analysis

Chi-Square test - Borough vs Violation Category

program type and violation category to test the association between the two variables.

H0:There is no association between program type and violation category

H1:There is association between program type and violation category

tbl1=table(Childcare_center$borough,Childcare_center$violation_category)
chisq.test(tbl1)

## 
##  Pearson's Chi-squared test
## 
## data:  tbl1
## X-squared = 134.57, df = 8, p-value < 2.2e-16

chisq.test(tbl1)%>% 
  broom::tidy() %>% 
  knitr::kable()

statistic	p.value	parameter	method
134.5698	0	8	Pearson’s Chi-squared test

There is significant relationship between program type and violation category suggested by the computed p-value much lower than 0.05

Chi-Square test - Program Types vs Violation Category

We conducted a chi-squared test between childcare type and violation category to test the association between the two variables.

H0:There is no association between childcare type and violation category

H1:There is association between childcare type and violation category

tbl2=table(Childcare_center$child_care_type,Childcare_center$violation_category)

chisq.test(tbl2)

## 
##  Pearson's Chi-squared test
## 
## data:  tbl2
## X-squared = 37.101, df = 2, p-value = 8.781e-09

chisq.test(tbl2) %>% 
  broom::tidy() %>% 
  knitr::kable()

statistic	p.value	parameter	method
37.1013	0	2	Pearson’s Chi-squared test

There is significant relationship between childcare type and violation category suggested by the computed p-value much lower than 0.05

ANOVA Test - Borough and Violation

We want to examining the relationship between the mean of total violation event (GENERAL+CRITICAL+PUBLIC HEALTH HAZARD) and borough. The frequency of violation events were separated by five boroughs (Manhattan, Brooklyn, Queens, Staten Island, Bronx).

H0: The mean of violation are not different across borough

H1: The mean of violation are different across borough

clean_1 = children_center<-
  raw_data%>%
  janitor::clean_names()%>%
  mutate(
    center_name=tolower(center_name),
    center_name=gsub('[[:punct:] ]+',' ',center_name),
    center_name=gsub(" ","",center_name),
    center_name=gsub("llc","",center_name),
    center_name=gsub("inc","",center_name),
    center_name=gsub("th","",center_name),
    center_name=gsub("school","",center_name),
    center_name=gsub("center","",center_name),
    center_name=gsub("ctr","",center_name)
  )%>%
  filter(status=="Permitted") %>% 
  mutate(borough = as.factor(borough), program_type = as.factor(program_type)) %>%
   filter(violation_category != "NO VIOLATION") %>% 
  mutate(violation_category = "VIOLATION") %>% 
  group_by(center_name,borough) %>% 
  count() 


fit = lm(n~ borough, data = clean_1)
anova(fit) %>% 
  knitr::kable(caption = "One way anova of Violation frequency and Brough")

One way anova of Violation frequency and Brough
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
borough	4	15688.56	3922.141	25.89965	0
Residuals	985	149164.49	151.436	NA	NA

The p-value is very small(p<0.05). We reject the null hypothesis and saying that there is enough evidence the mean of violation are different across borough.

Logistic Regression

To predict whether a childcare facility will violate or not, we fit the data to a logistic regression model.

Our model is shown below:

# New dataset for modelling
childcare_model <- Childcare_center %>% 
  mutate(violation_category = ifelse(is.na(violation_category), 0, 1),
         borough = as.factor(borough)) %>% # 0 is no liolation, 1 is with violation
  select(borough,child_care_type,total_educational_workers,maximum_capacity, violation_category)

# fit logistic reg
logistic_model = childcare_model %>% 
  glm(violation_category ~ borough + child_care_type + maximum_capacity, data = ., family = binomial()) %>% 
  broom::tidy() %>% 
  knitr::kable(caption = "Effect of Predictors on Violation Status")

logistic_model

Effect of Predictors on Violation Status
term	estimate	std.error	statistic	p.value
(Intercept)	0.4692795	0.0559961	8.380571	0.0000000
boroughBROOKLYN	-1.1458556	0.0514892	-22.254278	0.0000000
boroughMANHATTAN	-1.0180213	0.0580792	-17.528165	0.0000000
boroughQUEENS	-0.3377869	0.0502914	-6.716594	0.0000000
boroughSTATEN ISLAND	-1.6359735	0.0995010	-16.441776	0.0000000
child_care_typeChild Care - Pre School	0.1660567	0.0488473	3.399506	0.0006751
maximum_capacity	0.0023471	0.0003790	6.192338	0.0000000

As we can see in the summary, all predictors are significant with p value < 0.01, suggesting association with violation status. Different borough has different coefficients, suggesting regional difference in violation status. The odds of violation for childcare center in Bronx is 0.469, holding all other variables constant.For each of the other borough, the coefficient tells us that the log-odds of violation for a given group is smaller than that of the reference group.

I also used a simple method to test our model accuracy.Firstly, I split our dataset to training and testing dataset, and then fit the model using the training data, and test for accuracy using testing data. Lastly, I used confusion matrix to access our logistic model.

library(rsample)   # for data splitting

# Modeling packages
library(caret)     # for logistic regression modeling

# Model interpretability packages
library(vip)       # variable importance


childcare_split <- initial_split(childcare_model, prop = .7)
childcare_train <- training(childcare_split)
childcare_test  <- testing(childcare_split)

set.seed(123)

# predict violation using model
# fit logistic reg
logistic_model_train = childcare_train %>% 
  glm(violation_category ~ borough + child_care_type + maximum_capacity, data = ., family = binomial()) 

logistics_predict = predict(logistic_model_train, childcare_test, type = 'response')
# convert predicted value to TRUE/FALSE
childcare_test$predict_violation = ifelse(logistics_predict >= .5, 1, 0)

# Used confusion matrix to measure model performance
confusion_data = childcare_test %>% 
  select(predict_violation, violation_category) %>% 
  mutate(predict_violation = as.factor(predict_violation), violation_category = as.factor(violation_category))
confusionMatrix(data=confusion_data$predict_violation, reference = confusion_data$violation_category) %>% 
  broom::tidy() %>% 
  select(term, estimate) %>% 
  filter(term %in% c("accuracy", "kappa", "sensitivity", "specificity", "f1")) %>% 
  knitr::kable(caption = "Measuring Model Performance ")

Measuring Model Performance
term	estimate
accuracy	0.6202091
kappa	0.2380403
sensitivity	0.6015857
specificity	0.6366419
f1	0.5975880

It turns out that our model has accuracy rate of 62%, with Kappa coefficient at around 0.24. This means that the predictability of our model is relatively week. But the significance of our predictors indicating that borough, childcare types and maximum capacity are all associated with violation. The reason behind regional disparities in violation should be investigated in the future.

The main reason for our weak predictability is that we lack of essential predictors for the model. Combined with the top 10 violation summary we analyzed before, useful information may include: hours of training, celling and floor maintance frequecny, stuff medical clearance status, etc.