program type and violation category to test the association between the two variables.
H0:There is no association between program type and violation category
H1:There is association between program type and violation category
tbl1=table(Childcare_center$borough,Childcare_center$violation_category)
chisq.test(tbl1)## 
##  Pearson's Chi-squared test
## 
## data:  tbl1
## X-squared = 134.57, df = 8, p-value < 2.2e-16chisq.test(tbl1)%>% 
  broom::tidy() %>% 
  knitr::kable()| statistic | p.value | parameter | method | 
|---|---|---|---|
| 134.5698 | 0 | 8 | Pearson’s Chi-squared test | 
There is significant relationship between program type and violation category suggested by the computed p-value much lower than 0.05
We conducted a chi-squared test between childcare type and violation category to test the association between the two variables.
H0:There is no association between childcare type and violation category
H1:There is association between childcare type and violation category
tbl2=table(Childcare_center$child_care_type,Childcare_center$violation_category)
chisq.test(tbl2)## 
##  Pearson's Chi-squared test
## 
## data:  tbl2
## X-squared = 37.101, df = 2, p-value = 8.781e-09chisq.test(tbl2) %>% 
  broom::tidy() %>% 
  knitr::kable()| statistic | p.value | parameter | method | 
|---|---|---|---|
| 37.1013 | 0 | 2 | Pearson’s Chi-squared test | 
There is significant relationship between childcare type and violation category suggested by the computed p-value much lower than 0.05
We want to examining the relationship between the mean of total violation event (GENERAL+CRITICAL+PUBLIC HEALTH HAZARD) and borough. The frequency of violation events were separated by five boroughs (Manhattan, Brooklyn, Queens, Staten Island, Bronx).
H0: The mean of violation are not different across borough
H1: The mean of violation are different across borough
clean_1 = children_center<-
  raw_data%>%
  janitor::clean_names()%>%
  mutate(
    center_name=tolower(center_name),
    center_name=gsub('[[:punct:] ]+',' ',center_name),
    center_name=gsub(" ","",center_name),
    center_name=gsub("llc","",center_name),
    center_name=gsub("inc","",center_name),
    center_name=gsub("th","",center_name),
    center_name=gsub("school","",center_name),
    center_name=gsub("center","",center_name),
    center_name=gsub("ctr","",center_name)
  )%>%
  filter(status=="Permitted") %>% 
  mutate(borough = as.factor(borough), program_type = as.factor(program_type)) %>%
   filter(violation_category != "NO VIOLATION") %>% 
  mutate(violation_category = "VIOLATION") %>% 
  group_by(center_name,borough) %>% 
  count() 
fit = lm(n~ borough, data = clean_1)
anova(fit) %>% 
  knitr::kable(caption = "One way anova of Violation frequency and Brough")| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| borough | 4 | 15688.56 | 3922.141 | 25.89965 | 0 | 
| Residuals | 985 | 149164.49 | 151.436 | NA | NA | 
The p-value is very small(p<0.05). We reject the null hypothesis and saying that there is enough evidence the mean of violation are different across borough.
To predict whether a childcare facility will violate or not, we fit the data to a logistic regression model.
Our model is shown below:
# New dataset for modelling
childcare_model <- Childcare_center %>% 
  mutate(violation_category = ifelse(is.na(violation_category), 0, 1),
         borough = as.factor(borough)) %>% # 0 is no liolation, 1 is with violation
  select(borough,child_care_type,total_educational_workers,maximum_capacity, violation_category)
# fit logistic reg
logistic_model = childcare_model %>% 
  glm(violation_category ~ borough + child_care_type + maximum_capacity, data = ., family = binomial()) %>% 
  broom::tidy() %>% 
  knitr::kable(caption = "Effect of Predictors on Violation Status")
logistic_model| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | 0.4692795 | 0.0559961 | 8.380571 | 0.0000000 | 
| boroughBROOKLYN | -1.1458556 | 0.0514892 | -22.254278 | 0.0000000 | 
| boroughMANHATTAN | -1.0180213 | 0.0580792 | -17.528165 | 0.0000000 | 
| boroughQUEENS | -0.3377869 | 0.0502914 | -6.716594 | 0.0000000 | 
| boroughSTATEN ISLAND | -1.6359735 | 0.0995010 | -16.441776 | 0.0000000 | 
| child_care_typeChild Care - Pre School | 0.1660567 | 0.0488473 | 3.399506 | 0.0006751 | 
| maximum_capacity | 0.0023471 | 0.0003790 | 6.192338 | 0.0000000 | 
As we can see in the summary, all predictors are significant with p value < 0.01, suggesting association with violation status. Different borough has different coefficients, suggesting regional difference in violation status. The odds of violation for childcare center in Bronx is 0.469, holding all other variables constant.For each of the other borough, the coefficient tells us that the log-odds of violation for a given group is smaller than that of the reference group.
I also used a simple method to test our model accuracy.Firstly, I split our dataset to training and testing dataset, and then fit the model using the training data, and test for accuracy using testing data. Lastly, I used confusion matrix to access our logistic model.
library(rsample)   # for data splitting
# Modeling packages
library(caret)     # for logistic regression modeling
# Model interpretability packages
library(vip)       # variable importance
childcare_split <- initial_split(childcare_model, prop = .7)
childcare_train <- training(childcare_split)
childcare_test  <- testing(childcare_split)
set.seed(123)
# predict violation using model
# fit logistic reg
logistic_model_train = childcare_train %>% 
  glm(violation_category ~ borough + child_care_type + maximum_capacity, data = ., family = binomial()) 
logistics_predict = predict(logistic_model_train, childcare_test, type = 'response')
# convert predicted value to TRUE/FALSE
childcare_test$predict_violation = ifelse(logistics_predict >= .5, 1, 0)
# Used confusion matrix to measure model performance
confusion_data = childcare_test %>% 
  select(predict_violation, violation_category) %>% 
  mutate(predict_violation = as.factor(predict_violation), violation_category = as.factor(violation_category))
confusionMatrix(data=confusion_data$predict_violation, reference = confusion_data$violation_category) %>% 
  broom::tidy() %>% 
  select(term, estimate) %>% 
  filter(term %in% c("accuracy", "kappa", "sensitivity", "specificity", "f1")) %>% 
  knitr::kable(caption = "Measuring Model Performance ")| term | estimate | 
|---|---|
| accuracy | 0.6202091 | 
| kappa | 0.2380403 | 
| sensitivity | 0.6015857 | 
| specificity | 0.6366419 | 
| f1 | 0.5975880 | 
It turns out that our model has accuracy rate of 62%, with Kappa coefficient at around 0.24. This means that the predictability of our model is relatively week. But the significance of our predictors indicating that borough, childcare types and maximum capacity are all associated with violation. The reason behind regional disparities in violation should be investigated in the future.
The main reason for our weak predictability is that we lack of essential predictors for the model. Combined with the top 10 violation summary we analyzed before, useful information may include: hours of training, celling and floor maintance frequecny, stuff medical clearance status, etc.