We can see from the following boxplot that the distributions maximum capacity in preschool and Infants / Toddlers are significantly different. Overall, Preschool has higher maximum capacity than Infants/Toddlers, and the range of capicity is also wider than it is in the Infants/Toddlers.The max maximum capacity in childcare center is 371, while the max maximum capacity in Infants/Toddlers is 135.
plotly_capacity_type = Childcare_center %>%
group_by(violation_category, maximum_capacity, child_care_type) %>%
plot_ly(y = ~maximum_capacity, color = ~child_care_type, type = "box", colors = "viridis")
plotly_capacity_type
We then want to examine whether there is connection between maximum capacity and violation. Since the maximum capacity in different child care types are different, we display the distribution of maximum capacity vs violation category by child care type.
We can conclude from the plot that there is no significant difference between the distribution of maximum capacity for each violation category. Besides, there are many outliers in each groups, which will affect our analysis results.
boxplot_capacity = Childcare_center %>%
mutate(violation_category = ifelse( is.na(violation_category), "NO VIOLATION", violation_category)) %>%
group_by(violation_category, maximum_capacity, child_care_type) %>%
ggplot(aes(x = violation_category, y = maximum_capacity, color = child_care_type)) +
geom_boxplot() +
labs(
x = "Violation category",
y = "Maximum capacity",
color = ""
) +
theme(legend.position = "bottom")
boxplot_capacity
To further analyze the association, we set maximum capacity less than its median as small center, and larger and equal than its median as larger center by child care type, and then compare the frequecny of violation types.
(1). Pre School
We can see from the frequency bar chart that pre schools with larger maximum capacity are more likely to violate related rules: the frequecnies are higher in all violation categories. In contract, pre schools with smaller maximum capacity have more chance of no violation.
(2). Infants/Toddlers
The frequency of violation is higher in Infants/Toddlers with larger maximum capacity; and the frequency of no violation is higher in Infants/Toddlers with smaller maximum capacity.
library(patchwork)
bar_max_cap1 = Childcare_center %>%
mutate(violation_category = ifelse( is.na(violation_category), "NO VIOLATION", violation_category),
maximum_capacity_group = ifelse(maximum_capacity<66, "small", "large")) %>%
filter(child_care_type == "Child Care - Pre School") %>%
group_by(violation_category, maximum_capacity_group) %>%
summarise(count = n()) %>%
ggplot(aes(x = violation_category, y = count, fill = maximum_capacity_group)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=count), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5)+
labs(
x = "Violation Category",
y = "Frequency",
title = "Pre School",
fill = "Group"
) +
theme(legend.position = "bottom", axis.text.x = element_text(size=10, angle=45, hjust = 1))
bar_max_cap2 = Childcare_center %>%
mutate(violation_category = ifelse( is.na(violation_category), "NO VIOLATION", violation_category),
maximum_capacity_group = ifelse(maximum_capacity<26, "small", "large")) %>%
filter(child_care_type == "Child Care - Infants/Toddlers") %>%
group_by(violation_category, maximum_capacity_group) %>%
summarise(count = n()) %>%
ggplot(aes(x = violation_category, y = count, fill = maximum_capacity_group)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=count), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5)+
labs(
x = "Violation Category",
y = "Frequency",
title = "Infants/Toddlers",
fill = "Group"
) +
theme(legend.position = "bottom", axis.text.x = element_text(size=10, angle=45, hjust = 1))
bar_max_cap1 + bar_max_cap2
As we can see, there’s no significant difference between the distribution of total education workers in each violation categories.
boxplot_worker = Childcare_center %>%
group_by(violation_category, total_educational_workers, child_care_type) %>%
mutate(violation_category = ifelse( is.na(violation_category), "NO VIOLATION", violation_category)) %>%
ggplot(aes(x =violation_category , y = total_educational_workers , color = child_care_type)) +
geom_boxplot() +
labs(
x = "Violation category",
y = "Total educational workers",
color = ""
) +
theme(legend.position = "bottom")
boxplot_worker
As we can see, there’s no significant different within each child care type. But the educational workers per child is higher in Infants/Toddlers than it is in Pre school.
plotly_capacity_type = Childcare_center %>%
mutate(workers_per_child = total_educational_workers/maximum_capacity) %>%
group_by(workers_per_child, violation_category, child_care_type) %>%
plot_ly(y = ~workers_per_child, type = "box", x = ~violation_category, color = ~child_care_type, colors = "viridis") %>%
layout(boxmode = "group")
plotly_capacity_type