2 Data exploration
In this section we are going to explore the data in order to find insights.
2.1 Missing values checking and fixing
Do we have any missing data ?
sapply(suicide, is.na) %>% colSums %>% kable () %>%
kable_styling(bootstrap_options = "striped", full_width = F)
x | |
---|---|
country | 0 |
year | 0 |
sex | 0 |
age | 0 |
suicides_no | 0 |
population | 0 |
suicides.100k.pop | 0 |
country.year | 0 |
HDI.for.year | 19456 |
gdp_for_year…. | 0 |
gdp_per_capita…. | 0 |
generation | 0 |
Only the HDI.for.year
column contains missing values. What is the proportion of missing data in this column?
## [1] 69.9353
Near 70 % of the data is missing for this column. We’ll see how we can make use of this variable.
2.2 Qualitative variables frequencies
2.2.1 Génération
suicide %>% group_by(generation) %>%
summarize(nb = n()) %>% kable () %>%
kable_styling(bootstrap_options = "striped", full_width = F)
generation | nb |
---|---|
Boomers | 4990 |
G.I. Generation | 2744 |
Generation X | 6408 |
Generation Z | 1470 |
Millenials | 5844 |
Silent | 6364 |
These are the number of occurences of each generation in the dataset.
hcbar(x = suicide$generation, name = "Generation") %>%
hc_add_theme(hc_theme_economist()) %>%
hc_title(text = "Distribution of generations counts ")
X generation and silent are the most popular. Generation Z is the smallest group.
2.2.2 Age groups
Let’s now visualize the age groups
hcbar(x = suicide$ag, name = "Age categories") %>%
hc_add_theme(hc_theme_economist()) %>%
hc_title(text = "Bar chart representing the counts of age category")
The age groups are all equally distributed.
2.2.3 By sex
How about the the sex group. They both are equally distributed
2.3 Data by year
Now do we have the same amount of data for each year ?
The dataset does not have all the data for each year. Each year varies. For example the last year 2016 has the fewest records. We need to keep this information in mind when we want to interpret the results of the analysis.