2 Data exploration

In this section we are going to explore the data in order to find insights.

2.1 Missing values checking and fixing

Do we have any missing data ?

sapply(suicide, is.na) %>% colSums %>% kable () %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	x
country	0
year	0
sex	0
age	0
suicides_no	0
population	0
suicides.100k.pop	0
country.year	0
HDI.for.year	19456
gdp_for_year….	0
gdp_per_capita….	0
generation	0

Only the HDI.for.year column contains missing values. What is the proportion of missing data in this column?

sum(is.na(suicide$HDI.for.year))/length(suicide$HDI.for.year) * 100

## [1] 69.9353

Near 70 % of the data is missing for this column. We’ll see how we can make use of this variable.

2.2 Qualitative variables frequencies

2.2.1 Génération

suicide %>% group_by(generation) %>% 
summarize(nb = n()) %>% kable () %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

generation	nb
Boomers	4990
G.I. Generation	2744
Generation X	6408
Generation Z	1470
Millenials	5844
Silent	6364

These are the number of occurences of each generation in the dataset.

hcbar(x = suicide$generation, name = "Generation") %>% 
hc_add_theme(hc_theme_economist()) %>%
  hc_title(text = "Distribution of generations counts ")

X generation and silent are the most popular. Generation Z is the smallest group.

2.2.2 Age groups

Let’s now visualize the age groups

hcbar(x = suicide$ag, name = "Age categories") %>% 
hc_add_theme(hc_theme_economist()) %>%
  hc_title(text = "Bar chart representing the counts of age category")

The age groups are all equally distributed.

2.2.3 By sex

How about the the sex group. They both are equally distributed

hcbar(x = suicide$sex, name = "Gender") %>% 
hc_add_theme(hc_theme_economist()) %>%
  hc_title(text = "Bar charts of Gender counts")

2.3 Data by year

Now do we have the same amount of data for each year ?

hcbar(x = as.character(suicide$year), name = "Years") %>% 
hc_add_theme(hc_theme_economist())

The dataset does not have all the data for each year. Each year varies. For example the last year 2016 has the fewest records. We need to keep this information in mind when we want to interpret the results of the analysis.