A3a: Exploratory Visual Analysis

Sunny Cui
11 min readNov 26, 2019

Introduction:

This proposal presents an exploratory visual analysis on a real-world data set named “the World Development Indicators”. When analyzing the data, I formulated two hypotheses, and then used Tableau, an interactive data visualization software, to further investigate the problems I raised. This project is organized into 5 phases: an introduction, a short background of the data profile, the first question and analysis, the second question and analysis, and a reflection at the end.

Data Profile:

For this project, I focused on a subset of data from the World Bank ’s World Development Indicator dataset. I downloaded the dataset from The World Bank’s website (https://datacatalog.worldbank.org/dataset/world-development-indicators). The World Bank(worldbank.org) is an organization that provides financial and technical assistance to countries, especially developing countries, around the world. It is established in 1944 and currently based in Washington, D.C.

The World Bank provides the most current and credible source of global development data ranging from gender and health to economic growth and education. It's World Development Indicators (WDI) database is updated in quarterly in every April, July, September and December, and the database temporal coverage from 1960 to 2019 (still updating) [2]. The WDI is a 57MB zip file that contains 6 csv files, and, for this exploratory analysis, I will only be focusing on the WDIData.csv file. The WDIData.csv (195.6 MB) has 65 columns and 377, 786 rows, and includes both qualitative and quantitative data.

The qualitative data include country name, country code, indicator name, indicator code, and the quantitative data consist of indicator value(in numerical form). The indicators cover a wide variety of topics, such as GDP, CO2 emission, population, birth rate, mortality rate, education, endangered animal species etc. One thing brought to my attention was that although the indicator topics are very thorough, the numerical value for each indicator is not always available (represented by “null”). Another thing I noticed was that certain country names such as “Arab World” and “European Union” are not actual countries. They are pre-aggregated groups and represent groups of countries.

Question 1: How do higher education attainment rate and suicide mortality rate change worldwide over time?

Inspiration:

This analysis is inspired by the in-class exercise “FDS Activity”. During the activity, I looked at the change in suicide mortality rate and higher educational achievement rate in China and Norway. After the activity, I read through a few academic articles on education and suicide rate, and they mostly focused on investigating the association between high education attainment and suicide rate. In this analysis, I’m interested in identifying how do these two variables change worldwide overtime. My hypothesis is that both higher education attainment rate and suicide mortality rate increase over time.

Data exploration:

To prepare data for this analysis, I start by importing the WDIData.csv to Tableau, and cleaned the data by putting field names in the first row, pivoting the date, and filtering out “null” values. Then, I dragged “Country Name” to detail and filtered out 47 geographically integrated country groups, such as “Arab World” (shown in Figure 1.1).

Figure 1.1: filtering out pre-aggregated country groups

The indicators I chose are “Educational attainment, at least Bachelor’s or equivalent (%)” and “Suicide mortality rate (per 100,000 population)”. “Educational attainment, at least Bachelor’s or equivalent (%)” measures the percentage of people received at least Bachelor’s degree or equivalent. “Suicide mortality rate (per 100,000 population)” is defined as “the number of suicide deaths in a year per 100,000 population” [3].

Various visualizations:

To begin with, I created two calculated fields for education attainment and suicide mortality rate. Since I want to investigate how do these variables change worldwide over time, I chose to use the map view. By adding another “latitude(generated)” to rows, I was able to have two maps on a sheet (see Figure 1.2 below).

The two graphs were edited separately in which the graph above (green) reflects education attainment and the red graph represents suicide mortality rate. Moreover, color saturation is the visual attribute for education in which darker color means higher education attainment rate. On the other hand, suicide mortality rate is encoded by the size of the dots: larger dots indicate higher suicide mortality rate.

Figure 1.2: preliminary visualization I

Next, I applied “dual axis” that combines the two separate graphs together; the result is shown below in Figure 1.3. Combining the graphs together not only reduces cognitive load but also strengthens the relationship between education attainment and suicide mortality.

Figure 1.3: preliminary visualization II (applying “dual axis”)

In addition, red and green are complementary colors which can create the strongest pre-attentive visual vibration to viewers. Red and green are chosen because of not only their visual effect but also their psychological meaning since color/hue can greatly influence human perceptions. According to colorpsychology.org, red attract our attention and at times signifies danger and death, whereas green symbolizes growth and renewal, which is suitable to the data they are speaking for [4].

Figure 2.1: final visualization I

Since I wanted to investigate how do education attainment and suicide mortality rate change worldwide over time, I also have to display the change in time. To do this, I dragged the variable “Years” to pages so I was able to see how do these variables change over time by interacting with the legend on the right. The final visualization is shown in Figure 2.1 and 2.2.

A static visualization won't be able to demonstrate the change in time, so I created an animated gif where viewers can easily see the variation with time. Sadly, the two variables are only available in 2000, 2005, 2010, 2015 and 2016, but we can still see an increase in both variables, which supports my prediction.

Figure 2.2: final visualization II (gif)

Question 2: Is women’s attitudes towards domestic violence related to their education level?

Inspiration:

The topic of female victims in domestic violence recently brought up to my attention. I remembered I read a blog on Weibo, a Chinese microblogging platform, a week ago about domestic violence in China, where the article states nearly 30% of Chinese women have or currently experience domestic violence. This stat really surprised me and I felt so upset when reading the blog. Domestic violence can is a major contributor to the ill health of women as it has serious consequence on both women’s physical and mental health. This motivates me to learn more on this topic. In particular, I was interested in whether women’s attitudes towards domestic violence is correlated with their educational background.

Data exploration:

I explored the data file by reading different “indicator names” and looking for useful indicators for my analysis. To make this step easier, I opened a new sheet in Tableau and dragged “Indicator Name” to the blank sheet (see Figure 3.1).

Figure 3.1: browsing different Indicators

Soon, I luckily found several useful information. Since my question is whether women’s attitudes towards domestic violence is related to their educational background, I selected two indicators: “Women who believe a husband is justified in beating his wife (any of five reasons)(%)” and “Adolescents out of school, female (% of female lower secondary school age).”

“Women who believe a husband is justified in beating his wife (any of five reasons)(%)” measures the percentage of women(ages 15–49) who believe a husband/partner is justified in hitting or beating his wife/partner for any of the following five reasons: argues with him; refuses to have sex; burns the food; goes out without telling him; or when she neglects the children. I predict that this variable is positively correlated with female adolescents dropping out of school rate, as women’s level of education is a significant predictor of their likelihood of experiencing domestic violence. “Adolescents out of school, female (% of female lower secondary school age)” is defined as the percentage of lower secondary school age adolescents(female) who are not enrolled in school.

Although there are many indicators/variables can be used to measure “education level”, I eventually chose adolescents out of school (female) since it is an intermediate education level. I predict there is a positive correlation between women who believe a husband is justified in beating his wife(%) and female adolescents out of school(%).

Various visualizations:

I begin by opening a new sheet and created two “calculated fields”, I named them to “female adolescents out of school(%)” and “women believing a husband is justified in beating his wife(%)”. I used the “IF, THEN, ELSE” statement to create the calculated fields. Then, I dragged the variables in columns and rows separately and dragged the nominal variable “Country Name” to detail. The result is a scatterplot shown below in Figure 3.2.

Figure 3.2 preliminary visualization I (using summed variables)

Tableau automatically plotted the sum of these measurements into the scatterplot, where every dot represents a country’s total female out of school rate and women believing a husband is justified in domestic violence rate. This wouldn’t make sense since we are analyzing the correlation between two percentage variables where the x-axis and y-axis should have maximum values of 100 instead of 1600 and 400. As a result, I changed the measure of the variables to average, and the result is shown below in Figure 3.3.

Figure 3.3 preliminary visualization II (using averaged variables)

By looking at the trend, we can tell that the two variables are positively correlated with a few outliers. Next, I added a Trend Line, a linear regression model in Tableau, to see the strength of the correlation (See Figure 4.1 below). The correlation has R-squared value = 0.512. By taking a square root of 0.512, the correlation coefficient R is 0.72 indicating a strong positive association between the two variables. Although the result cannot directly imply causation, it supports my hypothesis that women who believe a husband is justified in beating his wife(%) is positively correlated with female adolescents out of school(%).

Figure 4.1 preliminary visualization III (with trend line)

After creating the scatterplot, I incorporated the “cluster model” feature, where Tableau automatically creates statistically-based segments/clusters which provides insight into how the data is distributed (shown in Figure 4.2 below).

Figure 4.2 preliminary visualization III (showing clusters)

The scatterplot is clustered into three groups, blue, yellow and red respectively. The cluster feature helps segment data into groups that are not defined previously. Once the data has been segmented into clusters, I can easily identify where the outliers are. After applying clusters, I found that most outliers exist in the yellow cluster(medium out of school rate and medium domestic violence tolerance rate).

The important message here is that this phenomenon suggests that the correlation is polarized, meaning that the correlation is stronger when both variables are very high (red) and very low (blue). To test out whether my conclusion is right, I removed 5 outliers from the yellow cluster and dragged the trend line back to the graph. The coefficient R increased from 0.72 to 0.81 (see Figure 4.3).

Figure 4.3 preliminary visualization IV (with trend line)

Now, I decided to explore this correlation a little further by taking other variables into account. I started to wonder if there are other indicators/variables related to this correlation, such as women’s “wealth ” and “career achievement”.

For the final visualization, I decided to create more than just a scatterplot between dropping out of school and domestic violence tolerance rate. I eventually selected “GDP per capita” and “Female in Senior and Middle Management (%)” as the two variables I want to add to the current scatterplot. “GDP per capita” measures a country’s economic output that accounts for its population, which can be an effective measurement of people’s wealth/standard of living. “Female in Senior and Middle Management (%)” is defined as “the proportion of females in total employment in senior and middle management.[5]”, which can indicate women’s career achievement in a certain country.

I started by creating two new calculated fields. Next, I dragged “Female in Senior and Middle Management (%)” to color, and changed its measurement from sum to average (see Figure 5.1 below). According to the legend, darker circles means higher career achievements. This analysis implies as countries with higher female career achievement rate tend to have lower dropping out of school and domestic violence tolerance rate.

Figure 5.1 preliminary visualization V (adding another indicator)

To create the final visualization, I dragged “GDP per capita” to size, and changed its measure to average. Larger circle means higher GDP per capita, which implies that higher living standards suggest lower female adolescents dropping out of school and domestic violence tolerance rate. The final visualization demonstrates the relationships among the four indicators. The reason why I choose to use “color” and “size” to represent GDP per capita and career accomplishment is that these two visual attributes can be perceived easily and pre-attentively by human eyes, making this visualization more intuitive. The final visualization is shown below in Figure 5.2.

Figure 5.2 the final visualization

Reflection:

Through this exploratory visual analysis, I was able to explore different features in Tableau, and use them to solve problems and create meaningful visualizations. Other than learning the skills to clean, organize and display data, I also got to learn how to focus on the data that is important to me and filtering out other irrelevant information, which, I believe, is a great skill especially when working with a large data set. I enjoyed the process of exploring, formulating questions, and creating visualizations and iterating.

In particular, when creating the final visualization for the first question, I primarily focused on showing the worldwide patterns of the variables so I chose to use a map graph. I learned that map visualizations are perfect for providing orientation and displaying geographic data; however, they are not very effective in showing changes over time. The final map visualization is very successful in demonstrating data patterns in different geographic locations but the diagram is relatively weaker in showing changes over time than a line graph.

Overall, I think this project was challenging but rewarding. The most important lesson I learned from this project is that there is no perfect data visualization. When crafting the visualizations, I kept asking myself: is there any guidelines or framework for me to follow? Aesthetics, intuitiveness, and interactivity, which one should I prioritize? I realized that although there are some conventions to follow, there is no right or wrong for crafting a visualization. However, as designers/scientists, it is important to understand the skillset and mindset of our audience in order to visualize data successfully and meaningfully.

References:

[1]: What We Do. (n.d.). Retrieved November 22, 2019, from https://www.worldbank.org/en/about/what-we-do.

[2]: World Development Indicators. (2019, October 28). Retrieved November 22, 2019, from https://datacatalog.worldbank.org/dataset/world-development-indicators.

[3]: Suicide mortality rate (per 100,000 population). (2019, October 22). Retrieved from https://datacatalog.worldbank.org/suicide-mortality-rate-100000-population-1.

[4]: The Ultimate Guide to Color Meanings. (n.d.). Retrieved November 22, 2019, from https://www.colorpsychology.org/.

[5]: Female share of employment in senior and middle management (%). (n.d.). Retrieved November 23, 2019, from https://datacatalog.worldbank.org/female-share-employment-senior-and-middle-management-1.

--

--