A Nation's Happiness - Quantified via Life Expectancy

by Abby Schultz and Mariana Ceballos

Introduction

This project will analyze the Global Development dataset from CORGIS, a collection of 6,427 observations with 25 features across 189 countries.

The objective of this project is to identify the key factors that influence quality of life at a national level. Specifically, we will investigate the determinants of life expectancy within the dataset. We expect that features including infrastructure, population density, and the distribution of rural versus urban population will be significant predictors of life expectancy. This expectation stems from the understanding that infrastructure facilitates access to opportunities, population density can reflect the quality of living conditions, and the rural/urban population balance provides information about the labor force and access to essential services, all of which are potentially associated with wellbeing.

Exploratory Analysis


Data Cleaning

The initial exploratory data analysis involved several key steps. First, we removed the Country column because it did not provide unique identifying information. Next, we checked for null values (missing data) and found this dataset contained very few, most of them occurring in the the mobile_cellular_subscriptions and mobile_cellular_subscriptions_per_ 100_people columns. These gaps likely reflect the limited adoption and availability of mobile phones in the earlier time periods included in the data, so we concluded it would be appropriate to impute these missing values with 0.

We also identified 34 missing values in the rural_population column. Upon closer examination, we discovered that these pertained primarily to Singapore, which consistently reported zero rural population. Our research confirmed that Singapore's rapid urbanization and economic shift have essentially eliminated officially recognized rural areas. Source Due to the missing data, we decided to exclude Singapore from the prediction data to reduce the risk of a potential outlier manipulating our models.

After removing Singapore from the data, imputing the remaining null values with 0 and removing features that won't be used in our analysis, this is how our data is organized:

tot_pop tel_lines_per_100 rural_population population_density birth_rate death_rate fertility_rate life_expectancy arable_land_perc agr_land_perc
0 24277000.0 39.562351 5918004 2.669706 15.4 7.0 1.754 74.866341 4.879744 7.357225
1 24593000.0 40.712240 5985198 2.704456 15.4 7.1 1.740 75.078049 4.918123 7.301471
2 24900000.0 41.570761 6047712 2.738217 15.3 7.0 1.700 75.478537 4.956502 7.245717
3 25202000.0 41.320642 6080235 2.771427 15.1 7.1 1.690 75.680488 4.960241 7.288275
4 25456000.0 41.429346 6100530 2.799359 15.0 7.0 1.680 76.036341 4.963870 7.330943
... ... ... ... ... ... ... ... ... ... ...

Univariate Analysis

Distribution of the birth_rate Feature

Birth rate is multimodal, with several peaks throughout its distribution. This can be owing to features such as varying fertility inclinations and disparities in healthcare/accessibility, which we will explore in depth during our analysis.


 

Distribution of the life_expectancy Feature

Life expectancy is skewed left, as the mortality rate generally increases as people get older. We see several fluctuations and even a slight trough as values approach the mean of 64.7 years, implying that a multitude of factors can affect the variable we are attemtping to predict.

Bivariate Analysis

Telepone Lines per 100 People vs Life Expectancy

As visualized in the graph above, the life_expectancy and tel_lines_per_100 features have a substantial correlation of about 0.71. This provides support to our intuition that technology accessibility is positively correlated with life expectancy. This can be explained by cutting edge healthcare technology and resources being probably more likely to prolong lifespan.

Prediction Problem


The prediction task for this project is framed as a regression problem, where we seek to predict the life_expectancy feature.

Recognizing that life expectancy can serve as an indicator of a nation's overall wellbeing, we are attempting to model the relationship between various factors and this outcome. Based on our data cleaning and preliminary exploration of the dataset, we anticipate that tel_lines_per_100 (a measure of technology access), features related to the population distribution such as rural vs. urban and density, and birth rate will be particularly relevant for predicting life expectancy.

Model performance will be assessed via test accuracy. This approach is preferred over relying solely on training accuracy, as models can overfit the training data, resulting in inflated training scores and poor performance on unseen data. Evaluating our model on a separate test set provides a more realistic assessment of its ability to generalize.

Baseline Model

Our baseline model uses 3 features to predict life expectancy:

  1. Rural population percentage: This feature, calculated as rural population divided by total population, represents the proportion of the population living in rural areas. We included this variable because our exploratory groupings suggested a positive correlation between birth rate and life expectancy, and we hypothesized that rural population percentage might be related to birth rates.

    Quantitative. Calculated via a FunctionTransformer.

  2. Population density: This feature is defined as the total population divided by surface area, was selected as a feature for our baseline model. We aim to assess how strongly it correlates to life expectancy.

    Quantitative. Passed through the pipeline unaltered with a 'passthrough' FunctionTransformer.

  3. Telephone Lines per 100 People: We chose tel_lines_per_100> as a proxy for technology accessibility, as it provides a quantifiable measure of access to communication infrastructure. We hypothesize that technology accessibility is posivitely correlated to life expectancy.

    Quantitative. Passed through the pipeline unaltered with a 'passthrough' FunctionTransformer.

 

 

Model Results

Training MSE Validation MSE
Fold 0 40.759611 39.007372
Fold 1 40.376617 40.549943
Fold 2 40.530343 39.964884
Fold 3 40.221227 41.223159
Fold 4 40.090048 41.722994

 

We consistently achieved a training MSE and validation MSE of around 40 with our baseline model, using 5 folds for cross validation. This means our model makes predictions that are 6.3 years away from the true life expectancy on average. We concluded that this error was relatively large, and aim to construct a more accurate model.

Final Model

For our final model, we aimed to enhance our baseline model from Step 4 by incorporating new features to create a more accurate predictor. We chose to experiment with sklearn's PolynomialFeatures module, hypothesizing that the data may not follow a linear pattern, and would be best modeled by a polynomial equation. We also introduced some additional features to our model, as listed below:

  1. Rate of Natural Increase: We decided that rate of natural increase (RNI) has potential as a good additional feature, which is defined as birth rate - death rate. RNI serves as an indicator of population fluctuations without including the effects of migration to/from a country. Definition

  2. Standardized Population: We are curious to determine if a standardized total population feature could be useful in our final model, as the raw values of total_population are very large. Standardizing helps to prevent total_population from dominating other features due to its scale.

  3. Fertility Rate: Another feature we introduced was fertility_rate, which is defined as the number of live births divided by the female population labeled as 'fertile'. Intuitively, a higher fertility rate would be associated with a potential for a higher life expectancy, since regions with higher fertility may also invest more in healthcare for their citizens.

 

 

Model Results

Training MSE Validation MSE
Results 7.079275 7.041261

 

After some experimenting with the degree hyperparameter for our PolynomialFeatures module, we determined that the ideal value (ie. the value that minimizes the mean squared error) is degree 3.

Introducing these new features has resulted in our training MSE decreasing to approximately 7.04. This means our final model's predictions are off by an average of 2.65 years, which is over a 50% reduction in error from our baseline model's error of about 6.3 years. For reference, the standard deviation of the life expectancy variable is about 10.6. Our model's MSE is much smaller than the standard deviation of the training data, which means our model is successfully capturing the nuances and variation of the data, allowing it to make predictions as accurate as possible.