Introduction

This project will analyze the Global Development dataset from CORGIS, a collection of 6,427 observations with 25 features across 189 countries.

The objective of this project is to identify the key factors that influence quality of life at a national level. Specifically, we will investigate the determinants of life expectancy within the dataset. We expect that features including infrastructure, population density, and the distribution of rural versus urban population will be significant predictors of life expectancy. This expectation stems from the understanding that infrastructure facilitates access to opportunities, population density can reflect the quality of living conditions, and the rural/urban population balance provides information about the labor force and access to essential services, all of which are potentially associated with wellbeing.

Exploratory Analysis

Data Cleaning

The initial exploratory data analysis involved several key steps. First, we removed the Country column because it did not provide unique identifying information. Next, we checked for null values (missing data) and found this dataset contained very few, most of them occurring in the the mobile_cellular_subscriptions and mobile_cellular_subscriptions_per_ 100_people columns. These gaps likely reflect the limited adoption and availability of mobile phones in the earlier time periods included in the data, so we concluded it would be appropriate to impute these missing values with 0.

We also identified 34 missing values in the rural_population column. Upon closer examination, we discovered that these pertained primarily to Singapore, which consistently reported zero rural population. Our research confirmed that Singapore's rapid urbanization and economic shift have essentially eliminated officially recognized rural areas. Source Due to the missing data, we decided to exclude Singapore from the prediction data to reduce the risk of a potential outlier manipulating our models.

After removing Singapore from the data, imputing the remaining null values with 0 and removing features that won't be used in our analysis, this is how our data is organized:

	tot_pop	tel_lines_per_100	rural_population	population_density	birth_rate	death_rate	fertility_rate	life_expectancy	arable_land_perc	agr_land_perc
0	24277000.0	39.562351	5918004	2.669706	15.4	7.0	1.754	74.866341	4.879744	7.357225
1	24593000.0	40.712240	5985198	2.704456	15.4	7.1	1.740	75.078049	4.918123	7.301471
2	24900000.0	41.570761	6047712	2.738217	15.3	7.0	1.700	75.478537	4.956502	7.245717
3	25202000.0	41.320642	6080235	2.771427	15.1	7.1	1.690	75.680488	4.960241	7.288275
4	25456000.0	41.429346	6100530	2.799359	15.0	7.0	1.680	76.036341	4.963870	7.330943
...	...	...	...	...	...	...	...	...	...	...

Univariate Analysis

Distribution of the `birth_rate` Feature

Birth rate is multimodal, with several peaks throughout its distribution. This can be owing to features such as varying fertility inclinations and disparities in healthcare/accessibility, which we will explore in depth during our analysis.

Distribution of the `life_expectancy` Feature

Life expectancy is skewed left, as the mortality rate generally increases as people get older. We see several fluctuations and even a slight trough as values approach the mean of 64.7 years, implying that a multitude of factors can affect the variable we are attemtping to predict.

Bivariate Analysis

Telepone Lines per 100 People vs Life Expectancy

As visualized in the graph above, the life_expectancy and tel_lines_per_100 features have a substantial correlation of about 0.71. This provides support to our intuition that technology accessibility is positively correlated with life expectancy. This can be explained by cutting edge healthcare technology and resources being probably more likely to prolong lifespan.

Prediction Problem

The prediction task for this project is framed as a regression problem, where we seek to predict the `life_expectancy` feature.

Recognizing that life expectancy can serve as an indicator of a nation's overall wellbeing, we are attempting to model the relationship between various factors and this outcome. Based on our data cleaning and preliminary exploration of the dataset, we anticipate that tel_lines_per_100 (a measure of technology access), features related to the population distribution such as rural vs. urban and density, and birth rate will be particularly relevant for predicting life expectancy.

Model performance will be assessed via test accuracy. This approach is preferred over relying solely on training accuracy, as models can overfit the training data, resulting in inflated training scores and poor performance on unseen data. Evaluating our model on a separate test set provides a more realistic assessment of its ability to generalize.

Baseline Model

Our baseline model uses 3 features to predict life expectancy:

Rural population percentage: This feature, calculated as rural population divided by total population, represents the proportion of the population living in rural areas. We included this variable because our exploratory groupings suggested a positive correlation between birth rate and life expectancy, and we hypothesized that rural population percentage might be related to birth rates.

Quantitative. Calculated via a FunctionTransformer.
Population density: This feature is defined as the total population divided by surface area, was selected as a feature for our baseline model. We aim to assess how strongly it correlates to life expectancy.

Quantitative. Passed through the pipeline unaltered with a 'passthrough' FunctionTransformer.
Telephone Lines per 100 People: We chose tel_lines_per_100> as a proxy for technology accessibility, as it provides a quantifiable measure of access to communication infrastructure. We hypothesize that technology accessibility is posivitely correlated to life expectancy.

Quantitative. Passed through the pipeline unaltered with a 'passthrough' FunctionTransformer.

Model Results

	Training MSE	Validation MSE
Fold 0	40.759611	39.007372
Fold 1	40.376617	40.549943
Fold 2	40.530343	39.964884
Fold 3	40.221227	41.223159
Fold 4	40.090048	41.722994

We consistently achieved a training MSE and validation MSE of around 40 with our baseline model, using 5 folds for cross validation. This means our model makes predictions that are 6.3 years away from the true life expectancy on average. We concluded that this error was relatively large, and aim to construct a more accurate model.

Final Model

For our final model, we aimed to enhance our baseline model from Step 4 by incorporating new features to create a more accurate predictor. We chose to experiment with sklearn's PolynomialFeatures module, hypothesizing that the data may not follow a linear pattern, and would be best modeled by a polynomial equation. We also introduced some additional features to our model, as listed below:

Rate of Natural Increase: We decided that rate of natural increase (RNI) has potential as a good additional feature, which is defined as birth rate - death rate. RNI serves as an indicator of population fluctuations without including the effects of migration to/from a country. Definition
Standardized Population: We are curious to determine if a standardized total population feature could be useful in our final model, as the raw values of total_population are very large. Standardizing helps to prevent total_population from dominating other features due to its scale.
Fertility Rate: Another feature we introduced was fertility_rate, which is defined as the number of live births divided by the female population labeled as 'fertile'. Intuitively, a higher fertility rate would be associated with a potential for a higher life expectancy, since regions with higher fertility may also invest more in healthcare for their citizens.

Model Results

	Training MSE	Validation MSE
Results	7.079275	7.041261

After some experimenting with the degree hyperparameter for our PolynomialFeatures module, we determined that the ideal value (ie. the value that minimizes the mean squared error) is degree 3.

Introducing these new features has resulted in our training MSE decreasing to approximately 7.04. This means our final model's predictions are off by an average of 2.65 years, which is over a 50% reduction in error from our baseline model's error of about 6.3 years. For reference, the standard deviation of the life expectancy variable is about 10.6. Our model's MSE is much smaller than the standard deviation of the training data, which means our model is successfully capturing the nuances and variation of the data, allowing it to make predictions as accurate as possible.

A Nation's Happiness - Quantified via Life Expectancy

by Abby Schultz and Mariana Ceballos

Introduction

This project will analyze the Global Development dataset from CORGIS, a collection of 6,427 observations with 25 features across 189 countries.

Exploratory Analysis

Data Cleaning

Univariate Analysis

Distribution of the `birth_rate` Feature

Distribution of the `life_expectancy` Feature

Bivariate Analysis

Telepone Lines per 100 People vs Life Expectancy

Prediction Problem

The prediction task for this project is framed as a regression problem, where we seek to predict the `life_expectancy` feature.

Baseline Model

Our baseline model uses 3 features to predict life expectancy:

Model Results

Final Model

Model Results

Introduction

This project will analyze the Global Development dataset from CORGIS, a collection of 6,427 observations with 25 features across 189 countries.

Exploratory Analysis

Data Cleaning

Univariate Analysis

Distribution of the birth_rate Feature

Distribution of the life_expectancy Feature

Bivariate Analysis

Telepone Lines per 100 People vs Life Expectancy

Prediction Problem

The prediction task for this project is framed as a regression problem, where we seek to predict the life_expectancy feature.

Baseline Model

Our baseline model uses 3 features to predict life expectancy:

Model Results

Final Model

Model Results

Distribution of the `birth_rate` Feature

Distribution of the `life_expectancy` Feature

The prediction task for this project is framed as a regression problem, where we seek to predict the `life_expectancy` feature.