import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
data = pd.read_csv('/kaggle/input/countrydatacsv/Country-data.csv')
data
country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162 | Vanuatu | 29.2 | 46.6 | 5.25 | 52.7 | 2950 | 2.62 | 63.0 | 3.50 | 2970 |
163 | Venezuela | 17.1 | 28.5 | 4.91 | 17.6 | 16500 | 45.90 | 75.4 | 2.47 | 13500 |
164 | Vietnam | 23.3 | 72.0 | 6.84 | 80.2 | 4490 | 12.10 | 73.1 | 1.95 | 1310 |
165 | Yemen | 56.3 | 30.0 | 5.18 | 34.4 | 4480 | 23.60 | 67.5 | 4.67 | 1310 |
166 | Zambia | 83.1 | 37.0 | 5.89 | 30.9 | 3280 | 14.00 | 52.0 | 5.40 | 1460 |
167 rows × 10 columns
data.isnull().sum()
country 0 child_mort 0 exports 0 health 0 imports 0 income 0 inflation 0 life_expec 0 total_fer 0 gdpp 0 dtype: int64
data.dtypes
country object child_mort float64 exports float64 health float64 imports float64 income int64 inflation float64 life_expec float64 total_fer float64 gdpp int64 dtype: object
data['country'].describe()
count 167 unique 167 top Afghanistan freq 1 Name: country, dtype: object
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 167 entries, 0 to 166 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 167 non-null object 1 child_mort 167 non-null float64 2 exports 167 non-null float64 3 health 167 non-null float64 4 imports 167 non-null float64 5 income 167 non-null int64 6 inflation 167 non-null float64 7 life_expec 167 non-null float64 8 total_fer 167 non-null float64 9 gdpp 167 non-null int64 dtypes: float64(7), int64(2), object(1) memory usage: 13.2+ KB
data['country'].nunique()
167
Let's begin by standardizing the data and performing PCA. ¶
# Excluding the 'country' column for standardization
features = data.columns[1:]
x = data.loc[:, features].values
# Standardizing the features
x = StandardScaler().fit_transform(x)
# Performing PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data=principalComponents, columns=['principal component 1', 'principal component 2'])
# Concatenating the country names for visualization
finalDf = pd.concat([principalDf, data[['country']]], axis=1)
# Plotting the PCA results
plt.figure(figsize=(10,6))
plt.scatter(finalDf['principal component 1'], finalDf['principal component 2'])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('2 component PCA')
plt.show()
# Displaying the variance explained by each principal component
pca.explained_variance_ratio_
array([0.4595174 , 0.17181626])
The PCA loadings on the original features reveal the following insights:
Principal Component 1 (PC1): It is most positively influenced by life expectancy, income, and GDP per capita, and most negatively by child mortality and total fertility. This suggests that PC1 represents overall socio-economic development and health standards.
Principal Component 2 (PC2): It shows significant positive loadings on imports and exports, indicating that this component may represent the degree of trade engagement or economic openness.
Based on these findings, if the goal is to allocate $100 million effectively, considering the indicators most influencing PC1 would be ideal, as it accounts for the largest variance in the dataset. Investments in areas that improve life expectancy, income, and GDP per capita, while reducing child mortality and total fertility, would be impactful. This could include funding healthcare, education, and economic development initiatives in countries where these indicators are lagging.
# Getting the PCA components
pca_components = pca.components_
# Creating a DataFrame for better visualization of the PCA components
pca_components_df = pd.DataFrame(data=pca_components, columns=features, index=['PC1', 'PC2'])
# Displaying the PCA components
pca_components_df.transpose().sort_values(by='PC1', ascending=False)
PC1 | PC2 | |
---|---|---|
life_expec | 0.425839 | -0.222707 |
income | 0.398441 | 0.022536 |
gdpp | 0.392645 | -0.046022 |
exports | 0.283897 | 0.613163 |
imports | 0.161482 | 0.671821 |
health | 0.150838 | -0.243087 |
inflation | -0.193173 | -0.008404 |
total_fer | -0.403729 | 0.155233 |
child_mort | -0.419519 | 0.192884 |
Such targeted investments are likely to yield significant improvements in the overall socio-economic and health conditions of the countries in question, aligning with the principal component analysis results.
from sklearn.model_selection import train_test_split
# Splitting the dataset into training and testing sets
X_train, X_test = train_test_split(x, test_size=0.2, random_state=42)
# Applying PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Extracting the explained variance
explained_variance = pca.explained_variance_ratio_
# Visualizing the PCA-transformed training data
plt.figure(figsize=(10,6))
plt.scatter(X_train_pca[:,0], X_train_pca[:,1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Training Data')
plt.show()
explained_variance
array([0.47172968, 0.14401055])
The PCA-transformed training data is visualized above, with the first two principal components shown. In this representation:
Principal Component 1 (PC1) explains approximately 94.72% of the variance. Principal Component 2 (PC2) accounts for about 5.28% of the variance. This indicates that PC1 is significantly more influential in capturing the variance within the dataset.
Now, let's connect back to the original features to see how they contribute to these principal components. This will provide insights into which socio-economic and health indicators are most significant in the dataset, guiding where the hypothetical $100 million could be most effectively spent. We'll visualize the loadings of the original features on these principal components.
# Plotting the PCA loadings for the original features
plt.figure(figsize=(12, 8))
# For each feature
for i, feature in enumerate(features):
plt.arrow(0, 0, pca.components_[0, i], pca.components_[1, i], color='r', alpha=0.5)
plt.text(pca.components_[0, i] * 1.2, pca.components_[1, i] * 1.2, feature, color='g', ha='center', va='center')
# Setting limits, labels and title
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Loadings Plot')
plt.grid()
# Adding unit circle
circle = plt.Circle((0, 0), 1, color='blue', fill=False)
plt.gca().add_artist(circle)
plt.show()
The PCA Loadings Plot visualizes how each original feature influences the two principal components. The arrows represent the loadings of each feature. Features that point in the same direction have a positive correlation, while features pointing in opposite directions are negatively correlated.
Key observations from the plot:
Child Mortality (child_mort) and Total Fertility (total_fer) are strongly negatively correlated with Life Expectancy (life_expec), Income, and GDP per Capita (gdpp). This indicates that improvements in health and economic conditions generally lead to lower child mortality and fertility rates.
Exports and Imports point in a similar direction, indicating a positive correlation between a country's level of trade and its economic indicators.
The length of the arrows indicates the strength of the feature's influence on the principal components. Features with longer arrows have a greater impact.
Given this analysis, strategic investments should focus on improving life expectancy, income, and GDP per capita, as these areas have strong correlations with overall development. This could include healthcare improvements, education, and economic development projects. Additionally, fostering trade and economic openness could also be beneficial, as indicated by the positive correlation of exports and imports with other positive development indicators.