Stat 4770/7770 Final Presentation¶

  • Ritvik Ellendula
  • A Data Analysis Exploration into Environmental & Socioeconomic Factors on Poor Health Outcomes

Introduction¶

  • This project strives to analyze the magnitude and influence that different socioeconomic factors possess on overall poor health
  • To do this, this project utilized the Python libraries Pandas, Seaborn, Numpy, MatPlotLib, and Sklearn to graph and create models to discern different variables' impact on health

Importing Libraries¶

Summaries¶

In [159]:
county_data.describe()
print("General Summary Statistics on Data")
print(county_data.describe())
General Summary Statistics on Data
       Five-digit.FIPS.Code  State.FIPS.Code  County.FIPS.Code  Poor.Health  \
count           2715.000000      2715.000000       2715.000000  2715.000000   
mean           30479.473665        30.376059        103.414733     0.174385   
std            15292.329045        15.272748        108.697370     0.045176   
min             1001.000000         1.000000          1.000000     0.082900   
25%            19016.000000        19.000000         35.000000     0.140350   
50%            29163.000000        29.000000         79.000000     0.167300   
75%            46025.000000        46.000000        133.000000     0.203950   
max            56045.000000        56.000000        840.000000     0.407300   

         Uninsured  Primary.Care.Physicians.Per.1000  \
count  2715.000000                       2715.000000   
mean      0.107790                          0.557937   
std       0.047841                          0.329909   
min       0.020700                          0.000000   
25%       0.069200                          0.337500   
50%       0.100700                          0.500000   
75%       0.134650                          0.722000   
max       0.308800                          4.470000   

       Mental.health.providers.Per.1000  Adult.Obesity  Proportion.of.Smokers  \
count                       2715.000000    2715.000000            2715.000000   
mean                           1.486268       0.320086               0.178243   
std                            1.499236       0.045581               0.034748   
min                            0.042000       0.138000               0.067354   
25%                            0.474500       0.294000               0.152513   
50%                            1.036000       0.323000               0.173543   
75%                            1.992000       0.350000               0.202849   
max                           15.579000       0.495000               0.331744   

       High.School.Graduation  ...  Physical.Inactivity  Excessive.Drinking  \
count             2715.000000  ...          2715.000000         2715.000000   
mean                 0.882899  ...             0.255718            0.174894   
std                  0.070744  ...             0.051730            0.031926   
min                  0.363000  ...             0.084000            0.092652   
25%                  0.846900  ...             0.221000            0.152109   
50%                  0.894500  ...             0.256000            0.174272   
75%                  0.933600  ...             0.291000            0.196637   
max                  1.000000  ...             0.451000            0.294401   

       Median.Household.Income  Severe.Housing.Problems  Unemployment  \
count              2715.000000              2715.000000   2715.000000   
mean              51606.557274                 0.143692      0.046099   
std               13650.706994                 0.040969      0.015427   
min               25569.000000                 0.030400      0.016200   
25%               42680.000000                 0.115950      0.035800   
50%               49350.000000                 0.139900      0.043800   
75%               57274.500000                 0.164750      0.053350   
max              136191.000000                 0.394100      0.190700   

       Percent.Rural      Over.65  Percent.Females  Life.Expectancy  \
count    2715.000000  2715.000000      2715.000000      2715.000000   
mean        0.543305     0.185704         0.500014        77.464751   
std         0.304532     0.044622         0.021505         2.832091   
min         0.000000     0.048000         0.266000        67.068555   
25%         0.302300     0.157000         0.496000        75.596650   
50%         0.545000     0.182000         0.504000        77.544959   
75%         0.777600     0.210000         0.511000        79.232825   
max         1.000000     0.569000         0.570000        97.965235   

         Population  
count  2.715000e+03  
mean   1.181631e+05  
std    3.560418e+05  
min    1.718000e+03  
25%    1.452450e+04  
50%    3.159400e+04  
75%    8.273800e+04  
max    1.016351e+07  

[8 rows x 21 columns]

Mean/Median Poor Health Proportions Across State¶

In [160]:
state_mean=county_data.groupby("State.Abbreviation")["Poor.Health"].mean()
state_median=county_data.groupby("State.Abbreviation")["Poor.Health"].median()
print("Highest Mean and Median:")
print(state_mean.sort_values(ascending=True).head(5))
print(state_median.sort_values(ascending=True).head(5))
print("Lowest Mean and Median:")
print(state_mean.sort_values(ascending=True).tail(5))
print(state_median.sort_values(ascending=True).tail(5))
Highest Mean and Median:
State.Abbreviation
CT    0.113317
VT    0.116242
SD    0.119667
RI    0.121780
MN    0.123187
Name: Poor.Health, dtype: float64
State.Abbreviation
SD    0.11215
RI    0.11260
VT    0.11495
CT    0.11600
MN    0.12165
Name: Poor.Health, dtype: float64
Lowest Mean and Median:
State.Abbreviation
KY    0.221119
MS    0.223572
WV    0.225047
AL    0.229694
AR    0.231626
Name: Poor.Health, dtype: float64
State.Abbreviation
KY    0.22040
WV    0.22070
AL    0.22090
MS    0.22195
AR    0.22510
Name: Poor.Health, dtype: float64

Graphical Data on Poor Health Averages Across States¶

In [161]:
statesgroupsascend = county_data.groupby('State.Abbreviation')['Poor.Health'].mean().sort_values(ascending = True)
df_tmp = pd.DataFrame(statesgroupsascend)
df_tmp['State'] = df_tmp.index
sns.catplot(data=df_tmp, y="Poor.Health", x = 'State', kind="bar", height=4, aspect=3)
plt.xticks(rotation=45);
plt.show()
No description has been provided for this image

Important Considerations¶

  • The associations and relationships found within this analysis are not necessarily causal
  • Upon Reviewing the Data Dictionary, there is a very evident distribution amongst variables focusing on 1)Outcome, 2)Access/Quality of Care, 3)Social Determinants of Health-Behavioral, 4)Social Determinants of Health-Environmental, 5)Demographic Characteristics

Key Questions of Interest¶

  1. Is a higher number of physicians in the area associated with less poor health outcomes?
  2. Is a higher rate of unemployment related to poor health outcomes?
  3. Is a higher percent rural related to poor health outcomes?
  4. Is excessive drinking related to poor health outcomes?

Y-Variable Rationale¶

  • According to the Data Dictionary Review, there are two "outcome variables" : Poor Health & Election Results

Specifically, this analysis specifically chooses Poor Health Outcomes, as numerous environmental factors & socioeconomic factors hold a strong impact on poor health, and there are many different "predictor variables" to consider Poor Health, in this context refers to the "portion of county that has a poor health status"

Possible Predictor Variable Correlations¶

In [162]:
y = 'Poor.Health'
selected_vars = ['Primary.Care.Physicians.Per.1000', 'Unemployment', 
          'Percent.Rural', 'Excessive.Drinking']
broad_vars = ['Primary.Care.Physicians.Per.1000', 'Mental.health.providers.Per.1000', 'Adult.Obesity', 'Proportion.of.Smokers','High.School.Graduation',
              'Physical.Inactivity', 'Excessive.Drinking', 'Median.Household.Income', 'Severe.Housing.Problems', 'Unemployment', 'Percent.Rural', 'Over.65', 'Percent.Females', 'Life.Expectancy']
print("Correlation of Poor Health with majority of predictors:\n")
print(county_data[[y] + broad_vars].corr()[y].sort_values(ascending=False))
Correlation of Poor Health with majority of predictors:

Poor.Health                         1.000000
Proportion.of.Smokers               0.723864
Physical.Inactivity                 0.603145
Unemployment                        0.544357
Adult.Obesity                       0.456427
Severe.Housing.Problems             0.280768
Percent.Rural                       0.124330
Percent.Females                    -0.008175
Mental.health.providers.Per.1000   -0.098010
Over.65                            -0.104271
High.School.Graduation             -0.110337
Primary.Care.Physicians.Per.1000   -0.306522
Life.Expectancy                    -0.644491
Excessive.Drinking                 -0.660582
Median.Household.Income            -0.680146
Name: Poor.Health, dtype: float64

Selected Predictor Variable Correlations¶

In [163]:
print("\nCorrelation of Poor.Health with selected predictors:\n")
print(county_data[[y] + selected_vars].corr()[y].sort_values(ascending=False))
Correlation of Poor.Health with selected predictors:

Poor.Health                         1.000000
Unemployment                        0.544357
Percent.Rural                       0.124330
Primary.Care.Physicians.Per.1000   -0.306522
Excessive.Drinking                 -0.660582
Name: Poor.Health, dtype: float64

Review y-variable's association with potential predictors¶

  • As evidenced by the Data Dictionary Review, there are four core areas that predictor variables span to assess the outcome variables, so this analysis looks at a specific subset of columns that represent different aspects of determinants for health.
  • For instance, Primary Care Physicians Per 1000 = access to healthcare, Unemployment = health-environmental, Percent Rural = demographics, and High School Graduation = health-behavioral

Potential predictive variables to assess "poor health" include the number of physicians, the rate of unemployment, the percentage of rural, and the high school graduation rate

Looking at correlations : Unemployment has positive correlation with poor health, Percent rural has a minor positive correlation with poor health, Primary Care Physicians has a small negative correlation, meaning more medical providers causes better health outcomes, and excessive drinking has strong negative correlation.

Association Graphics¶

  • 4 Different linear models (each one of the predictor variables' relationship with Poor Health)

Relationship between predictor variables¶

  • There is a relationship between poor and the predictor variables, so it's pertinent to visualize potential interactions with predictor variables
  • Such case is relationship between excessive drinking and unemployment
In [165]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=county_data, x='Excessive.Drinking', y='Unemployment', hue='State.Abbreviation',
    alpha=0.4, s=20, palette='rocket'
)
plt.title('Relationship Between Excessive Drinking and Unemployment by State')
plt.xlabel('Excessive Drinking (Proportion)')
plt.ylabel('Unemployment Rate (Proportion)')
plt.legend(bbox_to_anchor=(1.2, 1), loc='upper left', ncol=3)
plt.show()
No description has been provided for this image

Analyzing Relationship between Predictor Variables Including Region Mapping¶

  • It gets convoluted with so many states, so we can split it to regions, assigning states to a region like the Midwest or Northeast, and we see the relationship to that specific region
In [166]:
#next, we can add a column to the dataset with regions 
region_map = {
    'ME':'Northeast','NH':'Northeast','VT':'Northeast','MA':'Northeast','RI':'Northeast','CT':'Northeast','NY':'Northeast','NJ':'Northeast','PA':'Northeast',
    'OH':'Midwest','MI':'Midwest','IN':'Midwest','IL':'Midwest','WI':'Midwest','MN':'Midwest','IA':'Midwest','MO':'Midwest','ND':'Midwest','SD':'Midwest','NE':'Midwest','KS':'Midwest',
    'DE':'South','MD':'South','DC':'South','VA':'South','WV':'South','KY':'South','NC':'South','SC':'South','GA':'South','FL':'South','AL':'South','TN':'South','MS':'South','AR':'South','LA':'South','OK':'South','TX':'South',
    'MT':'West','ID':'West','WY':'West','CO':'West','NM':'West','AZ':'West','UT':'West','NV':'West','CA':'West','OR':'West','WA':'West','AK':'West','HI':'West'
}

county_data['Region'] = county_data['State.Abbreviation'].map(region_map)
#After all this grueling identification, we now have a cohesive column added to the original data that assigns the region map to the same column as the state it's associated with

Region Mapping Graph¶

In [167]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=county_data,x='Excessive.Drinking',y='Unemployment',
                hue='Region',alpha=0.4,s=20,palette='rocket'
)
plt.title('Relationship Between Excessive Drinking and Unemployment by State')
plt.xlabel('Excessive Drinking (Proportion)')
plt.ylabel('Unemployment Rate (Proportion)')
plt.legend(bbox_to_anchor=(1.2, 1), loc='upper left', ncol=2)
plt.show()
No description has been provided for this image

Heat Mapping Graph¶

  • This graph is a heatmap, and it depicts each predictor variable's general correlation with one another, using harsher colors to show stronger correlation

Review of distribution¶

  • This shows that higher access to physicians provides better health outcomes, and higher drinking means better health, which may seem counterintuitive, but oftentimes a strong drinking culture can be related to high wealth, which relates to better access to healthcare, healthy food, etc.
  • This may mean that rural areas have somewhat limited access to healthcare access, causing worse health outcomes. Furthermore, higher unemployment rates may signify less access to necessities, limiting access to resources to improve health.

Additionally, the scatterplot and the heatmap show the relationship between some variables is relatively impactful, which is important to consider in the future.

The graph also shows a somewhat positive trend between percent rural and poor health, but a very strong positive trend between unemployment and poor health.

The graphs for poor health versus primary physician count and excessive drinking depict a negative trend between those predictors and the outcome variables

Creating a Predictive Model for Poor Health¶

  • Now, we will begin creating predictive models, a decision tree matrix
  • This will help quantify the relationship between our selected predictor variables and poor health

Decision Tree Explanation¶

  • Decision trees work by breaking down complex decisions to numerous simpler, smaller choices
  • The process begins with an initial decision node, which splits numerous times to decision nodes, which take a different path based on their own requirements
  • This process is repeated to create sub-nodes until the complex decision is broken to the nth degree
  • Cost complexity Tree Pruning -- a strong implementation not utilized in this analysis -- reduces a larger decision tree to a more efficient one by iterating to remove the "weak" links by assessing the tree that provides the least impact to accuracy based on its relative complexity

Decision Tree Model¶

In [169]:
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

#redefining variables quickly
y = county_data['Poor.Health']
X = county_data[['Unemployment', 'Excessive.Drinking',
                 'Percent.Rural', 'Primary.Care.Physicians.Per.1000']]
#train test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1940)

#create initial tree
regtree = DecisionTreeRegressor(random_state=1940, max_leaf_nodes=8)
print("The specific decision tree specifications used are: ")
print(regtree.fit(X_train, y_train))
The specific decision tree specifications used are: 
DecisionTreeRegressor(max_leaf_nodes=8, random_state=1940)

Tree Visualization¶

  • Tree shows the path predictive model takes (specific values considered)
  • Max Leaves Node=8, so tree stops at 8 Decisions
In [170]:
#model tree
fig = plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor='w', edgecolor='k')
plot_tree(regtree, filled=True, feature_names=list(X.columns))
plt.title("Simple Decision Tree for Predicting Poor Health")
plt.show()
No description has been provided for this image

Checking Accuracy¶

In [171]:
#get accuracy checkers 
y_pred = regtree.predict(X_test)
r2 = regtree.score(X_test, y_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Next, in order to assess how strong our tree is we will calculate two values: Variance(R^2) and Root Mean Squared Error (RMSE)\n")
print(f"R^2 (Variance): {r2:.3f}") #.3f is how many decimal places it'll go 
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}") #.3f is how many decimal places it'll go 
Next, in order to assess how strong our tree is we will calculate two values: Variance(R^2) and Root Mean Squared Error (RMSE)

R^2 (Variance): 0.554
Root Mean Squared Error (RMSE): 0.031

Model Accuracy Explanation¶

  • In this scenario, a variance of 0.554 means that the model explains 55.4% of variation in the poor health outcomes, meaning based on the data, the tree does a moderate job on capturing variation in data.
  • The Root Mean Squared Error assesses the percent difference between the predictions and observed values that the model captures, so a 0.031 value is indicative of a 3.1% difference on average predictions

Respective Variables' Importance¶

  • Next, we'll rank each variable and their respective importance and contributions to our model.
In [172]:
importances = pd.Series(regtree.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Variable Importance (in proportions, from scale of 0 to 1) used by Decision Trees:")
print(importances)
print("\nThese values are indicative that Excessive Drinking and Unemployment Rate \nserved as the major contributors to the decision tree's modeling process")
Variable Importance (in proportions, from scale of 0 to 1) used by Decision Trees:
Excessive.Drinking                  0.716903
Unemployment                        0.283097
Percent.Rural                       0.000000
Primary.Care.Physicians.Per.1000    0.000000
dtype: float64

These values are indicative that Excessive Drinking and Unemployment Rate 
served as the major contributors to the decision tree's modeling process

Interpretations in Context to Original Questions¶

  • The decision tree model indicates that the variables that have the most magnitude in discerning "poor health" are excessive drinking and unemployment. This means that in regards to the original question, there likely is not a significant impact that rural percent area, and the amount of primary care physicians has on poor health outcomes.

Conclusions Pt. 1¶

  1. Q1: Is a higher number of physicians in the area associated with less poor health outcomes?
  • It seems that a higher number of physicians is associated with more positive health outcomes, but the extent to which may be low, as the decision tree model failed to capture any impact from number of physicians.
  1. Q2: Is a higher rate of unemployment related to poor health outcomes?
  • It is evident that a higher rate of unemployment is associated with poor health outcomes, and extent is relatively impactful based on our model.

Conclusions Pt. 2¶

  1. Q3: Is a higher percent rural related to poor health outcomes?
  • A higher percent rural likely has little to no impact on poor health outcomes, evidenced by a low relationship and impact in the decision tree model's decision making.
  1. Q4: Is excessive drinking related to poor health outcomes?
  • Excessive drinking seems to have a strong negative relationship with poor health outcomes, accompanied by a powerful magnitude, indicative that communities who engage in excessive drinking have less poor health outcomes.

Suggestions for further research¶

  • The main surprise in this analysis is the negative relationship between excessive drinking and poor health outcomes
  • Future analysis may look into the relationship between excessive drinking and median household income to see if there are socioeconomic-related interaction effects between these variables