county_data.describe()
print("General Summary Statistics on Data")
print(county_data.describe())

General Summary Statistics on Data
       Five-digit.FIPS.Code  State.FIPS.Code  County.FIPS.Code  Poor.Health  \
count           2715.000000      2715.000000       2715.000000  2715.000000   
mean           30479.473665        30.376059        103.414733     0.174385   
std            15292.329045        15.272748        108.697370     0.045176   
min             1001.000000         1.000000          1.000000     0.082900   
25%            19016.000000        19.000000         35.000000     0.140350   
50%            29163.000000        29.000000         79.000000     0.167300   
75%            46025.000000        46.000000        133.000000     0.203950   
max            56045.000000        56.000000        840.000000     0.407300   

         Uninsured  Primary.Care.Physicians.Per.1000  \
count  2715.000000                       2715.000000   
mean      0.107790                          0.557937   
std       0.047841                          0.329909   
min       0.020700                          0.000000   
25%       0.069200                          0.337500   
50%       0.100700                          0.500000   
75%       0.134650                          0.722000   
max       0.308800                          4.470000   

       Mental.health.providers.Per.1000  Adult.Obesity  Proportion.of.Smokers  \
count                       2715.000000    2715.000000            2715.000000   
mean                           1.486268       0.320086               0.178243   
std                            1.499236       0.045581               0.034748   
min                            0.042000       0.138000               0.067354   
25%                            0.474500       0.294000               0.152513   
50%                            1.036000       0.323000               0.173543   
75%                            1.992000       0.350000               0.202849   
max                           15.579000       0.495000               0.331744   

       High.School.Graduation  ...  Physical.Inactivity  Excessive.Drinking  \
count             2715.000000  ...          2715.000000         2715.000000   
mean                 0.882899  ...             0.255718            0.174894   
std                  0.070744  ...             0.051730            0.031926   
min                  0.363000  ...             0.084000            0.092652   
25%                  0.846900  ...             0.221000            0.152109   
50%                  0.894500  ...             0.256000            0.174272   
75%                  0.933600  ...             0.291000            0.196637   
max                  1.000000  ...             0.451000            0.294401   

       Median.Household.Income  Severe.Housing.Problems  Unemployment  \
count              2715.000000              2715.000000   2715.000000   
mean              51606.557274                 0.143692      0.046099   
std               13650.706994                 0.040969      0.015427   
min               25569.000000                 0.030400      0.016200   
25%               42680.000000                 0.115950      0.035800   
50%               49350.000000                 0.139900      0.043800   
75%               57274.500000                 0.164750      0.053350   
max              136191.000000                 0.394100      0.190700   

       Percent.Rural      Over.65  Percent.Females  Life.Expectancy  \
count    2715.000000  2715.000000      2715.000000      2715.000000   
mean        0.543305     0.185704         0.500014        77.464751   
std         0.304532     0.044622         0.021505         2.832091   
min         0.000000     0.048000         0.266000        67.068555   
25%         0.302300     0.157000         0.496000        75.596650   
50%         0.545000     0.182000         0.504000        77.544959   
75%         0.777600     0.210000         0.511000        79.232825   
max         1.000000     0.569000         0.570000        97.965235   

         Population  
count  2.715000e+03  
mean   1.181631e+05  
std    3.560418e+05  
min    1.718000e+03  
25%    1.452450e+04  
50%    3.159400e+04  
75%    8.273800e+04  
max    1.016351e+07  

[8 rows x 21 columns]

state_mean=county_data.groupby("State.Abbreviation")["Poor.Health"].mean()
state_median=county_data.groupby("State.Abbreviation")["Poor.Health"].median()
print("Highest Mean and Median:")
print(state_mean.sort_values(ascending=True).head(5))
print(state_median.sort_values(ascending=True).head(5))
print("Lowest Mean and Median:")
print(state_mean.sort_values(ascending=True).tail(5))
print(state_median.sort_values(ascending=True).tail(5))

Highest Mean and Median:
State.Abbreviation
CT    0.113317
VT    0.116242
SD    0.119667
RI    0.121780
MN    0.123187
Name: Poor.Health, dtype: float64
State.Abbreviation
SD    0.11215
RI    0.11260
VT    0.11495
CT    0.11600
MN    0.12165
Name: Poor.Health, dtype: float64
Lowest Mean and Median:
State.Abbreviation
KY    0.221119
MS    0.223572
WV    0.225047
AL    0.229694
AR    0.231626
Name: Poor.Health, dtype: float64
State.Abbreviation
KY    0.22040
WV    0.22070
AL    0.22090
MS    0.22195
AR    0.22510
Name: Poor.Health, dtype: float64

statesgroupsascend = county_data.groupby('State.Abbreviation')['Poor.Health'].mean().sort_values(ascending = True)
df_tmp = pd.DataFrame(statesgroupsascend)
df_tmp['State'] = df_tmp.index
sns.catplot(data=df_tmp, y="Poor.Health", x = 'State', kind="bar", height=4, aspect=3)
plt.xticks(rotation=45);
plt.show()

y = 'Poor.Health'
selected_vars = ['Primary.Care.Physicians.Per.1000', 'Unemployment', 
          'Percent.Rural', 'Excessive.Drinking']
broad_vars = ['Primary.Care.Physicians.Per.1000', 'Mental.health.providers.Per.1000', 'Adult.Obesity', 'Proportion.of.Smokers','High.School.Graduation',
              'Physical.Inactivity', 'Excessive.Drinking', 'Median.Household.Income', 'Severe.Housing.Problems', 'Unemployment', 'Percent.Rural', 'Over.65', 'Percent.Females', 'Life.Expectancy']
print("Correlation of Poor Health with majority of predictors:\n")
print(county_data[[y] + broad_vars].corr()[y].sort_values(ascending=False))

Correlation of Poor Health with majority of predictors:

Poor.Health                         1.000000
Proportion.of.Smokers               0.723864
Physical.Inactivity                 0.603145
Unemployment                        0.544357
Adult.Obesity                       0.456427
Severe.Housing.Problems             0.280768
Percent.Rural                       0.124330
Percent.Females                    -0.008175
Mental.health.providers.Per.1000   -0.098010
Over.65                            -0.104271
High.School.Graduation             -0.110337
Primary.Care.Physicians.Per.1000   -0.306522
Life.Expectancy                    -0.644491
Excessive.Drinking                 -0.660582
Median.Household.Income            -0.680146
Name: Poor.Health, dtype: float64

print("\nCorrelation of Poor.Health with selected predictors:\n")
print(county_data[[y] + selected_vars].corr()[y].sort_values(ascending=False))

Correlation of Poor.Health with selected predictors:

Poor.Health                         1.000000
Unemployment                        0.544357
Percent.Rural                       0.124330
Primary.Care.Physicians.Per.1000   -0.306522
Excessive.Drinking                 -0.660582
Name: Poor.Health, dtype: float64

plt.figure(figsize=(8,6))
sns.scatterplot(data=county_data, x='Excessive.Drinking', y='Unemployment', hue='State.Abbreviation',
    alpha=0.4, s=20, palette='rocket'
)
plt.title('Relationship Between Excessive Drinking and Unemployment by State')
plt.xlabel('Excessive Drinking (Proportion)')
plt.ylabel('Unemployment Rate (Proportion)')
plt.legend(bbox_to_anchor=(1.2, 1), loc='upper left', ncol=3)
plt.show()

#next, we can add a column to the dataset with regions 
region_map = {
    'ME':'Northeast','NH':'Northeast','VT':'Northeast','MA':'Northeast','RI':'Northeast','CT':'Northeast','NY':'Northeast','NJ':'Northeast','PA':'Northeast',
    'OH':'Midwest','MI':'Midwest','IN':'Midwest','IL':'Midwest','WI':'Midwest','MN':'Midwest','IA':'Midwest','MO':'Midwest','ND':'Midwest','SD':'Midwest','NE':'Midwest','KS':'Midwest',
    'DE':'South','MD':'South','DC':'South','VA':'South','WV':'South','KY':'South','NC':'South','SC':'South','GA':'South','FL':'South','AL':'South','TN':'South','MS':'South','AR':'South','LA':'South','OK':'South','TX':'South',
    'MT':'West','ID':'West','WY':'West','CO':'West','NM':'West','AZ':'West','UT':'West','NV':'West','CA':'West','OR':'West','WA':'West','AK':'West','HI':'West'
}

county_data['Region'] = county_data['State.Abbreviation'].map(region_map)
#After all this grueling identification, we now have a cohesive column added to the original data that assigns the region map to the same column as the state it's associated with

plt.figure(figsize=(8,6))
sns.scatterplot(data=county_data,x='Excessive.Drinking',y='Unemployment',
                hue='Region',alpha=0.4,s=20,palette='rocket'
)
plt.title('Relationship Between Excessive Drinking and Unemployment by State')
plt.xlabel('Excessive Drinking (Proportion)')
plt.ylabel('Unemployment Rate (Proportion)')
plt.legend(bbox_to_anchor=(1.2, 1), loc='upper left', ncol=2)
plt.show()

from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

#redefining variables quickly
y = county_data['Poor.Health']
X = county_data[['Unemployment', 'Excessive.Drinking',
                 'Percent.Rural', 'Primary.Care.Physicians.Per.1000']]
#train test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1940)

#create initial tree
regtree = DecisionTreeRegressor(random_state=1940, max_leaf_nodes=8)
print("The specific decision tree specifications used are: ")
print(regtree.fit(X_train, y_train))

The specific decision tree specifications used are: 
DecisionTreeRegressor(max_leaf_nodes=8, random_state=1940)

#model tree
fig = plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor='w', edgecolor='k')
plot_tree(regtree, filled=True, feature_names=list(X.columns))
plt.title("Simple Decision Tree for Predicting Poor Health")
plt.show()

#get accuracy checkers 
y_pred = regtree.predict(X_test)
r2 = regtree.score(X_test, y_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Next, in order to assess how strong our tree is we will calculate two values: Variance(R^2) and Root Mean Squared Error (RMSE)\n")
print(f"R^2 (Variance): {r2:.3f}") #.3f is how many decimal places it'll go 
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}") #.3f is how many decimal places it'll go

Next, in order to assess how strong our tree is we will calculate two values: Variance(R^2) and Root Mean Squared Error (RMSE)

R^2 (Variance): 0.554
Root Mean Squared Error (RMSE): 0.031

importances = pd.Series(regtree.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Variable Importance (in proportions, from scale of 0 to 1) used by Decision Trees:")
print(importances)
print("\nThese values are indicative that Excessive Drinking and Unemployment Rate \nserved as the major contributors to the decision tree's modeling process")

Variable Importance (in proportions, from scale of 0 to 1) used by Decision Trees:
Excessive.Drinking                  0.716903
Unemployment                        0.283097
Percent.Rural                       0.000000
Primary.Care.Physicians.Per.1000    0.000000
dtype: float64

These values are indicative that Excessive Drinking and Unemployment Rate 
served as the major contributors to the decision tree's modeling process

Stat 4770/7770 Final Presentation¶

Introduction¶

Importing Libraries¶

Summaries¶

Mean/Median Poor Health Proportions Across State¶

Graphical Data on Poor Health Averages Across States¶

Important Considerations¶

Key Questions of Interest¶

Y-Variable Rationale¶

Possible Predictor Variable Correlations¶

Selected Predictor Variable Correlations¶

Review y-variable's association with potential predictors¶

Association Graphics¶

Relationship between predictor variables¶

Analyzing Relationship between Predictor Variables Including Region Mapping¶

Region Mapping Graph¶

Heat Mapping Graph¶

Review of distribution¶

Creating a Predictive Model for Poor Health¶

Decision Tree Explanation¶

Decision Tree Model¶

Tree Visualization¶

Checking Accuracy¶

Model Accuracy Explanation¶

Respective Variables' Importance¶

Interpretations in Context to Original Questions¶

Conclusions Pt. 1¶

Conclusions Pt. 2¶

Suggestions for further research¶