Data Analysis Tools, week 1: hypothesis testing and ANOVA

Welcome to the first week of the second course in Coursera’s Data Management and Visualization specialization. In order to utilize ANOVA and post hoc testing, I needed to examine a different explanatory variable than in the previous course. This analysis will examine whether or not a person’s gender and race is associated with their support of the death penalty, as punishment for murder. I will still be using the Outlook on Life (OOL) surveys, made available by ICPSR for Coursera students.

Hypotheses to be tested:

Null hypothesis: the death penalty is supported equally among all gender-racial groups.
Alternate hypothesis: support for the death penalty varies among gender-racial groups.

The categorical response variable was a combination of two groups: those who know someone who has been arrested for a crime, and those who have a friend or relative who has been convicted of a crime. This variable had two categories: yes and no.

The new categorical explanatory variable contains four (4) categories: white men, white women, men of color, and women of color. The gender choices were extremely limited in the dataset: male, female, and no response. Because it was not possible to know if a “no response” was a refusal to answer or identifying as a transgender or nonbinary person, these individuals were omitted from the sample. For more comprehensive data analysis, the survey should have used “masculine” and “feminine” instead of “male” and “female,” and included options for transgender, nonbinary, and possibly other gender identities.

An analysis of variance (ANOVA) revealed that among this sample, the gender and race of an individual (collapsed into 4 categories, as the categorical explanatory variable) is significantly associated with a preference for the death penalty. Utilizing an ordinary least squares (OLS) approach, the following results were obtained: F-statistic = 32.57, p = 2.10e-20. Tukey’s Honestly Significant Difference post hoc test was conducted to determine which groups were significantly different from each other. There was no significant difference in the results between white men and white women, therefore we accept the null hypothesis; however, there were significant differences between white women and men of color, white men and men of color, men of color and women of color, white women and women of color, and white men and women of color, and we accept the alternate hypothesis for these groups.

The punishment preferences among the groups are as follows: 66.7% of white men favor the death penalty, 58.8% of white women favor the death penalty, 55.6% of men of color favor imprisonment, and 64.7% of women of color favor imprisonment.

I was unable to calculate standard deviation for these results. I do not think this is possible, because the explanatory variable has 4 categories, and the response variable has 2 categories: neither are quantitative. After spending many hours (at least 16!) trying to find and code quantitative variables relevant to my original thesis, I was unsuccessful. I understand the code involved in calculating means and standard deviations, but I was unable to show that in this assignment. If any of my classmates have some input or resources, I’d welcome the assistance. I would very much like to calculate the deviation between each of the four gender-ethnic groups.


Is there a relationship between the gender and race of people who
favor the death penalty as punishment for murder?
All responses, death penalty vs. life in prison:
count 1535
unique 2
top Prison
freq 791
Name: W2_QK3, dtype: object
Preferences by ethnicity-gender subsets:
Death 0.666667
Prison 0.333333
Name: W2_QK3, dtype: float64
Death 0.587814
Prison 0.412186
Name: W2_QK3, dtype: float64
Death 0.443925
Prison 0.556075
Name: W2_QK3, dtype: float64
Death 0.352713
Prison 0.647287
Name: W2_QK3, dtype: float64

Ordinary Least Squares:
OLS Regression Results
Dep. Variable: W2_QK3 R-squared: 0.060
Model: OLS Adj. R-squared: 0.058
Method: Least Squares F-statistic: 32.57
Date: Thu, 21 Sep 2017 Prob (F-statistic): 2.10e-20
Time: 10:23:36 Log-Likelihood: -1065.9
No. Observations: 1535 AIC: 2140.
Df Residuals: 1531 BIC: 2161.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.3333 0.027 48.542 0.000 1.279 1.387
C(ETH_GEN)[T.2] 0.0789 0.040 1.972 0.049 0.000 0.157
C(ETH_GEN)[T.3] 0.2227 0.036 6.167 0.000 0.152 0.294
C(ETH_GEN)[T.4] 0.3140 0.035 9.023 0.000 0.246 0.382
Omnibus: 1.131 Durbin-Watson: 1.938
Prob(Omnibus): 0.568 Jarque-Bera (JB): 196.108
Skew: -0.066 Prob(JB): 2.60e-43
Kurtosis: 1.254 Cond. No. 5.32

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Post hoc test:
Multiple Comparison of Means – Tukey HSD,FWER=0.05
group1 group2 meandiff lower upper reject
MOC WM -0.2227 -0.3156 -0.1299 True
MOC WOC 0.0912 0.0096 0.1728 True
MOC WW -0.1439 -0.2399 -0.0479 True
WM WOC 0.314 0.2245 0.4034 True
WM WW 0.0789 -0.024 0.1817 False
WOC WW -0.2351 -0.3278 -0.1424 True



import pandas
import numpy
import warnings 
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 

warnings.simplefilter(action = "ignore", category = FutureWarning)
data = pandas.read_csv('ool_pds.csv', low_memory=False)
pandas.set_option('display.max_columns', None)
pandas.set_option('display.max_rows', None)
pandas.set_option('display.float_format', lambda x:'%f'%x)

print('Is there a relationship between the gender and race of people who \nfavor the death penalty as punishment for murder?')
# setting up to work with the data: 
warnings.simplefilter(action = "ignore", category = FutureWarning)
pandas.set_option('display.max_columns', None)
pandas.set_option('display.max_rows', None)
pandas.set_option('display.float_format', lambda x:'%f'%x)
data = pandas.read_csv('ool_pds.csv', low_memory = False)

# categorical response variable, death-vs-prison:
data['W2_QK3'] = data['W2_QK3'].convert_objects(convert_numeric=True)
data['W2_QK3'] = data['W2_QK3'].replace(-1, numpy.nan)
data['W2_QK3'] = data['W2_QK3'].dropna()

# make a subset: 
sub1 = data.copy()

# categorical explanatory variable, race or ethnicity:
# plz note: this is incredibly cis/binary and trans-/nb-exclusive :(
sub1['PPETHM'] = sub1['PPETHM'].convert_objects(convert_numeric=True)  
sub1['PPETHM'] = sub1['PPETHM'].replace(-1, numpy.nan).dropna()  
sub1['PPETHM'] = sub1['PPETHM'].replace(-2, numpy.nan).dropna()  
sub1['PPGENDER'] = sub1['PPGENDER'].convert_objects(convert_numeric=True)  
sub1['PPGENDER'] = sub1['PPGENDER'].replace(-1, numpy.nan).dropna()  
sub1['PPGENDER'] = sub1['PPGENDER'].replace(-2, numpy.nan).dropna()  

# create a new categorical explanatory variable based on gender and race
# white women, white men, women of color, men of color
def ETH_GEN(row):
    if row['PPETHM'] == 1:  # white
        if row['PPGENDER'] == 1: return 1 # white male
        else: return 2 # white female
    else:  # POC
        if row['PPGENDER'] == 1: return 3 # men of color
        else: return 4  # women of color 
sub1['ETH_GEN'] = sub1.apply(lambda row: ETH_GEN(row), axis = 1)
sub1['ETH_GEN'] = sub1['ETH_GEN'].convert_objects(convert_numeric=True)

# make a subset of individuals for whom the relevant data is available
sub2 = sub1[[ 'W2_QK3', 'ETH_GEN' ]].dropna()

#recoding group names
recode1 = {1: 'White Men', 2: 'White Women', 3: 'Men of Color', 4: 'Women of Color'}
sub2['ETH_GEN_LABELS']= sub2['ETH_GEN'].map(recode1)
sub2['ETH_GEN_LABELS']= sub2['ETH_GEN_LABELS'].astype('category')

# look at some data
print('All responses, death penalty vs. life in prison:')
sub2['W2_QK3'] = sub2['W2_QK3'].astype('category')
sub2['W2_QK3'] = sub2['W2_QK3'].cat.rename_categories(['Death', 'Prison'])
desc = sub2['W2_QK3'].describe()
print('Preferences by ethnicity-gender subsets:')
print('WHITE MEN:')
sub_wm = sub2[sub2['ETH_GEN_LABELS'] == 'White Men'].dropna()
percent_wm = sub_wm['W2_QK3'].convert_objects(convert_numeric=True).value_counts(sort = False, normalize = True)
print('WHITE WOMEN:')
sub_ww = sub2[sub2['ETH_GEN_LABELS'] == 'White Women'].dropna()
percent_ww = sub_ww['W2_QK3'].convert_objects(convert_numeric=True).value_counts(sort = False, normalize = True)
print('MEN OF COLOR:')
sub_moc = sub2[sub2['ETH_GEN_LABELS'] == 'Men of Color'].dropna()
percent_moc = sub_moc['W2_QK3'].convert_objects(convert_numeric=True).value_counts(sort = False, normalize = True)
print('WOMEN OF COLOR:')
sub_woc = sub2[sub2['ETH_GEN_LABELS'] == 'Women of Color'].dropna()
percent_woc = sub_woc['W2_QK3'].convert_objects(convert_numeric=True).value_counts(sort = False, normalize = True)
# now let's run some tests!
# ANOVA: ordinary least squares
print('\nOrdinary Least Squares:')
sub2['ETH_GEN'] = sub2['ETH_GEN'].astype('category')
sub2['ETH_GEN'] = sub2['ETH_GEN'].cat.rename_categories(['WM', 'WW', 'MOC', 'WOC'])
model1 = smf.ols(formula = 'W2_QK3 ~ C(ETH_GEN)', data = sub1)
results1 =
print('\nPost hoc test:')
# post hoc, Tukey's Honestly Significant Difference Test
sub4 = sub1[['ETH_GEN', 'W2_QK3']].dropna()
sub4['ETH_GEN'] = sub4['ETH_GEN'].astype('category')
sub4['ETH_GEN'] = sub4['ETH_GEN'].cat.rename_categories(['WM', 'WW', 'MOC', 'WOC'])
mc1 = multi.MultiComparison(sub4['W2_QK3'], sub4['ETH_GEN'])
res1 = mc1.tukeyhsd()

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.