Introduction

Seneca, the Roman philosopher and political adviser to the emperor Nero, was quoted as saying that "the mind is slow to unlearn what it learnt early." Our ability to learn as humans is among the characteristics that give us an advantage in life. Learning has allowed us to instinctively adapt and effectively cope with our environment for survival. But what does it mean to unlearn something?

An important aspect of learning comes from being able to establish connections between objects and events. This is called associative learning. Dr. J R. Stroop is an American pychologist who studied how the strength of this cognitive association can sometimes interfere with certain tasks. He was able to show that things for which we have formed very strong associations (i.e., recognizing words) may inhibit us when we are faced with a challenge that requires a deliberate dissociation of those connections. It takes effort to be able to unlearn something.

The purpose of this project is to revisit Dr. Stroop's color-word task and demonstrate the use of descriptive statistics, exploratory data analysis, and statistical inference to analyze how interference affects the reaction time that is involved in reading an object's color name and naming an object's color. Specifically, I would like to ask the research question:

Research Question

Is there a difference between the reaction time of naming the ink color of a color word when the word does not conflict with its ink color as compared to the reaction time of naming the ink color of a color word when the word conflicts with its ink color?

Methodology

Two stimuli that are involved in the Stroop experiment are the name of a color (word stimulus) and the color of the ink with which the name is printed (color stimulus). A condition which we will refer to as the conflicting word stimuli describes a situation when a color is used to print a word which names another color.

The interference of word stimuli upon naming colors is a construct that has been operationalized as the increase in time for reacting to colors caused by the presence of conflicting word stimuli. This type of interference is what I like to measure in this project using a type of within-subject, repeated measures design wherein the same experimental unit is subjected to two levels of treatment which provide measures that can be used to infer the effects of interference. I shall discuss the implementation of the repeated measures design in the following section.

Subjects and Procedure

Twenty-four individuals participated in a Stroop Effect Experiment. Each participant is presented with a list of color names printed in similarly colored ink (e.g. the word "BLUE" is painted blue, the word "GREEN" is painted green). A timer is started and the participant is asked to start dealing with each word in the list and, for each word, to verbally state the color of the ink in which the word is printed. The timer is stopped as soon as the participant goes through all the words in the list and the duration of time is recorded as the participant's result under the "Congruent" condition (non-conflicting word stimuli).

The same participant is again presented with another, similarly-sized list of color names. However, each color name is printed in a color that does not match the name (e.g. the word "ORANGE" is painted red, the word "PURPLE" is painted red). Once again, a timer is started and the participant is asked to start dealing with each word in the list, and for each word, to verbally state the color of the ink in which the word is printed. The timer is stopped as soon as the participant goes through all the words in the list and the duration of time is recorded as the participant's result under the "Incongruent" condition (conflicting word stimuli).

A dataset of 24 observations is thusly obtained. Each observation pertains to a participant and consists of two variables (Congruent, Incongruent) that record the results of the participant's performance under the congruent (non-conflicting word stimuli) and incongruent (conflicting word stimuli) conditions.

Independent and Dependent Variables

The independent variable, which is the factor that is manipulated in the experiment, is the presence or absence of a conflicting word stimuli. This factor has two levels: the first is the absence of a conflicting word stimuli (i.e. the condition when name of the color is the same as the ink color) and the second is the presence of a conflicting word stimuli (i.e. the condition of one color printed in the ink of a dissimilar color).

On the other hand, the time it took for a participant to react to the colors caused by the presence or absence of a conflicting word stimuli will serve as our dependent variable.

Statement of Hypotheses

Null Hypothesis

The population mean time to react to colors caused by the presence of a conflicting word stimuli (i.e. the condition of one color printed in the ink of a dissimilar color) is equal to the population mean time to react to colors caused by the presence of a non-conflicting word stimuli (i.e. the condition where the name of the color is the same as the ink color).

$$ \text{H}_\text{ 0}: \mu_\text{C} = \mu_\text{I} $$

where:

  • $\large \mu_\text{C}$ is the population mean of the participants' reaction times when the test is conducted in the absence of a conflicting word stimuli (Congruent condition).

  • $\large \mu_\text{I}$ is the population mean of the participants' reaction times when the test is conducted in the presence of a conflicting word stimuli (Incongruent condition).

Alternative Hypothesis

The population mean time to react to colors caused by the presence of a conflicting word stimuli (i.e. the condition of one color printed in the ink of a dissimilar color) is not equal to the population mean time to react to colors caused by the presence of a non-conflicting word stimuli (i.e. the condition when the name of the color is the same as the ink color).

$$ \large{\text{H}_\text{ A}: \ \mu_\text{C} \ne \mu_\text{I}} $$

Another way of stating the null and alternative hypotheses

Since the repeated measures design generates paired data for each of the participants, we can also express the null hypothesis in another way and state that the true population mean difference between the paired observations is zero. When the population mean difference is zero, the population mean values of the two groups must be the same.

$$ \large{\text{H}_\text{ 0}: \ \mu_\text{C} - \mu_\text{I}} = 0 $$

Conversely, the alternative hypothesis assumes that the true population mean difference between the paired reaction times is not equal to zero.

$$ \large{\text{H}_\text{ A}: \ \mu_\text{C} - \mu_\text{I}} \ne 0 $$

Note that I have taken a conservative stance in my appreciation of what passes for a satisfaction of the alternative hypothesis (as opposed to taking on a directional view, i.e., $\mu_{c} < \mu_{I}$ or $\mu_{c} > \mu_{I}$). Any observed substantial difference in the population mean reaction times between the two groups is likely associated with a true difference and not likely to have occurred by chance alone.

Implications of Type I and Type II errors

If we say that there is a difference between the reaction durations under the two conditions when in fact there is none, we would be committing a Type I error. Thus, we might run the risk of incurring additional costs should the findings prompt further studies regarding interference with particular attention given to the determination of the causes for such a difference.

On the other hand, if we commit a Type II error, we are claiming that there is no difference between the reaction durations under the two conditions when such a difference does exists in actuality. Failing to reject the null hypothesis when it is false brings with it the risk of neglecting the chance to analyze the causes for such a difference.

Analysis Plan

Statement of the Significance Level ($\alpha$)

For those cases where the null hypothesis is indeed true, I choose not to incorrectly reject $\text{ H}_\text{ 0} $ more than 5% of the time. Thus, I have set the significance level at $\large{\alpha}=$ 0.05. This also means that I am setting the chance of committing a Type I error (the probability of rejecting the null hypothesis given that it is true) to be at 5%.

The non-directional nature of our alternative hypothesis ($\text{ H}_\text{A} $) means that the probability will have to be uniformly distributed near both ends of the t-distribution (0.025 on each side) to help define the critical regions.

Statistical Test

A non-directional, paired sample t-test (also known as the dependent t-test for paired samples) will be employed to verify our hypothesis. This parametric procedure is particularly suited for the repeated-measures design which is expected to generate a dataset of paired data. The paired nature of the test comes as a result of having a within-subject design: each observation in the dataset refers to a participant's performance against two condition levels (congruent or incongruent stimuli) resulting in pairs of variable data.

Moreover, t-test is used because the population standard deviation pertaining to the difference in mean reaction times, $\large{\sigma}$, is unknown and the test is robust against small sample sizes as in this case. The underlying t-distribution will help compensate for the variation in the standard error, $s \ /\sqrt{n}$, brought by a combination of (1) the use of the sample standard deviation as a point estimate and (2) the small sample size.

The non-directional nature of the t-test comes from the manner in which we formulated the alternative hypothesis. It implies that the difference in the population mean reaction times between the two conditions are different and that we do not specify which condition's mean reaction time is expected to be greater than the other. A non-directional t-test also implies the use of a two-tailed statistical test to help guide our assessment of the findings regarding the difference between the sample means.

Dependent t-test for Paired Sample Data

I will now proceed with the hypothesis test and the goal is to determine a test statistic (from the data) that follows a t-distribution under the assumption that the null hypothesis is true. Let's load the data.

In [65]:
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import scipy.stats as stats
import math
import pprint as pp

sns.set(rc={"figure.figsize": (9, 6)})
sns.set_style('whitegrid')
current_palette = matplotlib.colors.hex2color('#b3e5fc')
crlf = "\n"
In [66]:
print(crlf)
filepath = '.\\data\\stroopdata.csv'
df = pd.read_csv(filepath, names=['Congruent','Incongruent'], header=0)
pp.pprint(df[:5])
print(crlf)

   Congruent  Incongruent
0     12.079       19.278
1     16.791       18.741
2      9.564       21.214
3      8.630       15.687
4     14.669       22.803


In [67]:
print(crlf)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 2 columns):
Congruent      24 non-null float64
Incongruent    24 non-null float64
dtypes: float64(2)
memory usage: 456.0 bytes

The dataset has 24 records and a couple of continuous numeric variables expressed in the columns: "Congruent" and "Incongruent". Each row is also deemed complete as no missing values were detected in the dataset by the info() function.

Column Name Data Type Description
Congruent float Duration of time (in seconds) to complete list of color names printed in non-conflicting ink color
Incongruent float Duration of time (in seconds) to complete list of color names printed in conflicting ink color
In [68]:
print(crlf)
df.describe()

Out[68]:
Congruent Incongruent
count 24.000000 24.000000
mean 14.051125 22.015917
std 3.559358 4.797057
min 8.630000 15.687000
25% 11.895250 18.716750
50% 14.356500 21.017500
75% 16.200750 24.051500
max 22.328000 35.255000


The Stroop Effect dataset comes in with a small sample size (n=24) which will affect the choice and shape of our t-distribution when we conduct a hypothesis test later.

The sample mean for the Incongruent variable (22.02) is notably higher than the Congruent variable's sample mean (14.05). However, both groups exhibit small variations in their values as indicated by their standard deviations (Congruent sd = 3.56, Incongruent sd = 4.80).

We are going to extract the Congruent and Incongruent data into their own subsets and calculate some descriptive statistics for each group.

In [69]:
print(crlf)

# Subset the variables
dfc = df['Congruent']
dfi = df['Incongruent']

dfc_mean = np.mean(dfc)
dfc_std  = np.std(dfc, ddof=1)

print("Congruent Condition:")
print("Sample mean: {0}".format(dfc_mean))
print("Sample stdev: {0}".format(dfc_std))
print(crlf)

Congruent Condition:
Sample mean: 14.051125
Sample stdev: 3.55935795765


The Congruent variable contains the time durations (in seconds) it took each participant to finish going through and speaking out the ink colors of 25 color words. The "Congruent" variable name is used here because each color word is printed in an ink color that does not conflict with the name of the color word (i.e. the word "BLUE" is indeed printed in blue ink).

The Congruent variable has a mean reaction duration time of 14.05 seconds and a standard deviation of 3.56 seconds. From the empirical rule, this means that 68% of the Congruent values are between 10.49 seconds and 17.61 seconds.

A histogram showing the density distribution of the continuous, univariate Congruent condition variable is shown below.

In [70]:
print(crlf)
plt.figure(figsize=(9, 6)) 
sns.distplot(dfc, bins=7,  \
             hist_kws={"alpha": 0.8, "color": "#b3e5fc"},\
             kde=True)
plt.title("Histogram of Congruent Condition Data\n", fontsize=15)
plt.xlabel("Reaction Time (seconds)")
plt.show()

The calculated Kernel Density Estimate (KDE) on the distribution of the reaction times for the Congruent condition shown above is fairly normal. The probability density distribution itself is centered around 14.05 seconds; with a small amount of spread among its values (sd = 3.56). This means that we can expect to find a majority of the reaction time scores to be within a small distance from the mean.


In [71]:
print(crlf)
dfi_mean = np.mean(dfi)
dfi_std  = np.std(dfi, ddof=1)
print("Incongruent Condition:")
print("Sample mean: {0}".format(dfi_mean))
print("Sample stdev: {0}".format(dfi_std))

Incongruent Condition:
Sample mean: 22.0159166667
Sample stdev: 4.79705712247

The Incongruent variable consists of the time it took each participant to finish a similarly sized list of color words. Each color word is printed in an ink color that conflicts with the name of the color word (i.e. the word "RED" is printed in blue ink), hence the variable name "Incongruent".

The average reaction duration time of Incongruent observations is 22.02 seconds which is 7.97 seconds longer than the mean Congruent duration. The Incongruent standard deviation of 4.80 is more varied around the mean than its Congruent counterpart which means we expect to see Incongruent values to be more spread out from the mean reaction time of the sample. From the empirical rule, about 68% of the Incongruent values are between 17.22 seconds and 26.82 seconds.

A histogram showing the density distribution of the continuous, univariate Incongruent condition variable is shown below.


In [72]:
print(crlf)
plt.figure(figsize=(9, 6)) 
sns.distplot(dfi, bins=7, \
             hist_kws={"alpha": 0.8, "color": "#b3e5fc"},\
             kde=True)
plt.title("Histogram of Incongruent Condition Data\n", fontsize=15)
plt.xlabel("Reaction Time (seconds)")
plt.show()

The probability density function of the reaction times for the Incongruent condition appears to have some outliers that represent observations that took longer than 30 seconds to finish. This causes the calculated Kernel Density Estimate (KDE) to be positively skewed.

The distribution's mean of 22.02 seconds is substantially higher than the Congruent group. The relatively high standard deviation (sd=4.80) is also reflected in the plot with the data points representing the reaction times spread out around the sample mean.

It would be interesting to plot both distributions side-by-side so we would have a better understanding of how they compare with each other in terms of shape, central tendencies, and spread.


In [73]:
print(crlf)
plt.figure(figsize=(9, 6)) 
for col in ['Congruent', 'Incongruent']:
    if col == 'Congruent':
        plt.hist(df[col],  
                 alpha=0.9, color="#b3e5fc", 
                 bins=7, label='Congruent')
    elif col == 'Incongruent':
        plt.hist(df[col],
                 alpha=0.3, color="#ffa726", 
                 bins=7, label='Incongruent')
        
plt.title("Combined Histograms of Condition Data\n", fontsize=15)
plt.xlabel("\nReaction Time (seconds)", fontsize=15)
plt.legend()
plt.show()

It is apparent from the histograms that the Incongruent group's distribution (shown in yellow) seems to break away from the Congruent group (shown in blue) and because of such, it appears that it may very well belong to its own or to a new distribution distinct from the Congruent group. We will keep this observation in mind and determine if it finds statistical support as we work on a formal hypothesis test using this data.

We are certainly not limited when it comes to the tools that allow us to compare distributions across paired data. Another way of showing a comparison between the Congruent and Incongruent distributions is a multi-figure boxplot where we plot both sets of data along a common measurement scale depicting the reaction times.


In [74]:
plt.figure(figsize=(6,8))
ax = sns.boxplot(data=df[['Congruent','Incongruent']], color="#b3e5fc") \
          .set(ylabel="Reaction Time (seconds)\n", xlabel="\nInterference Condition")

The boxplots tell us that fifty percent pecent of the time, it took participants about 14.36 seconds to finish going through the list of words painted in non-conflicting ink color; whereas fifty percent of the time, it took participants about 21.01 seconds to finish going through the list of words painted in conflicting color. More observations about the comparisons of the boxplots are given as follows:

  • Both distributions are slightly positively skewed with the Congruent group showing a more normal or symmetric distribution while the Incongruent group exhibits a slight skew on the positive side owing to the presence of observations with outlier data.
  • The Incongruent group has a greater variation among reaction times (SD = 4.80) compared to the Congruent group whose reaction time values tend to be closer to its sample mean (SD = 3.56).
  • There is a very noticeable difference between the middle (Q1 to Q3 range) reaction times of the Congruent and the Incongruent groups, with the latter being quite higher and more spread out. About 25% of the longest reaction times for the Congruent group may be considered fast when compared to the lower 75% of the Incongruent group's reaction times. This means that some of the slowest reaction times in the Congruent group are actually almost as fast as the fastest reaction times in the Incongruent group.

Coefficient of variation

I have defined a function to help us figure the coefficent of variation for each of our univariate variable data.

In [75]:
def cv(mean, std):
    """
    Compute the coefficient of variation
    """
    result = std/mean
    return float(result)
In [76]:
print(crlf)
dfc_cv = round(cv(dfc_mean, dfc_std), 2)
dfi_cv = round(cv(dfi_mean, dfi_std), 2)

print("Coefficient of variation - Congruent: {0}".format(dfc_cv))
print("Coefficient of variation - Incongruent: {0}".format(dfi_cv))

Coefficient of variation - Congruent: 0.25
Coefficient of variation - Incongruent: 0.22


The standard deviation of the reaction times for reading the entire list of color names with non-conflictiong ink colors is 25% of its sample mean. In this standardized form, we can say that the Congruent and Incongruent groups' reaction times are almost similar notwithstanding the Congruent group's variation as being slightly more varied than the reaction times for the Incongruent group whose standard deviation makes up for only 22% of its sample mean.

The reaction times for reading words printed in conflicting colors may be longer, but they are consistently long and slightly spread out around the mean.

Checking the Assumptions for using the t-test

Interference Condition
Congruent Incongruent
Sample size 24 24
Mean 14.05 22.02
Standard Deviation 3.56 4.80

The distribution is normally distributed

A paired t-test does not assume that observations within each group of conditions are normal, only that the differences are normal.

The difference between each observation's Congruent and Incongruent condition reaction times are implemented as a new, calculated field called Difference that we are adding to the existing dataset through the code below.

In [77]:
# Create a new column to hold the difference between means
print(crlf)
df['Difference'] = df['Congruent'] - df['Incongruent']
diff = df['Difference']
df[0:5][['Congruent','Incongruent', 'Difference']]

Out[77]:
Congruent Incongruent Difference
0 12.079 19.278 -7.199
1 16.791 18.741 -1.950
2 9.564 21.214 -11.650
3 8.630 15.687 -7.057
4 14.669 22.803 -8.134

We will plot a probability distribution function to visualize the shape, center, and spread of the differences in reaction times between the Congruent and Incongruent condition groups.

In [78]:
print(crlf)
plt.figure(figsize=(9, 6)) 
sns.distplot(df['Difference'],  \
             hist_kws={"alpha": 0.8, "color": "#b3e5fc"},\
             kde=True)
plt.title("Histogram of Difference Between Condition Reaction Times\n", fontsize=15)
plt.xlabel("Reaction Time (seconds)")
plt.show()

The distribution of the differences between the mean reaction times of the two conditions seems to be slightly skewed to the negative side. This slight skew can be explained by the presence of outliers in the data which generates larger calculated differences between the variables. Most of the recorded time differences condense around the value of -7 seconds. The longest reaction time difference seems to be around -22 seconds while the fastest reaction time difference is around -2 seconds.

We will assume that the data was collected from participants who volunteered to do the Stroop Task and that each individual are not related to each other so much so that the performance of one will not have an effect on the performance of another.

Homogeneity of variance

The test for the homogeneity of variance, while beneficial for t-tests for independent means, may not be assumed for paired sample t-tests such as the instant case. This is because it logically follows that since we actually have samples that are paired, then using an analysis of the two samples as if they were independent is non sequitur.

Adequacy of the sample size

The sample size of our dataset is 24. Granted that this sample size is indeed small (less than the proverbial 30 observations), I believe that choosing a Student's t-test to verify our hypothesis is actually appropriate for our purpose because the t-test was designed for and proves itself effective especially with small samples where the population standard deviation is unknown. One concern, though, about using a small sample size in a t-test is with regards to statistical power. I will rely on the inherent nature of the t-distribution's ability to compensate for the small sample size through the use of the sample degrees of freedom.

Point estimate

The point estimate for the difference between the population of reaction times for the two groups (Congruent and Incongruent) is based on our paired data sample and it would have to come from the difference between the sample means:

$$ \large{\bar{x}_\text{C} - \bar{x}_\text{I} \Rightarrow} \normalsize{14.05 - 22.05 = -7.96 \ seconds} $$

where $\bar{x}_{\text{C}}$ is the sample mean of the reaction times under the Congruent condition and $\bar{x}_{\text{I}}$ is the sample mean of the reaction times under the Incongruent condition.

What follows are the calculations to generate the components for our hypothesis test: the degrees of freedom, the mean difference in reaction times and the standard deviation of the differences in reaction times.

Degrees of freedom

In [79]:
print(crlf)

# Sample size
n = np.shape(df)[0]

# Degrees of freedom
dof = n - 1

print("Sample size: {0}".format(n))
print("Degrees of freedom: {0}".format(dof))

Sample size: 24
Degrees of freedom: 23

Descriptive statistics on the difference in reaction times

In [80]:
print(crlf)
diff.describe()

Out[80]:
count    24.000000
mean     -7.964792
std       4.864827
min     -21.919000
25%     -10.258500
50%      -7.666500
75%      -3.645500
max      -1.950000
Name: Difference, dtype: float64

Mean and standard deviation of the differences in reaction times

In [81]:
print(crlf)
diff_mean = np.mean(diff)
print("Mean of difference of reaction times: {0}".format(diff_mean))
print(crlf)

Mean of difference of reaction times: -7.96479166667


In [82]:
print(crlf)
diff_std = np.std(diff, ddof=1)
print("Standard deviation of the difference in reaction times: {0}".format(diff_std))
print(crlf)

Standard deviation of the difference in reaction times: 4.86482691036


The difference between the average duration for naming a list of non-conflicting color words and the average duration for naming a list of conflicting color words is -7.96 seconds. This means that, on average, a participant performs slower in naming the color of a word when the word itself conflicts with its own ink color. Whether or not this difference just happens to occur normally by chance (as with our sample) is one of the goals of this project.

The standard deviation of 4.86 seconds tells us that there is quite a variation among the differences between reaction times and that about 68% of the differences fall between -12.82 and -3.10 seconds.

t-critical value for $\alpha$ = 0.05, dof = 23

A significance level of 0.05 implies that we are looking for 5% of values that are rare. So, we need to find the t-critical values that are associated with the threshold values that represent 2.5% on both sides of the t-distribution since we are dealing with a two-sided or two-tailed significance test.

The code below returns a two-tail critical value for an alpha of 0.05 and a degrees of freedom of 23.

In [83]:
print(crlf)
alpha = 0.05  # confidence level of 95%
alpha_p = alpha/2 # two-tail test

t_critical = round(abs(stats.t.ppf(alpha_p, dof)), 3)
print("t-critical value (two-tail, alpha=0.05, dof=23): {0}".format(t_critical))
print(crlf)

t-critical value (two-tail, alpha=0.05, dof=23): 2.069


Decision Rule

If the t-statistic falls below or above the t-critical values, then it would be inside of our critical region and this means that we will consider our t-statistic to be representative of a sample (n=24, $\mu_{D}$=-7.96) that is very unlikely to have been selected by chance. This also implies that Congruent and Incongruent reaction times are probably different or may have come from different populations.

In other words, we would expect our t-statistic to fall within -2.069 and 2.069. If our t-statistic falls outside of this interval (above 2.069 or below -2.069), we would then have to reject the null hypothesis.

Standard error of the sampling distribution of the difference in reaction times

The formula for the standard error is given below. The standard error allows us to estimate the standard deviation that can be inferred if we were to have a sampling distribution of the difference in reaction times between the Congruent and Incongruent sets of data, assuming that the mean of such a distribution is the mean of the differences.

$$\large{{\frac{s}{\sqrt{n}}}}$$

where n is the sample size and s is the standard deviation of the difference in reaction times between the Congruent and Incongruent groups. The standard error is a component that will be used in calculating the t-statistic for our sample as well as in determining the confidence interval for the difference in reaction times further in this project.

In [84]:
print(crlf)
diff_SE = diff_std/math.sqrt(n)
print("Standard error of the sampling distribution of the difference in reaction times: {0}".format(diff_SE))
print(crlf)

Standard error of the sampling distribution of the difference in reaction times: 0.993028634778


Calculate the t-statistic of the difference of sample means (paired data)

Remember that the t-statistic is the ratio of our point estimate (difference between the means of Congruent and Incongruent reaction times) and the standard deviation of this point estimate (standard error).

$$t = \frac{\bar{x}_{C} - \bar{x}_{I}}{\frac{s}{\sqrt{n}}} \Rightarrow \frac{\bar{x}_{C} - \bar{x}_{I}}{SE}$$

In [85]:
def t_stat(xbar, mu, s, n):
    """
    Compute the t-statistic
    """
    mean_diff = xbar - mu
    std_err = s/math.sqrt(n)
    result = mean_diff/std_err
    return result  
In [86]:
print(crlf)
diff_tstat = t_stat(dfc_mean, dfi_mean, diff_std, n)
print("t-statistic for point estimate (difference in reaction times): {0}".format(diff_tstat))

t-statistic for point estimate (difference in reaction times): -8.02070694411

p-value

Given a t-statistic and a degrees of freedom, we can use the survival function (stats.t.sf) of the t namespace (aka the complementary CDF) to compute the one-sided p-value. Afterwards, the two-sided p-value can be obtained by doubling the result from the one-sided p-value.

In [87]:
from scipy.stats import t

print(crlf)
# one-tailed p-value (t-distribution)
diff_pvalue_1S = t.sf(np.abs(diff_tstat), dof)  

# two-tailed p-value (t-distribution)
diff_pvalue_2S = 2 * diff_pvalue_1S

# below: alternative to get two-tailed p-value 
# diff_pvalue_2S = stats.t.sf(np.abs(diff_tstat), n-1)*2

print ("t-statistic = {0}\npvalue      =  {1:.12f}".format(diff_tstat, diff_pvalue_2S))
print(crlf)

t-statistic = -8.02070694411
pvalue      =  0.000000041030


The difference between the reaction times under the Congruent and Incongruent interference condition performances (-7.96 seconds) is significant because the p-value (0.000000041030) associated with the t-statistic -8.02 is lower than 0.05.

The likelihood of getting a test statistic of -8.02 or any lower value, if in fact the null hypothesis is true, is 0.000000041030.

using the scipy.ttest_rel() function

I have demonstrated the computation of the t-test and the p-value albeit manually for the sake of clearly demonstrating the procedures involved in conducting a Student's t hypothesis test on paired data.

The scipy module does provide a way to come up with the same results that is more straightforward. The ttest_rel() function calculates and returns the t-statistic and the p-value for a two-sided test for the null hypothesis that two related or repeated samples have identical average (expected) values.

In [88]:
stats.ttest_rel(dfc,dfi)
Out[88]:
Ttest_relResult(statistic=-8.020706944109957, pvalue=4.1030005857111781e-08)

Decision

Since the p-value of 0.000000041030 is so small; the null hypothesis provides a very poor explanation of the data. Moreover, the results are highly significant given that the p-value is considerably less than the alpha of 5%. Therefore, we reject the null hypothesis in favor of the alternative. We find very good evidence that the difference in the reaction times between the Congruent and Incongruent groups exceeds 0.

The data provides strong evidence at an alpha level of 0.05 (specifically, t(23)=-8.02, p < 0.05) that the presence of a conflicting word stimuli considerably increases the response time for naming the conflicting color of a word by about 7.96 seconds with a margin of error of 2.05 seconds.

The graphic below demonstrates how our t-statistic falls inside the critical region demarcated by the t-critcal value of -2.069 on the left side of the t-distribution.

In [103]:
# Plot distribution
plt.figure(figsize=(15, 6)) 
bounds = np.arange(-15, 7, 0.001)
t_dist = stats.t(n-1).pdf
plt.xlabel("t-scores", fontsize=15)
plt.ylabel("")

plt.plot(bounds, t_dist(bounds), color="#03a9f4")

# Plot critical region
plt.fill_between(bounds, t_dist(bounds), facecolor="#4fc3f7",
                 where=np.logical_or(bounds < -t_critical,bounds > t_critical))

# Plot t statistic of t-test
plt.title("Student's t Distribution (df = 23)", fontsize=15)
plt.annotate("t-statistic = {:.2f}".format(diff_tstat), 
             xy=(diff_tstat, 0), 
             xytext=(diff_tstat, 0.05),
             arrowprops=dict(facecolor="#ffc107"))

plt.show()

Effect size

I am curious about the effect size or the distance between the difference of the means of the two groups, this time expressed in standard deviation units. How many standard deviations are each mean away from each other? For this, I resort to the use of Cohen's d for the difference in sample mean reaction times.

In [90]:
from __future__ import division

def cohens(xbar, mu, s):
    """
    Compute Cohen's d
    """
    mean_diff = xbar - mu
    result = mean_diff/s
    return result   

print(crlf)
diff_cohens = cohens(dfc_mean, dfi_mean, diff_std)
print("cohen's d: {0}".format(diff_cohens))

cohen's d: -1.63721994912

The mean of the Congruent group is about 1.64 standard deviations lower than the mean of the Incongruent group. A Cohen's effect size value (d=-1.64) suggests the difference in reaction times associated with the use of conflicting words resutls in a very high effect or practical significance.

Confidence interval for mean population difference

We have just used the p-value and test statistic to reject the null hypothesis. I am curious about the population that can be inferred from the difference that we found between the mean response times of the two groups.

In the section that follows, I will compute a 95% confidence interval for the mean difference in response times if there was indeed an actual difference. We begin with the formula for computing the confidence interval (C.I.):

$$ \left( \bar{x}_{D} - t \cdot \frac{ s } {\sqrt{n}} \ \ , \ \ \bar{x}_{D} + t \cdot \frac{ s } {\sqrt{n}} \right)$$

where:

  • $\bar{x}_{D} $ - difference in the sample mean response times (in seconds) between the Congruent and Incongruent data

  • $s$ - the standard deviation for the difference in the sample mean response times

  • $n$ - the sample size

In [91]:
# Calculate the margin of error
print(crlf)
diff_ME = t_critical * diff_SE
print("Margin of error {0}".format(diff_ME))
print(crlf)

Margin of error 2.05457624536


In [92]:
# CI lower bound
print(crlf)
diff_CI_lower = diff_mean - diff_ME
print("C.I. lower bound: {0}".format(diff_CI_lower))
print(crlf)

C.I. lower bound: -10.019367912


In [93]:
# CI upper bound
print(crlf)
diff_CI_upper = diff_mean + diff_ME
print("C.I. upper bound: {0}".format(diff_CI_upper))

C.I. upper bound: -5.91021542131

Confidence Interval

We are 95% confident that the true population difference in reaction times between the Congruent and Incongruent group is between -10.02 seconds and -5.91 seconds.

It is quite clear that our null hypothesis does not find support in this confidence interval finding because a difference of 0 is nowhere near the bounds that a 95% confidence level brings on a dataset that is centered at the current point estimate of -7.96 seconds.

Digging Deeper

The Stroop effect is a fascinating phenomenon. I tried the test myself using an online, interactive application and I can attest that I went through the list of words more slowly when confronted with the set of words where the color of the word conflicts with the word's meaning (I am supposed to name the color of the word).

I believe that my response was influenced by the process in my brain that is responsible for recognizing color and it conflicts against the process that is responsible for recognizing words. The fact that I performed faster working on the set of words where the color and words matched (the congruent condition) means that I recognize the meaning of word faster before I recognize the color of the word. It also means that when it came time for me to name the conflicting colors in the incongruent set, I have to deliberately inhibit my word-recognition process in order to allow my color-recognition process to come forward. And this effort, however minute, takes extra time.

The inhibition I just described finds support from one of the theories used to explain the Stroop effect - the Selective Attention Theory:

"The Selective Attention Theory that color recognition as opposed to reading a word, requires more attention, the brain needs to use more attention to recognize a color than to word encoding, so it takes a little longer." - M. McMahon

Relevant Topics

Dr. J.R. Stroop's experiment has influenced many studies in Pyschology that deals with the effects of interference or inhibition. It has also found practical application in tests that rely on concentration or attentiveness. The Stroop Task has been used to verify cognitive attention and focus among extreme mountain climbers.

Peter Wuhr of Erlangen University in Germany used a variation of the Stroop experiment when he studied how the processing of the orientation of visual objects is affected by the conditions of using congruent or incongruent orientation words. His discovery that the congruence in orientation words generates faster orientation-naming feedback basically reveals a new Stroop effect for spatial orientation.

References

Stroop, John Ridley (1935). "Studies of interference in serial verbal reactions". Journal of Experimental Psychology. 18 (6): 643–662. doi:10.1037/h0054651. Retrieved 2008-10-08.

Wuhr, P (2007). "A Stroop Effect For Spatial Orientation". The Journal of General Psychology. 134 (3): 285–294. doi:10.3200/genp.134.3.285-294.

McMahon, M. "What Is the Stroop Effect". Retrieved November 11, 2013.