Scientific rationale, metric formulas, and clinical interpretation guide for health professionals and researchers.
The Stroop test is one of the most widely used instruments in cognitive neuropsychology. It was first described by John Ridley Stroop in 1935 and has since generated over 700 empirical studies, making it one of the most replicated paradigms in all of experimental psychology (MacLeod, 1991).
The test exploits a fundamental characteristic of skilled readers: word reading is automatic and difficult to suppress, while colour naming requires controlled attentional processing. When a colour word is printed in an incongruent ink colour (e.g., the word RED printed in blue ink), the automatic word-reading response competes with the slower colour-naming response, generating measurable interference in both reaction time and accuracy.
This interference — the Stroop Effect — is considered a robust index of selective attention, response inhibition, and cognitive control. It reflects the activity of frontal executive systems, particularly the anterior cingulate cortex and dorsolateral prefrontal cortex, which mediate conflict monitoring and resolution.
This implementation uses up to three conditions depending on the selected mode. All stimuli are colour words displayed in a specific ink colour; the participant must identify the ink colour, ignoring the word meaning.
The word meaning and the ink colour match. Example: the word RED printed in red. No conflict exists between automatic reading and colour naming; this condition serves as the cognitive baseline.
The word meaning and the ink colour differ. Example: the word RED printed in blue. This is the conflict condition — the automatic reading response must be suppressed in favour of the correct colour response.
A word with no colour connotation (e.g., TABLE, HOUSE) is printed in a colour. Example: TABLE. The word carries no semantic interference for colour naming. This condition allows estimation of baseline processing time without the congruency benefit — the gap between Neutral and Congruent RT estimates the facilitation effect.
Each trial follows a fixed sequence: (1) a fixation cross (+) appears at the centre of the screen for the duration of the inter-trial interval (400 ms by default), directing the participant's gaze to the stimulus location; (2) the fixation cross is replaced by the colour word stimulus, which remains visible until a response is made or the time limit is reached; (3) after each response or timeout, the screen clears briefly before the next fixation cross appears. The fixation cross is displayed in a neutral grey and carries no colour connotation — its sole function is to direct spatial attention.
Three input methods are available, selected before starting the test:
The standard report is generated immediately after each session. All calculations use only trials from the current session, with no outlier removal applied.
For each condition, the mean RT is computed across all trials with a valid (non-timeout) response. Timed-out trials are excluded from RT but counted as errors in the accuracy calculation.
Accuracy is the proportion of trials in which the participant selected the correct ink colour, expressed as a percentage. Timeout trials count as errors.
The Overall Accuracy shown in the report is computed from Congruent and Incongruent trials only — the Neutral condition is excluded. Because the Neutral condition serves exclusively as an RT baseline, its accuracy is not clinically meaningful for the interference assessment and is therefore not displayed or included in the overall figure.
Interference is the RT difference between the Incongruent and Congruent conditions. It quantifies the cost of resolving word-colour conflict, expressed in milliseconds. A positive value indicates the expected direction (incongruent slower than congruent).
The Coefficient of Variation (CV) quantifies response consistency relative to the mean reaction time. It expresses the standard deviation as a percentage of the mean, allowing comparison of variability across conditions and across sessions regardless of differences in baseline speed.
A lower CV indicates more consistent, stable responses. A higher CV indicates greater trial-to-trial variability, which may reflect attentional instability, fatigue, or inconsistent strategic control — independent of mean RT. In the Historical Comparison table, CV changes are colour-coded in green (improvement) when the value decreases relative to the historical reference, because lower variability is the desired direction.
The Clinical Report is a printable PDF document that consolidates the current session's results and, when historical data is available, a longitudinal summary of all previous sessions. It can be generated in two ways:
The comparison table shows four metrics side by side: the current session value, the historical reference, and the change between them. The historical reference is computed from all sessions in the CSV excluding the current one. This prevents a circular comparison in which today's result would influence its own reference baseline — a standard approach in serial performance monitoring.
The column label adapts to the amount of data available:
The Detailed Report applies a three-pass exclusion pipeline before computing RT-based statistics. This pipeline follows standard practices in RT research to remove trials that are not representative of genuine cognitive processing. Exclusion thresholds are configurable by the user.
| Pass | Criterion | Default | Excludes from | Rationale |
|---|---|---|---|---|
| 1 | Anticipatory response | RT < 200 ms | RT and Accuracy | A response under 200 ms cannot reflect genuine colour identification — it precedes the time required for visual processing and decision-making. It is most likely a pre-motor reflex or key-press accident, not a cognitive response. |
| 2 | Lapse / disengagement | RT > 2000 ms | RT only | Very long RTs typically reflect attentional lapses, momentary disengagement, or distraction rather than genuine processing difficulty. Including them inflates mean RT and increases variance. The response is still counted in accuracy because a decision was eventually made. |
| 3 | Statistical outlier (per condition) | |RT − M| > 2.5 × SD | RT only | Trials whose RT deviates more than 2.5 standard deviations from the condition mean are flagged as statistical outliers. This pass operates within each condition separately to preserve genuine between-condition differences. It is applied after Pass 2 to avoid outlier inflation from extreme values. |
The 2.5 SD threshold is widely used in RT research as a balance between preserving statistical power and removing artifactual observations (Ratcliff, 1993). More conservative thresholds (3.0 SD) retain more trials at the cost of greater outlier influence; more liberal thresholds (2.0 SD) remove more data but may exclude legitimate slow responses.
The Detailed Report is generated from a Trial Data CSV file and provides six complementary analyses. All analyses operate on the trial-level data after applying the three-pass exclusion pipeline described in Section 4.
A box-and-whisker plot is displayed for each condition, allowing visual comparison of RT spread and central tendency. The implementation uses the Tukey method (Tukey, 1977).
The learning curve plots individual trial RTs over the sequence of trials, overlaid with a moving average to reveal practice or fatigue effects across the session.
A downward trend in the moving average over time indicates a practice effect (RT improvement). An upward trend at the end of the session may indicate cognitive fatigue. A flat curve suggests stable performance throughout.
Post-error slowing is a well-established cognitive phenomenon in which participants are slower on the trial immediately following an error compared to trials following a correct response. It is interpreted as reflecting error-monitoring processes and adaptive response adjustment (Rabbitt & Rodgers, 1977).
The session is divided into four equally-sized temporal blocks, and mean RT and accuracy are computed per block. This allows detection of practice, fatigue, or attentional fluctuation effects across the session.
Errors are grouped by word-ink combination. For each unique pair, the error rate is computed and the combinations with the highest rates are reported. This reveals which specific stimuli are most cognitively demanding, which can reflect lexical frequency effects, colour-name similarity, or individual semantic associations.
All trials removed by the three-pass exclusion pipeline are listed in full, showing the trial number, condition, RT, and the exclusion reason. This transparency allows the clinician or researcher to review every data-quality decision made before analysis.
There is no universally standardised normative range for Stroop interference, as values vary substantially across age, education, and task design. The following descriptive thresholds are used in this tool as a general orientation for clinical screening, not as diagnostic cutoffs:
When interference is negative (e.g., −42 ms), the congruent condition was slower than the incongruent condition — a reversal of the expected Stroop direction. This is atypical in healthy adults and does not indicate superior cognitive control. Possible explanations include response set effects (a consistent speed-accuracy bias toward the incongruent format), practice or habituation effects, or strategic suppression of word reading. Negative interference should be noted as an unusual finding and interpreted cautiously, particularly if it persists across multiple sessions.
In the historical report, Best Session identifies the session with the smallest absolute interference value — the session whose score is closest to zero. This criterion reflects the session in which word-colour conflict had the least measurable effect on response time, regardless of direction. A session with +20 ms is therefore considered better than one with −60 ms, because its absolute interference (20 ms) is smaller. This approach avoids misclassifying atypical negative values as markers of exceptional performance.
The Stroop paradigm has been validated as a sensitive measure across numerous clinical and research contexts (MacLeod, 1991):
MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative review. Psychological Bulletin, 109(2), 163–203. https://doi.org/10.1037/0033-2909.109.2.163
Rabbitt, P. M. A., & Rodgers, B. (1977). What does a man do after he makes an error? An analysis of response programming. Quarterly Journal of Experimental Psychology, 29(4), 727–743. https://doi.org/10.1080/14640747708400645
Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological Bulletin, 114(3), 510–532. https://doi.org/10.1037/0033-2909.114.3.510
Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643–662. https://doi.org/10.1037/h0054651
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.