Thursday, January 6, 2011

Stereotype threat and measurement invariance

Stereotype threat (ST) is a popular explanation for the observed black - white (B-W) IQ  and achievement of gaps.  Although I think the chance of ST explaining a large magnitude gap is low,  pigeon hole menace advocates have some interesting studies and investigating them is worth the work.  This post will discuss the relationship of ST to measurement invariance (MI).   Briefly, MI implies that for two groups the functional relationship between a test score (e.g. IQ ) and the the latent variable the score measures (e.g. intelligence) is the same.  It is the most important psychometric property without a Wikipedia page so this is the simplest introduction I can find. 

ST Studies: The hundreds of ST studies (e.g., Steel and Aronson 1995)  test whether black (or women or white) test takers feel more performance pressure than whites (or men or Asians) because of anxiety about being pigeonholed.   The experiments all have more or less the same fairly rigorous (and absolutely hilarious) design.  Black and white subjects are randomly assigned to two test taking groups.  The experimental group is subjected to an ST (e.g., the proctor unfurls the confederate battle flag) that should only affect blacks.  The gap between the two groups is compared using standard methods (e.g. ANCOVA) and more often than not the gap is larger in the experimental group. 

Errors in Analysis: Wicherts (2005) criticizes the use of ANCOVA models that test for the ST effect while controlling for the students individual characteristics (e.g. SAT scores).  His primary concern is that the effect of ST can vary by the student's ability so the the assumptions of equal regression weights may not be met and is not tested.  He then suggests using Multigroup Confirmatory Factor Analysis (MGCFA)  to test for MI.  Wicherts et al (2005) applied MGCFA to three ST experiments and found that ST could lead to he loss of MI even when the ANCOVA showed little difference between group averages.  These results have been (correctly) interpreted that ST can lead to biased test results.   So how much, if any, of the B-W gap observed in non-experimental conditions can they explain?

Magnitude:  In a meta-analysis, Walton and Spencer (2009) found that ST caused a -0.2 SD difference between black and white test scores.  However they used some questionable modeling techniques (e.g. comparing the black experimental patients to white control patients).  Wicherts and de Han (2009) presented at a meeting the results of a meta-analysis where they found publication bias and concluded that "Stereotype threat cannot explain the difference in mean cognitive test performance between African Americans and European Americans".  This has not yet been published though. 

Generalizability: ST study authors extrapolate the results to the the general setting (e.g., SAT). ST critics counter with the general validity of the tests and note that black SATs under predict academic performance for blacks (Sackett 2008) implying that latent intelligence could not be underestimated.  Wicherts and Millsap (2009) challenge this conclusion and note that under prediction can still occur even if measurement variance does not hold.

In order to show that the observed gap is due to stereotype threat you have to show MI does not hold.  Wicherts et.al. (2005) note, "If a certain test score gap is accompanied by measurement invariance (and power is not an issue), stereotype threat is not likely to play a differential role in those particular group differences."  The few tests of MI for blacks and whites (Dolan 2004) have detected no violations.  Lack of MI must be observed in an important test of achievement for ST to have any practical importance.

 








9 comments:

Steve Sailer said...

Stereotype threat is always measured experimentally on low-stakes tests because it would be unethical to try to induce poorer performance on high-stakes tests. But low-stakes tests usually have concerns about motivation, since they are, by definition, low stakes. Stereotype threat tests attempt to induce blacks into working less hard on tests that don't do them any good, personally speaking. Why is it surprising that they sometimes succeed?

Statsquatch said...

Maybe but ST partisans could argue that you are more likely to feel anxious on a high stakes test (e.g., SAT) so ST will be amplified. Of course, for its partisans not being able to run the experiments on high stakes test is a feature not a bug. They can impugn the test without any first hand data. Regardless, ST is a moot point unless someone can show a lack of measurement invariance on a high stakes test. I doubt they will on the SATs though, the ETS is smart enough to avoided that.

Steve Sailer said...

Can you do me a huge favor and give me a couple of (stylized) examples of measurement invariance? I looked at the presentation you linked to, but there wasn't enough text about the ice cream sales by temperature in Austria and US graphs for me to understand them.

Statsquatch said...

Sure, check in Sunday. I have to fire up R.

occidentalascent said...

Stats,

What do you make of this paper?
http://www.socsc.smu.edu.sg/events/Paper/sdarticle.pdf

Statsquatch said...

OA, the link got cut off.

Statsquatch said...

OA,

Got it. Thanks. This may deserve a longer look but from the abstract it appears to be a very confusing result and may be reason to be thankful for the sociologist’s fallacy. The widening gap over time is consistent with Jensen's results and the increase of IQ heritability over time. The conclusion that early achievement is predictive of later achievement to some degree is consistent with the robustness of the achievement measured. From this to conclude then that early SES affects later achievment is odd. Wouldn’t you just model early SES on later IQ? So this may depend on the details of the analysis.

M said...

Pesta & Poznanski have a paper showing differences on reaction time measures which should be less affected by motivational factors.

http://www.csuohio.edu/business/academics/mlr/documents/Pesta_08_intell_race_iq.pdf

Statsquatch said...

M,

Thanks for the Ref. It would be cool if they did a stereotype threat assessment on ECT.