In the The Affirmative Action Hoax, Steven Farron notes the "universal phenomenon" of aptitude tests over predicting the performance of lower performing groups. That is, members of higher scoring groups have better outcomes than members of lower scoring groups with the same test scores. For example, blacks with a given LSAT will have higher failure rates on the bar exam than equivalent white test takers. The same phenomenon has been observed comparing entrance exams of Arabs and Jews in Israel and whites and Asians on mathematical aptitude tests. Sailer even found a Federal Reserve study that showed that blacks had a higher risk of foreclosures than whites with the same FICO scores.
Borsboom et. al. (2008) showed that this "universal phenomenon" is a statistical artifact dependent on two conditions:
- Measurement Invariance: The test is measurement invariant in that the mathematical relationship between the test score and the latent trait
- Differential distribution: the mean of the distribution of the latent trait that the tests is trying to estimates differs between the two groups.
La Griffe du Lion had a similar post about this but I recommend Borsboom's more traditional treatment. The paper has a graph similar to that below that shows a stylized example of a test that is a simple function of the latent trait and that a test score of 50 is required for acceptance into a program while those with a latent trait level above 50 are suitable.
The ellipses represent the joint distribution of the test score and the latent attribute for low and high achieving groups. Due to measurement invariance the distributions only differ in the means. Obviously more members of the high scoring group will be true positives and more of the low scoring group true negatives but the two error types are more important since the accepted / not suitable candidates are at risk for failure and any rational institution tries to minimize this type of error.
P(accepted | not suitable, low achieving group ) > P(accepted | not suitable, high group).
So using the same the cutoff for both the high and low scoring groups leads to disproportionate rate of failure among the low scoring groups. Of course, with affirmative action a lower criterion is used for low scoring groups with predictable results.

6 comments:
What do you make of this paper?
I just skimmed it. They seem to say that the degree of overprediction (intercept differences) found for minorities has been badly estimated and is probably lower while slope difference may be undetected and may be higher. I can't tell if they are right but I hope there is a revival in test bias research.
Can this be thought of in regression toward the mean terms?
Yes, the proof uses regression to the mean and the amount of overprediction decreases with the reliability of the test. Higher reliability means less regression to the mean and less error
Jlovborg,
Thanks for the paper. I read it is interesting and I will try to post on it. The author seems to be a big shot court expert in disparate impact cases including Ricci.
This is quite intuitive if you think of how spam filters work.
Thank you for providing more evidence that rational bayesian stereotyping is highly costly to ignore. Perhaps a simulation of the effects of implementing such a procedure instead of the naive test cut off procedure would be interesting.
Post a Comment