Psychological Tests

I. Introduction

Personality assessment is perhaps more an art form than a science. In an attempt to render it as objective and standardized as possible, generations of clinicians came up with psychological tests and structured interviews. These are administered under similar conditions and use identical stimuli to elicit information from respondents. Thus, any disparity in the responses of the subjects can and is attributed to the idiosyncrasies of their personalities.


Moreover, most tests restrict the repertory of permitted of answers. “True” or “false” are the only allowed reactions to the questions in the Minnesota Multiphasic Personality Inventory II (MMPI-2), for instance. Scoring or keying the results is also an automatic process wherein all “true” responses get one or more points on one or more scales and all “false” responses get none.

This limits the involvement of the diagnostician to the interpretation of the test results (the scale scores). Admittedly, interpretation is arguably more important than data gathering. Thus, inevitably biased human input cannot and is not avoided in the process of personality assessment and evaluation. But its pernicious effect is somewhat reined in by the systematic and impartial nature of the underlying instruments (tests).

Still, rather than rely on one questionnaire and its interpretation, most practitioners administer to the same subject a battery of tests and structured interviews. These often vary in important aspects: their response formats, stimuli, procedures of administration, and scoring methodology. Moreover, in order to establish a test’s reliability, many diagnosticians administer it repeatedly over time to the same client. If the interpreted results are more or less the same, the test is said to be reliable.

The outcomes of various tests must fit in with each other. Put together, they must provide a consistent and coherent picture. If one test yields readings that are constantly at odds with the conclusions of other questionnaires or interviews, it may not be valid. In other words, it may not be measuring what it claims to be measuring.

Thus, a test quantifying one’s grandiosity must conform to the scores of tests which measure reluctance to admit failings or propensity to present a socially desirable and inflated facade (“False Self”). If a grandiosity test is positively related to irrelevant, conceptually independent traits, such as intelligence or depression, it does not render it valid.

Most tests are either objective or projective. The psychologist George Kelly offered this tongue-in-cheek definition of both in a 1958 article titled “Man’s construction of his alternatives” (included in the book “The Assessment of Human Motives”, edited by G.Lindzey):

“When the subject is asked to guess what the examiner is thinking, we call it an objective test; when the examiner tries to guess what the subject is thinking, we call it a projective device.”

The scoring of objective tests is computerized (no human input). Examples of such standardized instruments include the MMPI-II, the California Psychological Inventory (CPI), and the Millon Clinical Multiaxial Inventory II. Of course, a human finally gleans the meaning of the data gathered by these questionnaires. Interpretation ultimately depends on the knowledge, training, experience, skills, and natural gifts of the therapist or diagnostician.

Projective tests are far less structured and thus a lot more ambiguous. As L. K.Frank observed in a 1939 article titled “Projective methods for the study of personality”:

“(The patient’s responses to such tests are projections of his) way of seeing life, his meanings, signficances, patterns, and especially his feelings.”

In projective tests, the responses are not constrained and scoring is done exclusively by humans and involves judgment (and, thus, a modicum of bias). Clinicians rarely agree on the same interpretation and often use competing methods of scoring, yielding disparate results. The diagnostician’s personality comes into prominent play. The best known of these “tests” is the Rorschach set of inkblots.

II. MMPI-2 Test

The MMPI (Minnesota Multiphasic Personality Inventory), composed by Hathaway (a psychologist) and McKinley (a physician) is the outcome of decades of research into personality disorders. The revised version, the MMPI-2 was published in 1989 but was received cautiously. MMPI-2 changed the scoring method and some of the normative data. It was, therefore, hard to compare it to its much hallowed (and oft validated) predecessor.

The MMPI-2 is made of 567 binary (true or false) items (questions). Each item requires the subject to respond: “This is true (or false) as applied to me”. There are no “correct” answers. The test booklet allows the diagnostician to provide a rough assessment of the patient (the “basic scales”) based on the first 370 queries (though it is recommended to administer all of 567 of them).Based on numerous studies, the items are arranged in scales. The responses are compared to answers provided by “control subjects”. The scales allow the diagnostician to identify traits and mental health problems based on these comparisons. In other words, there are no answers that are “typical to paranoid or narcissistic or antisocial patients”. There are only responses that deviate from an overall statistical pattern and conform to the reaction patterns of other patients with similar scores. The nature of the deviation determines the patient’s traits and tendencies – but not his or her diagnosis!

The interpreted outcomes of the MMPI-2 are phrased thus: “The test results place subject X in this group of patients who, statistically-speaking, reacted similarly. The test results also set subject X apart from these groups of people who, statistically-speaking, responded differently”. The test results would never say: “Subject X suffers from (this or that) mental health problem”.

There are three validity scales and ten clinical ones in the original MMPI-2, but other scholars derived hundreds of additional scales. For instance: to help in diagnosing personality disorders, most diagnosticians use either the MMPI-I with the Morey-Waugh-Blashfield scales in conjunction with the Wiggins content scales – or (more rarely) the MMPI-2 updated to include the Colligan-Morey-Offord scales.

The validity scales indicate whether the patient responded truthfully and accurately or was trying to manipulate the test. They pick up patterns. Some patients want to appear normal (or abnormal) and consistently choose what they believe are the “correct” answers. This kind of behavior triggers the validity scales. These are so sensitive that they can indicate whether the subject lost his or her place on the answer sheet and was responding randomly! The validity scales also alert the diagnostician to problems in reading comprehension and other inconsistencies in response patterns.

The clinical scales are dimensional (though not multiphasic as the test’s misleading name implies). They measure hypochondriasis, depression, hysteria, psychopathic deviation, masculinity-femininity, paranoia, psychasthenia, schizophrenia, hypomania, and social introversion. There are also scales for alcoholism, post-traumatic stress disorder, and personality disorders.

The interpretation of the MMPI-2 is now fully computerized. The computer is fed with the patients’ age, sex, educational level, and marital status and does the rest. Still, many scholars have criticized the scoring of the MMPI-2.



The third edition of this popular test, the Millon Clinical Multiaxial Inventory (MCMI-III), has been published in 1996. With 175 items, it is much shorter and simpler to administer and to interpret than the MMPI-II. The MCMI-III diagnoses personality disorders and Axis I disorders but not other mental health problems. The inventory is based on Millon’s suggested multiaxial model in which long-term characteristics and traits interact with clinical symptoms.

The questions in the MCMI-III reflect the diagnostic criteria of the DSM. Millon himself gives this example (Millon and Davis, Personality Disorders in Modern Life, 2000, pp. 83-84):

“… (T)he first criterion from the DSM-IV dependent personality disorder reads ‘Has difficulty making everyday decisions without an excessive amount of advice and reassurance from others,’ and its parallel MCMI-III item reads ‘People can easily change my ideas, even if I thought my mind was made up.'”

The MCMI-III consists of 24 clinical scales and 3 modifier scales. The modifier scales serve to identify Disclosure (a tendency to hide a pathology or to exaggerate it), Desirability (a bias towards socially desirable responses), and Debasement (endorsing only responses that are highly suggestive of pathology). Next, the Clinical Personality Patterns (scales) which represent mild to moderate pathologies of personality, are: Schizoid, Avoidant, Depressive, Dependent, Histrionic, Narcissistic, Antisocial, Aggressive (Sadistic), Compulsive, Negativistic, and Masochistic. Millon considers only the Schizotypal, Borderline, and Paranoid to be severe personality pathologies and dedicates the next three scales to them.

The last ten scales are dedicated to Axis I and other clinical syndromes: Anxiety Disorder, Somatoform Disorder, Bipolar Manic Disorder, Dysthymic Disorder, Alcohol Dependence, Drug Dependence, Posttraumatic Stress, Thought Disorder, Major Depression, and Delusional Disorder.

Scoring is easy and runs from 0 to 115 per each scale, with 85 and above signifying a pathology. The configuration of the results of all 24 scales provides serious and reliable insights into the tested subject.

Critics of the MCMI-III point to its oversimplification of complex cognitive and emotional processes, its over-reliance on a model of human psychology and behavior that is far from proven and not in the mainstream (Millon’s multiaxial model), and its susceptibility to bias in the interpretative phase.

IV. Rorschach Inkblot Test


The Swiss psychiatrist Hermann Rorschach developed a set of inkblots to test subjects in his clinical research. In a 1921 monograph (published in English in 1942 and 1951), Rorschach postulated that the blots evoke consistent and similar responses in groups patients. Only ten of the original inkblots are currently in diagnostic use. It was John Exner who systematized the administration and scoring of the test, combining the best of several systems in use at the time (e.g., Beck, Kloper, Rapaport, Singer).The Rorschach inkblots are ambiguous forms, printed on 18X24 cm. cards, in both black and white and color. Their very ambiguity provokes free associations in the test subject. The diagnostician stimulates the formation of these flights of fantasy by asking questions such as “What is this? What might this be?”. S/he then proceed to record, verbatim, the patient’s responses as well as the inkblot’s spatial position and orientation. An example of such record would read: “Card V upside down, child sitting on a porch and crying, waiting for his mother to return.”

Having gone through the entire deck, the examiner than proceeds to read aloud the responses while asking the patient to explain, in each and every case, why s/he chose to interpret the card the way s/he did. “What in card V prompted you to think of an abandoned child?”. At this phase, the patient is allowed to add details and expand upon his or her original answer. Again, everything is noted and the subject is asked to explain what is the card or in his previous response gave birth to the added details.

Scoring the Rorschach test is a demanding task. Inevitably, due to its “literary” nature, there is no uniform, automated scoring system.

Methodologically, the scorer notes four items for each card:

I. Location – Which parts of the inkblot were singled out or emphasized in the subject’s responses. Did the patient refer to the whole blot, a detail (if so, was it a common or an unusual detail), or the white space.

II. Determinant – Does the blot resemble what the patient saw in it? Which parts of the blot correspond to the subject’s visual fantasy and narrative? Is it the blot’s form, movement, color, texture, dimensionality, shading, or symmetrical pairing?

III. Content – Which of Exner’s 27 content categories was selected by the patient (human figure, animal detail, blood, fire, sex, X-ray, and so on)?

IV. Popularity – The patient’s responses are compared to the overall distribution of answers among people tested hitherto. Statistically, certain cards are linked to specific images and plots. For example: card I often provokes associations of bats or butterflies. The sixth most popular response to card IV is “animal skin or human figure dressed in fur” and so on.

V. Organizational Activity – How coherent and organized is the patient’s narrative and how well does s/he link the various images together?

VI. Form Quality – How well does the patient’s “percept” fit with the blot? There are four grades from superior (+) through ordinary (0) and weak (w) to minus (-). Exner defined minus as:

“(T)he distorted, arbitrary, unrealistic use of form as related to the content offered, where an answer is imposed on the blot area with total, or near total, disregard for the structure of the area.”

The interpretation of the test relies on both the scores obtained and on what we know about mental health disorders. The test teaches the skilled diagnostician how the subject processes information and what is the structure and content of his internal world. These provide meaningful insights into the patient’s defenses, reality test, intelligence, fantasy life, and psychosexual make-up.

Still, the Rorschach test is highly subjective and depends inordinately on the skills and training of the diagnostician. It, therefore, cannot be used to reliably diagnose patients. It merely draws attention to the patients’ defenses and personal style.

V. TAT Diagnostic Test

The Thematic Appreciation Test (TAT) is similar to the Rorschach inkblot test. Subjects are shown pictures and asked to tell a story based on what they see. Both these projective assessment tools elicit important information about underlying psychological fears and needs. The TAT was developed in 1935 by Morgan and Murray. Ironically, it was initially used in a study of normal personalities done at Harvard Psychological Clinic.

The test comprises 31 cards. One card is blank and the other thirty include blurred but emotionally powerful (or even disturbing) photographs and drawings. Originally, Murray came up with only 20 cards which he divided to three groups: B (to be shown to Boys Only), G (Girls Only) and M-or-F (both sexes).

The cards expound on universal themes. Card 2, for instance, depicts a country scene. A man is toiling in the background, tilling the field; a woman partly obscures him, carrying books; an old woman stands idly by and watches them both. Card 3BM is dominated by a couch against which is propped a little boy, his head resting on his right arm, a revolver by his side, on the floor.

Card 6GF again features a sofa. A young woman occupies it. Her attention is riveted by a pipe-smoking older man who is talking to her. She is looking back at him over her shoulder, so we don’t have a clear view of her face. Another generic young woman appears in card 12F. But this time, she is juxtaposed against a mildly menacing, grimacing old woman, whose head is covered with a shawl. Men and boys seem to be permanently stressed and dysphoric in the TAT. Card 13MF, for instance, shows a young lad, his lowered head buried in his arm. A woman is bedridden across the room.

With the advent of objective tests, such as the MMPI and the MCMI, projective tests such as the TAT have lost their clout and luster. Today, the TAT is administered infrequently. Modern examiners use 20 cards or less and select them according to their “intuition” as to the patient’s problem areas. In other words, the diagnostician first decides what may be wrong with the patient and only then chooses which cards will be shown in the test! Administered this way, the TAT tends to become a self-fulfilling prophecy and of little diagnostic value.

The patient’s reactions (in the form of brief narratives) are recorded by the tester verbatim. Some examiners prompt the patient to describe the aftermath or outcomes of the stories, but this is a controversial practice.

The TAT is scored and interpreted simultaneously. Murray suggested to identify the hero of each narrative (the figure representing the patient); the inner states and needs of the patient, derived from his or her choices of activities or gratifications; what Murray calls the “press”, the hero’s environment which imposes constraints on the hero’s needs and operations; and the thema, or the motivations developed by the hero in response to all of the above.

Clearly, the TAT is open to almost any interpretative system which emphasizes inner states, motivations, and needs. Indeed, many schools of psychology have their own TAT exegetic schemes. Thus, the TAT may be teaching us more about psychology and psychologists than it does about their patients!

VI. Structured Interviews

The Structured Clinical Interview (SCID-II) was formulated in 1997 by First, Gibbon, Spitzer, Williams, and Benjamin. It closely follows the language of the DSM-IV Axis II Personality Disorders criteria. Consequently, there are 12 groups of questions corresponding to the 12 personality disorders. The scoring is equally simple: either the trait is absent, subthreshold, true, or there is “inadequate information to code”.

The feature that is unique to the SCID-II is that it can be administered to third parties (a spouse, an informant, a colleague) and still yield a strong diagnostic indication. The test incorporates probes (sort of “control” items) that help verify the presence of certain characteristics and behaviors. Another version of the SCID-II (comprising 119 questions) can also be self-administered. Most practitioners administer both the self-questionnaire and the standard test and use the former to screen for true answers in the latter.

The Structured Interview for Disorders of Personality (SIDP-IV) was composed by Pfohl, Blum and Zimmerman in 1997. Unlike the SCID-II, it also covers the self-defeating personality disorder from the DSM-III. The interview is conversational and the questions are divided into 10 topics such as Emotions or Interests and Activities. Succumbing to “industry” pressure, the authors also came up with a version of the SIDP-IV in which the questions are grouped by personality disorder. Subjects are encouraged to observe the “five year rule”:

“What you are like when you are your usual self … Behaviors. cognitions, and feelings that have predominated for most of the last five years are considered to be representative of your long-term personality functioning …”

The scoring is again simple. Items are either present, subthreshold, present, or strongly present.

VII. Disorder-specific Tests

There are dozens of psychological tests that are disorder-specific: they aim to diagnose specific personality disorders or relationship problems. Example: the Narcissistic Personality Inventory (NPI) which is used to diagnose the Narcissistic Personality Disorder (NPD).

The Borderline Personality Organization Scale (BPO), designed in 1985, sorts the subject’s responses into 30 relevant scales. These indicates the existence of identity diffusion, primitive defenses, and deficient reality testing.

Other much-used tests include the Personality Diagnostic Questionnaire-IV, the Coolidge Axis II Inventory, the Personality Assessment Inventory (1992), the excellent, literature-based, Dimensional assessment of Personality Pathology, and the comprehensive Schedule of Nonadaptive and Adaptive Personality and Wisconsin Personality Disorders Inventory.

Having established the existence of a personality disorder, most diagnosticians proceed to administer other tests intended to reveal how the patient functions in relationships, copes with intimacy, and responds to triggers and life stresses.

The Relationship Styles Questionnaire (RSQ) (1994) contains 30 self-reported items and identifies distinct attachment styles (secure, fearful, preoccupied, and dismissing). The Conflict Tactics Scale (CTS) (1979) is a standardized scale of the frequency and intensity of conflict resolution tactics and stratagems (both legitimate and abusive) used by the subject in various settings (usually in a couple).

The Multidimensional Anger Inventory (MAI) (1986) assesses the frequency of angry responses, their duration, magnitude, mode of expression, hostile outlook, and anger-provoking triggers.

Yet, even a complete battery of tests, administered by experienced professionals sometimes fails to identify abusers with personality disorders. Offenders are uncanny in their ability to deceive their evaluators.

APPENDIX: Common Problems with Psychological Laboratory Tests

Psychological laboratory tests suffer from a series of common philosophical, methodological, and design problems.

A. Philosophical and Design Aspects

  1. Ethical – Experiments involve the patient and others. To achieve results, the subjects have to be ignorant of the reasons for the experiments and their aims. Sometimes even the very performance of an experiment has to remain a secret (double blind experiments). Some experiments may involve unpleasant or even traumatic experiences. This is ethically unacceptable.
  1. The Psychological Uncertainty Principle – The initial state of a human subject in an experiment is usually fully established. But both treatment and experimentation influence the subject and render this knowledge irrelevant. The very processes of measurement and observation influence the human subject and transform him or her – as do life’s circumstances and vicissitudes.
  1. Uniqueness – Psychological experiments are, therefore, bound to be unique, unrepeatable, cannot be replicated elsewhere and at other times even when they are conducted with the SAME subjects. This is because the subjects are never the same due to the aforementioned psychological uncertainty principle. Repeating the experiments with other subjects adversely affects the scientific value of the results.
  1. The undergeneration of testable hypotheses – Psychology does not generate a sufficient number of hypotheses, which can be subjected to scientific testing. This has to do with the fabulous (=storytelling) nature of psychology. In a way, psychology has affinity with some private languages. It is a form of art and, as such, is self-sufficient and self-contained. If structural, internal constraints are met – a statement is deemed true even if it does not satisfy external scientific requirements.

B. Methodology

    1. Many psychological lab tests are not blind. The experimenter is fully aware who among his subjects has the traits and behaviors that the test is supposed to identify and predict. This foreknowledge may give rise to experimenter effects and biases. Thus, when testing for the prevalence and intensity of fear conditioning among psychopaths (e.g., Birbaumer, 2005), the subjects were first diagnosed with psychopathy (using the PCL-R questionnaire) and only then underwent the experiment. Thus, we are left in the dark as to whether the test results (deficient fear conditioning) can actually predict or retrodict psychopathy (i.e., high PCL-R scores and typical life histories).

    2. In many cases, the results can be linked to multiple causes. This gives rise to questionable cause fallacies in the interpretation of test outcomes. In the aforementioned example, the vanishingly low pain aversion of psychopaths may have more to do with peer-posturing  than with a high tolerance of pain: psychopaths may simply be too embarrassed to “succumb” to pain; any admission of vulnerability is perceived by them as a threat to an omnipotent and grandiose self-image that is sang-froid and, therefore, impervious to pain. It may also be connected to inappropriate affect.

    3. Most psychological lab tests involve tiny samples (as few as 3 subjects!) and interrupted time series. The fewer the subjects, the more random and less significant are the results. Type III errors and issues pertaining to the processing of data garnered in interrupted time series are common.

    4. The interpretation of test results often verges on metaphysics rather than science. Thus, the Birbaumer test established that subjects who scored high on the PCL-R have different patterns of skin conductance (sweating in anticipation of painful stimuli) and brain activity. It did not substantiate, let alone prove, the existence or absence of specific mental states or psychological constructs.

    5. Most lab tests deal with tokens of certain types of phenomena. Again: the fear conditioning (anticipatory aversion) test pertains only to reactions in anticipation of an instance (token) of a certain type of pain. It does not necessarily apply to other types of pain or to other tokens of this type or any other type of pain.

    6. Many psychological lab tests give rise to the petitio principii (begging the question) logical fallacy. Again, let us revisit Birbaumer’s test. It deals with people whose behavior is designated as “antisocial”. But what constitute antisocial traits and conduct? The answer is culture-bound. Not surprisingly, European psychopaths score far lower on the PCL-R than their American counterparts. The very validity of the construct “psychopath” is, therefore, in question: psychopathy seems to be merely what the PCL-R measures!

    7. Finally, the “Clockwork Orange” objection: psychological lab tests have frequently been abused by reprehensible regimes for purposes of social control  and social engineering.

Many additional Frequently Asked Questions (FAQs) about Personality Disorders – click HERE!

Psychological Defense Mechanisms

Psychological Signs and Symptoms

Your abuser in Therapy

Testing the Abuser

Conning the System


One Response

  1. […] Psychological Tests […]

Comments are closed.

%d bloggers like this: