Are you choosing a Reliable Assessment?

May 5, 2020 Shloka Goyal

Part 2: Reliability: Concept, Types and Applicability 

When looking into the properties of an assessment, reliability is probably the first term that comes up in the mind of any HR professional. The credibility of an assessment is often analysed by individuals on the basis of its reliability coefficient. However, reliability is only one of the parameters on which an assessment must be judged; it is nonetheless an extremely important one. This blog aims to impart a general understanding of the concept of ‘reliability’, its types and applications, and how to choose a reliable assessment that suits your purpose.


Understanding the Concept

Reliability, in the simplest of terms, means “consistency.” It helps us understand how stable and consistent the results of an assessment are. This stability is achieved when the assessment keeps a check on all the factors which may cause variance, or error, in the measurement. These factors may include the length of the assessment, ambiguity in instructions and/or items, difficulty level, administrative procedure, scoring pattern, past exposure to the assessment items, assessor’s characteristics like experience and qualification, etc. to name a few.

One of the unique features of reliability is that it is a precursor to a valid assessment (i.e., an assessment measuring what it purports to measure). This basically means that an unreliable assessment cannot possibly be valid.

The measurement of reliability is backed by strong statistical logic and is acceptable as a scientific evidence of the effectiveness of any assessment. Reliability is measured in terms of the correlation coefficient which essentially ranges from 0 to 1, wherein 0 indicates no reliability and 1 indicates a perfect reliability.

 

Interpreting Reliability Coefficients

Research and deliberation in this area over years has led to the establishment of a more generic understanding and interpretation of reliability, as represented in Table 1 below.

 

Table 1: General Guidelines to Interpretation of Reliability Coefficient

Reliability Coefficient Value

Interpretation

0.90 and above

Excellent

0.80 – 0.89

Good

0.70 – 0.79

Adequate

Below 0.70

May have limited application

Source: U.S. Department of Labor, Employment and Training Administration (1999).

It is important to note that a perfect correlation of 1 is almost impossible to achieve in practicality as there always exists some amount of error and no assessment can be perfect.

However, centuries of research and accumulation of data has led to some extremely interesting findings suggesting how different types of assessments have differing ‘upper limits’ of reliability coefficient, thus changing the interpretation of reliability coefficients when it comes to a more practical scenario.

 

Table 2: Expected Reliability of Various Types of Assessments

Type of assessment

Expected Reliability*

Ability Assessment (individually administered)

0.95

Ability Assessment (group administration)

0.85

Technical/Knowledge Based Assessment (MCQ)

0.85

Personality Assessments

0.75

Rating Scales

0.5

Creativity Assessments

0.3

Projective Tests

0.2

*This column represents the expected upper limit of the reliability measure of the respective assessment which is practically possible to achieve in case of a highly scientific and sound assessment.

Source: Research Matters: A Cambridge Assessment publication (2006)

 

Reliability can be measured in different ways

Broadly, there are three different ways of determining the reliability of an assessment depending on how reliability is viewed in each case. The three common dimensions of reliability are as follow:

Table 3: Classification of types of Reliability into three Broad Dimensions

S. No.

Dimension

Description

Basic Requirement

Sub-types

1.

Temporal Stability

Aims to measure how stable the results on an assessment are over time.

This essentially means that when the same, or similar, group of individuals is assessed over a construct at two different points of time, the scores obtained at the two instances should be consistent.

The participants are required to take the same assessment twice with a considerable time gap in between.

  • Test-Retest Reliability

 

  • Alternate Form Reliability

2.

Internal Consistency

It is not always possible to administer an assessment multiple times on the same set of participants.

Thus, this method aims to measure the stability of results in terms of how consistently the items of the assessment are assessing the construct.

A single administration of the assessment is needed.

The number of items in the assessment needs to be large enough to perform sound statistical calculations.

  • Split-Half Reliability

 

  • Coefficient Alpha

 

  • Kuder-Richardson Formula 20 (K-R-20)

3.

Inter-rater Reliability

Views reliability in terms of consistency among different scorers/researchers, i.e., scorer-based consistency.

High inter-scorer reliability means that after an assessment is administered, when the responses are presented to different researchers or Subject-Matter Experts, the scoring and interpretation by all of them will ideally be the same as one-another.

The upper limit of inter-scorer reliability in case of open-ended response pattern can be 0.65 whereas it can be as high as 0.99 in case of close-ended responses.

Multiple assessors are required to assess the performance of the same participant(s).

The criteria and guidelines of assessing the performance ought to be predetermined and well-defined.

-

Within these dimensions, there are certain sub-categories or types of reliability. These describe the actual methods or processes to be followed to calculate the reliability of an assessment. The table below gives a summary of the most common of these ‘types’ along with the example related to their practical applications.

Table 4: Types of Reliability

Type of Reliability

Description

Applicability Example

Additional Consideration

Test-Retest Reliability

The same assessment is used to test the same, or similar, group of participants at two different points of time with some interim period in between.

It is used for constructs which are expected to stay unchanged over time.

The personality of an individual tends to be consistent over time, and hence, it makes sense to ensure that the personality assessment is consistently producing similar results over time.

it might not be a very appropriate measure of knowledge-based assessments as the knowledge of individuals may develop in the interim period due to training or practice. In case of knowledge-based assessments, for example, a low test-retest score is not an issue of concern.

Alternate Form Reliability

(or Parallel Form Reliability)

It is similar to the Test-retest Reliability method in that it involves two occasions of administration; however, instead of using the same assessment at both occasions, a comparable parallel form of the same construct is used for the second administration.

While attempting a cognitive assessment for the second time with a 15-day interim period, a participant may remember what response options they opted for during the last attempt, leading to a higher possibility of them choosing the same response the second time as well. However, if the question items measuring the construct are no longer the same, the participant is compelled to read and think about it again and choose a response accordingly.

Ensures that practice effect does not hamper the results of the assessment.

 

Using multiple parallel forms can also reduce the possibility of malpractices like cheating during the assessment.

 

Apart from time-based reliability, administrating parallel forms can also help determine item-based reliability of an assessment.

Split-Half Reliability

Assesses the reliability of an assessment using only a single form and a single administration session. For this, the assessment form is carefully divided into two equal halves and the scores obtained by the participants on the two halves are used to measure the reliability coefficient.

Splitting an assessment from the middle and dividing it into two halves followed by comparing the scores obtained on the first half with the score of the second half.

Splitting the assessment into two parts by keeping all odd items in one half and all even items in one half.

The characteristics of the items in the first half should be same as the characteristics of items in the second half. This includes number of questions, difficulty level, construct/trait being measured, etc.

Coefficient Alpha

(or Cronbach Alpha)

Focuses on how well the question items on an assessment positively correlate internally with each other.

It is an extension of the Kuder-Richardson Formula 20 (a.k.a. K-R-20) which calculated reliability of unidimensional assessments with homogenous content and dichotomous scoring pattern.

Can be calculated in all situations wherein measuring temporal consistency is not possible or splitting the assessment into two halves may lead to extremely small item numbers and inflated results.

Can be used for assessments with ‘multiple correct responses’ type of scoring pattern as well.

Can be used for assessments with heterogenous item types.

 


What the Assessment Users should know

Though picking up an assessment with a high reliability coefficient may seem to be logical, it may not necessarily be the right thing to do. Apart from the reliability coefficient, one must look into:

  • The type of reliability used
    • Is it relevant for the assessment in hand?
    • It is appropriate in terms of measuring consistency?
  • How the reliability studies were conducted
    • Were all the steps followed?
    • What considerations were kept in mind?
    • How large was the sample size?
  • Characteristics of the sample
    • Look into the age group, qualification, gender representation and other such characteristics of the sample
    • The characteristics of the sample group should align with the population for which the assessment is intended to be used

References

  1. Price, P. C., Jhangiani, R., & Chiang, I. C. A. (2015). Research methods in psychology. BCCampus.
  2. Anastasi, A. and Urbina, S.  (2014). Psychological Testing. 7th ed. New Delhi: Prentice-Hall of India.
  3. Gregory, R. J. (2014). Psychological Testing: History, Principles and Applications (7th Edition). Pearson Education: Global Edition.
  4. Saad, S., Carter, G. W., Rothenberg, M., & Israelson, E. (1999). Testing and assessment: an employer’s guide to good practices. US Department of Labor Employment and Training Administration. Washington, DC.
  5. Rust, J. (2006). Discussion piece: The psychometric principles of assessment. Research Matters: A Cambridge Assessment publication. Available at: http://www.cambridgeassessment.org.uk/research-matters/

Have feedback on the write-up and/or any questions related to the topic?

Feel free to write to us on shloka.goyal@cocubes.com

About the Author

Shloka Goyal

Shloka is an I/O Psychologist working with Aon's Assessment Solutions and has earned her masters' degree in I/O Psychology. Her expertise lies in the area of psychometric testing and assessments. Her work is majorly focused on knowledge development and providing scientific, research-based assessment solutions.

Follow on Linkedin More Content by Shloka Goyal
Previous Article
Defining Digital Leadership During COVID-19
Defining Digital Leadership During COVID-19

At a time when we were getting tired of the noise around digital transformation, Covid-19 arrived and jolte...

NEXT FEATURE
Continuity In Education: Is COVID-19 really the challenge?
Continuity In Education: Is COVID-19 really the challenge?

As is rightly said, we should never let a crisis go to waste. Corona may have brought the world economy to ...