Readings: Department of Labor (1998). O*NET version 1.0. http://online.onetcenter.org/ Heneman, H. H., Judge, T. A.,

Readings:

Department of Labor (1998). O*NET version 1.0. http://online.onetcenter.org/

Heneman, H. H., Judge, T. A., Kammeyer-Mueller, J. (2014). Staffing Organizations (8th ed.). New York: McGraw-Hill, Chapter 4.

Sackett, P.R. & Laczo, R.M. (2003). Job and Work Analysis. In W. C. Borman, D. R. Ilgen, R. J. Klimoski, & I. B. Weiner (Eds.),Handbook of Psychology (pp. 21-37). Hoboken, NJ: John Wiley & Sons, Inc.

**Jeanneret, P. & Strong, M. (2003). Linking O*NET job analysis information to job requirement predictors: An O*NET application.Personnel Psychology, 56, 465-492.

**Dierdorff, E.C. & Wilson, M.A. (2003). A meta-analysis of job analysis reliability. Journal of Applied Psychology, 88, 635-646.

Overview:

There is one common ground about which the law, science, and practice of personnel psychology all agree. This is the need for a thorough job analysis before developing selection and placement applications. In this module, we will review the science and practice of job analysis. In the process, you will learn an essential set of terms as well as something about the approach that is taken to analyzing jobs.

Additional terms:

Criterion development is the first step in developing and evaluating a selection system. It is the process of establishing conceptual and operational definitions for essential job functions. Conceptual definitions usually include clear verbal s of performance dimensions. Operational definitions consist of the measures used to identify the level of individual performance in a reliable and accurate fashion. For example, for the job of graduate student, an important performance dimension would be completing important assignments thoroughly and accurately in a timely fashion. This is a statement of the conceptual definition of a performance criterion. The operational definition of this criterion might include several indicators, including the professor’s evaluation of the accuracy and thoroughness of the assignment, as well as a computer-based indicator of when the assignment was submitted. It should be noted that, since performance is multi-dimensional, there will usually be multiple criterion measures for any job (and often for a single conceptual criterion).

Criterion development is accomplished as part of the job analysis process. After developing a list of task statements for a job, job analysts usually distribute surveys to job incumbents asking their views about the importance of each task statement to the job. Subject Matter Experts (SMEs) then make judgments about the importance of various Knowledge, Skills, Abilities, and Other characteristics (KSAOs) to the accomplishment of essential job functions. These KSAOs are then used to develop selection devices, and clusters of tasks (duties) and KSAOs are used to develop criterion dimensions. Measures of criterion performance are then used to evaluate the predictive accuracy of selection devices. Thus, criterion development is essential to criterion-based validation of selection devices.

Important terms from readings:

Job, position, task, KSAO, task- and KSAO-oriented job analysis, position , job specifications, task statement, task questionnaire, SME, task-KSAO linkage

———————

QUESTION

For this assignment, I want you to conduct a “mini job analysis” for the job your currently hold or the job you have most recently had. This assignment is worth 15 points. The first section is worth 10 points, the second section is worth 5 points.

For this assignment, consider yourself a subject matter expert (SME) and do the following:

1. Construct a job requirements matrix that includes

Specific tasks

oYou must have at least 10 specific tasks. Make sure your task statements are written correctly. Heneman and Judge provide very nice examples of techniques for writing task statements. O*NET statements do not go into as much depth as do the task statements that Heneman and Judge present. I would prefer that, for this part of the assignment, you model your work after Heneman and Judge’s suggestions rather than after O*NET.

Task dimensions / duties

oYou must have at least 3 task dimensions / duties that encompass the specific tasks.

Importance (indicate what your scale is, whether it is 0-100, percent of time spent on task, relative amount spent with Likertformat, etc. Just be sure you are clear about how you are operationalizing the scale)
Knowledge (use O*NET knowledge areas – listed in H&J and on O*NET site)
Skills (use O*NET skills – listed in H&J and on O*NET site)
Abilities (use O*NET skills – listed in H&J and on O*NET site)
The importance of each of the KSAs (indicate what your scale is, whether it is 0-100, whether it is needed or if it can be trained ~ an issue in the Jones et al. article, etc. Just be sure you are clear about how you are operationalizing the scale)
The context in which the job occurs (use O*NET skills – listed in H&J and on O*NET site)

Hint: Use Heneman & Judge’s (2009) example as your guide on how to format the matrix, but be sure to look at O*NET because there is a lot of useful information and many examples of task statements and KSAs.

2. Create a job for your current or most recent job. Although there is no standard job , for this assignment it must include the following:

Job Title
Job Summary
Essential functions (duties linked with tasks)
Qualifications
Job Context

You can add more to the job , but you must at least have the above items. Also, the job should be on its own page, no more than one page, and be aesthetically pleasing.


Journal of Applied Psychology
2003, Vol. 88, No. 4, 635– 646
Copyright 2003 by the American Psychological Association, Inc.
0021-9010/03/$12.00 DOI: 10.1037/0021-9010.88.4.635
A Meta-Analysis of Job Analysis Reliability
Erich C. Dierdorff and Mark A. Wilson
North Carolina State University
Average levels of interrater and intrarater reliability for job analysis data were investigated using
meta-analysis. Forty-six studies and 299 estimates of reliability were cumulated. Data were categorized
by specificity (generalized work activity or task data), source (incumbents, analysts, or technical experts),
and descriptive scale (frequency, importance, difficulty, time-spent, and the Position Analysis Questionnaire). Task data initially produced higher estimates of interrater reliability than generalized work activity
data and lower estimates of intrarater reliability. When estimates were corrected for scale length and
number of raters by using the Spearman-Brown formula, task data had higher interrater and intrarater
reliabilities. Incumbents displayed the lowest reliabilities. Scales of frequency and importance were the
most reliable. Implications of these reliability levels for job analysis practice are discussed.
some showing significant evidence of moderation, and others
displaying none. It is interesting to note that only recently have the
definitions of reliable and valid job information received directed
attention and discourse (Harvey & Wilson, 2000; Morgeson &
Campion, 2000; Sanchez & Levine, 2000). More recent research
has tended to frame the quality of job analysis data through views
ranging from various validity issues (Pine, 1995; Sanchez & Levine, 1994), to potential social and cognitive sources of inaccuracy
(Henry & Morris, 2000; Morgeson & Campion, 1997), to the
merits of job analysis and consequential validity (Sanchez &
Levine, 2000), and to an integrative approach emphasizing both
reliability and validity examinations (Harvey & Wilson, 2000;
Wilson, 1997). As an important component of data quality, we
sought to specifically examine the role of reliability in relation to
job analysis data quality.
Since mandating the legal requirements for the use of job
analyses (Uniform Guidelines for Employee Selection Procedures,
1978), the importance of obtaining job analysis data and assessing
the reliability of such data has become a salient issue to both
practitioners and researchers. It has been estimated that large
organizations spend between $150,000 and $4,000,000 annually
on job analyses (Levine, Sistrunk, McNutt, & Gael, 1988). Furthermore, it appears probable that job analysis data will continue to
undergo increasing legal scrutiny regarding issues of quality, similar to the job-relatedness of performance appraisal data, which
have already seen a barrage of court decisions during the past
decade (Gutman, 1993; Werner & Bolino, 1997). Considering the
widespread utility implications, legal issues, and organizational
costs associated with conducting a job analysis, it would seem safe
to assume that the determination of the general expected level of
job analysis data reliability should be of primary importance to any
user of this type of work information.
Prior literature has lamented the paucity of systematic research
investigating reliability issues in job analysis (Harvey, 1991; Harvey & Wilson, 2000; Morgeson & Campion, 1997; Ployhart,
Schmitt, & Rogg, 2000). Most research delving into the reliability
and validity of job analysis has been in a search for moderating
variables of individual characteristics, such as demographic variables like sex, race, or tenure (e.g., Borman, Dorsey, & Ackerman,
1992; Landy & Vasey, 1991; Richmann & Quinones, 1996) or
other variables like performance and cognitive ability (e.g.,
Aamodt, Kimbrough, Keller, & Crawford, 1982; Harvey, Friedman, Hakel, & Cornelius, 1988; Henry & Morris, 2000). The
overall conclusions of these research veins have been mixed, with
Purpose
The principal purpose of this study was to provide insight into
the average levels of reliability that one could expect of job
analysis data. Coinciding with this purpose were more specific
examinations of the reliability expectations given different data
specificity, various sources of data, variety of descriptive scales,
and techniques of reliability estimation. The hope embedded in
estimating average levels of reliability was that these data may in
turn inspire greater attention to the reliability of job analysis data,
as well as be used as reference points when examining the reliability of such data. We feel that not enough empirical attention
has been paid to this issue, and that the availability of such
reliability reference points could be of particular importance to
practitioners conducting job analyses. To date, no such estimates
have been available, and practitioners have had no means of
comparison with which to associate the reliability levels they may
have obtained. Moreover, elucidation of the levels of reliability
across varying data specificity, data sources, and descriptive scales
would provide useful information regarding decisions surrounding
the method, sample, format, and overall design of a job analysis
project.
Erich C. Dierdorff and Mark A. Wilson, Department of Psychology,
North Carolina State University.
This research was conducted in partial fulfillment of the requirements
for the master’s degree at the North Carolina State University by Erich C.
Dierdorff.
Correspondence concerning this article should be addressed to Mark
A. Wilson, Department of Psychology, North Carolina State University, P.O. Box 7801, Raleigh, North Carolina 27695-7801. E-mail:
mark_wilson@ncsu.edu
635
636
DIERDORFF AND WILSON
Scope and Classifications
Work information may range from attributes of the work performed to the required attributes of the workers themselves. Unfortunately, this common collective conception of work information (job-oriented vs. worker-oriented) can confound two
distinctive realms of data. Historically, Dunnette (1976) described
these realms as “two worlds of human behavioral taxonomies” (p.
477). Dunnette’s two worlds referred to the activities required by
the job and the characteristics of the worker deemed necessary for
successful performance of the job. More recently, Harvey and
Wilson (2000) contrasted “job analysis” versus “job specification,”
with the former collecting data about work activities and the latter
collecting data describing worker attributes presumably required
for job performance. The present study focused only on reliability
evidence obtained through data that described the activities performed within a given work role (i.e., job analysis). This parameter
allowed the study’s investigations to examine the reliability of data
that carry the feasibility of verification through observation, as
opposed to latent worker attributes typically described by job
specification data.
The primary classification employed by the present study delineated job analysis data by two categories of specificity: task and
general work activity (GWA). These classifications were not
meant to be all-inclusive but rather were meant to capture the
majority of job analysis data. Task-level data were defined as
information that targets the more microdata specificity (e.g.,
“cleans teeth using a water-pick” or “recommends medication
treatment schedule to patient”). In contrast, GWA-level data were
defined similarly to the description offered by Cunningham,
Drewes, and Powell (1995), portraying GWAs as “generic descriptors,” including general activity statements applicable across a
range of jobs and occupations (e.g., “estimating quantity” or “supervising the work of others”). An important caveat to data inclusion was that only GWAs relating to the work performed within a
job were used, thus excluding what have been referred to as
“generalized worker requirements” such as knowledge, skills, and
abilities (KSAs; McCormick, Jeanneret, & Mecham, 1972). By
separately coding tasks and GWAs, the present study allowed an
investigation of job-analysis reliability relative to the specificity
domain of the collected data. Prior literature has suggested that
job-analysis data specificity may affect the reliability of such data
(Harvey & Wilson, 2000; K. F. Murphy & Wilson, 1997), with
more specific data showing higher reliability levels. Moreover,
with the increasing prevalence of “competency” modeling, which
in part incorporates more general levels of behavioral information
(Schippmann, 1999), as well as the recent push to more generic
activities for purposes of job and occupation analysis (Cunningham, 1996), the separate examination of GWAs allowed for interpretative comparisons with increasingly prevalent and contemporary job and occupation analysis approaches.
In addition to data specificity, the present study incorporated a
classification for the source from which the data were generated.
Sources of job-analysis information were classified into three
groupings: (1) incumbents, (2) analysts, and (3) technical experts.
Incumbent sources referred to job information derived from jobholders. These data were usually collected through self-report
surveys and inventories. Analyst derived job information was from
nonjobholder professional job analysts. These data were generally
gathered through methods such as observation and interviewing
and were then used to complete a formal job-analysis instrument
(i.e., Position Analysis Questionnaire [PAQ]). The third source
group, technical experts, captured data obtained through individuals defined specifically as training specialists, supervisors, or
higher level managers (Landy & Vasey, 1991). Because many
technical experts can also be considered job incumbents, this
designation was reserved only for data that were explicitly described as being collected from technical experts, supervisors, or
some other “senior level” source. By source-coding reliability
evidence, analyses could reveal any changes in the magnitude of
reliability estimates in relation to these common sources of jobanalysis data. Prior empirical investigation has suggested the possibility of differential levels of reliability across various classifications of respondents (Green & Stutzman, 1986; Henry & Morris,
2000; Landy & Vasey, 1991), such as performance level of the
incumbent and various demographic characteristics of subject matter experts. The present research sought to compare the reliability
levels across sources rather than only within a given source as in
previous research.
A third classification was used to categorize the type of descriptive scale upon which a job was analyzed. Some common examples of descriptive scales are time spent on task, task importance,
and task difficulty (Gael, 1983). Past research has suggested that
the variety of scales used in job analysis yield different average
reliability coefficients (Birt, 1968). For instance, scales of frequency of task performance and task duration have displayed
reliabilities ranging from the .50s to the .70s (McCormick &
Ammerman, 1960; Morsh, 1964). Difficulty scales have generally
been found to have lower reliabilities than other descriptive scales,
with estimates ranging from the .30s to the .50s (McCormick &
Ammerman, 1960; McCormick & Tombrink, 1960; Wilson, Harvey, & Macy, 1990). Thus, data were coded for the commonly
used descriptive scales of frequency, importance, difficulty, and
time spent. GWA data derived from the PAQ (McCormick, Jeanneret, & Mecham, 1972), which is arguably the most widely used
and researched generic job analysis instrument, were additionally
coded.
To allow for a comparative analysis of reliability across the
three aforementioned classifications, it was necessary to group
coefficients into appropriate estimation categories. Therefore, reliability estimates were delineated by their computational approach. Two approaches commonly used in job analyses to estimate reliability were chosen as the categories employed by this
study. Both types of reliability estimation are discussed in the
ensuing section.
Types of Reliability Estimates Used in Job Analysis
The two most commonly used forms of reliability estimation are
interrater and intrarater reliability (Viswesvaran, Ones, & Schmidt,
1996). In the context of job analysis practice, interrater reliability
seems to be the more prevalent of the two techniques. Interrater
reliability identifies the degree to which different raters (i.e.,
incumbents) agree on the components of a target work role or job.
Interrater reliability estimations are essentially indices of rater
covariation. This type of estimate can portray the overall level of
consistency among the sample raters involved in the job analysis
effort. Typically, interrater reliability is assessed using either Pear-
META-ANALYSIS OF JOB ANALYSIS RELIABILITY
son correlations or intraclass correlations (ICC; see Shrout &
Fleiss, 1979, for a detailed discussion). Most previous empirical
literature has focused on the intrarater reliability of job analysis
data. Two forms of intrarater reliability commonly employed
within job analysis are repeated item and rate–rerate of the same
job at different points in time. Both of these estimates may be
viewed as coefficients of stability (Viswesvaran et al., 1996). The
repeated items approach can display the consistency of a rater
across a particular job analysis instrument (i.e., task inventory),
whereas the rate–rerate technique assesses the extent to which
there is consistency across two administrations. Intrarater reliability is typically assessed using Pearson correlations.
Research Questions
We examined reliability from previously conducted job analyses
by using the four aforementioned classifications. To explore this
purpose, we used meta-analytic procedures. The purpose of these
meta-analyses was to estimate the average reliability that one
could expect when gathering work information through a job
analysis at different data specificities from different sources and
when using various descriptive scales. In short, we sought to
investigate the following questions: What are the mean estimates
of reliability for job analysis information, and how do these estimates differ in magnitude across data specificity, data source, and
descriptive scale? Are the levels of interrater reliability higher or
lower than levels of intrarater reliability? Finally, does the source
of the job analysis information or the choice of descriptive scale
affect the magnitudes of reliability estimates?
Method
Database
We conducted a literature search using standard and supplementary
techniques in an attempt to lessen the effect of the “file drawer” problem—
the increased probability of positive findings in published literature
(Rosenthal, 1979). In the case of job analysis research, this could result in
unrealistically high estimations of reliability. In addition, many empirical
studies about or using job analysis data only report reliability estimations
as side bars to the main topic, thus making it more difficult to locate these
sources of reliability data. Using the standard technique, we used the
Internet and other computer-based resources. Some examples of these
sources were PsycINFO, PsychLit, job analysis–related Web sites and
listserves, the National Technical Information Services database, as well as
other online and offline library databases. Within these sources, we used
keyword searches with terms such as “job analysis, job analysis accuracy,
job analysis reliability, work analysis, and job information accuracy.” The
majority of reliability data that we used in this study were gathered with
this method. The supplementary technique, meant to expand the breadth of
the literature search, used both ancestry and descendency approaches
(Cooper, 1984), as well as correspondence with researchers in the field of
job analysis. The supplementary approach produced a substantial amount
of reliability data in the form of technical reports and unpublished job
analyses. Table 1 displays descriptive statistics of the included studies.
Analyses
To be included in the meta-analyses, studies were first required to
describe the approach used to assess reliability of the job data. Those that
did not assess reliability according to the aforementioned estimation types
were excluded. Second, the sample size used in the reliability estimation
637
Table 1
Descriptive Summary of Collected Data
Data category
No. of studies
No. of
reliability estimates
Interrater reliability
Specificity
Task
GWA
Source
Incumbent
Analyst
Technical expert
Scale
Frequency
Importance
Difficulty
Time spent
PAQ
Intrarater reliability
Specificity
Task
GWA
Source
Incumbent
Analyst
Technical expert
Scale
Frequency
Importance
Difficulty
Time spent
Publication type
Journal
Technical report
Book
Dissertation
31
214
16
15
119
95
16
100
10
9
8
10
3
6
8
10
11
10
23
83
10
5
49
36
12
4
2
42
31
6
5
4
5
6
13
4
7
10
26
10
1
2
205
87
3
4
Note. GWA ⫽ generalized work activity; PAQ ⫽ Position Analysis
Questionnaire.
was required. Third, studies were required to assess the requirements of the
job itself, not merely attributes of the workers.
Once the pool of studies was assembled, we coded the data for the
purposes of a comparative analysis. Coding allowed for us to conduct
separate meta-analyses within each of the study’s classifications, hence
making the average correlation generated within each grouping more
empirically justified. Two raters independently coded the gathered studies
according to the four aforementioned classifications. Interrater agreement
of study coding was 98%. Disagreements were resolved through discussion, and no additional exclusions were necessary.
We conducted a meta-analysis correcting only for sampling error for
each of the distributions gleaned from the study’s classifications. When
cumulating reliability across several past empirical studies, it may be
necessary to determine whether a need to adjust results from various
studies to a common length of items or number of raters (interrater
reliability) or to a common time interval (intrarater reliability) is required.
Two available options were to use the Spearman-Brown formula to bring
all estimates to a common length or to use previous research investigating
the functional relationship between time intervals and job analysis reliability. The present study conducted meta-analyses both with and without the
Spearman-Brown corrections of individual reliability estimates. Without
evidence of the functional relationship affecting intrarater reliability of job
analysis data, the only statement able to be proffered is that as the time
interval increases, reliability generally decreases (Viswesvaran et al.,
1996). Thus, no meta-analytic corrections were made to bring estimates of
638
DIERDORFF AND WILSON
intrarater reliability to a common time interval. However, intrarater reliability in job analysis can be derived from either a rate–rerate or a repeated
item approach. Therefore, to display the potential effects of time and allow
comparison between these two common forms of intrarater reliability,
separate meta-analyses for repeated item and rate–rerate reliabilities were
conducted. The mean time interval for rate–rerate data from the gathered
studies was 6.5 weeks and had a range of 1–16 weeks. Rate–rerate data
comprised 84% of the collected intrarater reliability data and repeated item
data made up the remaining 16%.
As for a common length of items or number of raters, the body of
literature on job analysis procedures does not concede a particular recommendation. Suggestions for item length and number of raters varies depending on the organization, project purposes, and the practical limitations
of the project (Levine & Cunningham, 1999). Therefore, to portray the
potential magnitude change in job analysis reliability as the number of
raters fluctuates, we used the Spearman-Brown formula to bring estimates
of reliability to several equal numbers of raters (e.g., 5, 15, and 25 raters).
As for a common length of items, the Spearman-Brown was similarly used
to bring the number of items to several common item lengths (e.g., 100,
200, and 300 items). Because of the smaller number of items typically
duplicated in the repeated item approach as opposed to the rate–rerate
approach to intrarater reliability (i.e., small subset of items vs. an entire
instrument), estimates for these meta-analyses were corrected to the same
equal numbered rater sets, but unlike the previous meta-analyses the
correction for item length was to 25 items only. The rationale for designating each of these particular rater and item sets was to mirror specifications typically found in job analysis projects. However, we do recognize
that these numbered sets are somewhat arbitrary, and others are clearly
possible.
For any meta-analysis using reliability estimates corrected with the
Spearman-Brown formula, all corrections were applied to the individual
reliability estimates. Operationally, individual reliability estimates were
first corrected to bring the estimates to equal numbers of raters. Once the
individual reliability estimates were corrected for number of raters, they
were then corrected for number of items. The individual reliability estimates derived from these various corrections were then used as input for
ensuing meta-analyses.
Similar to past research cumulating reliability estimates (Viswesvaran et
al., 1996), at least four estimates in a given distribution were needed to
perform a meta-analysis. For each meta-analysis conducted, we computed
the sample-size weighted mean, observed standard deviation, and residual
standard deviation. We also computed the unweighted mean and standard
deviation, which do not account for the sample sizes of included estimates.
Because each reliability coefficient was weighted, the sample-size
weighted mean provided the best estimate of the average reliability for a
given distribution, whereas the unweighted mean ensured that the results
were not skewed by a few large sample estimates. It is important to note
that as a general definition, an intrarater reliability estimate is computed as
a sample size of one, and thus sample-size weighted mean intrarater
reliability may seem incorrect. However, all of the collected intrarater
reliability data were in the form of averages of multiple single-rater
reliabilities. Therefore, for intrarater reliability, the sample size of a given
averaged intrarater reliability estimate served as the meta-analytic samplesize weight.
Using the results from the statistics described above, we assessed the
sampling error variance associated with the mean of the reliability by
dividing the variance by the number of estimates averaged (Callender &
Osburn, 1988). An 80% confidence interval was calculated from the
sampling error of the mean around each mean reliability estimate. We also
computed 80% credibility intervals for both interrater and intrarater reliabilities. We calculated these intervals using the sampling error of the mean
correlation as derived from the residual standard deviation. We calculated
the residual standard deviation as the square root of the difference between
observed and sampling error variance. Using the residual standard devia-
tion and the mean reliability correlation to form the 80% credibility
interval, we estimated the reliability below which the population reliability
value is likely to fall with the chance of .90. The credibility interval refers
to the estimated distribution of the population values, not observed values,
which are affected by sampling error (Hunter & Schmidt, 1990).
Results
Interrater Reliability
The results for the interrater reliability meta-analyses are reported in the left half of Table 2. The sample size weighted mean
reliability estimate for task-level job analysis data was .77 (n
⫽ 24,656; k ⫽ 119). The sample-size weighted mean reliability
estimate for GWA-level job analysis data was .61 (n ⫽ 9,999; k ⫽
95). These mean interrater reliability estimates can be seen as the
average values one could expect when collecting job analysis
information at the respective data specificity. Also shown are the
unweighted mean estimates, standard deviations, and the 80%
confidence and credibility intervals. Table 2 also provides the
results of meta-analyses for interrater reliability classified by
source and descriptive scale nested within data specificity. As can
be seen, there were insufficient data to perform meta-analyses for
GWA data from technical experts and on the scales of importance,
difficulty, and time spent. Note that results in Table 2 are not
corrected for item length or number of raters.
Table 3 displays the sample-size weighted mean interrater reliabilities corrected to an equal number of raters and items using the
Spearman-Brown formula. Similar to the uncorrected estimates,
tasks generally have higher interrater reliability than do GWAs.
However, for smaller numbers of raters and items (i.e., 5 raters at
100 and 200 items) the interrater reliability for GWA data is
slightly higher than for task data. As for data source, analysts tend
to show the highest interrater reliability, and incumbents the lowest, regardless of data specificity. Both incumbents and analysts
did display higher interrater reliability for tasks than GWAs,
although the estimates for larger numbers of incumbents and items
were quite comparable across data specificity (e.g., .74 for tasks
vs. .73 for GWAs). For interrater reliability in the category of
descriptive scale, only scales of frequency had sufficient data to
allow comparison across specificity. Here, frequency ratings of
GWAs had higher interrater reliability than ratings for tasks. These
results should be interpreted with caution, however, due to the
small number of reliability estimates (k ⫽ 4). Specifically with
task data, scales of importance showed the highest levels of interrater reliability. Interestingly, data from scales of difficulty were
not the lowest in reliability magnitudes as with the uncorrected
estimates. Taken collectively, the evidence from Table 3 generally
supports prior suggestions and findings of differential interrater
reliability of job analysis data across data specificity, data source,
and descriptive scale.
Intrarater Reliability
The right half of Table 2 displays the results of the metaanalyses conducted for intrarater reliabilities of job analysis data.
The sample-size weighted mean reliability estimate for task-level
job analysis data was .68 (n ⫽ 7,392; k ⫽ 49). The sample-size
weighted mean reliability estimate for GWA-level job analysis
data was .73 (n ⫽ 3,096; k ⫽ 36). Again, these mean intrarater
100
10
9
10
11
10
23
95
7
86
4
83
24,420
162
77
1,311
280
746
10,785
9,999
1,126
8,770
400
8,768
.587
.688
.745
.587
.699
.771
.632
.665
.606
.773
.631
.466
.771
Rwt
.223
.031
.056
.224
.159
.258
.140
.281
.217
.249
.142
.178
.249
SDwt
.657
.808
.752
.660
.588
.615
.634
.708
.669
.717
.663
.479
.695
Runwt
.207
.110
.094
.207
.224
.355
.246
.249
.202
.266
.219
.195
.265
SDunwt
.764
.871
.690
.741
.635
.555, .618
.668, .708
.718, .772
.556, .618
.634,
.670,
.575,
.590,
.577,
.741, .805
.573, .689
.390, .543
.741, .800
80% CI
.213
.043
.043
.214
.153
.250
.121
.279
.208
.247
.061
.222
.247
SDres
.896
1.000
.789
1.000
.874
.311, .862
.633, .743
.689, .801
.311, .863
.501,
.448,
.476,
.305,
.338,
.453, 1.000
.552, .709
.179, .754
.452, 1.000
80% CD
1,863
2,416
27
31
8
4
7
10
36
6
57
1,171
1,351
642
1,743
3,096
42
49
k
7,281
7,392
n
.679
.705
.721
.730
.485
.645
.733
.813
.683
.684
Rwt
.137
.130
.034
.027
.075
.101
.135
.096
.107
.107
SDwt
.819
.697
.771
.772
.579
.718
.711
.800
.715
.723
Runwt
.092
.140
.086
.048
.134
.120
.141
.104
.132
.131
SDunwt
.736
.747
.521
.686
.762
.645, .713
.675, .735
.706,
.712,
.448,
.604,
.703,
.762, .863
.662, .704
.664, .703
80% CI
Intrarater reliabilities
.120
.116
.021
.008
.029
.090
.125
.066
.099
.097
SDres
.748
.740
.522
.762
.896
.524, .834
.555, .855
.694,
.719,
.448,
.529,
.569,
.727, .898
.556, .810
.557, .810
80% CD
Note. k ⫽ number of reliabilities included in meta-analysis; R ⫽ mean reliability; wt ⫽ sample-size weighted; unwt ⫽ sample-size unweighted; CI ⫽ confidence interval; res ⫽ residual; CD ⫽
credibility interval; GWA ⫽ generalized work activity; PAQ ⫽ Position Analysis Questionnaire.
119
24,656
Task
Source
Incumbent
Analyst
Technical expert
Scale
Frequency
Importance
Difficulty
Time spent
GWA
Source
Incumbent
Analyst
Technical expert
Scale
Frequency
Importance
Difficulty
Time spent
PAQ
k
n
Data category
Interrater reliabilities
Table 2
Meta-Analyses of Job Analysis Interrater and Intrarater Reliabilities
META-ANALYSIS OF JOB ANALYSIS RELIABILITY
639
.319
.221
.175
.312
.245
.225
.238
.409
.230
.419
.313
.417
.388
.763
.488
.290
.433
.370
.227
.384
.286
.393
.594
.365
.723
.442
.439
.474
.423
.627
.574
.373
.472
.541
.871
.717
.488
R
.323
.398
.291
.402
.335
.272
.260
.301
.394
.331
.126
.136
.294
SD
15 raters
.769
.489
.527
.522
.492
.702
.659
.446
.524
.610
.914
.802
.570
R
.301
.384
.263
.388
.337
.273
.263
.326
.379
.326
.088
.105
.295
SD
25 raters
.681
.409
.377
.441
.371
.559
.500
.317
.436
.484
.827
.638
.421
R
.328
.408
.301
.411
.329
.267
.252
.281
.403
.330
.161
.157
.285
SD
5 raters
.785
.507
.560
.540
.517
.726
.686
.471
.543
.634
.926
.828
.598
R
.290
.379
.249
.383
.335
.273
.263
.334
.373
.324
.077
.094
.293
SD
15 raters
200 items
.827
.554
.655
.592
.587
.782
.754
.538
.600
.695
.952
.886
.674
R
.248
.364
.205
.367
.324
.273
.260
.354
.356
.315
.051
.066
.283
SD
25 raters
.723
.442
.439
.474
.423
.627
.574
.373
.472
.541
.871
.717
.487
R
.323
.398
.291
.402
.335
.272
.260
.303
.394
.331
.126
.136
.294
SD
5 raters
.819
.551
.635
.583
.573
.772
.741
.525
.588
.683
.941
.876
.659
R
.258
.367
.215
.370
.328
.273
.261
.350
.360
.317
.055
.071
.286
SD
15 raters
300 items
R
.860
.607
.727
.637
.642
.817
.797
.586
.646
.738
.967
.920
.209
.351
.167
.352
.308
.273
.256
.366
.342
.306
.035
.047
.279
SD
25 raters
.727
Note. Only categories with sufficient data to perform a meta-analysis are shown. R ⫽ mean reliability; GWA ⫽ generalized work activity; PAQ ⫽ Position Analysis Questionnaire.
.253
.312
Task
Source
Incumbent
Analyst
Technical expert
Scale
Frequency
Importance
Difficulty
Time spent
GWA
Source
Incumbent
Analyst
Scale
Frequency
PAQ
SD
R
Data category
5 raters
100 items
Table 3
Meta-Analyses of Interrater Reliability Estimates Using Spearman-Brown Corrections
640
DIERDORFF AND WILSON
Note. Only categories with sufficient data to perform a meta-analysis are shown. R ⫽ mean reliability; GWA ⫽ generalized work activity; PAQ ⫽ Position Analysis Questionnaire.
.556
.146
.440
.106
.222
.147
.464
.136
.351
.085
.163
.129
.314
.106
.054
.091
.222
.238
.164
.287
.240
.370
.233
.242
.404
.230
.505
.215
.287
.240
.483
.218
.588
.198
.147
.358
.242
.368
.382
.191
.752
.752
.709
.621
.581
.371
.279
.375
.393
.204
.722
.684
.671
.576
.475
.370
.325
.359
.384
.198
.637
.527
.564
.472
.269
.369
.272
.375
.391
.203
.728
.698
.679
.585
.497
.374
.302
.374
.395
.207
.694
.627
.636
.539
.393
.363
.329
.343
.370
.187
.595
.470
.513
.429
.211
.374
.310
.371
.394
.206
.681
.601
.619
.522
.358
.370
.325
.359
.384
.198
.347
.322
.305
.330
.161
.512
.375
.415
.350
.134
.637
.527
.564
.472
.269
.342
.016
.697
.981
.357
.104
.648
.969
.355
.067
.525
.915
.354
.023
.658
.972
.361
.103
.605
.954
.344
.091
.473
.880
.316
.021
.585
.955
.355
.101
.314
.079
.380
.814
.525
.927
.311
.728
.330
.674
.345
.542
.327
.685
.340
.627
.342
.488
.343
.606
.345
.322
.395
Task
Source
Incumbent
Technical expert
Scale
Frequency
Importance
Difficulty
Time spent
GWA
Source
Analyst
Scale
PAQ
SD
R
.542
SD
R
SD
R
SD
R
SD
R
SD
641
Data category
R
SD
R
SD
R
5 raters
15 raters
25 raters
R
SD
15 raters
5 raters
300 items
200 items
25 raters
15 raters
5 raters
From the values given in Tables 3 and 4, intrarater reliabilities
for task data were higher than their interrater reliability counterparts. This suggests that ratings of more specific data may exhibit
higher levels of stability than they will levels of consistency. Thus,
rating tasks may foster information that remains stable for raters
more so than fostering a common rating consensus across raters.
Contrarily, ratings of GWA data had higher interrater reliabilities
than intrarater estimates. This evidence suggests that at the more
general level of activity, ratings are more consistent than they are
stable among raters. Hence, ratings of GWAs appear to promote
consensus more so than stability. When reviewing the reliabilities
across data sources, incumbents seem to provide ra