You are on page 1of 9

Research in Developmental Disabilities 30 (2009) 847–855

Contents lists available at ScienceDirect

Research in Developmental
Disabilities

Reliability and responsiveness of the Bruininks–Oseretsky


Test of Motor Proficiency-Second Edition in children with
intellectual disability
Yee-Pay Wuang, Chwen-Yng Su *
Kaohsiung Medical University, Department of Occupational Therapy, 100 Shih-Chuan 1st Road, Kaohsiung 807, Taiwan

A R T I C L E I N F O A B S T R A C T

Article history: We examined the internal consistency, test–retest reliability, and


Received 16 December 2008 the responsiveness of the Bruininks–Oseretsky Test of Motor
Accepted 17 December 2008 Proficiency-Second Edition (BOT-2) for children with intellectual
disabilities (ID). One hundred children with ID aged 4–12 years
Keywords: tested on 3 separate occasions: two baseline measurements with a
Intellectual disability 2-week interval before the intervention, and a follow-up measure-
Reliability ment after 4 months of pediatric rehabilitation program. The test–
Responsiveness
retest reliability and internal consistency of the total scale were
excellent, with an ICC of 0.99 (95% confidence interval) and a of
0.92. Responsiveness was acceptable for all BOT-2 measures except
the balance subtest. The minimal detectable change (MDC) and the
minimal important difference (MID) values yielded a lower
sensitivity level but a higher specificity level. Implications for
interpreting these responsiveness indices are discussed.
ß 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Motor skill deficits are common in children with intellectual disabilities (ID) and affect the child’s
participation or performance in activities in school, at home, and in the community (Dolva, Coster, &
Lilja, 2004). As a result, therapeutic intervention is particularly important to enhance motor function
and promote school success. To monitor the effectiveness of an intervention, it is crucial to employ
reliable and sensitive measures able to yield consistent results across repeated measurements and
detect subtle changes in motor function.

* Corresponding author. Tel.: +886 7 3121101x2650; fax: +886 7 3215845.


E-mail address: cysu@cc.kmu.edu.tw (C.-Y. Su).

0891-4222/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ridd.2008.12.002
848 Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855

The Bruininks–Oseretsky Test of Motor Proficiency (BOTMP) (Bruininks, 1978) is one of the most
widely used measure for evaluating motor deficits in children and adolescents with disabilities
such as cerebral palsy, mental retardation, developmental coordination disorder, attention deficit
hyperactivity disorder, and autism (Cairney et al., 2008; Dewey, Cantell, & Crawford, 2007; Gordon,
Schneider, Chinnan, & Charles, 2007; Wuang, Wang, Huang, & Su, 2008). Recently, the second
edition (Bruininks & Bruininks, 2005) has been published that differs from the first in several
respects, including deletion of response speed subtest, division of the visual–motor control subtest
into two new subtests which were lengthened by the addition of new items, change of subtest name
from upper-limb speed and dexterity to manual dexterity coupled with deletion and revision of
items, revision and addition of items in the running speed and agility as well as strength subtests,
and rearrangement of subtests to ensure easy administration and minimize examinee fatigue. The
BOT-2 is composed of 8 subtests grouped into four broad classes of motor functioning on the basis
of the muscle groups and limbs involved in the movements. The items are scored using different
numbers of response categories, each on an ordinal scale, ranging from four categories (0–3) to 13
categories (0–12). Higher categories indicate more and lower categories suggest less of the trait
being measured.
The BOT-2 has moderate to high inter-rater and test–retest reliabilities in healthy children and
evidence for criterion validity between BOT-2 scores and other measures of motor performance has
been reported (Bruininks & Bruininks, 2005). A number of investigators have studied the effectiveness
of different treatment strategies for children with ID using the BOTMP or BOT-2. For example, Wang
and Chang (1997) noted significant pre–post differences in several items of the BOTMP balance
subtest after a 6-week jumping skill training program. Wuang et al. (in press) found statistically
significant improvements in BOTMP scores, brought about by different intervention strategies (i.e.
sensory integration, neuro-developmental treatment, and perceptual motor therapy). However, there
has been no study addressing its reliability and responsiveness in this population. Reliability, the ratio
of true score variance to observed score variance, requires that the test shows little variability between
repeated measurements of children whose motor status has not changed. Responsiveness is viewed as
a measure of longitudinal validity and describes the ability of an instrument to accurately detect
important changes in performance over time (Stratford, Binkley, & Riddle, 1996). Knowledge of these
outcome measurement properties is essential in the assessment or follow-up of children with
disabilities. It determines the minimum detectable change (MDC) that can be used to interpret
whether the observed change reflects a statistically reliable change, and the minimal important
difference (MID) that depicts smallest change between two scores which is considered to be important
from the clients’ or clinicians’ perspectives (de Vet et al., 2006). The MDC and MID are all important
reference points on the scale of the measurement instrument, which assist in the interpretation of
score changes after treatment, thereby providing relevant information for clinical decision making.
Thus, further investigation of the psychometric qualities of the BOT-2 would seem reasonable and
warranted to support its use in research and clinical settings.
The primary purpose of the present study was to examine the internal consistency, test–retest
reliability, and responsiveness of the BOT-2 in children with ID. A second purpose was to estimate the
MDC and MID for the BOT-2.

2. Method

2.1. Participants

Children with ID were recruited from three public special schools, 2 non-profit agencies serving
disabled citizens, 2 child development centers and 4 hospitals in southern Taiwan. In these settings,
children were selected for participation if they met the following criteria: they (1) were aged between
4 and 12 years; (2) had a diagnosis of ID defined by a full-scale IQ  70 or less obtained on an
individually administered test of intelligence together with substantial limitations in adaptive
functioning; (3) were without serious emotional or behavioral disturbances; and (4) participated in
physical or occupational therapy programs at the time of research. Excluded were children who
carried coexisting autism, cerebral palsy, blindness, and deafness in an attempt to minimize
Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855 849

confounding of data. Also excluded were children with previous history of neurological disorders such
as traumatic brain injury, muscular dystrophies, and epilepsy.
A total of 133 children were eligible for inclusion and consented to participate. Of these, 33 (24.8%)
children dropped out of the study for various reasons. In the final sample (n = 100), 41% were female,
and the average age was 82.9 months (S.D. = 24.9, range = 48–124 months). Sixty-four children were
classified as having mild ID (IQ = 55–70), and the other 36 children were classified as having moderate
to severe ID (IQ = 25–54). The child’s primary caregivers (i.e. father, mother, or grandparent) had an
average of 12.4 years of education (S.D. = 3.3, range = 6–20 years). The occupations of the primary
caregivers were categorized into four major groups: professional or central administration (n = 10),
semi-professional workers (n = 21), technical workers (n = 42), and semi-technical or non-technical
workers (n = 27) (Wang et al., 1998).

2.2. Measures

2.2.1. BOT-2
The BOT-2 assesses proficiency in four motor-area composites (Bruininks & Bruininks, 2005). Fine
manual control composite is divided into fine motor precision (FMP) (7 items, score range = 0–41) and
fine motor integration (FMI) (8 items, score range = 0–40) subtests that measure the motor skills
involved in writing and drawing tasks requiring precise control of finger and hand movements.
Manual coordination composite is classified into manual dexterity (MD) (5 items, score range = 0–45
points) and upper-limb coordination (ULC) (7 items, score range = 0–39 points) subtests that evaluate
reaching, grasping, and object manipulation, with the emphasis on speed, dexterity, and coordination
of upper extremities. Body coordination composite is grouped into bilateral coordination (BLC) (7
items, score range = 0–24 points) and balance (BAL) (9 items, score range = 0–37 points) subtests that
tap the balance and motor skills required for successful participation in sports and recreational games.
Strength and agility composite is split into running speed and agility (RSA) (5 items, score range = 0–
52 points) and strength (STR) (5 items, score range = 0–42 points) subtests that assess large muscle
strength, running speed, and postural control during walking and running. The four composite scores
are combined to yield a total motor composite score. The average age-adjusted scale scores for
subtests are 15 (S.D. = 5), whereas composites are derived by summing the subtest scale scores and
converting them to a quotient with a mean of 100 and a S.D. of 15.
For the composites, internal consistency reliability coefficients ranged from 0.78 to 0.97, test–
retest coefficients over an interval of 7–42 days ranged from 0.52 to 0.95, and inter-rater reliability
coefficients exceeded 0.92 (Bruininks & Bruininks, 2005). The BOT-2 total composite correlated fairly
well with the Peabody Developmental Motor Scales (Folio & Fewell, 2000) and the Test of Visual Motor
Skills-Revised (Gardner, 1995) (r = 0.62 and 0.73, respectively).

2.2.2. School Function Assessment-Chinese Version (SFA) (Huang, 2008)


The SFA is a judgment-based criterion-referenced assessment that measures a student’s
performance of functional tasks that support his or her participation in both academic and social
school-related activities for students in grades kindergarten to 6th grade (Coster, Deeney,
Haltiwanger, & Haley, 1998). The SFA comprises three sections: participation, task supports, and
activity performance. The activity performance section taps performance in school-related functional
activities, such as using school materials, interacting with others, following school rules, and
communicating needs. There are two major categories in the activity performance section: physical
tasks and cognitive/behavioral tasks. The physical tasks performance scale (PTPS) was used as an
external criterion for clinically relevant change in the present study. The PTPS consists of 161 items
(activities) divided into 12 domains, including travel, maintaining and changing positions,
recreational movement, manipulation with movement, using materials, set-up and cleanup, eating
and drinking, hygiene, clothing management, up/down stairs, written work, and computer and
equipment use. The activities are rated on a scale of 1–4, where 1 = does not perform, 2 = partial
performance, 3 = inconsistent performance, and 4 = consistent performance.The reliability and
discrimination accuracy of this Chinese version were similar to those of the original English version
reported by Coster et al. (1998). Internal consistency was measured using the coefficient alpha
850 Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855

procedure and results were uniformly high (0.94–0.98). Test–retest reliability of between 7 and 42
days ranged between 0.87 and 0.99 (n = 27). Percentage of correct classification was 100% for learning
disabilities, 97.6% for physical disabilities, and 90.4% for multiple disabilities (Huang, 2005).

2.3. Procedure

This study was approved by the Ethics Committee for Human Experimentation at the Kaohsiung
Medical University Hospital (KMUH). After obtaining informed consent from participants, the
principal investigator invited physical or occupational therapists (n = 6) who treated these children
prior to the study to participate in the intervention stage. All children received a conventional
pediatric rehabilitation program, at least 1 day a week, for 4 months. The conventional program is
client-specific and incorporates four primary sensorimotor training techniques, including neurode-
velopment treatment, sensory integration, motor control, and motor learning.
Children were assessed with the BOT-2 on three different occasions. Two baseline measurements
(T1 and T2) were performed with a 2-week interval in between. No interventions were undertaken
between these two time points. A third follow-up assessment measurement (T3) was made
immediately after completing a 4-month therapeutic program. All of the BOT-2 testing was performed
by the principal investigator, a certified OT with 13 years of clinical experience in pediatric
rehabilitation, using standardized procedures for administration specified in the test manual. To
decrease a possible experimenter bias, the examiner did not reacquaint herself with the child’s scores
from the first assessment when conducting the retests. Testing lasted approximately 1 h, with a
suitable number of breaks to minimize the effects of fatigue and frustration. Testing was conducted on
an individual basis in quiet locations identified at each child’s respective school, home, or pediatric
occupational therapy (OT) unit at hospitals. Three BOT-2 assessments were made at the same time of
day for each visit. Besides, 5 therapists who did not participate in the intervention stage were recruited
to complete the PTPS independently on T1 and T3. The therapists were provided written and verbal
instructions for how to fill out the PTPS and had good inter-rater reliability with a senior occupational
therapist (intraclass correlation coefficient [ICC] = .97–.99) for the PTPS. Participants were not paid for
their participation in the study.

2.4. Data analysis

2.4.1. Internal consistency


Internal consistency indicates the extent to which items or scales are measuring the same
construct. Cronbach’s alpha (a) coefficient was calculated for internal consistency of the BOT-2 using
the data from all the children of the first assessment. Cronbach’s alphas above 0.7 are generally
regarded as acceptable, over 0.8 good, and over 0.9 excellent (Vangeneugden, Laenen, Geys, Renard, &
Molenberghs, 2005).

2.4.2. Test–retest reliability


The stability of the BOT-2 between the first and second measurements was assessed using the
intraclass correlation coefficient (ICC) with a 2-way random effects model that allows for the results to
be generalized to testers not participating in the study. Benchmarks suggested by Shrout and Fleiss
(1979) were used to interpret ICC values (>0.75 excellent reliability, 0.40–0.75 fair-to-good reliability
and <0.40 poor reliability). Standard error of measurement (SEM) was applied to determine the
precision of the BOT-2 subtests and composites. The SEM describes the error in interpreting an
individual’s test score and was computed by the baseline standard deviation of the instrument
multiplied by the square root of one minus its reliability coefficient, where scale reliability was
estimated using ICC. SEM  SD/2 was taken as the criterion of acceptable precision (Wyrwich,
Nienaber, Tierney, & Wolinsky, 1999). The lower the reliability, the greater the SEM, and the less
precise the scale.

2.4.3. Responsiveness
Five different analyses were performed to evaluate the responsiveness of the BOT-2 composites.
Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855 851

2.4.3.1. Effect size (ES). The ES is a standardized measure of change obtained by dividing the mean
change between initial and the third follow-up measurements by the S.D. of the initial measurement
(Kazis, Anderson, & Meenan, 1989). As a guide to interpreting these values, Cohen (1977) labeled an
effect size ‘small’ if ES  0.2 < 0.5, ‘moderate’ if ES  0.5 < 0.8, or ‘large’ if ES  0.8.

2.4.3.2. Standardized response mean (SRM). The SRM was calculated as average change in scores
between initial and the third follow-up measurements divided by the S.D. of that change score (Liang,
Fossel, & Larson, 1990). A positive SRM indicates improvement whereas a negative SRM indicates
deterioration.

2.4.3.3. Minimum detectable change (MDC). The MDC represents the smallest change in score that
reflects real change rather than measurement error (Stratford et al., 1996). The MDC was computed as
1.65  H2  SEM. A z-score of 1.65 was chosen to reflect an acceptable 90% confidence interval for
clinical application to individual children. The proportion of the sample with change scores exceeding
MDC was calculated for each of the BOT-2 subtest scale scores and composite standard scores.

2.4.3.4. Minimal important difference (MID). Two methods were used to estimate the MID. First,
changes in the BOT-2 scores were examined by changes in school-related functional task performance
using the therapist-assessed mean PTPS score (anchor). MID using this approach was defined as the
difference in mean change of the BOT-2 scores for classifying children as having experienced an
‘‘important improvement’’ or ‘‘no change’’ (Sloan, Symonds, Vargas-Chanes, & Friedly, 2003). The
‘‘improved’’ group was defined as those with mean change score on the PTPS greater than or equal to
one, while the ‘‘no change’’ group was defined as those with a change score of less than 1 and equal to
or greater than 0. Those with a mean change score of less than zero were not included in the analysis.
To establish the validity of the anchor, a one-way multivariate analysis of variance (MANOVA) was
performed to test for differences among BOT-2 change scores between the two groups. Analogous to
the MDC proportion, the proportion of the sample with change scores exceeding the MID was
determined for each BOT-2 composite. In the second strategy, ROC curves were used to provide an
estimate of the MID. ROC curves were constructed by plotting the ‘‘true positives’’ (sensitivity) versus
‘‘false positives’’ (1-specificity) for multiple cut off points. The point closest to the upper left corner of
the ROC curve was assumed to represent the optimal trade off between sensitivity and specificity for
detecting clinical improvement (Stucki, Liang, Fossel, & Katz, 1995). The area under the ROC curve
represents the probability that an instrument correctly discriminate between ‘‘important improve-
ment’’ and ‘‘no change’’ children. An area of 0.6 or greater indicates a good level of discrimination
ability.

3. Results

3.1. Reliability

Table 1 presents the reliability estimates for the BOT-2. Internal consistency of the BOT-2 total
score was excellent (Cronbach’s a = 0.92). Coefficient alpha ranged from 0.81 to 0.88 for the subtests,
and 0.87–0.88 for the composites. This implicates a sufficient homogeneity of all the individual
domains as well as the total test. In the test–retest analysis, ICC varied between 0.88 and 0.99 for the
subtests and composites, the ICC of the total score was 0.99, indicating excellent reliability. As
expected, SEM values for the subtests and composites all attained the criterion (SEM  SD/2),
suggesting an acceptable measurement precision of the BOT-2.

3.2. Responsiveness

Table 2 summarizes the ES, SRM, MDC, and MID values for the BOT-2. The BOT-2 total score had an
SRM of 0.54 and ES of 0.67. The SRM for the subtests ranged from 0.27 to 0.79, with lower SRM for the
balance subtest. The SRM for the composites ranged from 0.31 to 0.73, with lower SRM for body
coordination composite. The ESs are of a similar magnitude to the SRMs. The MDC (90% CI) for the total
852 Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855

Table 1
Internal consistency and test–retest reliability of the BOT-2.

BOT-2 Mean  S.D. Cronbach’s a ICC (95% CI) SEM

T1 T2

Subtests
Fine motor precision 6.09  2.05 6.14  2.11 0.81 0.96 (0.94–0.97) 0.42
Fine motor integration 7.69  2.11 7.64  2.08 0.83 0.98 (0.97–0.99) 0.39
Manual dexterity 4.75  1.82 4.93  1.86 0.83 0.92 (0.89–0.95) 0.51
Upper-limb coordination 6.32  2.15 6.54  2.13 0.87 0.88 (0.83–0.92) 0.73
Bilateral coordination 8.41  3.37 8.36  3.34 0.87 0.96 (0.95–0.98) 0.65
Balance 9.48  4.00 9.49  4.06 0.85 0.99 (0.98–0.99) 0.49
Running speed and agility 6.40  2.55 6.25  2.73 0.87 0.97 (0.95–0.98) 0.49
Strength 8.13  3.23 8.02  3.39 0.85 0.96 (0.95–0.97) 0.63

Composites
Fine manual control 31.47  4.81 31.41  4.97 0.88 0.99 (0.98–0.99) 0.58
Manual coordination 27.67  4.55 27.81  4.63 0.88 0.98 (0.97–0.99) 0.66
Body coordination 35.43  7.62 35.39  7.73 0.87 0.99 (0.98–0.99) 0.80
Strength and agility 32.57  6.72 32.30  6.84 0.88 0.99 (0.97–0.99) 0.80
Total 127.14  22.81 126.91  23.38 0.92 0.99 (0.99–1.00) 1.79

score was 4.18, implying that in 90% of the cases we can state that a change of a child’s total score with
less than 4 points is just the result of measurement error. 50% of the children showed significant
improvement in the BOT-2 total score using the MDC as the criterion. The MDC values for the subtests
ranged from 0.67 to 1.70, and 1.36–1.87 for the composites.
Among the 100 children, 49% and 51% were categorized as ‘‘improved’’ and ‘‘no change,’’ respectively.
None of the children had a change in PTPS score of less than 0. The MID based on the difference in change
scores of the BOT-2 total score between ‘‘improved’’ and ‘‘no change’’ was 6.53, with 45% of children
surpassing this MID threshold (Table 2). The MID values for the subtests ranged from 0.57 to 1.73, and
0.93–2.55 for the composites. A MANOVA applied to the mean change scores for the four composites
revealed significant differences between the two groups [Wilks’ lambda = 0.84, F(4, 95) = 4.513,
p < 0.01]. Subsequent ANOVAs on the individual composite scores were performed. In light of the

Table 2
Responsiveness statistics for the BOT-2.

BOT-2 Difference scorea Effect SRM MDCb MDC MIDc MID


(T1 and T3) size proportion proportion

Subtests
Fine manual precision 1.06  1.36 ( 2 to 4) 0.78 0.72 0.98 58 (58%) 0.72 56 (56%)
Fine motor integration 0.94  1.50 ( 2 to 6) 0.63 0.65 0.67 52 (52%) 0.88 50 (50%)
Manual dexterity 1.01  2.10 ( 2 to 8) 0.48 0.76 1.19 37 (37%) 1.47 37 (37%)
Upper limb coordination 1.69  2.46 ( 1 to 9) 0.69 0.79 1.70 33 (33%) 1.61 35 (35%)
Bilateral coordination 2.29  2.72 ( 1 to 11) 0.84 0.78 1.52 45 (45%) 1.11 50 (54%)
Balance 0.28  1.24 ( 2 to 4) 0.23 0.27 1.14 27 (27%) 0.57 27 (27%)
Running speed and agility 0.88  1.33 (0 to 5) 0.66 0.54 1.14 41 (41%) 0.59 40 (40%)
Strength 1.79  2.58 ( 1 to 9) 0.70 0.65 1.47 37 (37%) 1.73 42 (42%)

Composites
Fine manual control 2.22  3.41 ( 2 to 14) 0.65 0.56 1.36 59 (59%) 0.93 59 (49%)
Manual coordination 3.31  5.21 ( 2 to 19) 0.64 0.73 1.54 43 (43%) 2.55 37 (37%)
Body coordination 1.59  3.54 ( 4 to 12) 0.45 0.31 1.87 34 (34%) 1.65 34 (34%)
Strength and agility 2.88  3.79 ( 1 to 14) 0.76 0.63 1.87 51 (51%) 1.39 51 (51%)
Total 2.22  3.41 ( 2 to 14) 0.67 0.54 4.18 50 (50%) 6.53 45 (45%)
a
Difference scores were calculated by subtracting the first baseline score from the third retest score (i.e. T3 T1). A positive score means
that performance increased at retest and a negative score means that performance decreased at retest. The numbers in parenthesis
represent S.D. and lowest to highest score.
b
MDC was calculated as 1.65  H2  SEM, at a 10% level of significance (90% confidence interval).
c
MID was defined as the difference in mean change of the BOT-2 scores for children classified as ‘‘improved’’ and the mean change score
for children classified as ‘‘no change.’’
Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855 853

Table 3
Sensitivity and specificity using MDC and MID cut-points to differentiate between improved and stable groups.

BOT-2 composites MDC threshold MID threshold

Sensitivity Specificity Sensitivity Specificity

Fine manual control 46.94% 62.75% 48.98% 60.78%


Manual coordination 55.10% 70.59% 48.98% 76.47%
Body coordination 32.65% 84.31% 32.65% 84.31%
Strength and agility 51.02% 68.63% 51.02% 68.63%
Total 55.10% 72.55% 48.98% 76.47%

number of univariate analyses conducted, the Bonferroni a level was set at 0.0125 (0.05/4) for all
follow-up analyses to maintain a family-wise error rate of 0.05. Significantly univariate differences were
found in the fine manual control (F = 109.121, p = 0.002), manual coordination (F = 422.695, p = 0.000),
body coordination (F = 169.536, p = 0.000), and strength and agility (F = 184.382, p = 0.000) composites,
with ‘‘improved’’ group scoring higher than the ‘‘no change’’ group.
At the point on the ROC closest to the left upper angle, the sensitivity and specificity of the BOT-2
total composite were 65 and 65%, respectively. The sensitivity and specificity for the fine manual
control composite scale were 49% and 61%, respectively. The manual coordination composite reached
65 and 65%, whereas the body coordination yielded a sensitivity of 59% and a specificity of 61%. The
strength and agility composite achieved sensitivity of only 49% and a specificity of 71%. The BOT-2
change scores that corresponded to these ROC points were 3.5 for the total score, 1.5 for the fine
manual control composite, 0.5 for the manual coordination composite, 0.5 for the body coordination
composite, and 2.5 for the strength and agility composite. The ROC areas for the changes in total, fine
manual control, manual coordination, body coordination, and strength and agility composite standard
scores were 0.66, 0.61, 0.66, 0.56 and 0.61, respectively.
Table 3 displays the sensitivity and specificity for the discrimination between clinical
improvement and clinical stability using the MDC and MID (mean change approach) as cut-off
criteria. Relative to the ROC approach, both MDC and MID (mean change approach) had lower
sensitivity but higher specificity for detecting clinically meaningful change in child’s performance of
specific school-related functional activities.

4. Discussion

This study assessed for the first time the reliability and the responsiveness of the BOT-2 in children
with ID and showed that this test is a reliable outcome measure with which to evaluate motor
proficiency. The internal consistency of the BOT-2 total score in our sample (a = 0.92) was somewhat
lower than those without ID in the same age range (a = 0.95) (Bruininks & Bruininks, 2005). However,
the mean Cronbach a for the subtests and composites were slightly higher in children with ID (0.85
and 0.88, respectively) than in their typically developing counterparts (0.80 and 0.87, respectively).
Test–retest reliability coefficients (ICC) were excellent for the total BOT-2 and its subtests and
composites in our sample, while test–retest reliability coefficients (Pearson correlation coefficients)
were moderate to high in the children without ID (Bruininks & Bruininks, 2005). However, the results
were not directly comparable due to differences in population, statistical methods, and time intervals
(14 days vs 7–42 days).
The reliability coefficients alone do not provide sufficient information about the precision of test
scores of individual children. Therefore, this study provides additional psychometric data regarding
the variability of the errors of measurement based on the SEM. The SEM for the total BOT-2 score was
1.79, for the subtests ranged from 0.39 to 0.73, and for the composites ranged from 0.58 to 0.80.
Compared with the SEMs for the subtests and composites in children without ID (1.66–3.67 for ages 4–
7, and 1.95–3.80 for ages 8–11) (Bruininks & Bruininks, 2005), the SEMs for children with ID in this
study were smaller. The resulting estimates of SEM can be used to guide score interpretation such that
a child’s true total score may lie within 2.95 at 68% CI, 3.5 at 95% CI, and 4.62 at 99% CI.
There appears to be no consensus yet on how responsiveness should be assessed (Husted, Cook,
Farewell, & Gladman, 2000), and it is recommended that a number of constructs (e.g. clinician or
854 Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855

clients’ perceptions and functional measures) be used to determine responsiveness. In this study, the
responsiveness for the BOT-2 was analyzed at both the group and individual levels, taking into account
that average effects across a group may not be meaningful to the individual child (Schmitt & Di Fabio,
2004). Examples of group-level statistics are the ES and SRM, where the mean score change serves as
the numerators of both parameters and the denominators are the standard deviation at baseline and
standard deviation of change for the sample of interest, respectively. The values of the SRM and the ES
for the BOT-2 scores were moderate, with the exception of the balance subtest and body coordination
composite. The probable reason for these low-effect sizes might be that most items on the balance
subtest are easy for our sample since over half of the children (59%) had scale scores greater than or
equal to 10 at baseline. Further, the balance subtest is one of the two subtests that form the body
coordination composite. Apart from this, the moderate effect sizes across other BOT-2 measures
suggested that children with cognitive impairment may require longer periods of intervention to
achieve a substantial improvement in motor performance. This speculation requires replication of our
findings with a similar sample of children receiving longer intervention and with children of different
diagnoses such as cerebral palsy or brain injury.
The MDC at the 90% CI and MID were used to define individual-level change. Our results indicated
that although MDC and MID are different concepts and the methods used to calculate their values are
also different, the estimates of the MDC on the BOT-2 subtests and composites were in the neighboring
range to those of the MID. For instance, to be considered clinically relevant in children with ID, a
change in the BOT-2 total standard score needs to be either greater than 4 points (55% sensitivity, 73%
specificity) using MDC as a criterion, 3.5 points (65% sensitivity, 65% specificity) using ROC approach,
or 6.5 points (49% sensitivity, 76% specificity) using MID (mean change approach) as a threshold.
Although it is up to clinicians to decide whether MDC, MID or ROC are more suitable to use as a
criterion standard in their specific circumstances (Kovacs et al., 2008), sensitivity and specificity
associated with these potential cut-off scores could be valuable in the decision making process. For
example, sensitivity may be valued more than specificity for the identification of important changes in
the determination of treatment effectiveness. Besides, the MDC may be helpful to allow for clinicians
to directly interpret whether score changes over time are outside of measurement error (Martin,
2007). Nonetheless, use of MDC values is complicated by different ‘‘beyond error’’ thresholds (i.e. 90%
CI vs 95% CI) (Beaton, Boers, & Wells, 2002; Turner, 2008). In addition, MDC values will vary among
samples, since MDC is based on the data distribution (Lassere et al., 2001). The MID values, on the
other hand, are also context-specific and may vary by length and type of treatment, definition of
‘minimal important’ on the anchor, type of anchor (client-based vs clinician-based), and score
distribution at baseline which might be an indicator of severity of disability (de Vet et al., 2006, 2007).
Concerning the discrimination validity of the BOT-2, the area under the ROC curve also suggested that
except for the body coordination composite, the remaining composite scores demonstrated
acceptable discriminative properties to detect clinical change.
The study’s strengths are the prospectively examined cohort and no missing data. Retrospective
ratings of change yield little information about the ability of an instrument to detect treatment effects
(Liang, 2000). A reliable and valid multi-item rating scale (PTPS) was used as the criterion to characterize
clinical significance of BOT-2 change score. Further, simultaneous use of both group- and individual-
level responsiveness statistics may provide a stronger foundation for understanding meaningful changes
over time (Cella, Eton, Lai, Peterman, & Merkel, 2002; Haley & Fragala-Pinkham, 2006).
In conclusion, the results of this study revealed that the BOT-2 had good internal consistency and
excellent test–retest reliability over a 2-week period. In the assessment of overall responsiveness, the
MDC and MID proportions supported the results of ESs and SRMs that were mainly in the moderate
range using Cohen’s criteria. The MDC and MID values will be useful in the interpreting BOT-2 scores,
both in individuals and in groups of children participating in intervention.

References

Beaton, D. E., Boers, M., & Wells, G. A. (2002). Many faces of the minimal clinically important difference (MCID): A literature review
and directions for future research. Current Opinion in Rheumatology, 14, 109–114.
Bruininks, R. H. (1978). Bruininks–Oseretsky Test of Motor Proficiency. Circle Pines, MN: American Guidance Service.
Y.-P. Wuang, C.-Y. Su / Research in Developmental Disabilities 30 (2009) 847–855 855

Bruininks, R. H., & Bruininks, B. D. (2005). Bruininks–Oseretsky Test of Motor Proficiency (2nd ed.). Minneapolis, MN: Pearson
Assessment.
Cairney, J., Schmidt, L. A., Veldhuizen, S., Kurdyak, P., Hay, J., & Faught, B. E. (2008). Left-handedness and developmental coordination
disorder. Canadian Journal of Psychiatry, 53, 696–699.
Cella, D., Eton, D. T., Lai, J. S., Peterman, A., & Merkel, D. E. (2002). Combining anchor and distribution based methods to derive
minimal clinically important differences on the Functional Assessment of Cancer Therapy (FACT) Anemia and Fatigue Scales.
Journal of Pain Symptom Management, 24, 547–561.
Cohen, J. (1977). Statistical power for the behavioral sciences. New York: Academic Press.
Coster, W., Deeney, T., Haltiwanger, J., & Haley, S. (1998). School Function Assessment. San Antonio: Therapy Skill Builders.
de Vet, H. C., Ostelo, R. W., Terwee, C. B., van der Roer, N., Knol, D. L., Beckerman, H., et al. (2007). Minimally important
change determined by a visual method integrating an anchor-based and a distribution-based approach. Quality of Life Research,
16, 131–142.
de Vet, H. C., Terwee, C. B., Ostelo, R. W., Beckerman, H., Knol, D. L., & Bouter, L. M. (2006). Minimal changes in health status
questionnaires: Distinction between minimally detectable change and minimally important change. Health and Quality of Life
Outcomes, 4, 54.
Dewey, D., Cantell, M., & Crawford, S. G. (2007). Motor and gestural performance in children with autism spectrum disorders,
developmental coordination disorder, and/or attention deficit hyperactivity disorder. Journal of International Neuropsychological
Society, 13, 246–256.
Dolva, A. S., Coster, W., & Lilja, M. (2004). Functional performance in children with Down syndrome. American Journal of Occupational
Therapy, 58, 612–629.
Folio, M. R., & Fewell, R. R. (2000). Peabody Developmental Motor Scales (2nd ed.). Austin, TX: PRO-ED.
Gardner, M. F. (1995). Test of Visual–Motor Skills-Revised. Hydesville, CA: Psychological and Educational Publications.
Gordon, A. M., Schneider, J. A., Chinnan, A., & Charles, J. R. (2007). Efficacy of a hand-arm bimanual intensive therapy (HABIT) in
children with hemiplegic cerebral palsy: A randomized control trial. Developmental Medicine and Child Neurology, 49, 830–838.
Haley, S. M., & Fragala-Pinkham, M. A. (2006). Interpreting change scores of tests and measures used in physical therapy. Physical
Therapy, 86, 735–743.
Huang, J. L. (2005). The reliability and validity of the School Function Assessment-Chinese version for cross-cultural use in Taiwan.
Occupational Therapy International, 11, 26–39.
Huang, J. L. (2008). School Function Assessment-Chinese Version. Taipei: The Psychological Corporation.
Husted, J. A., Cook, R. J., Farewell, V. T., & Gladman, D. D. (2000). Methods for assessing responsiveness: A critical review and
recommendations. Journal of Clinical Epidemiology, 53, 459–468.
Kazis, L. E., Anderson, J. J., & Meenan, R. F. (1989). Effect size for interpreting changes in health status. Medical Care, 27, S178–S189.
Kovacs, F. M., Abraira, V., Royuela, A., Corcoll, J., Alegre, L., Tomás, M., et al. (2008). Minimum detectable and minimal clinically
important changes for pain in patients with nonspecific neck pain. BMC Musculoskeletal Disorders, 9, 43.
Lassere, M. N., van der Heijde, D., Johnson, K., Bruynesteyn, K., Molenaar, E., Boonen, A., et al. (2001). Robustness and generalizability
of smallest detectable difference in radiological progression. Journal of Rheumatology, 28, 911–913.
Liang, M. H. (2000). Longitudinal construct validity: Establishment of clinical meaning in patient evaluative instruments. Medical
Care, 38(Suppl.), 84–90.
Liang, M. H., Fossel, A. H., & Larson, M. G. (1990). Comparisons of five health status instrument for orthopedic evaluation. Medical Care,
28, 632–642.
Martin, R. L. (2007). Invited commentary on the movement continuum special series. Physical Therapy, 87, 927–928.
Schmitt, J. S., & Di Fabio, R. P. (2004). Reliable change and minimum important difference (MID) proportions facilitated group
responsiveness comparisons using individual threshold criteria. Journal of Clinical Epidemiology, 57, 1008–1018.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlation: Uses in assessing rater reliability. Psychological Bulletin, 86, 420.
Sloan, J. A., Symonds, T., Vargas-Chanes, D., & Friedly, B. (2003). Practical guidelines for assessing the clinical significance of heath
related quality of life changes within clinical trials. Drug Information Journal, 37, 23–31.
Stratford, P. W., Binkley, J. M., & Riddle, J. L. (1996). Health status measures: Strategies and analytic methods for assessing change
scores. Physical Therapy, 76, 1109–1123.
Stucki, G., Liang, M. H., Fossel, A. H., & Katz, J. N. (1995). Relative responsiveness of condition-specific and generic health status
measures in degenerative lumbar spinal stenosis. Journal of Clinical Epidemiology, 48, 1369–1378.
Turner, D. (2008). Development, evaluation and application of a Pediatric Ulcerative Colitis Activity Index (PUCAI). Unpublished doctoral
dissertation, University of Toronto, Toronto, Ontario, Canada.
Vangeneugden, T., Laenen, A., Geys, H., Renard, D., & Molenberghs, G. (2005). Applying concepts of generalizability theory on clinical
trial data to investigate sources of variation and their impact on reliability. Biometrics, 61, 295–304.
Wang, T. M., Su, C. W., Liao, H. F., Lin, L. Y., Chou, K. S., & Lin, S. H. (1998). The standardization of the Comprehensive Developmental
Inventory for infants and toddlers. Psychological Testing, 45, 19–46.
Wang, W. Y., & Chang, J. J. (1997). Effects of jumping skill training on walking balance for children with mental retardation and
Down’s syndrome. Kaohsiung Journal of Medical Sciences, 13, 487–495.
Wyrwich, K. W., Nienaber, M. A., Tierney, W. M., & Wolinsky, F. D. (1999). Linking clinical relevance and statistical significance in
evaluating intraindividual changes in health-related quality of life. Medical Care, 37, 469–478.
Wuang, Y. P., Wang, C. C., Huang, M. H., & Su, C. Y. (2008). Profiles and cognitive predictors of motor functions among early school-age
children with mild intellectual disabilities. Journal of Intellectual Disability Research, 52, 1048–1060.

You might also like