Pukelsheim Optimal DoE

Optimal Design of Experiments
SIAM's Classics in Applied Mathematics series consists of books that were previously allowed to go out of print. These books are republished by SIAM as a professional service because they continue to be important resources for mathematical scientists. Editor-in-Chief Robert E. O'Malley, jr., University of Washington Editorial Board Richard A. Brualdi, University of Wisconsin-Madison Nicholas J. Higham, University of Manchester Leah Edelstein-Keshet, University of British Columbia Herbert B. Keller, California Institute of Technology Andrzej Z. Manitius, George Mason University Hilary Ockendon, University of Oxford Ingram Olkin, Stanford University Peter Olver, University of Minnesota Ferdinand Verhulst, Mathematisch Instituut, University of Utrecht Classics in Applied Mathematics C. C. Lin and L. A. Segel, Mathematics Applied to Deterministic Problems in the Natural Sciences Johan G. F. Belinfante and Bernard Kolman, A Survey of Lie Groups and Lie Algebras with Applications and Computational Methods James M. Ortega, Numerical Analysis: A Second Course Anthony V. Fiacco and Garth P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization Techniques F. H. Clarke, Optimization and Nonsmooth Analysis George F. Carrier and Carl E. Pearson, Ordinary Differential Equations Leo Breiman, Probability R. Bellman and G. M. Wing, An Introduction to Invariant Imbedding Abraham Berman and Robert J. Plemmons, Nonnegative Matrices in the Mathematical Sciences Olvi L. Mangasarian, Nonlinear Programming *Carl Friedrich Gauss, Theory of the. Combination of Observations Least Subject to Errors: Part One, Part Two, Supplement. Translated by G. W. Stewart Richard Bellman, Introduction to Matrix Analysis U. M. Ascher, R. M. M. Mattheij, and R. D. Russell, Numerical Solution of Boundary Value Problems for Ordinary Differential Equations K. E. Brenan, S. L. Campbell, and L. R. Pel2old, Numerical Solution of Initial-Value Problems in Differential-Algebraic Equations Charles L. Lawson and Richard J. Hanson, Solving Least Squares Problems J. E. Dennis, Jr. and Robert B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations Richard E. Barlow and Frank Proschan, Mathematical Theory of Reliability Cornelius Lanczos, Linear Differential Operators Richard Bellman, Introduction to Matrix Analysis, Second Edition Beresford N. Parlett, The Symmetric Eigenvalue Problem Richard Haberman, Mathematical Models: Mechanical Vibrations, Population Dynamics, and Traffic Flow *First time in print.
ii
Classics in Applied Mathematics (continued) Peter W. M. John, Statistical Design and Analysis of Experiments Tamer Basar and Geert Jan Olsder, Dynamic Noncooperative Game Theory, Second Edition Emanuel Parzen, Stochastic Processes Petar Kokotovic, Hassan K. Khalil, and John O'Reilly, Singular Perturbation Methods in Control: Analysis and Design Jean Dickinson Gibbons, Ingram Olkin, and Milton Sobel, Selecting and Ordering Populations: A New Statistical Methodology James A. Murdock, Perturbations: Theory and Methods Ivar Ekeland and Roger Temam, Convex Analysis and Variational Problems Ivar Stakgold, Boundary Value Problems of Mathematical Physics, Volumes I and II J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables David Kinderlehrer and Guido Stampacchia, An Introduction to Variational Inequalities and Their Applications F Natterer, The Mathematics of Computerized Tomography Avinash C. Kak and Malcolm Slaney, Principles of Computerized Tomographic Imaging R. Wong, Asymptotic Approximations of Integrals O. Axelsson and V. A. Barker, Finite Element Solution of Boundary Value Problems: Theory and Computation David R. Brillinger, Time Series: Data Analysis and Theory Joel N. Franklin, Methods of Mathematical Economics: Linear and Nonlinear Programming, Fixed-Point Theorems Philip Hartman, Ordinary Differential Equations, Second Edition Michael D. Intriligator, Mathematical Optimization and Economic Theory Philippe G. Ciarlet, The Finite Element Method for Elliptic Problems Jane K. Cullum and Ralph A. Willoughby, Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. I: Theory M. Vidyasagar, Nonlinear Systems Analysis, Second Edition Robert Mattheij and Jaap Molenaar, Ordinary Differential Equations in Theory and Practice Shanti S. Gupta and S. Panchapakesan, Multiple Decision Procedures: Theory and Methodology of Selecting and Ranking Populations Eugene L. Allgower and Kurt Georg, Introduction to Numerical Continuation Methods Leah Edelstein-Keshet, Mathematical Models in Biology Heinz-Otto Kreiss and Jens Lorenz, Initial-Boundary Value Problems and the Navier-Stokes Equations J. L. Hodges, Jr. and E. L. Lehmann, Basic Concepts of Probability and Statistics, Second Edition George F Carrier, Max Krook, and Carl E. Pearson, Functions of a Complex Variable: Theory and Technique Friedrich Pukelsheim, Optimal Design of Experiments
iii
This page intentionally left blank
Optimal Design of Experiments

Friedrich Pukelsheim
University of Augsburg Augsburg, Germany
Society for Industrial and Applied Mathematics Philadelphia
Copyright 2006 by the Society for Industrial and Applied Mathematics This SIAM edition is an unabridged republication of the work first published by John Wiley & Sons, Inc., New York, 1993.
10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Cataloging-in-Publication Data: Pukelsheim, Friedrich, 1948Optimal design of experiments / Friedrich Pukelsheim. Classic ed. p. cm. (Classics in applied mathematics ; 50) Originally published: New York : J. Wiley, 1993. Includes bibliographical references and index. ISBN 0-89871-604-7 (pbk.) 1. Experimental design. 1. Title. II. Series. QA279.P85 2006 519.5'7--dc22
2005056407
Partial royalties from the sale of this book are placed in a fund to help students attend SIAM meetings and other SIAM-related activities. This fund is administered by SIAM, and qualified individuals are encouraged to write directly to SIAM for guidelines.
is a registered trademark.
Contents
Preface to the Classics Edition, xvii Preface, xix List of Exhibits, xxi Interdependence of Chapters, xxiv Outline of the Book, xxv Errata, xxix 1. Experimental Designs in Linear Models 1.1. 1.2. 1.3. 1.4. 1.5. 1.6. 1.7. 1.8. 1.9. 1.10. 1.11. 1.12. 1.13. 1.14. 1.15. 1.16. 1.17. 1.18. 1.19. 1.20. 1.21. 1.22. Deterministic Linear Models, 1 Statistical Linear Models, 2 Classical Linear Models with Moment Assumptions, 3 Classical Linear Models with Normality Assumption, 4 Two-Way Classification Models, 4 Polynomial Fit Models, 6 Euclidean Matrix Space, 7 Nonnegative Definite Matrices, 9 Geometry of the Cone of Nonnegative Definite Matrices, 10 The Loewner Ordering of Symmetric Matrices, 11 Monotonic Matrix Functions, 12 Range and Nullspace of a Matrix, 13 Transposition and Orthogonality, 14 Square Root Decompositions of a Nonnegative Definite Matrix, 15 Distributional Support of Linear Models, 15 Generalized Matrix Inversion and Projections, 16 Range Inclusion Lemma, 17 General Linear Models, 18 The Gauss-Markov Theorem, 20 The Gauss-Markov Theorem under a Range Inclusion Condition, 21 The Gauss-Markov Theorem for the Full Mean Parameter System, 22 Projectors, Residual Projectors, and Direct Sum Decomposition, 23
Vll
Vlll
CONTENTS
1.23. 1.24. 1.25. 1.26. 1.27. 1.28.
Optimal Estimators in Classical Linear Models, 24 Experimental Designs and Moment Matrices, 25 Model Matrix versus Design Matrix, 27 Geometry of the Set of All Moment Matrices, 29 Designs for Two-Way Classification Models, 30 Designs for Polynomial Fit Models, 32 Exercises, 33 35
2. Optimal Designs for Scalar Parameter Systems 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8. 2.9. 2.10. 2.11. 2.12. 2.13. 2.14. 2.15. 2.16. 2.17. 2.18. 2.19. 2.20. 2.21. 2.22. 2.23. Parameter Systems of Interest and Nuisance Parameters, 35 Estimability of a One-Dimensional Subsystem, 36 Range Summation Lemma, 37 Feasibility Cones, 37 The Ice-Cream Cone, 38 Optimal Estimators under a Given Design, 41 The Design Problem for Scalar Parameter Subsystems, 41 Dimensionality of the Regression Range, 42 Elfving Sets, 43 Cylinders that Include the Elfving Set, 44 Mutual Boundedness Theorem for Scalar Optimality, 45 The Elfving Norm, 47 Supporting Hyperplanes to the Elfving Set, 49 The Elfving Theorem, 50 Projectors for Given Subspaces, 52 Equivalence Theorem for Scalar Optimality, 52 Bounds for the Optimal Variance, 54 Eigenvectors of Optimal Moment Matrices, 56 Optimal Coefficient Vectors for Given Moment Matrices, 56 Line Fit Model, 57 Parabola Fit Model, 58 Trigonometric Fit Models, 58 Convexity of the Optimality Criterion, 59 Exercises, 59
3. Information Matrices 3.1. Subsystems of Interest of the Mean Parameters, 61 3.2. Information Matrices for Full Rank Subsystems, 62 3.3. Feasibility Cones, 63
61
CONTENTS
IX
3.4. 3.5. 3.6. 3.7. 3.8. 3.9. 3.10. 3.11. 3.12. 3.13. 3.14. 3.15. 3.16. 3.17. 3.18. 3.19. 3.20. 3.21. 3.22. 3.23. 3.24. 3.25.
Estimability, 64 Gauss-Markov Estimators and Predictors, 65 Testability, 67 F-Test of a Linear Hypothesis, 67 ANOVA, 71 Identifiability, 72 Fisher Information, 72 Component Subsets, 73 Schur Complements, 75 Basic Properties of the Information Matrix Mapping, 76 Range Disjointness Lemma, 79 Rank of Information Matrices, 81 Discontinuity of the Information Matrix Mapping, 82 Joint Solvability of Two Matrix Equations, 85 Iterated Parameter Subsystems, 85 Iterated Information Matrices, 86 Rank Deficient Subsystems, 87 Generalized Information Matrices for Rank Deficient Subsystems, 88 Generalized Inverses of Generalized Information Matrices, 90 Equivalence of Information Ordering and Dispersion Ordering, 91 Properties of Generalized Information Matrices, 92 Contrast Information Matrices in Two-Way Classification Models, 93 Exercises, 96 98
4. Loewner Optimality 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9. Sets of Competing Moment Matrices, 98 Moment Matrices with Maximum Range and Rank, 99 Maximum Range in Two-Way Classification Models, 99 Loewner Optimality, 101 Dispersion Optimality and Simultaneous Scalar Optimality, 102 General Equivalence Theorem for Loewner Optimality, 103 Nonexistence of Loewner Optimal Designs, 104 Loewner Optimality in Two-Way Classification Models, 105 The Penumbra of the Set of Competing Moment Matrices, 107
CONTENTS
4.10. 4.11. 4.12. 4.13.
Geometry of the Penumbra, 108 Existence Theorem for Scalar Optimality, 109 Supporting Hyperplanes to the Penumbra, 110 General Equivalence Theorem for Scalar Optimality, 111 Exercises, 113 114
5. Real Optimality Criteria 5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7. 5.8. 5.9. 5.10. 5.11. 5.12. 5.13. 5.14. 5.15. 5.16. 5.17. Positive Homogeneity, 114 Superadditivity and Concavity, 115 Strict Superadditivity and Strict Concavity, 116 Nonnegativity and Monotonicity, 117 Positivity and Strict Monotonicity, 118 Real Upper Semicontinuity, 118 Semicontinuity and Regularization, 119 Information Functions, 119 Unit Level Sets, 120 Function-Set Correspondence, 122 Functional Operations, 124 Polar Information Functions and Polar Norms, 125 Polarity Theorem, 127 Compositions with the Information Matrix Mapping, 129 The General Design Problem, 131 Feasibility of Formally Optimal Moment Matrices, 132 Scalar Optimality, Revisited, 133 Exercises, 134
6. Matrix Means 6.1. 6.2. 6.3. 6.4. 6.5. 6.6. 6.7. 6.8. 6.9. 6.10. 6.11. Classical Optimality Criteria, 135 D-Criterion, 136 A-Criterion, 137 E-Criterion, 137 T-Criterion, 138 Vector Means, 139 Matrix Means, 140 Diagonality of Symmetric Matrices, 142 Vector Majorization, 144 Inequalities for Vector Majorization, 146 The Holder Inequality, 147
135
CONTENTS
XI
6.12. 6.13. 6.14. 6.15. 6.16. 6.17.
Polar Matrix Means, 149 Matrix Means as Information Functions and Norms, 151 The General Design Problem with Matrix Means, 152 Orthogonality of Two Nonnegative Definite Matrices, 153 Polarity Equation, 154 Maximization of Information versus Minimization of Variance, 155 Exercises, 156
7. The General Equivalence Theorem 7.1. 7.2. 7.3. 7.4. 7.5. 7.6. 7.7. 7.8. 7.9. 7.10. 7.11. 7.12. 7.13. 7.14. 7.15. 7.16. 7.17. 7.18. 7.19. 7.20. 7.21. 7.22. 7.23. 7.24.
158
Subgradients and Subdifferentials, 158 Normal Vectors to a Convex Set, 159 Full Rank Reduction, 160 Subgradient Theorem, 162 Subgradients of Isotonic Functions, 163 A Chain Rule Motivation, 164 Decomposition of Subgradients, 165 Decomposition of Subdifferentials, 167 Subgradients of Information Functions, 168 Review of the General Design Problem, 170 Mutual Boundedness Theorem for Information Functions, 171 Duality Theorem, 172 Existence Theorem for Optimal Moment Matrices, 174 The General Equivalence Theorem, 175 General Equivalence Theorem for the Full Parameter Vector, 176 Equivalence Theorem, 176 Equivalence Theorem for the Full Parameter Vector, 177 Merits and Demerits of Equivalence Theorems, 177 General Equivalence Theorem for Matrix Means, 178 Equivalence Theorem for Matrix Means, 180 General Equivalence Theorem for E-Optimality, 180 Equivalence Theorem for E-Optimality, 181 E-Optimality, Scalar Optimality, and Eigenvalue Simplicity, 183 E-Optimality, Scalar Optimality, and Elfving Norm, 183 Exercises, 185
xi xii
CONTENTS
8. Optimal Moment Matrices and Optimal Designs 8.1. 8.2. 8.3. 8.4. 8.5. 8.6. 8.7. 8.8. 8.9. 8.10. 8.11. 8.12. 8.13. 8.14. 8.15. 8.16. 8.17. 8.18. 8.19. From Moment Matrices to Designs, 187 Bound for the Support Size of Feasible Designs, 188 Bound for the Support Size of Optimal Designs, 190 Matrix Convexity of Outer Products, 190 Location of the Support Points of Arbitrary Designs, 191 Optimal Designs for a Linear Fit over the Unit Square, 192 Optimal Weights on Linearly Independent Regression Vectors, 195 A-Optimal Weights on Linearly Independent Regression Vectors, 197 C-Optimal Weights on Linearly Independent Regression Vectors, 197 Nonnegative Definiteness of Hadamard Products, 199 Optimal Weights on Given Support Points, 199 Bound for Determinant Optimal Weights, 201 Multiplicity of Optimal Moment Matrices, 201 Multiplicity of Optimal Moment Matrices under Matrix Means, 202 Simultaneous Optimality under Matrix Means, 203 Matrix Mean Optimality for Component Subsets, 203 Moore-Penrose Matrix Inversion, 204 Matrix Mean Optimality for Rank Deficient Subsystems, 205 Matrix Mean Optimality in Two-Way Classification Models, 206 Exercises, 209
187
9. D-, A-, E-, T-Optimality 9.1. 9.2. 9.3. 9.4. 9.5. 9.6. 9.7. 9.8. 9.9. 9.10. 9.11. D-, A-, E-, T-Optimality, 210 G-Criterion, 210 Bound for Global Optimality, 211 The Kiefer-Wolfowitz Theorem, 212 D-Optimal Designs for Polynomial Fit Models, 213 Arcsin Support Designs, 217 Equivalence Theorem for A-Optimality, 221 L-Criterion, 222 A-Optimal Designs for Polynomial Fit Models, 223 Chebyshev Polynomials, 226 Lagrange Polynomials with Arcsin Support Nodes, 227
210
CONTENTS
Xlll
9.12. 9.13. 9.14. 9.15. 9.16. 9.17.
Scalar Optimality in Polynomial Fit Models, I, 229 E-Optimal Designs for Polynomial Fit Models, 232 Scalar Optimality in Polynomial Fit Models, II, 237 Equivalence Theorem for T-Optimality, 240 Optimal Designs for Trigonometric Fit Models, 241 Optimal Designs under Variation of the Model, 243 Exercises, 245 247
10. Admissibility of Moment and Information Matrices 10.1. 10.2. 10.3. 10.4. 10.5. 10.6. 10.7. 10.8. 10.9. 10.10. 10.11. 10.12. 10.13. 10.14. 10.15.
Admissible Moment Matrices, 247 Support Based Admissibility, 248 Admissibility and Completeness, 248 Positive Polynomials as Quadratic Forms, 249 Loewner Comparison in Polynomial Fit Models, 251 Geometry of the Moment Set, 252 Admissible Designs in Polynomial Fit Models, 253 Strict Monotonicity, Unique Optimality, and Admissibility, 256 E-Optimality and Admissibility, 257 T-Optimality and Admissibility, 258 Matrix Mean Optimality and Admissibility, 260 Admissible Information Matrices, 262 Loewner Comparison of Special C-Matrices, 262 Admissibility of Special C-Matrices, 264 Admissibility, Minimaxity, and Bayes Designs, 265 Exercises, 266 268
11. Bayes Designs and Discrimination Designs 11.1. 11.2. 11.3. Bayes Linear Models with Moment Assumptions, 268 Bayes Estimators, 270 Bayes Linear Models with Normal-Gamma Prior Distributions, 272 11.4. Normal-Gamma Posterior Distributions, 273 11.5. The Bayes Design Problem, 275 11.6. General Equivalence Theorem for Bayes Designs, 276 11.7. Designs with Protected Runs, 277 11.8. General Equivalence Theorem for Designs with Bounded Weights, 278 11.9. Second-Degree versus Third-Degree Polynomial Fit Models, I, 280
XIV
CONTENTS
11.10. 11.11. 11.12. 11.13. 11.14. 11.15. 11.16. 11.17. 11.18. 11.19. 11.20. 11.21. 11.22.
Mixtures of Models, 283 Mixtures of Information Functions, 285 General Equivalence Theorem for Mixtures of Models, 286 Mixtures of Models Based on Vector Means, 288 Mixtures of Criteria, 289 General Equivalence Theorem for Mixtures of Criteria, 290 Mixtures of Criteria Based on Vector Means, 290 Weightings and Scalings, 292 Second-Degree versus Third-Degree Polynomial Fit Models, II, 293 Designs with Guaranteed Efficiencies, 296 General Equivalence Theorem for Guaranteed Efficiency Designs, 297 Model Discrimination, 298 Second-Degree versus Third-Degree Polynomial Fit Models, III, 299 Exercises, 302 304
12. Efficient Designs for Finite Sample Sizes 12.1. 12.2. 12.3. 12.4. 12.5. 12.6. 12.7. 12.8. 12.9. 12.10. 12.11. 12.12. 12.13. 12.14. 12.15. 12.16. Designs for Finite Sample Sizes, 304 Sample Size Monotonicity, 305 Multiplier Methods of Apportionment, 307 Efficient Rounding Procedure, 307 Efficient Design Apportionment, 308 Pairwise Efficiency Bound, 310 Optimal Efficiency Bound, 311 Uniform Efficiency Bounds, 312 Asymptotic Order O(n- l ), 314 Asymptotic Order O(n- 2 ), 315 Subgradient Efficiency Bounds, 317 Apportionment of D-Optimal Designs in Polynomial Fit Models, 320 Minimal Support and Finite Sample Size Optimality, 322 A Sufficient Condition for Completeness, 324 A Sufficient Condition for Finite Sample Size D-Optimality, 325 Finite Sample Size D-Optimal Designs in Polynomial Fit Models, 328 Exercises, 329
CONTENTS
XV
13. Invariant Design Problems 13.1. 13.2. 13.3. 13.4. 13.5. 13.6. 13.7. 13.8. 13.9. 13.10. 13.11. 13.12. Design Problems with Symmetry, 331 Invariance of the Experimental Domain, 335 Induced Matrix Group on the Regression Range, 336 Congruence Transformations of Moment Matrices, 337 Congruence Transformations of Information Matrices, 338 Invariant Design Problems, 342 Invariance of Matrix Means, 343 Invariance of the D-Criterion, 344 Invariant Symmetric Matrices, 345 Subspaces of Invariant Symmetric Matrices, 346 The Balancing Operator, 348 Simultaneous Matrix Improvement, 349 Exercises, 350
331
14. Kiefer Optimality 14.1. 14.2. 14.3. 14.4. 14.5. 14.6. 14.7. 14.8. 14.9. 14.10. Matrix Majorization, 352 The Kiefer Ordering of Symmetric Matrices, 354 Monotonic Matrix Functions, 357 Kiefer Optimality, 357 Heritability of Invariance, 358 Kiefer Optimality and Invariant Loewner Optimality, 360 Optimality under Invariant Information Functions, 361 Kiefer Optimality in Two-Way Classification Models, 362 Balanced Incomplete Block Designs, 366 Optimal Designs for a Linear Fit over the Unit Cube, 372 Exercises, 379
352
15. Rotatability and Response Surface Designs 15.1. 15.2. 15.3. 15.4. 15.5. 15.6. 15.7. 15.8. Response Surface Methodology, 381 Response Surfaces, 382 Information Surfaces and Moment Matrices, 383 Rotatable Information Surfaces and Invariant Moment Matrices, 384 Rotatability in Multiway Polynomial Fit Models, 384 Rotatability Determining Classes of Transformations, 385 First-Degree Rotatability, 386 Rotatable First-Degree Symmetric Matrices, 387
381
XVI
CONTENTS
15.9. 15.10. 15.11. 15.12. 15.13. 15.14. 15.15. 15.16. 15.17. 15.18. 15.19. 15.20. 15.21.
Rotatable First-Degree Moment Matrices, 388 Kiefer Optimal First-Degree Moment Matrices, 389 Two-Level Factorial Designs, 390 Regular Simplex Designs, 391 Kronecker Products and Vectorization Operator, 392 Second-Degree Rotatability, 394 Rotatable Second-Degree Symmetric Matrices, 396 Rotatable Second-Degree Moment Matrices, 398 Rotatable Second-Degree Information Surfaces, 400 Central Composite Designs, 402 Second-Degree Complete Classes of Designs, 403 Measures of Rotatability, 405 Empirical Model-Building, 406 Exercises, 406 408
Comments and References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Experimental Designs in Linear Models, 408 Optimal Designs for Scalar Parameter Systems, 410 Information Matrices, 410 Loewner Optimality, 412 Real Optimality Criteria, 412 Matrix Means, 413 The General Equivalence Theorem, 414 Optimal Moment Matrices and Optimal Designs, 417 D-, A-, E-, T-Optimality, 418 Admissibility of Moment and Information Matrices, 421 Bayes Designs and Discrimination Designs, 422 Efficient Designs for Finite Sample Sizes, 424 Invariant Design Problems, 425 Kiefer Optimality, 426 Rotatability and Response Surface Designs, 426
Biographies 1. Charles Loewner 1893-1968, 428 2. Gustav Elfving 1908-1984, 430 3. Jack Kiefer 1924-1981, 430 Bibliography Index
428
432 448
Preface to the Classics Edition
Research into the optimality theory of the design of statistical experiments originated around 1960. The first papers concentrated on one specific optimality criterion or another. Before long, when interrelations between these criteria were observed, the need for a unified approach emerged. Invoking tools from convex optimization theory, the optimal design problem is indeed amenable to a fairly complete solution. This is the topic of Optimal Design of Experiments, and over the years the material developed here has proved comprehensive, useful, and stable. It is a pleasure to see the book reprinted in the SIAM Classics in Applied Mathematics series. Ever since the inception of optimal design theory, the determinant of the moment matrix of a design was recognized as a very specific criterion function. In fact, determinant optimality in polynomial fit models permits an analysis other than the one presented here, based on canonical moments and classical polynomials. This alternate part of the theory is developed by H. DETTE and W.J. STUDDEN in their monograph The Theory of Canonical Moments with Applications in Statistics, Probability, and Analysis, and the references listed there complement and update the bibliography given here. Since the book's initial publication in 1993, its results have been put to good use in deriving optimal designs on the circle, optimal mixture designs, or optimal designs in other linear statistical models. However, many practical design problems of applied statistics are inherently nonlinear. Even then, local linearization may open the way to apply the present results, thus aiding in identifying good, practical designs.
FRIEDRICH PUKELSHEIM
Augsburg, Germany October 2005
xvii
Preface
... dans ce meilleur des [modeles] possibles ... tout est au mieux.
Candide (1759), Chapitre I, VOLTAIRE
The working title of the book was a bit long, Optimality Theory of Experimental Designs in Linear Models, but focused on two pertinent points. The setting is the linear model, the simplest statistical model, where the results are strongest. The topic is design optimality, de-emphasizing the issue of design construction. A more detailed Outline of the Book follows the Contents. The design literature is full of fancy nomenclature. In order to circumvent expert jargon I mainly speak of a design being -optimal for K 'Q in H, that is, being optimal under an information function , for a parameter system of interest K'6, in a class of competing designs. The only genuinely new notions that I introduce are Loewner optimality (because it refers to the Loewner matrix ordering) and Kiefer optimality (because it pays due homage to the man who was a prime contributor to the topic). The design problems originate from statistics, but are solved using special tools from linear algebra and convex analysis, such as the information matrix mapping of Chapter 3 and the information functions of Chapter 5. I have refrained from relegating these tools into a set of appendices, at the expense of some slowing of the development in the first half of the book. Instead, the auxiliary material is developed as needed, and it is hoped that the exposition conveys some of the fascination that grows out of merging three otherwise distinct mathematical disciplines. The result is a unified optimality theory that embraces an amazingly wide variety of design problems. My aim is not encyclopedic coverage, but rather to outline typical settings such as D-, A-, and E-optimal polynomial regression designs, Bayes designs, designs for model discrimination, balanced incomplete block designs, or rotatable response surface designs. Pulling together formerly separate entities to build a greater community will always face opponents who fear an assault on their way of thinking. On the contrary, my intention is constructive, to generate a frame for those design problems that share
xix
XX
PREFACE
a common goal. The goal of investigating optimal, theoretical designs is to provide a gauge for identifying efficient, practical designs. Il meglio e l'inimico del bene.
Dictionnaire Philosophique (1770), Art Dramatique, VOLTAIRE
ACKNOWLEDGMENTS The writing of this book became a pleasure when I began experiencing encouragement from so many friends and colleagues, ranging from good advice of how to survive a book project, to the tedious work of weeding out wrong theorems. Above all I would like to thank my Augsburg colleague Norbert Gaffke who, with his vast knowledge of the subject, helped me several times to overcome paralyzing deadlocks. The material of the book called for a number of research projects which I could only resolve by relying on the competence and energy of my co-authors. It is a privilege to have cooperated with Norman Draper, Sabine Rieder, Jim Rosenberger, Bill Studden, and Ben Torsney, whose joint efforts helped shape Chapters 15, 12, 11, 9, 8, respectively. Over the years, the manuscript has undergone continuous mutations, as a reaction to the suggestions of those who endured the reading of the early drafts. For their constructive criticism I am grateful to Ching-Shui Cheng, Holger Dette, Berthold Heiligers, Harold Henderson, Olaf Krafft, Rudolf Mathar, Wolfgang Nather, Ingram Olkin, Andrej Pazman, Norbert Schmitz, Shayle Searle, and George Styan. The additional chores of locating typos, detecting doubly used notation, and searching for missing definitions was undertaken by Markus Abt, Wolfgang Bischoff, Kenneth Nordstrom, Ingolf Terveer, and the students of various classes I taught from the manuscript. Their labor turned a manuscript that initially was everywhere dense in error into one which I hope is finally everywhere dense in content. Adalbert Wilhelm carried out most of the calculations for the numerical examples; Inge Dotsch so cheerfully kept retyping what seemed in final form. Ingo Eichenseher and Gerhard Wilhelms contributed the public domain postscript driver dvilw to produce the exhibits. Sol Feferman, Timo Makelainen, and Dooley Kiefer kindly provided the photographs of Loewner, Elfving, and Kiefer in the Biographies. To each I owe a debt of gratitude. Finally I wish to acknowledge the support of the Volkswagen-Stiftung, Hannover, for supporting sabbatical leaves with the Departments of Statistics at Stanford University (1987) and at Penn State University (1990), and granting an Akademie-Stipendium to help finish the project. FRIEDRICH PUKELSHEIM
Augsburg, Germany December 1992
List of Exhibits
1.1 1.2 1.3 1.4 1.5 1.6 1.7 2.1 2.2 2.3 2.4 2.5
The statistical linear model, 3 Convex cones in the plane R2, 11 Orthogonal decompositions induced by a linear mapping, 14 Orthogonal and oblique projections, 24 An experimental design worksheet, 28 A worksheet with run order randomized, 28 Experimental domain designs, and regression range designs, 32 The ice-cream cone, 38 Two Elfving sets, 43 Cylinders, 45 Supporting hyperplanes to the Elfving set, 50 Euclidean balls inscribed in and circumscribing the Elfving set, 55
3.1 ANOVA decomposition, 71 3.2 Regularization of the information matrix mapping, 81 3.3 Discontinuity of the information matrix mapping, 84 4.1 5.1 Penumbra, 108 Unit level sets, 121
6.1 Conjugate numbers, p + q = pq, 148 7.1 Subgradients, 159 7.2 Normal vectors to a convex set, 160 7.3 A hierarchy of equivalence theorems, 178
xxi
xxii
LIST OF EXHIBITS
8.1 Support points for a linear fit over the unit square, 194 9.1 The Legendre polynomials up to degree 10, 214 9.2 Polynomial fits over [-1; 1]: -optimal designs for 0 in T, 218 9.3 Polynomial fits over [1;!]: -optimal designs for 6 in 219 9.4 Histogram representation of the design , 220 9.5 Fifth-degree arcsin support, 220 9.6 Polynomial fits over [-1;1]: -optimal designs for 6 in T, 224 9.7 Polynomial fits over [-1;1]: -optimal designs for 6 in 225 9.8 The Chebyshev polynomials up to degree 10, 226 9.9 Lagrange polynomials up to degree 4, 228 9.10 E-optimal moment matrices, 233 9.11 Polynomial fits over [-1;1]: -optimal designs for 8 in T, 236 9.12 Arcsin support efficiencies for individual parameters 240 10.1 10.2 10.3 11.1 Cuts of a convex set, 254 Line projections and admissibility, 259 Cylinders and admissibility, 261 Discrimination between a second- and a third-degree model, 301 Quota method under growing sample size, 306 Efficient design apportionment, 310 Asymptotic order of the E-efficiency loss, 317 Asymptotic order of the D-efficiency loss, 322 Nonoptimality of the efficient design apportionment, 323 Optimality of the efficient design apportionment, 329 Eigenvalues of moment matrices of symmetric three-point designs, 334 The Kiefer ordering, 355 Some 3x6 block designs for 12 observations, 370
12.1 12.2 12.3 12.4 12.5 12.6 13.1
14.1 14.2
LIST OF EXHIBITS
XX111
14.3 14.4 15.1
Uniform vertex designs, 373 Admissible eigenvalues, 375 Eigenvalues of moment matrices of central composite designs, 405
Interdependence of Chapters
1 Experimental Designs in Linear Models 2 Optimal Designs for Scalar Parameter Systems
3 Information Matrices
4 Loewner Optimality
5 Real Optimality Criteria
6 Matrix Means
7 The General Equivalence Theorem
8 Optimal Moment Matrices and Optimal Designs
12 Efficient Designs for Finite Sample Sizes
9 D-, A-, E-, T-Optimality
13 Invariant Design Problems
10 Admissibility of Moment and Information Matrices
14 Kiefer Optimality
11 Bayes Designs and Discrimination Designs

XXIV
15 Rotatability and Response Surface Designs
Outline of the Book
CHAPTERS 1, 2, 3, 4: LINEAR MODELS AND INFORMATION MATRICES Chapters 1 and 3 are basic. Chapter 1 centers around the Gauss-Markov Theorem, not only because it justifies the introduction of designs and their moment matrices in Section 1.24. Equally important, it permits us to define in Section 3.2 the information matrix for a parameter system of interest K'0 in a way that best supports the general theory. The definition is extended to rank deficient coefficient matrices K in Section 3.21. Because of the dual purpose the Gauss-Markov Theorem is formulated as a general result of matrix algebra. First results on optimal designs are presented in Chapter 2, for parameter subsystems that are one-dimensional, and in Chapter 4, in the case where optimality can be achieved relative to the Loewner ordering among information matrices. (This is rare, see Section 4.7.) These results also follow from the General Equivalence Theorem in Chapter 7, whence Chapters 2 and 4 are not needed for their technical details.
CHAPTERS 5,6: INFORMATION FUNCTIONS Chapters 5 and 6 are reference chapters, developing the concavity properties of prospective optimality criteria. In Section 5.8, we introduce information functions which by definition are required to be positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous. Information functions submit themselves to pleasing functional operations (Section 5.11), of which polarity (Section 5.12) is crucial for the sequel. The most important class of information functions are the matrix means with parameter They are the topic of Chapter 6, starting from the classical D-, A-, E-criterion as the special cases respectively.
XXV
XXVI
OUTLINE OF THE BOOK
CHAPTERS 7, 8,12: OPTIMAL APPROXIMATE DESIGNS AND EFFICIENT DISCRETE DESIGNS The General Equivalence Theorem 7.14 is the key result of optimal design theory, offering necessary and sufficient conditions for a design's moment matrix M to be -optimal for K' in M. The generic result of this type is due to Kiefer and Wolfowitz (1960), concerning D-optimality for 6 in M . The present theorem is more general in three respects, in allowing for the competing moment matrices to form a set M which is compact and convex, rather than restricting attention to the largest possible set M of all moment matrices, in admitting parameter subsystems K' rather than concentrating on the full parameter vector 6, and in permitting as optimality criterion any information function , rather than restricting attention to the classical D-criterion. Specifying these quantitites gives rise to a number of corollaries which are discussed in the second half of Chapter 7. The first half is a self-contained exposition of arguments which lead to a proof of the General Equivalence Theorem, based on subgradients and normal vectors to a convex set. Duality theory of convex analysis might be another starting point; here we obtain a duality theorem as an intermediate step, as Theorem 7.12. Yet another approach would be based on directional derivatives; however, their calculus is quite involved when it comes to handling a composition C like the one underlying the optimal design problem. Chapter 8 deals with the practical consequences which the General Equivalence Theorem implies about the support points xi, and the weights w, of an optimal design The theory permits a weight w, to be any real number between 0 and 1, prescribing the proportion of observations to be drawn under xi. In contrast, a design for sample size n replaces wi by an integer n,-, as the replication number for xi. In Chapter 12 we propose the efficient design apportionment as a systematic and easy way to pass from wi, to ni. This discretization procedure is the most efficient one, in the sense of Theorem 12.7. For growing sample size AX, the efficiency loss relative to the optimal design stays bounded of asymptotic order n - 1 ; in the case of differentiability, the order improves to n- 2 . CHAPTERS 9,10,11: INSTANCES OF DESIGN OPTIMALITY D-, A-, and E-optimal polynomial regression designs over the interval [1; 1] are characterized and exhibited in Chapter 9. Chapter 10 discusses admissibility of the moment matrix of a polynomial regression design, and of the contrast information matrix of a block design in a two-way classification model. Prominent as these examples may be, it is up to Chapter 11 to exploit the power of the General Equivalence Theorem to its fullest. Various sets of competing moment matrices are considered, such as Ma for Bayes designs, M(a[a;b]) for designs with bounded weights, M(m) for mix-
OUTLINE OF THE BOOK
XXVll
ture model designs, {(M,... ,M): M M} for mixture criteria designs, and for designs with guaranteed efficiencies. And they are evaluated using an information function that is a composition of a set of m information functions, together with an information function on the nonnegative orthant Rm. CHAPTERS 13,14,15: OPTIMAL INVARIANT DESIGNS As with other statistical problems, invariance considerations can be of great help in reducing the dimensionality and complexity of the general design problem, at the expense of handling some additional theoretical concepts. The foundations are laid in Chapter 13, investigating various groups and their actions as they pertain to an experimental domain design r, a regression range design a moment matrix M(), an information matrix C/c(M), or an information function (C). The idea of "increased symmetry" or "greater balancedness" is captured by the matrix majorization ordering of Section 14.1. This concept is brought together with the Loewner matrix ordering to create the Kiefer ordering of Section 14.2: An information matrix C is at least as good as another matrix D, C > D, when relative to the Loewner ordering, C is above some intermediate matrix which is majorized by D, The concept is due to Kiefer (1975) who introduced it in a block design setting and called it universal optimality. We demonstrate its usefulness with balanced incomplete block designs (Section 14.9), optimal designs for a linear fit over the unit cube (Section 14.10), and rotatable designs for response surface methodology (Chapter 15). The final Comments and References include historical remarks and mention the relevant literature. I do not claim to have traced every detail to its first contributor and I must admit that the book makes no mention of many other important design topics, such as numerical algorithms, orthogonal arrays, mixture designs, polynomial regression designs on the cube, sequential and adaptive designs, designs for nonlinear models, robust designs, etc.
Errata
Page Line Text Correction Section 1.25 1/2 and 1/6 B~ B K , B k B ~ \X\ = C j E >0, GKCDCK'G' + F : s x (s k) d s
31 32 91 156 157 169 169 203 217 222
Section 1.26 + 12 Exh. 1.7 lower right: interchange B~B, BB~ -11 -2 X = \C\ i +11 E >0, +13 GKCDCK'G' : -7 sxk -12 _7 i [in denominator] r [in numerator] -8
241 270 330 347 357 361 378 390
_2 +4 +3 -7 + 15 + 11 +9 +13,-3
Exhibit 9.4 K NND(s)
a(jk)
Il m
Exhibit 9.2 Ks NND(k)
b(jk)
lI 1+m
xxix
OPTIMAL DESIGN OF EXPERIMENTS
CHAPTER 1
Experimental Designs in Linear Models
This chapter provides an introduction to experimental designs for linear models. Two linear models are presented. The first is classical, having a dispersion structure in which the dispersion matrix is proportional to the identity matrix. The second model is more general, with a dispersion structure that does not impose any rank or range assumptions. The Gauss-Markov Theorem is formulated to cover the general model. The classical model provides the setting to introduce experimental designs and their moment matrices. Matrix algebra is reviewed as needed, with particular emphasis on nonnegative definite matrices, projectors, and generalized inverses. The theory is illustrated with two-way classification models, and models for a line fit, parabola fit, and polynomial fit.
1.1. DETERMINISTIC LINEAR MODELS Many practical and theoretical problems in science treat relationships of the type
where the observed response or yield, y, is thought of as a particular value of a real-valued model function or response function, g, evaluated at the pair of arguments (t, 0). This decomposition reflects the distinctive role of the two arguments: The experimental conditions t can be freely chosen by the experimenter from a given experimental domain T, prior to running the experiment. The parameter system 6 is assumed to lie in a parameter domain , and is not known to the experimenter. This is paraphrased by saying that the experimenter controls t, whereas "nature" determines 6. The choice of the function g is central to the model-building process. One
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
of the simplest relationships is the deterministic linear model
where f(t) = (f\(t), . . . ,/*(0) ' an^ 0 = (#i> i #*) ' are vectors in ^-dimensional Euclidean space Rk. All vectors are taken to be column vectors, a prime indicates transposition. Hence f(t)'B is the usual Euclidean scalar product, /(0'0 ;<*/} (Ofy Linearity pertains to the parameter system 0, not to the experimental conditions t. Linearity shifts the emphasis from the model function g to the regression function f. Assuming that the experimenter knows both the regression function / and the experimental conditions t, a compact notation results upon introducing the k x 1 regression vector x = /(?), and the regression range X = [f(t) : t T} C R*. From an applied point of view the experimental domain T plays a more primary role than the regression range X, but the latter is expedient for a consistent development. The linear model, in its deterministic form discussed so far, thus takes the simple form y = x'B.
1.2. STATISTICAL LINEAR MODELS In many experiments the response can be observed only up to an additive random error e, distorting the model to become
In this model, the term e may subsume quite diverse sources of error, ranging from random errors resulting from inaccuracies in the measuring devices, to systematic errors that are due to inappropriateness of a model function Because of random error, repeated experimental runs typically lead to different observed responses, even if the regression vector x and the parameter system 8 remain identical. Therefore any evaluation of the experiment can involve a statement on the distribution of the response only, rather than on any one of its specific values. A (statistical) linear model thus treats response and error as real-valued random variables Y and , governed by a probability distribution P and satisfying the relationship
A schematic arrangement of these quantities is presented in Exhibit 1.1.
1.3. CLASSICAL LINEAR MODELS WITH MOMENT ASSUMPTIONS
EXHIBIT 1.1 The statistical linear model. The response Y decomposes into the deterministic mean effect x'Q plus the random error E.
1.3. CLASSICAL LINEAR MODELS WITH MOMENT ASSUMPTIONS
To proceed, we need to be more specific about the underlying distributional assumptions. For point estimation, the distributional assumptions solely pertain to expectation and variance relative to the underlying distribution P,
For this reason 0 is called the mean parameter vector, while the model variance a2 > 0 provides an indication of the variability inherent in the observation Y. Another way of expressing this is to say that the random error E has mean value zero and variance a2, neither of which depends on the regression vector x nor on the parameter vector 0 of the mean response. The k x 1 parameter vector 0 and the scalar parameter a2 comprise a total of k +1 unknown parameter components. Clearly, for any reasonable inference, the number n of observations must be at least equal to k + 1. We consider a set of n observations,
with possibly different regression vectors jc, in experimental run /. The joint distribution of the n responses Yt is specified by assuming that they are uncorrelated. Considerable simplicity is gained by using vector notation. Let
denote the n x l response vector Y, the n x k model matrix X, and the n x l error vector , respectively. (Henceforth the random quantities Y and E are n x l vectors rather than scalars!) The (i,y)th entry *|; of the matrix X is the
same as the ; th component of the regression vector jc,, that is, the regression vector jCj appears as the / th row of the model matrix X. The model equation thus becomes
With / as the n x n identity matrix, the model is succinctly represented by the expectation vector and dispersion matrix of y,
and is termed the classical linear model with moment assumptions. In other words, the mean vector Ep[Y] is given by the linear relationship X6 between the regression vectors *!,...,* and the parameter vector 0, while the dispersion matrix D/>[F] is in its classical, that is, simplest, form of being proportional to the identity matrix.
1.4. CLASSICAL LINEAR MODELS WITH NORMALITY ASSUMPTION For purposes of hypothesis testing and interval estimation, assumptions on the first two moments do not suffice and the entire distribution of Y is required. Hence in these cases there is a need for a classical linear model with normality assumption,
in which Y is assumed to be normally distributed with mean vector XB and dispersion matrix a2In. If the model matrix X is known then the normal distribution P = N^.^ is determined by 8 and a2. We display these parameters by writing Ee.a2[- ] in place of E/>[- ], etc. Moreover, the letter P soon signifies a projection matrix. 1.5. TWO-WAY CLASSIFICATION MODELS The two-sample problem provides a simple introductory example. Consider two populations with mean responses a\ and a2. The observed responses from the two populations are taken to have a common variance a2 and to be uncorrelated. With replications y = !,...,/ for populations / = 1,2 this yields a linear model
1.5. TWO-WAY CLASSIFICATION MODELS
Assembling the components into n x 1 vectors, with n = n\ + n^, we get
Here the n x 2 model matrix X and the parameter vector 6 are given by
with regression vectors x\ = Q and *2 = (i) repeated n\ and 2 times. The experimental design consists of the replication numbers n\ and n-i, telling the experimenter how many responses are to be observed from which population. It is instructive to identify the quantities of this example with those of the general theory. The experimental domain T is simply the two-element set {1,2} of population labels. The regression function takes values /(I) = Q and /(2) = (J) in R2, inducing the set X = {(J), (J)} as the regression range. The generalization from two to a populations leads to the one-way classification model. The model is still Y,; = a, + t;, but the subscript ranges turn into i = l , . . . , a and ; = !,...,,. The mean parameter vector becomes 0 = (!,..., )', and the experimental domain is T = {1,...,0}. The regression function / maps i into the /th Euclidean unit vector ei of Ra, with /th entry one and zeros elsewhere. Hence the regression range is X = {e l5 ... ,ea}. Further generalization is aided by a suitable terminology. Rather than speaking of different populations, / = 1,..., a, we say that the "factor" population takes "levels" / = 1..., a. More factors than one occur in multiway classification models. The two-way classification model with no interaction may serve as a prototype. Suppose level / of a first factor "A" has mean effect a/, while level j of a second factor "B" has mean effect )8y. Assuming that the two effects are additive, the model reads
with replications i 1,... ,n/ ; , for levels i = 1,... ,a of factor A and levels j = 1,... ,b of factor B. The design problem now consists of choosing the replication numbers n,;. An extreme, but feasible, choice is n,; = 0, that is, no observation is made with factor A on level / and factor B on level /. The
parameter vector 0 is the k x 1 vector (ai,..., aa, p\,..., ftb)', with k = a+b. The experimental domain is the discrete rectangle T = (1,..., a} x {1,..., b}. The regression function / maps (',/) into (J), where e{ is the ith Euclidean unit vector of Ra and d, is the ; th Euclidean unit vector of R*. We return to this model in Section 1.27. So far, the experimental domain has been a finite set; next it is going to be an interval of the real line R. 1.6. POLYNOMIAL FIT MODELS Let us first look at a line fit model,
Intercept a and slope )8 form the parameter vector 6 of interest, whereas the experimental conditions f, come from an interval T C R. For the sake of concreteness, we think of t e T as a "dose level". The design problem then consists of determining how many and which dose levels f i,..., r/ are to be observed, and how often. If the experiment calls for nt replications of dose level f,-, the subscript ranges in the model are / = 1,... ,n, for i = 1,... ,^. Here the regression function has values f ( i ) = (1,0'. generating a line segment embedded in the plane R2 as regression range X. The parabola fit model has mean response depending on the dose level quadratically,
This changes the regression function to f(t) = (l,r,f 2 )', and the regression range X turns into the segment of a parabola embedded in the space R3. These are special instances of polynomial fit models of degree d > 1, the model equation becoming
The regression range X is a one-dimensional curve embedded in R*, with k = d +1. Often the description of the experiment makes it clear that the experimental condition is a single real variable /; a linear model for a line fit (parabola fit, polynomial fit of degree d) is then referred to as a first-degree model (second-degree model, d th-degree model). This generalizes to the powerful class of m-way d th-degree polynomial fit models. In these models the experimental condition / = (fi, ,f m )' has m components, that is, the experimental domain T is a subset of Rm, and the model function f(t) '8 is a polynomial of degree d in the m variables f j , . . . , tm.
1.7. EUCLIDEAN MATRIX SPACE
For instance, a two-way third-degree model is given by
with i experimental conditions f, = (to,to)' T C R2, and with subscript ranges / = 1,..., nt; for / = 1,..., L As a second example consider the threeway second-degree model
with i experimental conditions f, = (to, to, to)' e T C R3, and with subscript ranges / = 1,..., n,- for i = 1,..., . Both models have ten mean parameters. The two examples illustrate saturated models because they feature every possible dth-degree power or cross product of the variables 11,...,tm. In general, a saturated m-way d th-degree model has
mean parameters. An instance of a nonsaturated two-way second-degree model is
with i experimental conditions tt = (to,to)' e T C R2, and with subscript ranges / = 1,..., n, for / = 1,..., . The discussion of these examples is resumed in Section 1.27, after a proper definition of an experimental design.
1.7. EUCLIDEAN MATRIX SPACE In a classical linear model, interest concentrates on inference for the mean parameter vector 6. The performance of appropriate statistical procedures tends to be measured by dispersion matrices, moment matrices, or information matrices. This calls for a review of matrix algebra. All matrices used here are real. First let us recall that the trace of a square matrix is the sum of its diagonal entries. Hence a square matrix and its transpose have identical traces. Another important property is that, under the trace operator, matrices commute
provided they are conformable,
We often apply this rule to quadratic forms given by a symmetric matrix A, in using x'Ax = trace Axx' = trace xx'A, as is convenient in a specific context. Let R"** denote the linear space of real matrices with n rows and k columns. The Euclidean matrix scalar product
turns Rn*k into a Euclidean space of dimension nk. For k = 1, we recover the Euclidean scalar product for vectors in W. The symmetry of scalar products, trace A 'B = (A,B) = (B,A) = trace B'A, reproduces the property that a square matrix and its transpose have identical traces. Commutativity under the trace operator yields (A,B) = trace A 'B = trace BA1 (B',A') = (A',B'), that is, transposition preserves the scalar products between the matrix spaces of reversed numbers of rows and columns, Rnxk and R*xw. In general, although not always, our matrices have at least as many rows as columns. Since we have to deal with extensive matrix products, this facilitates a quick check that factors properly conform. It is also in accordance with writing vectors of Euclidean space as columns. Notational conventions that are similarly helpful are to choose Greek letters for unknown parameters in a statistical model, and to use uppercase and lowercase letters to discriminate between a random variable and any one of its values, and between a matrix and any one of its entries. Because of their role as dispersion operators, our matrices often are symmetric. We denote by Sym(A;) the subspace of symmetric matrices, in the space Ukxk of all square, that is, not necessarily symmetric, matrices. Recall from matrix algebra that a symmetric k x k matrix A permits an eigenvalue decomposition
The real numbers A i , . . . , A j t are the eigenvalues of A counted with their respective multiplicities, and the vectors z\, . . , z* ^fc frm an orthonormal system of eigenvectors. In general, such a decomposition fails to be unique, since if the eigenvalue A; has multiplicity greater than one then many choices for the eigenvectors Zj become feasible. The second representation of an eigenvalue decomposition, A = Z'&^Z, assembles the pertinent quantities in a slightly different way. We define the operator AA by requiring that it creates a diagonal matrix with the argument vector A = ( A 1 ? . . . , \k)' on the diagonal. The orthonormal vectors z\, - - , Zk
1.8. NONNEGATIVE DEFINITE MATRICES
form the k x k matrix Z' = (z\,..., Zk}-, whence Z' is an orthogonal matrix. The equality with the first representation now follows from
Matrices that have matrices or vectors as entries, such as Z', are termed block matrices. They provide a convenient technical tool in many areas. The algebra of block matrices parallels the familiar algebra of matrices, and may be verified as needed. In the space Sym(/c), the subsets of nonnegative definite matrices, NND(fc), and of positive definite matrices, PD(/c), are central to the sequel. They are defined through
Of the many ways of characterizing nonnegative definiteness or positive definiteness, frequent use is made of the following. 1.8. NONNEGATIVE DEFINITE MATRICES Lemma. Let A be a symmetric k x k matrix with smallest eigenvalue Then we have
Proof. Assume A e NND(), and choose an eigenvector z e R* of norm 1 corresponding to A m inCA). Then we obtain 0 < z 'Az = A m jn(^)z 'z = A m jnC<4)- Now assume Amin(/l) > 0, and choose an eigenvalue decomposition
10
This yields trace trace To complete the circle, we verify for all by choosing For positive defimteness the arguments follow the same lines upon observing that we have provided The set of all nonnegative definite matrices NND(fc) has a beautiful geometrical shape, as follows. 1.9. GEOMETRY OF THE CONE OF NONNEGATIVE DEFINITE MATRICES Lemma. The set NND(A:) is a cone which is convex, pointed, and closed, and has interior PD() relative to the space Sym(A:). Proof. The proof is by first principles, recalling the definition of the properties involved. For A e NND(A;) and 5 > 0 evidently 8A NND(fc), thus NND(fc) is a cone. Next for A, B e NND(fc) we clearly have A+B e NND(fc) since
Because NND(fc) is a cone, we may replace A by (1 - a)A and B by aB, where a lies in the open interval (0;1). Hence given any two matrices A and B the set NND(&) also includes the straight line (1 - a)A + aB from A to B, and this establishes convexity. If A e NND(fc) and also A NND(fc), then A = 0, whence the cone NND(&) is pointed. The remaining two properties, that NND(fc) is closed and has PD(fc) for its interior, are topological in nature. Let
be the closed unit ball in Sym(k) under the Euclidean matrix scalar product. Replacing B e Sym(fc) by an eigenvalue decomposition ;Ayv;j>y' yields trace B2 ]T; A?; thus B e B has eigenvalues Ay satisfying |A,| < 1. It follows that B e B fulfills x'Bx < \x'Bx\ < ; jAyKjc'^) 2 < x'^y^^x = x'x for all x e IR*. A set is closed when its complement is open. Therefore we pick an arbitrary k x k matrix A which is symmetric but fails to be nonnegative definite. By definition, x 'Ax < 0 for some vector x e R*. Define 5 = -x'Ax/(2x'x] > 0. For every matrix B 6 B, we then have
1.10. THE LOEWNER ORDERING OF SYMMETRIC MATRICES
1 1
EXHIBIT 1.2 Convex cones in the plane U2. Left: the linear subspace generated by x e R2 is a closed convex cone that is not pointed. Right: the open convex cone generated by x,y 6 R2, together with the null vector, forms a pointed cone that is neither open nor closed.
Thus the set A + SB is included in the complement of NND(Jt), and it follows that the cone NND(fc) is closed. Interior points are identified similarly. Let A e intNND(fc), that is, A + SB C NND(fc) for some 8 > 0. If x 0 then the choice B = -xx'/x'x e B leads to
Hence every matrix A interior to NND(/c) is positive definite, intNND(fc) C PD(fc). It remains to establish the converse inclusion. Every matrix A PD(fc) has 0 < A min (A) = 6, say. For B e B and x e R*, we obtain x'Bx > -jc'jc, and
Thus A + 3BC NND(fc) shows that A is interior to NND(fc). There are, of course, convex cones that are not pointed but closed, or pointed but not closed, or neither pointed nor closed. Exhibit 1.2 illustrates two such instances in the plane R2. 1.10. THE LOEWNER ORDERING OF SYMMETRIC MATRICES True beauty shines in many ways, and order is one of them. We prefer to view the closed cone NND(fc) of nonnegative definite matrices through the
12
partial ordering >, defined on Sym(k) by
which has come to be known as the Loewner ordering of symmetric matrices. The notation B < A in place of A > B is self-explanatory. We also define the closely related variant > by
which is based on the open cone of positive definite matrices. The geometric properties of the set NND(fc) of being conic, convex, pointed, and closed, translate into related properties for the Loewner ordering:
The third property in this list says that the Lowener ordering is antisymmetric. In addition, it is reflexive and transitive,
Hence the Loewner ordering enjoys the three properties that constitute a partial ordering. For scalars, that is, A: = 1, the Loewner ordering reduces to the familiar total ordering of the real line. Or the other way around, the total ordering > of the real line U is extended to the partial ordering > of the matrix spaces Sym(/c), with k > 1. The crucial distinction is that, in general, two matrices may not be comparable. An example is furnished by
for which neither A > B nor B > A holds true. Order relations always call for a study of monotonic functions.
1.11. MONOTONIC MATRIX FUNCTIONS
We consider functions that have a domain of definition and a range that are equipped with partial orderings. Such functions are called isotonic when they
1.12. RANGE AND NULLSPACE OF A MATRIX
13
are order preserving, and antitonic when they are order reversing. A function is called monotonic when it is isotonic or antitonic. Two examples may serve to illustrate these concepts. A first example is supplied by a linear form A *-+ trace AB on Sym(A:), determined by a matrix B e Sym(fc). If this linear form is isotonic relative to the Loewner ordering, then A > 0 implies trace A B > 0, and Lemma 1.8 proves that the matrix B is nonnegative definite. Conversely, if B is nonnegative definite and A > C, then again Lemma 1.8 yields trace(^4 - C)B > 0, that is, trace AB > trace CB. Thus a linear form A i-> trace AB is isotonic relative to the Loewner ordering if and only if B is nonnegative definite. In particular the trace itself is isotonic, A -> trace A, as follows with B = Ik. It is an immediate consequence that the Euclidean matrix norm \\A\\ (trace A2)1/2 is an isotonic function from the closed cone NND(fc) into the real line. For if A > B > 0, then we have
As a second example, matrix inversion A l is claimed to be an antitonic mapping from the open cone PD(fc) into itself. For if A > B > 0 then we get
Pre- and postmultiplication by A ] gives A~l < B l, as claimed. A minimization problem relative to the Loewner ordering is taken up in the Gauss-Markov Theorem 1.19. Before turning to this topic, we review the role of matrices when they are interpreted as linear mappings. 1.12. RANGE AND NULLSPACE OF A MATRIX A rectangular matrix A e Rnxk may be identified with a linear mapping carrying x e Rk into Ax e IR". Its range or column space, and its nullspace or kernel are
The range is a subspace of the image space R n . The nullspace is a subspace of the domain of definition Rk. The rank and nullity of A are the dimensions of the range of A and of the nullspace of A, respectively. If the matrix A is symmetric, then its rank coincides with the number of nonvanishing eigenvalues, and its nullity is the number of vanishing eigenvalues. Symmetry involves transposition, and transposition indicates the presence of a scalar product (because A' is the unique matrix B that satisfies
14
EXHIBIT 13 Orthogonal decompositions induced by a linear mapping. Range and nullspace of a matrix A R"** and of its transpose A' orthogonally decompose the domain of definition Kk and the image space R".
(Ax,y) = (x,By) for all x,y). In fact, Euclidean geometry provides the following vital connection that the nullspace of the transpose of a matrix is the orthogonal complement of its range. Let
denote the orthogonal complement of a subspace L of the linear space R".

1.13. TRANSPOSITION AND ORTHOGONALITY
Lemma. Let A be an n x k matrix. Then we have
Proof.
A few transcriptions establish the result:
Replacing A' by A yields nullspace A = (range A')^. Thus any n x k matrix A comes with two orthogonal decompositions, of the domain of definition R*, and of the image space R". See Exhibit 1.3.
1.15. DISTRIBUTIONAL SUPPORT OF LINEAR MODELS
15
1.14. SQUARE ROOT DECOMPOSITIONS OF A NONNEGATIVE DEFINITE MATRIX As a first application of Lemma 1.13 we investigate square root decompositions of nonnegative definite matrices. If V is a nonnegative definite n x n matrix, a representation of the form
is called a square root decomposition of V, and U is called a square root of V. Various such decompositions are easily obtained from an eigenvalue decomposition
For instance, a feasible choice is U = ( v /Alzi,...,v / &) e R nx ". If V has nonvanishing eigenvalues A I , . . . , A*, other choices are U = (\f\iz\,..., v/AJtZjt) e R"x/c; here V = UU' is called a full rank decomposition for the reason that the square root U has full column rank. Every square root U of V has the same range as V, that is,
To prove this, we use Lemma 1.13, in that the ranges of V and U coincide if and only if the nullspaces of V and U' are the same. But U 'z = 0 clearly implies Vz = 0. Conversely, Vz = 0 entails 0 = z'Vz = z'UU'z = (U'z)'(U'z), and thus forces U'z = 0. The range formula, for every n x k matrix X,
is a direct consequence of a square root decomposition V = UU', since range V = range U implies range X'V range X'U = range X'VX C range X'V. Another application of Lemma 1.13 is to clarify the role of mean vectors and dispersion matrices in linear models. 1.15. DISTRIBUTIONAL SUPPORT OF LINEAR MODELS Lemma. Let Y be an n x 1 random vector with mean vector //, and dispersion matrix V. Then we have with probability 1,
16
that is, the distribution of Y is concentrated on the affine subspace that results if the linear subspace range V is shifted by the vector /A. Proof. The assertion is true if V is positive definite. Otherwise we must show that Y p, lies in the proper subspace range V with probability 1. In view of Lemma 1.13, this is the same as Y-JJL _L nullspace V with probability 1. Here nullspace V may be replaced by any finite set {z\,. . , Zk} of vectors spanning it. For each j = 1,..., k we obtain
thus Y n _L Zj with probability 1. The exceptional nullsets may depend on the subscript;', but their union produces a global nullset outside of which is orthogonal to {z1,...,zk}, as claimed. In most applications the mean vector /A is a member of the range of V. Then the affine subspace p + range V equals range V and is actually a linear subspace, so that Y falls into the range of V with probability 1. In a classical linear model as expounded in Section 1.3, the mean vector fj, is of the form X6 with unknown parameter system 6. Hence the containment IJi Xde range V holds true for all vectors 8 provided
Such range inclusion conditions deserve careful study as they arise in many places. They are best dealt with using projectors, and projectors are natural companions of generalized inverse matrices. 1.16. GENERALIZED MATRIX INVERSION AND PROJECTIONS For a rectangular matrix A Rnxk, any matrix G e Rkxn fulfilling AGA = A is called a generalized inverse of A. The set of all generalized inverses of A,
is an affine subspace of the matrix space R* XAI , being the solution set of an inhomogeneous linear matrix equation. If a relation is invariant to the choice of members in A~, then we often replace the matrix G by the set A~, For instance, the defining property may be written as A A'A = A. A square and nonsingular matrix A has its usual inverse A~l for its unique generalized inverse, A' = {A~1}. In this sense generalized matrix inversion is a generalization of regular matrix inversion. Our explicit convention of treating A~ as a set of matrices is a bit unusual, even though it is implicit in all of the work on generalized matrix
1.17. RANGE INCLUSION LEMMA
17
inverses. Namely, often only results that are invariant to the specific choice of a generalized inverse are of interest. For example, in the following lemma, the product X'GX is the same for every generalized inverse G of V. We indicate this by inserting the set V~ in place of the matrix G. However, the central optimality result for experimental designs is of opposite type. The General Equivalence Theorem 7.14 states that a certain property holds true, not for every, but for some generalized inverse. In fact, the theorem becomes false if this point is missed. Our notation helps to alert us to this pitfall. A matrix P e R"x" is called a projector onto a subspace /C C U" when P is idempotent, that is, P2 = P, and has /C for its range. Let us verify that the following characterizing interrelation between generalized inverses and projectors holds true:
For the direct part, note first that AG is idempotent. Moreover the inclusions
show that the range of AG and the range of A coincide. For the converse part, we use that the projector AG has the same range as the matrix A. Thus every vector Ax with x G Rk has a representation AGy with y IR", whence AGAx = AGAGy = AGy Ax. Since x can be chosen arbitrarily, this establishes AGA = A. The intimate relation between range inclusions and projectors, alluded to in Section 1.15, can now be made more explicit. 1.17. RANGE INCLUSION LEMMA Lemma. Let X be an n x k matrix and V be an n x s matrix. Then we have
If range X C range V and V is a nonnegative definite n x n matrix, then the product
does not depend on the choice of generalized inverse for V, is nonnegative definite, and has the same range as X' and the same rank as X. Proof. The range of X is included in the range of V if and only if A' VW for some conformable matrix W. But then VGX = VGVW = VW = X
18
for all G e V . Conversely we may assume the slightly weaker property that VGX = X for at least one generalized inverse G of V. Clearly this is enough to make sure that the range of X is included in the range of V. Now let V be nonnegative definite, in addition to X = VW. Then the matrix X'GX = W'VGVW = W'VW is the same for all choices G e V~, and is nonnegative definite. Furthermore the ranks of X'V'X = W'VW and VW = X are equal. In particular, the ranges of X'V~X and X' have the same dimension. Since the first is included in the second, they must then coincide. We illustrate by example what can go wrong if the range inclusion condition is violated. The set of generalized inverses of
is
This is also the set of possible products X'GX with G V if for X we choose the 2 x 2 identity matrix. Hence the product X'V'X is truely a set and not a singleton. Among the members
some are not symmetric (/3 ^ -y), some are not nonnegative definite (a < 0,/3 = y), and some do not have the same range as X' and the same rank as X (a = (3 = y = 1). Frequent use of the lemma is made with other matrices in place of X and V. The above presentation is tailored to the linear model context, which we now resume. 1.18. GENERAL LINEAR MODELS A central result in linear model theory is the Gauss-Markov Theorem 1.19. The version below is stated purely in terms of matrices, as a minimization problem relative to the Loewner ordering. However, it is best understood in the setting of a general linear model in which, by definition, the n x 1 response vector Y is assumed to have mean vector and dispersion matrix given by
1.18. GENERAL LINEAR MODELS
19
Here the n x k model matrix X and the nonnegative definite n x n matrix V are assumed known, while the mean parameter vector 9 e and the model variance a2 > 0 are taken to be unknown. The dispersion matrix need no longer be proportional to the identity matrix as in the classical linear model discussed in Section 1.3. Indeed, the matrix V may be rank deficient, even admitting the deterministic extreme V = 0. The theorem considers unbiased linear estimators LY for XO, that is, n x n matrices L satisfying the unbiasedness requirement
In a general linear model, it is implicitly assumed that the parameter domain is the full space, 6 = R*. Under this assumption, LY is unbiased for X6 if and only if LX X, that is, L is a left identity of X. There always exists a left identity, for instance, L = /. Hence the mean vector XO always admits an unbiased linear estimator. More generally, we may wish to estimate s linear forms Cj'0,... ,cs'0 of 6, with coefficient vectors c, e R*. For a concise vector notation, we form the k x s coefficient matrix K = (ci,... ,c5). Thus interest is in the parameter subsystem K'O. A linear estimator LY for K'O is determined by an s x n matrix L. Unbiasedness holds if and only if
There are two important implications. First, K'O is called estimable when there exists an unbiased linear estimator for K'O. This happens if and only if there is some matrix L that satisfies (1). In the sequel such a specific solution is represented as L = U', with an n x s matrix U. Therefore estimability means that AT' is of the form U'X. Second, if K'O is estimable, then the set of all matrices L that satisfy (1) determines the set of all unbiased linear estimators LY for K'O. In other words, in order to study the unbiased linear estimators for K'O = U'XO,we have to run through the solutions L of the matrix equation
It is this equation (2) to which the Gauss-Markov Theorem 1.19 refers. The theorem identifies unbiased linear estimators LY for the mean vector XO which among all unbiased linear estimators LY have a smallest dispersion matrix. Thus the quantity to be minimized is a2LVL', relative to the Loewner ordering. The crucial step in the proof is the computation of the covariance matrix between the optimality candidate LY and a competitor LY,
20
1.19. THE GAUSS-MARKOV THEOREM Theorem. Let X be an n x k matrix and V be a nonnegative definite n x n matrix. Suppose U is an n x s matrix. A solution L of the equation LX = U'X attains the minimum of LVL', relative to the Loewner ordering and over all solutions L of the equation LX = U'X,
if and only if
where R is a projector given by R = In- XG for some generalized inverse G oiX. A minimizing solution L exists; a particular choice is U '(ln-VR'HK), with any generalized inverse H otRVR'. The minimum admits the representation
and does not depend on the choice of the generalized inverses involved. Proof. For a fixed generalized inverse G of X we introduce the projectors P = XG and R = In - P. Every solution L satisfies L - L(P + R) = LXG + LR= U'P + LR. I. First the converse part is proved. Assume the matrix L solves LX = U'X and fulfills LVR' = 0, and let L be any other solution. We get
and symmetry yields (L - L)VL' = 0. Multiplying out (L - L + L)V(L L+L)' = (L-L)V(L-L)'+Q+0+LVLr, we obtain the minimizing property of L,
II. Next we tackle existence. Because of RX = 0 the matrix L = U'(In VR'HR} solves LX = U'X - U'VR'HRX = U'X. It remains to show that LVR' = 0. To this end, we note that range RV = range RVR', by the square root discussion in Section 1.14. Lemma 1.17 says that VR'HRV = VR'(RVR'yRV, as well as RVR'(RVR')-RV = RV. This gives
1.20. THE GAUSS-MARKOV THEOREM UNDER A RANGE INCLUSION CONDITION
21
Hence L fulfills the necessary conditions from the converse part, and thus attains the minimum. Furthermore the minimum permits the representation
III. Finally the existence proof and the converse part are jointly put to use in the direct part. Since the matrix L is minimizing, any other minimizing solution L satisfies (1) with equality. This forces (L - L)V(L - L}' 0, and further entails (L - L)V = 0. Postmultiplication by R' yields LVR' = LVR' = Q. The statistic RY is an unbiased estimator for the null vector, Ee.a2[RY] = RX6 = 0. In the context of a general linear model, the theorem therefore says that an unbiased estimator LY for U'Xd has a minimum dispersion matrix if and only if LY is uncorrelated with the unbiased null estimator RY, that is, C^LY,RY] = a-2LVR' = 0. Our original problem of estimating X6 emerges with (/ = /. The solution matrices L of the equation LX X are the left identities of X. A minimum variance unbiased linear estimator for X6 is given by LY, with L = In VR'HR. The minimum dispersion matrix takes the form a2(V ~- VR'(RVRTRV). A simpler formula for the minimum dispersion matrix becomes available under a range inclusion condition as in Section 1.17. 1.20. THE GAUSS-MARKOV THEOREM UNDER A RANGE INCLUSION CONDITION Theorem. Let X be an n x k matrix and V be a nonnegative definite n x n matrix such that the range of V includes the range of X. Suppose U is an n x s matrix. Then the minimum of LVL' over all solutions L of the equation LX U'X admits the representation
and is attained by any L = U'X(X'V~X)~X'H inverse of V.
where H is any generalized
22
Proof. The matrix X'V'X ~ W does not depend on the choice of the generalized inverse for V and has the same range as X', by Lemma 1.17. A second application of Lemma 1.17 shows that a similar statement holds for XW~X'. Hence the optimality candidate L is well defined. It also satisfies LX = U'X since
Furthermore, it fulfills LVR' = 0, because of VH'X = VV~X = X and
By Theorem 1.19, the matrix Lisa minimizing solution. From X'V~VV~X X'V~X,v/e now obtain the representation
The preceding two theorems investigate linear estimates LY that are unbiased for a parameter system U'XO. The third, and last version concentrates on estimating the parameter vector 0 itself. A linear estimator LY, with L IR*X", is unbiased for 9 if and only if
This reduces to LX Ik, that is, L is a left inverse of X. For a left inverse L of X to exist it is necessary and sufficient that X has full column rank k. 1.21. THE GAUSS-MARKOV THEOREM FOR THE FULL MEAN PARAMETER SYSTEM Theorem. Let X be an n x k matrix with full column rank k and V be a nonnegative definite n x n matrix. A left inverse L of X attains the minimum of LVL' over all left inverses L of*,
if and only if
1.22. PROJECTORS, RESIDUAL PROJECTORS, AND DIRECT SUM DECOMPOSITION
23
where R is a projector given by R = In- XG for some generalized inverse G otX. A minimizing left inverse L exists; a particular choice is G-GVR 'HR, with any generalized inverse H of RVR'. The minimum admits the representation
and does not depend on the choice of the generalized inverses involved. Moreover, if the range of V includes the range of X then the minimum is
and is attained by any matrix L (X'V~X)~1X'H ized inverse of V'.
where H is any general-
Proof. Notice that every generalized inverse G of X is a left inverse of X, since premultiplication of XGX = X by (X'XY1X' gives GX = Ik. With U' = G, Theorem 1.19 and Theorem 1.20 establish the assertions. The estimators LY that result from the various versions of the GaussMarkov Theorem are closely related to projecting the response vector Y onto an appropriate subspace of R". Therefore we briefly digress and comment on projectors.
1.22. PROJECTORS, RESIDUAL PROJECTORS, AND DIRECT SUM DECOMPOSITION
Projectors were introduced in Section 1.16. If the matrix F e R"x" is a projector, P P2, then it decomposes the space Rn into a direct sum consisting of the subspaces /C = range P, and = nullspace P. To see this, observe that the nullspace of P coincides with the range of the residual projector R = In - P. Therefore every vector x e Rn satisfies
But then the vector x lies in the intersection JC n C if and only if x ~ Px and x = Rx, or equivalently, x = Rx = RPx = 0. Hence the spaces 1C and C are disjoint except for the null vector. Symmetry of P adds orthogonality to the picture. Namely, we then have C = nullspace P = nullspace P' = (range P)1 = /C\ by Lemma 1.13.
24
EXHIBIT 1.4 Orthogonal and oblique projections. The projection onto the first component in IR2 along the direction (^) is orthogonal. The (dashed) projection along the direction (}) is nonorthogonal relative to the Euclidean scalar product.
Thus projectors that are symmetric correspond to orthogonal sum decompositions of the space Un, and are called orthogonal projectors. This translation into geometry often provides a helpful view. In Exhibit 1.4, we sketch a simple illustration. In telling the full story we should speak of P as being the projector "onto 1C along ", that is, onto its range along its nullspace. But brevity wins over exactness. It remains the fact that projectors in IR" correspond to direct sum decompositions IR" = /C , without reference to any scalar product. In Lemma 2.15, we present a method for computing projectors if the subspaces /C and arise as ranges of nonnegative definite matrices A and. We mention in passing that a symmetric n x n matrix V permits yet another eigenvalue decomposition of the form V Y^k<e Â where AI, ..., A/ are the distinct eigenvalues of V and PI , . . . , Pf are the orthogonal projectors onto the corresponding eigenspaces. In this form, the eigenvalue decomposition is unique up to enumeration, in contrast to the representations given in Section 1.7.
1.23. OPTIMAL ESTIMATORS IN CLASSICAL LINEAR MODELS From now on, a minimum variance unbiased linear estimator is called an optimal estimator, for short. Returning to the classical linear model,
1.24. EXPERIMENTAL DESIGNS AND MOMENT MATRICES
25
Theorem 1.20 shows that the optimal estimator for the mean vector X6 is PY and that it has dispersion matrix a2P, where P = X(X'X)~X' is the orthogonal projector onto the range of X. Therefore the estimator P Y evidently depends on the model matrix X through its range, only. Representing this subspace as the range of another matrix X still leads to the same estimator PY and the same dispersion matrix a2P. This outlook changes dramatically as soon as the parameter vector 6 itself has a definite physical meaning and is to be investigated with its given components, as is the case in most applications. From Theorem 1.21, the optimal estimator for 6 then is (X'X)~1X'Y, and has dispersion matrix o-2(X'X)~l. Hence changing the model matrix X in general affects both the optimal estimator and its dispersion matrix.
1.24. EXPERIMENTAL DESIGNS AND MOMENT MATRICES The stage is set now to introduce the notion of experimental designs. In Section 1.3, the model matrix was built up according to X = (*i,.. .,*)', starting from the regression vectors *,-. These are at the discretion of the experimenter who can choose them so that in a classical linear model, the optimal estimator (X'XylX'Y for the mean parameter vector 6 attains a dispersion matrix cr2(X'X)~l as small as possible, relative to the Loewner ordering. Since matrix inversion is antitonic, as seen in Section 1.11, the experimenter may just as well aim to maximize the precision matrix, that is, the inverse dispersion matrix,
The sum repeats the regression vector Xj e X according to how often it occurs in *!,...,*. Since the order of summation does not matter, we may assume that the distinct regression vectors x\,... ,JQ, say, are enumerated in the initial section, while replications are accounted for in the final section
x
We introduce for / < i the frequency counts , as the number of times the particular regression vector A:, occurs among the full list jti, ...,*. This motivates the definitions of experimental designs and their moment matrices which are fundamental to the sequel. DEFINITION. An experimental design for sample size n is given by a finite number of regression vectors x\,..., xg in X, and nonzero integers ni,...,nf such that
t+\i ixn-
26
In other words an experimental design for sample size n, denoted by , specifies i < n distinct regression vectors *,, and assigns to them frequencies , that sum to n. It tells the experimenter to observe n, responses under the experimental conditions that determine the regression vector *,. The vectors that appear in the design , are called the support of , supp, = {*i,... ,JQ}. The matrix Yî<i nixixl %'% is called the moment matrix of and is denoted by A/(,). Then the precision matrix of the optimal estimator for 0 may be written as
The set of all designs for sample size n is denoted by Ew. Experimental designs for finite sample size lead to, often untractable, integer optimization problems. Much more smoothness evolves if we take a slightly different point of view. Indeed, the last sum in (1) is an average over the regression range X, placing rational weight ,/ on the regression vector */. The clue is to allow the weights to vary continuously in the closed unit interval [0;lj. This emerges as the limiting case for sample size n tending to infinity. DEFINITION. An experimental design for infinite sample size (or a design, for short) is a distribution on the regression range X which assigns all its mass to a finite number of points. A general design, denoted by , specifies i > 1 regression vectors *, and weights w/, where W I , . . . , H ^ are positive numbers summing to 1. It tells the experimenter to observe a proportion w, out of all responses under the experimental conditions that come with the regression vector */. A vector in the regression range carrying positive weight under the design is called a support point of . The set of all support points is called the support of and is denoted by supp . We use the notation E for the set of all designs. The discussion of precision matrices suggests that any performance measure of a design ought to be based on its moment matrix. DEFINITION. The moment matrix of a design 6 E is the k x k matrix defined by
The representation as an integral is useful since in general the support of a design is not known. It nicely exhibits that the moment matrices depend
1.25. MODEL MATRIX VERSUS DESIGN MATRIX
27
linearly on the designs, in the sense that
However, the integral notation hides the fact that a moment matrix always reduces to a finite sum. A design for sample size n must be standardized according to gn/n to become a member of the set H. In (1) the precision matrix of the optimal estimator for 0 then takes the form
Thus standardized precision grows directly proportional to the sample size n, and decreases inversely proportional to the model variance a2. For this reason the moment matrices M() of designs e H are often identified with precision matrices for 6 standardized with respect to sample size n and model variance a2. Of course, in order to become realizable, a design for infinite sample size must in general be approximated by a design for sample size n. An efficient apportionment method is proposed in Chapter 12. 1.25. MODEL MATRIX VERSUS DESIGN MATRIX Once a finite sample size design is selected, a worksheet can be set up into which the experimenter enters the observations that are obtained under the appropriate experimental conditions. In practical applications, designs on the regression range X uniquely correspond with designs r on the experimental domain T. The set of all designs T on T is denoted by T. (The Greek letters T, T are in line with r, T, as are , H with jc, X.} The correspondence between on X and T on T is always plainly visible. The formal relation is that on X is the distribution of the vector / relative to the underlying probability measure T on T, that is, = r o f~l. From an applied point of view, the designs r on the experimental domain T play a more primary role than the designs on the regression range X. However, for the mathematical development, we concentrate on . Because of the obvious correspondence between and r, no ambiguity will arise. Exhibit 1.5 displays a worksheet for a design of sample size n 12, consisting of i 6 experimental conditions, each with n, = 2 replications. When the experiment is carried out, the 12 observations ought to be made in random order, as a safeguard against a systematic bias that might be induced by any "standard order". Hence while such a worksheet is useful for the statistician, the experimenter may be better off with a version such as in Exhibit 1.6. There, the experimental runs are randomized and presentation of the model
28
CHAPTER 1. EXPERIMENTAL DESIGNS IN LINEAR MODELS
Experimental Run conditions / til tn

1 2 3 4 5 6
-1 -1
Regression vector
Xil
*i2 -1 0 1 -1
Replicated observations
XM 1
*il
y/i Vi2
1 -1 -1 1
0 1
0 -1
11
1 1 1 1 1 1
-1 -1 -1 1 1 1
1 1
EXHIBIT 1.5 An experimental design worksheet. The design is for sample size 12, for a nonsaturated two-way second-degree model. Each run / = 1,...,6 has two replications, j = 1,2. Randomization of the run order must be implemented by the experimenter.
Run i
1 2 3 4 5 6 7 8 9 10 11 12
Experimental conditions
til ti2
1 -1
Observations
yt
11 1 1 1 -1 -1 -1
0 1 0 -1
-1 -1
-1 1
-1 1
0 1 0 -1
EXHIBIT 1.6 A worksheet with run order randomized. A worksheet for the same experiment as in Exhibit 1.5, except that now run order is randomized.
matrix X is suppressed. The matrix that appears in Exhibit 1.6 still tells the experimenter which experimental conditions are to be realized in which run, but it does not reflect the underlying statistical model. For this reason it is instructive to differentiate between a design matrix
1.26. GEOMETRY OF THE SET OF ALL MOMENT MATRICES
29
and a model matrix. A design matrix is any matrix that determines a design rn for sample size n on the experimental domain T, while the model matrix X of Section 1.3 also reflects the modeling assumptions. The transition from designs for finite sample size to designs for infinite sample size originates from the usual step towards asymptotic statistics, of letting sample size n tend to infinity in order to obtain procedures that do not depend on n. It also pleases the mathematical mind to formulate a tractable, smooth problem. Lemma 1.26 provides a first indication of the type of smoothness involved. The convex hull, conv 5, of a subset S of some linear space consists of all convex combinations Yî<t a i^ w*tn a finite number t of points 5, from S and with arbitrary positive weights a, summing to 1. In the space Sym(A:) of symmetric matrices, the set of all moment matrices, M(H), is the convex hull of the set S {xx1 : x X] of the rank one matrices formed from regression vectors x. We now show that the set M(H) is also compact.
1.26. GEOMETRY OF THE SET OF ALL MOMENT MATRICES Lemma. If the regression range X is a compact set in Uk, then the set M (H) of moment matrices of all designs on X is a compact and convex subset of the cone NND(fc). Proof. Being a convex hull, the set M (E) is convex. Since the generating set S {xx' : x X} is included in NND(/c) so is Af(H). Moreover, 5 is compact, being the image of the compact set X under the continuous mapping x \-> xx' from X to Sym(/c). In Euclidean space, the convex hull of a compact set is compact, thus completing the proof. In general, the set M(H) of moment matrices obtained from all experimental designs for sample size n fails to be convex, but it is still compact. To see compactness, note that such a design is characterized by n regression vectors Jtj, ...,*, counting multiplicities. The set of moment matrices M(En) thus consists of the continuous images M(^,) = Y^>j<nxjxj f tne n~ fold Cartesian product Xn of the regression range with itself. If X. is compact, so are Xn and the continuous image M(E n ). The regression range X is compact, of course, if the experimental domain T is compact and the regression function / is continuous. The closedness property that is implied by compactness is of a more technical nature. However, boundedness is persuasive also on practical grounds: The fitting of a linear model would appear to adequately approximate the true expected response over a bounded region only. With an unbounded regression range, we would be comparing experiments in rather different environments.
30
1.27. DESIGNS FOR TWO-WAY CLASSIFICATION MODELS We illustrate the basic concepts with the examples of Section 1.5. In the twosample problem, an experimental design for finite sample size n is of the form ,(,!,) = n\ and &,(j) = n2, with n\ + 2 = n. It directs the experimenter to observe n, responses from population i. Its moment matrix is
The optimal estimator for 0 = (ot\,a2)' becomes
using the common abbreviations y,-. = $3/<n, ^0 anc^ ^ = (V n /) ^ fr ' 1,2. This is the familiar result that the optimal estimators for the population means are the sample averages within each population. Generalization to the one-way classification model is straightforward. A finite sample size design is given by (/) = n,-, calling for n/ observations to be made at level /. Its moment matrix is diagonal with entries ,. The optimal estimator for the mean effect of level / is the within-group sample average y,-.. Two-way classification models lead to something more interesting. Here an experimental design for sample size n is of the form feiQ) = /,, with ]Ci<a Ylj,; are nonnegative and sum to 1, the rectangular matrix W is a probability distribution on the rectangular domain T. (It is the same as the measure T from Section 1.25, but is now preferably thought of as a matrix. The design e 5 differs from its weight matrix W e T only in that lives on the regression range X, while W lives on the experimental domain T.) The weight w,y determines the fraction of responses to be observed with factor A on level / and factor B on level ;'. We call W an a x b block design. Let la denote the unity vector, that is, the a x 1 vector with all components equal to 1. Given a weight matrix W, its row sum vector r, its column sum vector s, and their entries are given by
The components r, and s, give the proportions of observations that use level / of factor A and level ;' of factor B, respectively.
1.27. DESIGNS FOR TWO-WAY CLASSIFICATION MODELS
31
In many applications, factor A is some sort of "treatment" on which interest concentrates, and r represents the treatment replication vector. Factor B is a "blocking factor" of only secondary interest, but is necessary to represent the experiment by a linear model, and s is then called the blocksize vector. We emphasize that, in our development, the replication number r, of treatment / and the size Sj of block / are relative frequencies rather than absolute frequencies. If En is a design for sample size n and the standardized design n/n e 3 has weight matrix W e T, then the a x b matrix
has integer entries n^ and is called the incidence matrix of the design . Either W or N may serve as design matrix, in the sense of Section 1.26. In order to find the moment matrix, we must specify the model. As in Section 1.5, we consider the two-way classification model with no interaction,
We again take Ar to be the a x a diagonal matrix with row sum vector r on the diagonal, while As is the b x b diagonal matrix formed from the column sum vector s. With this notation, the moment matrix M of a design has an appealing representation in terms of its associated weight matrix W,
However, the optimal estimator (nM)~lX'Y fails to exist since M is singular! Indeed, postmultiplying M by (7a', -1)' gives the null vector. This means that only certain subsystems of the mean parameter vector B are identifiable, as we shall see in Section 3.20.
32
EXHIBIT 1.7 Experimental domain designs, and regression range designs. Top: a design r on the experimental domain T = [!;!]. Left: the induced design on the regression range X C R 2 for the line fit model. Right: the induced design on the regression range X C R3 for the parabola fit model.
1.28. DESIGNS FOR POLYNOMIAL FIT MODELS In the polynomial fit model of Section 1.6, the compactness assumption of Lemma 1.26 is satisfied if the interval figuring as experimental domain T is compact. Any design T e T, on the experimental domain T, induces a design H, on the regression range X, as discussed in Section 1.25. Exhibit 1.7 sketches this relation, for polynomial fit models of degree one and two over the symmetric unit interval T = [-!;!]. While the design on X incorporates the model assumptions through the support points x e X, this is not so for a design T on T. For this reason, we denote the moment matrix for r by Md(r) = JTf(t)f(t)'dT, indicating the degree of the model through the subscript d. The resulting matrices are
EXERCISES
33
classical moment matrices in that they involve the moments /iy of T, for ; = 0,...,2d,
In general, closed form inversion of Md(r) is no longer possible, and the optimal estimator must be found numerically. It may happen that interest concentrates not on the full mean parameter vector, but on a single component of the mean parameters. For instance, the experimenter may wish to learn more about the coefficient 6d of the highest term td of a d th-degree polynomial regression function. The surprising finding is that for one-dimensional parameter systems an explicit construction of optimal designs becomes feasible, with the help of some convex geometry.
EXERCISES
1.1 Show that a saturated m-way d th-degree model has (d^m) mean parameters. This is the number of ways d "exponent units" can be put into m + 1 compartments labeled 1, t\,..., tm. 1.2 Verify that a one-way d th-degree model matrix X = (*i,..., xd+\)', with rows jc, = (l,f,-,... ,ff)', is a Vandermonde matrix [Horn and Johnson (1985), p. 29]. Discuss its rank. 1.3 Show that the eigenvalues of P are 1 or 0, and rank P = trace P, for every projector P. 1.4 Show that P > Q if and only if range P D range Q, for all orthogonal projectors P, Q. 1.5 Show that pa = ]CyP?/ and \Pij\ < max-{pa,Pjj} < 1, for every orthogonal projector P. 1.6 Let P = X(X'X)~1X' be the orthogonal projector originating from an n x k model matrix X of rank k, with rows */. Show that (i) /?,, = x!(X'X)-lxit (ii) EiPn/ = */, W P < XX1/\min(X'X) and max, pa < R2/'\min(X'X) = c/n, with c = R2/\min(M) > 1, where R = max, ||jc,-|| and M = (l/n)X'X, (iv) if ln e range* then P > lnlnln and mm/ Pa ^ Vn-
34
1.7 Discuss the following three equivalent versions of the Gauss-Markov Theorem: i. LVL' > X(X'V-X)-X' for all L Rnxn with LX = X. ii. V >X(X'V-X}-X'. iii. trace VW > traceX(X'V-XYX'W for all W NND(n). 1.8 Let the responses Y\,..., Yn have an exchangeable (or completely symmetric) dispersion structure V = a2In+o2p(lnlt[-In), with variance a2 > 0 and correlation p. Show that V is positive definite if and only if pe(-l/(n-l);l). 1.9 (continued) Let , be a design for sample size n, with model matrix X, standardized moment matrix M = X'X/n = f xx'dgn/n, and standardized mean vector m X'ln/n /xdgn/n. Show that Iimw_00(cr2/n) x JT't/-1* = (M - mm') /(I - p), provided p e (0; 1). Discuss M -mm', as a function of = &/.
CHAPTER 2
Optimal Designs for Scalar Parameter Systems
In this chapter optimal experimental designs for one-dimensional parameter systems are derived. The optimally criterion is the standardized variance of the minimum variance unbiased linear estimator. A discussion of estimability leads to the introduction of a certain cone of matrices, called the feasibility cone. The design problem is then one of minimizing the standardized variance over all moment matrices that lie in the feasibility cone. The optimal designs are characterized in a geometric way. The approach is based on the set of cylinders that include the regression range, and on the interplay of the design problem and a dual problem. The construction is illustrated with models that have two or three parameters.
2.1. PARAMETER SYSTEMS OF INTEREST AND NUISANCE PARAMETERS Our aim is to characterize and compute optimal experimental designs. Any concept of optimality calls upon the experimenter to specify the goals of the experiment; it is only relative to such goals that optimality properties of a design would make any sense. In the present chapter we presume that the experimenter's goal is point estimation in a classical linear model, as set forth in Section 1.23. There we concentrated on the full mean parameter vector. However, the full parameter system often splits into a subsystem of interest and a complementary subsystem of nuisance parameters. Nuisance parameters assist in formulating a statistical model that adequately describes the experimental reality, but the primary concern of the experiment is to learn more about the subsystem of interest. Therefore the performance of a design is evaluated relative to the subsystem of interest, only. One-dimensional subsystems of interest are treated first, in the present
35
36
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
chapter. They are much simpler and help motivate the general results for multidimensional subsystems. The general discussion is taken up in Chapter 3. 2.2. ESTEVfABILFTY OF A ONE-DIMENSIONAL SUBSYSTEM As before, let 8 be the full k x 1 vector of mean parameters in a classical linear model,
Suppose the system of interest is given by c'O, where the coefficient vector c e Rk is prescribed prior to experimentation. To avoid trivialities, we assume cÔ. The most important special case is an individual parameter component 0y, which is obtained from c'O with c the Euclidean unit vector ej in the space IR*. Or interest may be in the grand mean 0. = Y,j<k j/k, which is of the form c'O if c is chosen to be l^/k, with the k x 1 unity vector lk. An estimator for c'6 is said to be optimal when it minimizes the variance among all unbiased linear estimators for c'O, as discussed in Section 1.23. Let us first look at the unbiasedness requirement. An estimator u'Y, with u e R n , is unbiased for c'O if and only if
This reduces to the relation c = X'u between the vectors c and u which determine the subsystem of interest and the linear estimator. The parameter system c'O is estimable if and only if there exists at least one vector u e IR" such that the linear estimator u'Y is unbiased for c'O (see Section 1.18). This entails u'X = c'. Hence c'O is estimable if and only if the vector c lies in the range of the matrix X'. But this matrix has the same range as X'X, and its associated moment matrix M = (\/ri)X'X, by Section 1.14. Therefore a design , as introduced in Section 1.24, renders a subsystem c'O estimable if and only if the estimability condition c e range M () is satisfied. It is conceptually helpful to separate the estimability condition from any reference to designs . Upon defining the subset A(c] of nonnegative definite matrices that have a range containing the vector c,
the estimability condition is cast into the form A/() e A(c). This form more clearly displays the present state of affairs. The key variables are the moment matrices M (), whereas the coefficient vector c is fixed. The matrix subset A(c) is called the feasibility cone for c'O. In order to see what it looks like, we need an auxiliary result that, for nonnegative definite
2.4. FEASIBILITY CONES
37
matrices, the range of a sum is the sum of the ranges. The sum /C + of two subspaces K. and C comprises all vectors x + y with x e 1C and y e C. 2.3. RANGE SUMMATION LEMMA Lemma. Let A and B be nonnegative definite k x k matrices. Then we have range(y4 + B) = (range A) + (range B). In particular, if A > B > 0 then the range of A includes the range of B. Proof. A passage to the orthogonal complement based on Lemma 1.13 transforms the range formula into a nullspace formula,
nullspace(A+B)
= (nullspace A)
(nullspace B).
The converse inclusion is obvious. It remains to consider the direct inclusion. If (A + B)z 0, then also z 'Az + z 'Bz = 0. Because of nonnegative definiteness, both terms vanish individually. However, for a nonnegative definite matrix A, the scalar z 'Az vanishes if and only if the vector Az is null, as utilized in Section 1.14. In particular, if A > B > 0, then A = B + C with C = A -B > 0. Therefore range A = (range B) + (range C) D range B. The following description shows that every feasibility cone lies somewhere between the open cone PD(fc) and the closed cone NND(fc). The description lacks an explicit characterization of which parts of the boundary NND(&) \ PD(fc) belong to the feasibility cone, and which parts do not. 2.4. FEASIBILITY CONES Theorem. The feasibility cone A(c) for c'9 is a convex subcone of NND(fc) which includes PD(fc). Proof. By definition, A(c) is a subset of NND(&). Every positive definite k x k matrix lies in A(c). The cone property and the convexity property jointly are the same as
The first property is evident, since multiplication by a nonvanishing scalar does not change the range of a matrix. That A(c) contains the sum of any two of its members follows from Lemma 2.3.
38
EXHIBIT 2.1 The ice-cream cone. The cone NND(2) in Sym(2) is isomorphic to the icecream cone in R3.
2.5. THE ICE-CREAM CONE Feasibility cones reappear in greater generality in Section 3.3. At this point, we visualize the situation for the smallest nontrivial order, k = 2. The space Sym(2) has typical members
and hence is of dimension 3. We claim that in Sym(2), the cone NND(2) looks like the ice-cream cone of Exhibit 2.1 which is given by
This claim is substantiated as follows. The mapping from Sym(2) into R3
2.5. THE ICE-CREAM CONE
39
which takes
into (a, \/2/3, y)' is linear and preserves scalar products. Linearity is evident. The scalar products coincide since for
we have
Thus the matrix
and the vector (a, \/2/3, y)' enjoy identical geometrical properties. For a more transparent coordinate representation, we apply a further orthogonal transformation into a new coordinate system,
Hence the matrix
is mapped into the vector
while the vector (jc,_y,z)' is mapped into the matrix
For instance, the identity matrix /2 corresponds to the vector
40
The matrix
is nonnegative definite if and only if ay (32 > 0 as well as a > 0 and y > 0. In the new coordinate system, this translates into z2 > x2 + y2 as well as z > y and z > y. These three properties are equivalent to z > (x2 + y2)1/2. Therefore the cone NND(2) is isometrically isomorphic to the ice-cream cone, and our claim is proved. The interior of the ice-cream cone is characterized by strict inequality, (x2 +_y 2 ) ! / 2 < z. This corresponds to the fact that the closed cone NND(2) has the open cone PD(2) for its interior, by Lemma 1.9. This correspondence is a consequence of the isomorphism just established, but is also easily seen directly as follows. A singular 2 x 2 matrix has rank equal to 0 or 1. The null matrix is the tip of the ice-cream cone. Otherwise we have A dd1 for some nonvanishing vector
Such a matrix
is mapped into the vector
which satisfies x2 + y2 = z2 and hence lies on the boundary of the ice-cream cone. What does this geometry mean for the feasibility cone A(c}l In the first place, feasibility cones contain all positive definite matrices, as stated in Theorem 2.4. A nonvanishing singular matrix A = dd' fulfills the defining property of the feasibility cone, c range dd', if and only if the vectors c and d are proportional. Thus, for dimension k = 2, we have
In geometric terms, the cone A(c) consists of the interior of the ice-cream cone and the ray emanating in the direction (lab, b2 -a2, b2+a2)' for c = ().
2.7. THE DESIGN PROBLEM FOR SCALAR PARAMETER SUBSYSTEMS
41
2.6. OPTIMAL ESTIMATORS UNDER A GIVEN DESIGN We now return to the task set in Section 2.2 of determining estimator for c'0. If estimability holds under the model matrix parameter system satisfies c'6 = u'X6 for some vector u e R". Markov Theorem 1.20 determines the optimal estimator for c'0 the optimal X, then the The Gaussto be
Whereas the estimator involves the model matrix X, its variance depends merely on the associated moment matrix M (l/n)X'X,
Up to the common factor tr 2 /n, the optimal estimator has variance c'M () c. It depends on the design only through the moment matrix M(). 2.7. THE DESIGN PROBLEM FOR SCALAR PARAMETER SUBSYSTEMS We recall from Section 1.24 that H is the set of all designs, and that M(H) is the set of all moment matrices. Let c ^ 0 be a given coefficient vector in Uk. The design problem for a scalar parameter system c'6 can now be stated as follows:
In short, this means minimizing the variance subject to estimability. Notice that the primary variables are matrices rather than designs! The optimal variance of this problem is, by definition,
A moment matrix M is called optimal for c'6 in M(H) when M lies in the feasibility cone A(c) and c'M'c attains the optimal variance. A design is called optimal for c'6 in H when its moment matrix M() is optimal for c'0 in M(H). In the design problem, the quantity c'M~c does not depend on the choice of generalized inverse for M, and is positive. This has nothing to do with M being a moment matrix but hinges entirely on the feasibility cone A(c).
42
Namely, if A lies in A(c), then Lemma 1.17 entails that c'A~c is well defined and positive. Another way of verifying these properties is as follows. Every matrix A e A(c) admits a vector h eRk solving c = Ah. Therefore we obtain
saying that c 'A c is well defined. Furthermore h 'Ah = 0 if and only if Ah = 0, by our considerations on matrix square roots in Section 1.14. The assumption c ^ 0 then forces c'A'c = h'Ah > 0.
2.8. DIMENSIONALITY OF THE REGRESSION RANGE Every optimization problem poses the immediate question whether it is nonvacuous. For the design problem, this means whether the set of moment matrices Af(E) and the feasibility cone A(c) intersect. The range of matrix M() is the same as the subspace (supp) C Rk that is spanned by the support points of the design . If is supported by the regression vectors x\,...,xe, say, then Lemma 2.3 and Section 1.14 entail
Therefore the intersection M(H) n A(c) is nonempty if and only if the coefficient vector c lies in the regression space C(X) C R* spanned by the regression range X C Uk. A sufficient condition is that X contains k linearly independent vectors *!,...,*. Then the moment matrix (!/&),<**/*/ is positive definite, whence Af (E) meets A(c) for all c ^ 0.
2.9. ELFVING SETS
43
EXHIBIT 2.2 Two Elfving sets. Left: the Elfving set for the line fit model over [-.1;1] is a square. Right: the Elfving set for the parabola fit model has no familiar shape. The rear dashed arc standing up on the (x,y)-plane is X, the front solid arc that is hanging down is -X.
2.9. ELFVING SETS The central tools to solve the optimal design problem are the Elfving set 7?., and the set M of cylinders including it. First we define the Elfving set H to be the convex hull of the regression range X and its negative image X {-jc : jc e X } ,
Two instances of an Elfving set are shown in Exhibit 2.2. In order to develop a better feeling for Elfving sets, we recall that a moment matrix M() = $xxx' d consists of the second order uncentered moments of the design . Hence it is invariant under a change of sign of x. This may serve as a justification to symmetrize the regression range by adjoining to X its negative image X. Now consider a design 17 on the symmetrized regression range X u (X). Suppose 17 has I support points. These are of the form ,*,-, with jc, e X and
44
Hence
is the mean vector of 77. This expression also appears as the generic term in the formation of the convex hull in the definition of ft. From this point of view, the Elfving set ft consists of the mean vectors of all designs on the symmetrized regression range X U {X}. However, the geometric shape of ft turns out to be more important. The Elfving set ft is a symmetric compact convex subset of Rk that contains the origin in its relative interior. Indeed, symmetry and convexity are built into the definition. Compactness holds true if the regression range X is compact, with the same argument as in Lemma 1.26. The relative interior of ft is the interior relative to the subspace C(X) that is generated by x e X. Specifically, with regression vectors Xi,...,xi X that span C(X), the Elfving set ft includes the polytope conv{*i,. ..,xe} which, in turn, contains a ball open relative to (X) around the origin. Hence the origin lies in the relative interior of ft. 2.10. CYLINDERS THAT INCLUDE THE ELFVING SET If the regression range X consists of a finite number of points, then the Elfving set ft is a polytope, that is, the convex hull of a finite set. It then has vertices and faces. This is in contrast to the smooth boundary of the ball {z e R* : z 'Nz < 1} that is obtained from the scalar product (x,y)N =x'Ny, as given by a positive definite matrix N. Such a scalar product ball has no vertices or faces. Nevertheless Elfving sets and scalar product balls are linked to each other in an intrinsic manner. A scalar product ball given by a positive definite matrix N is an ellipsoid, because of the full rank of N. If we drop the full rank assumption, the ellipsoid may degenerate to a cylinder. For a nonnegative definite matrix TV NND(fc), we call the set of vectors
the cylinder induced by N. It includes the nullspace of N, or in geometric terminology, it recedes to infinity in all directions of this nullspace (see Exhibit 2.3). Elfving sets allow many shapes other than cylinders. However, we may approximate a given Elfving set ft from the outside, by considering all cylinders that include ft. Since cylinders are symmetric and convex, inclusion of ft is equivalent to inclusion of the regression range X. Identifying a cylinder with the matrix inducing it, we define the set J\f of cylinders that include ft,
2.11. MUTUAL BOUNDEDNESS THEOREM FOR SCALAR OPTIMALITY
45
EXHIBIT 23
Cylinders. Left: the cylinder is induced by N = (\ ]), and recedes to infinity
in the directions (Jj) and ("j1). Right: the cylinder induced by N = ( 'Q ,2) is a compact ellipsoid, and has no direction of recession.
or X, by
These geometric considerations may sound appealing or not, we have yet to convince ourselves that they can be put to good use. The key result is that they provide bounds for the design problem, as follows. 2.11. MUTUAL BOUNDEDNESS THEOREM FOR SCALAR OPTIMALITY Theorem. Let M be a moment matrix that is feasible for c'6, M e M(H) n A(c), and let TV be a cylinder that includes the regression range X,
Then we have c'M c > c'Nc, with equality if and only if M and N fulfill conditions (1) and (2) given below. More precisely, we have
with respective equality if and only if, for every design e H which has moment matrix M,
46
Proof. Inequality (i) follows from the assumption that M is the moment matrix of a design e H, say. By definition of A/*, we have 1 > x 'Nx for all x e X, and integration with respect to leads to
Moreover the upper bound is attained if and only if x'Nx = 1, for all x in the support of , thereby establishing condition (1). Inequality (ii) relates to the assumption that M lies in the feasibility cone A(c). The fact that c e range M opens the way to applying the Gauss-Markov Theorem 1.20 to obtain M > min LR *x*. Lc=c LA/L' = c(c'M~c)~lc'. Since TV is nonnegative definite, the linear form A t-> trace AN is isotonic, by Section 1.11. This yields
It remains to show that in this display, equality forces condition (2), the converse being obvious. We start from two square root decompositions M KK' and N = HH', and introduce the matrix
The definition of A does not depend on the choice of generalized inverses for M. We know this already from the expression c'M'c. Because of M e A(c), we have MM~c = c. For K'M~c, in variance then follows from K = MM~K and K'Gc = K'M'MGMM-c = K'M~c, for G e M~. Next we compute the squared norm of A:
Thus equality in (ii) implies that A has norm 0 and hence vanishes. Pre- and postmultiplication of A by K and H' give 0 = MN - c(c'M~c)~lc'N which is condition (2). Essentially the same result will be met again in Theorem 7.11 where we discuss a multidimensional parameter system. The somewhat strange sequencing of vectors, scalars, and matrices in condition (2) is such as to readily carry over to the more general case.
2.12. THE ELFVING NORM
47
The theorem suggests that the design problem is accompanied by the dual problem:
Maximize subject to
The two problems bound each other in the sense that every feasible value for one problem provides a bound for the other:
Equality holds as soon as we find matrices M and N such that c'M c = c'Nc. Then M is an optimal solution of the design problem, and N is an optimal solution of the dual problem, and M and N jointly satisfy conditions (1) and (2) of the theorem. But so far nothing has been said about whether such a pair of optimal matrices actually exists. If the infimum or the supremum were not to be attained, or if they were to be separated by strict inequality (a duality gap), then the theorem would be of limited usefulness. It is at this point that scalar parameter systems permit an argument much briefer than in the general case, in that the optimal matrices submit themselves to an explicit construction. 2.12. THE ELFVING NORM The design problem for a scalar subsystem c'O is completely resolved by the Elfving Theorem 2.14. As a preparation, we take another look at the Elfving set K in the regression space C(X} C Uk that is spanned by the regression range X. For a vector z >(X} Q R*, the number
is the scale factor needed to blow up or shrink the set f l so that z comes to lie on its boundary. It is a standard fact from convex analysis that, on the space C(X), the function p is a norm. In our setting, we call p the Elfving norm. Moreover, the Elfving set 71 figures as its unit ball,
Boundary points of 7 are characterized through p(z) ~ 1, and this property of p is essentially all we need. Scalar parameter systems c'6 are peculiar in that their coefficient vector c can be embedded in the regression space. This is in contrast to multidimensional parameter systems K'Q with coefficient matrices K e IR fcxs of rank
48
s > 1. The relation between the coefficient vector c and the Elfving set 7?. is the key to the solution of the problem. Rescaling 0 ^ c C(X] by p(c) places c/p(c) on the boundary of 71. As a member of the convex set 72., the vector c/p(c) admits a representation
say, with e, e {1} and jc/ X for / = 1,... ,^, and 17 a design on the symmetrized regression range X u (-X}. The fact that c/p(c) lies on the boundary of K prevents rj from putting mass at opposite points, that is, at points Xi e X and x^ e X with jc2 = x\. Suppose this happens and without loss of generality, assume TI(X\) > 17 ( x\) > 0. Then the vector z from (1) has norm smaller than 1:
In the inequalities, we have first employed the triangle inequality on p, then used p(e,*,) < 1 for e/x, H, and finally added 2r$x$ > 0. Hence we get 1 = p(c/p(c)) = p(z) < 1, a contradiction. The bottom line of the discussion is the following. Given a design 17 on the symmetrized regression range X U ( X } such that 17 satisfies (1), we define a design on the regression range X through
We have just shown that the two terms 17(x) and rj(-x) cannot be positive
2.13. SUPPORTING HYPERPLANES TO THE ELFVING SET
49
at the same time. In other words the design satisfies
Thus representation (1) takes the form
with s a function on X which on the support of takes values 1. These designs and their moment matrices M () will be shown to be optimal in the Elfving Theorem 2.14.
2.13. SUPPORTING HYPERPLANES TO THE ELFVING SET Convex combinations are "interior representations". They find their counterpart in "exterior representations" based on supporting hyperplanes. Namely, since c/p(c) is a boundary point of the Elfving set 7, there exists a supporting hyperplane to the set 71 at the point c/p(c), that is, there exist a nonvanishing vector h >(X) C R* and a real number y such that
Examples are shown in Exhibit 2.4. In view of the geometric shape of the Elfving set 7?., we can simplify this condition as follows. Since 7 is symmetric we may insert z as well as z, turning the left inequality into \z'h\ < y. As a consequence, y is nonnegative. But y cannot vanish, otherwise 72. lies in a hyperplane of the regression space C(X} and has empty relative interior. This is contrary to the fact, mentioned in Section 2.9, that the interior of 72. relative to C(X} is nonempty. Hence y is positive. Subdividing by y > 0 and setting h h/y / 0, the supporting hyperplane to 72. at c/p(c) is given by
with some vector h e >(X], h^Q. The square of inequality (1) proves that the matrix defined by N = hh' satisfies
50
EXHIBIT 2.4 Supporting hyperplanes to the Elfving set. The diagram applies to the line fit model. Bottom: at c = (~^ ), the unique supporting hyperplane is the one orthogonal to h = (_Pj). Top: at d = ({), the supporting hyperplanes are those orthogonal to h = ( J ^ A ) for some A 6 [0; 1].
Hence N lies in the set Af of cylinders that include 7, introduced in Section 2.10. The equality in (1) determines the particular value
Therefore Theorem 2.11 tells us that (p(c))2 is a lower bound for the optimal variance of the design problem. In fact, this is the optimal variance. 2.14. THE ELFVING THEOREM Theorem. Assume that the regression range X C R* is compact, and that the coefficient vector c e R* lies in the regression space C(X) and has Elfving norm p(c) > 0. Then a design H is optimal for c'd in H if and only if there exists a function e on X which on the support of takes values 1 such that
There exists an optimal design for c'& in H, and the optimal variance is (p(c))2. Proof. As a preamble, let us review what we have already established. The Elfving set and the ElfVing norm, as introduced in the preceding sections,
2.14. THE ELFVING THEOREM
51
are
There exists a vector h e >(X] C Rk that determines a supporting hyperplane to 71 at c/p(c). The matrix N = hh' induces a cylinder that includes 71 or X, as discussed in Section 2.13, with c'Nc = (p(c))2. Now the proof is arranged like that of the Gauss-Markov Theorem 1.19 by verifying, in turn, sufficiency, existence, and necessity. First the converse is proved. Assume there is a representation of the form c/p(c) = Xlxesupp (*) e(x)x. Let M be the moment matrix of . We have e(x)x 'h < 1 for all x e X. In view of
every support point x of satisfies e(x)x'h 1. We get x'h = \je(x) = s(x), and
This shows that M lies in the feasibility cone A(c). Moreover, this yields h'Mh = c'h/p(c) = 1 and c'M'c = (p(c))2h'MM~Mh = (p(c))2. Thus the lower bound (p(c))2 is attained. Hence the bound is optimal, as are M, , and N. Next we tackle existence. Indeed, we have argued in Section 2.12 that c/p(c) does permit a representation of the form Jx e(x)x dg. The designs leading to such a representation are therefore optimal. Finally this knowledge is put to use in the direct part of the proof. If a design is optimal, then M() e A(c) and c'M(g)~c (p(c))2. Thus conditions (1) and (2) of Theorem 2.11 hold with M = M() and N = hh1. Condition (1) yields
Condition (2) is postmultiplied by h/h'h. Insertion of (c'M() c) (p(c)) and c'h p(c) produces
Upon defining e(x) = x'h, the quantities e(x) have values 1, by condition (i), while condition (ii) takes the desired form c/p(c)
52
The theorem gives the solution in terms of designs, even though the formulation of the design problem apparently favors moment matrices. This, too, is peculiar to the case of scalar parameter systems. Next we cast the result into a form that reveals its place in the theory to be developed. To this end we need to take up once more the discussion of projections and direct sum decompositions from Section 1.22. 2.15. PROJECTORS FOR GIVEN SUBSPACES Lemma. Let A and B be nonnegative definite k x k matrices such that the ranges provide a direct sum decomposition of Rk:
Then (A + B) l is a generalized inverse of A and A(A + B)~} is the projector onto the range of A along the range of B. Proof, The matrix A + B is nonsingular, by Lemma 2.3. Upon setting G = (A + B)'\ we have Ik = AG + BG. We claim AGB = 0, which evidently follows from nullspace AG = range B. The direct inclusion holds since, if x is a vector such that AGx = 0, then x = AGx + BGx = BGx e range B. Equality of the two subspaces is a consequence of the fact that they have the same dimension, k rank A. Namely, because of the nonsingularity of G, A:-rank A is the nullity common to AG and A. It is also the rank of B, in view of the direct sum assumption. With AGB = 0, postmultiplication of Ik = AG + BG by A produces A = AGA. This verifies G to be a generalized inverse of A for which the projector AG has nullspace equal to the range of B. The Elfving Theorem 2.14 now permits an alternative version that puts all the emphasis on moment matrices, as does the General Equivalence Theorem 7.14. 2.16. EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY Theorem. A moment matrix M e M(H) is optimal for c'Q in M(H) if and only if M lies in the feasibility cone A(c] and there exists a generalized inverse G of M such that
2.16. EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY
53
Proof. First the converse is proved, providing a sufficient condition for optimality. We use the inequality
for matrices A e M (H) n A(c), obtained from the Gauss-Markov Theorem 1.20 with U = Ik, X = c, V = A. This yields the second inequality in the chain
But we have c'G'c = c'Gc c'M c, since M lies in the feasibility cone A(c). Thus we obtain c'A~c>c'M~c and M is optimal. The direct part, the necessity of the condition, needs more attention. From the proof of the Elfving Theorem 2.14, we know that there exists a vector h e Rk such that this vector and an optimal moment matrix M jointly satisfy the three conditions
This means that the equality c'M c c'Nc holds true with N = hh'\ compare the Mutual Boundedness Theorem 2.11. The remainder of the proof, the construction of a suitable generalized inverse G of M, is entirely based on conditions (0), (1), and (2). Given any other moment matrix A e A/(H) belonging to a competing design 17 H we have, from condition (0),
Conditions (1) and (2) combine into
Next we construct a symmetric generalized inverse G of M with the property that
From (4) we have c'h ^ 0. This means that c'(ah) = 0 forces a 0, or formally,
54
With Lemma 1.13, a passage to the orthogonal complement gives
But the vector c is a member of the range of M, and the left hand sum can only grow bigger if we replace range c by range M. Hence, we get
This puts us in a position where we can find a nonnegative definite k x k matrix H with a range that is included in the nullspace of h' and that is complementary to the range of M:
The first inclusion is equivalent to range h C nullspace H, that is, Hh = 0. Now we choose for M, the generalized inverse G = (M+H)~l of Lemma 2.15. Postmultiplication of Ik = GM + GH by h yields h = GMh, whence condition (5) is verified. Putting together steps (3), (5), (2), and (4) we finally obtain, for all A e Af(B),
2.17. BOUNDS FOR THE OPTIMAL VARIANCE Bounds for the optimal variance (p(c))2 with varying coefficient vector c can be obtained in terms of the Euclidean norm \\c\\ = \fcrc of the coefficient vector c. Let r and R be the radii of the Euclidean balls inscribed in and circumscribing the Elfving set ft,
The norm ||z||, as a convex function, attains its maximum over all vectors z with p(z) = 1 at a particular vector z in the generating set X ( J ( - X ) . Also the norm is invariant under sign changes, whence maximization can be restricted
2.17. BOUNDS FOR THE OPTIMAL VARIANCE
55
EXHIBIT 2.5 Euclidean balls inscribed in and circumscribing the Elfving set. For the threepoint regression range X = {(*), (*), (*)}.
to the regression range X. Therefore, we obtain the alternative representation
Exhibit 2.5 illustrates these concepts. By definition we have, for all vectors c ^ 0 in the space C(X) spanned by*,
If the upper bound is attained, ||c||/p(c) = R, then c/p(c) or -c/p(c) lie in the regression range X. In this case, the one-point design assigning mass 1 to the single point c/p(c) and having moment matrix M = cc'/(p(c))2 is optimal for c'6 in M(H). Clearly, c/p(c) is an eigenvector of the optimal moment matrix M, corresponding to the eigenvalue c'c/(p(c))2 = R2. The following corollary shows that the eigenvector property pertains to every optimal moment matrix, not just to those stemming from one-point designs. Attainment of the lower bound, ||c||/p(c) = r, does not generally lead to optimal one-point designs. Yet it still embraces a very similar result on the eigenvalue properties of optimal moment matrices.
56
2.18. EIGENVECTORS OF OPTIMAL MOMENT MATRICES Corollary. Let the moment matrix M e A/(H) be optimal for c'0 in M(H). If 0 ^ c C(X) and ||c||/p(c) = r, that is, c/p(c) determines the Euclidean ball inscribed in ft, then c is an eigenvector of M corresponding to the eigenvalue r2. If 0 ^ c e (,*) and ||c||/p(c) = /?, that is, c/p(c) determines the Euclidean ball circumscribing 71, then c is an eigenvector of M corresponding to the eigenvalue R2. Proof. The hyperplane given by h = c/(p(c)r2) or h = c/(p(c)/?2) supports ft at c/p(c), since ||c||/p(c) is the radius of the ball inscribed in or circumscribing ft. The proof of the Elfving Theorem 2.14 shows that c/p(c) = A//z, that is, Me = r2c or Me = R2c. The eigenvalues of any moment matrix M are bounded from above by R2. This is an immediate consequence of the Cauchy inequality,
On the other hand, the in-ball radius r2 does not need to bound the eigenvalues from below. Theorem 7.24 embraces situations in which the smallest eigenvalue is r2, with rather powerful consequences. In general, however, nothing prevents M from becoming singular. For instance, suppose the regression range is the Euclidean unit ball of the plane, X = {* e R2 : ||*|| < 1} = ft. The ball circumscribing ft coincides with the ball inscribed in ft. The only optimal moment matrix for c'0 is the rank one matrix cc'/lkl)2. Here c is an eigenvector corresponding to the eigenvalue r2 1, but the smallest eigenvalue is zero. The next corollary answers the question, for a given moment matrix A/, which coefficient vectors c are such that M is optimal for c'0 in A/(E). 2.19. OPTIMAL COEFFICIENT VECTORS FOR GIVEN MOMENT MATRICES Corollary. Let M be a moment matrix in A/(E). The set of all nonvanishing coefficient vectors c e Uk such that M is optimal for c'0 in Af (E) is given by the set of vectors c = SMh where d > 0 and the vector h e Uk is such that it satisfies (x'h)2 < 1 = h'Mh for all x e X. Proof. For the direct inclusion, the formula c = p(c)Mh from the proof of the Elfving Theorem 2.14 represents c in the desired form. For the converse inclusion, let be a design with moment matrix M and let * be a support point. From h'Mh - 1, we find that e(x) = x'h equals 1. With 8 > 0, we get c = SMh = S ]Cxesupp ( (x)(x)x' whence optimality follows.
2.21. PARABOLA FIT MODEL
57
Thus, in terms of cylinders that include the Elfving set, a moment matrix M is optimal in A/(H) for at least one scalar parameter system if and only if there exists a cylinder N e A/" of rank one such that trace M N 1. Closely related conditions appear in the admissibility discussion of Corollary 10.10. 2.20. LINE FIT MODEL We illustrate these results with the line fit model of Section 1.6,
with a compact interval [a; b\ as experimental domain T. The vectors (j) with t [a; b] constitute the regression range X, so that the Elfving set Tl becomes the parallelogram with vertices Q and Q. Given a coefficient vector c, it is easy to compute its Elfving norm p(c) and to depict c/p(c) as a convex combination of the four vertices of 72.. As an example, we consider the symmetric and normalized experimental domain T = [!;!]. The intercept a has coefficient vector c (J). A design T on T is optimal for the intercept if and only if the first design moment vanishes, /AI = J T f dr = 0. This follows from the Elfving Theorem 2.14, since
forces e(t) 1 if r(t) > 0 and ^ = 0. The optimal variance for the intercept isl. The slope /3 has coefficient vector c = (^). The design T which assigns mass 1/2 to the points 1 is optimal for the slope, because with e(l) = 1 and e(1) = 1, we get
The optimal design is unique, since there are no alternative convex representations of c. The optimal variance for the slope is 1. 2.21. PARABOLA FIT MODEL In the model for a parabola fit, the regression space is three-dimensional and an illustration becomes slightly more tedious. The model equation is
58
with regression function /(/) = (1,f, t2)'. The Elfving set 7 is the convex hull of the two parabolic arcs (l,r,f 2 ) with t e [a; b], as shown in Exhibit 2.2. If the experimental domain is [a;b] = [-1;1], then the radius of the ball circumscribing U is R = \/3 = |jc||, with c = (1, 1,1)'. The radius of the ball inscribed in U is r ^ l/y/5 = |jc||/5, with c = (-1,0,2)'. The vector c/5 has Elfving norm 1 since it lies in the face with vertices /(-I), /(O), and /(I),
Thus an optimal design for c'O in M(H) is obtained by allocating mass 1/5 at the two endpoints 1 and putting the remaining mass 3/5 into the midpoint 0. Its moment matrix
has eigenvector c corresponding to the eigenvalue r2 = 1/5, as stated by Corollary 2.18. The other two eigenvalues are larger, 2/5 and 6/5. We return to this topic in Section 9.12.
2.22. TRIGONOMETRIC FIT MODELS An example with much smoothness is the model for a trigonometric fit of degree d,
with k = 2d + 1 mean parameters a, A, ri, , A/, r</. The experimental domain is the "unit circle" T = [0;27r). The endpoint 2ir may be omitted in view of the periodicity of the regression function
Because of this periodicity, we may as well start from the compact interval T = [0;27r], in order to satisfy the compactness assumption of the Elfving Theorem 2.14. For degree one, d = 1, the Elfving set 7 is the convex hull of the two Euclidean discs {(l,cos(f),sin(f)) : t G [0;27r)|. The Elfving set is a
EXERCISES
59
truncated tube,
The radii from Section 2.17 are r = 1 and R \/2. Depending on where the ray generated by the coefficient vector c penetrates the boundary of 7, an optimal design for c'O in M(H) can be found with one or two points of support. Other optimal designs for this model are computed in Section 9.16. 2.23. CONVEXITY OF THE OPTIMALFTY CRITERION The elegant result of the Elfving Theorem 2.14 is born out by the abundance of convexity that is present in the design problem of Section 2.7. For instance, the set of matrices M(H) n A(c) over which minimization takes place is the intersection of two convex sets and hence is itself convex, as follows from Lemma 1.26 and Lemma 2.4. (However, it is neither open nor closed.) Moreover, the optimality criterion c'M~c that is to be minimized is a convex function of M. We did not inquire into the convexity and continuity properties of the criterion function since we could well do without these properties. For multidimensional parameter systems, solutions are in no way as explicit as in the Elfving Theorem 2.14. Therefore we have to follow up a series of intermediate steps before reaching results of comparable strength in Chapter 7. The matrix analogue of the scalar term c'M'c is studied first. EXERCISES 2.1 Show that sup^6R* 2c'h -h'Mh equals c'M~c or oo according as M e A(c) or not [Whittle (1973), p. 129] 2.2 Show that sup eRA ^, c=1 \ji'Mi - suphKkMh^Q(c'h)2/h'Mh equals c'M~c or oo according as M A(c) or not [Studden (1968), p. 1435]. 2.3 Show that
[Studden (1968), p. 1437]. 2.4 Let the set K C Rk be symmetric, compact, convex, with zero in its interior. Show that p(z) = inf {8 > 0 : z e SK} defines a norm on Rk, and that the unit ball {p < 1} of p is 71.
60
2.5 For the dual problem of Section 2.11, show that if N e M is an optimal solution, then so is Nc(c'Nc)~c'N. 2.6 On the regression range X = optimal design for Q\ in H is 554]. 2.7 In the model with regression function f(t) = that the unique optimal design for B\ in H is p. 1581]. show that the unique = 1/3 [Silvey (1978), p. overT= [0;l],show - 1 [Atwood (1969),
2.8 In the line fit model over T [-!;!], show that the unique optimal = 1 [Bandemer (1977), p. 217]. design for Q\ + fy in H is 2.9 In the line fit model over T = [1;0], show that the unique optimal design for B\ + 82 m H is 2.10 In the parabola fit model over T = [-1; 1], determine the designs that are optimal for c'B in H, with (i) c = (1,1,1)', (ii) c = (1,0, -1)', (iii) c = (1,0, -2)' [Lauter (1976), p. 63]. In the preceding problems, find the optimal solutions N e Af of the dual problem of Section 2.11.
2.11
CHAPTER 3
Information Matrices
Information matrices for subsystems of mean parameters in a classical linear model are studied. They are motivated through dispersion matrices of Gauss-Markov estimators, by the power of the F-test, and as Fisher information matrices. Their functional properties are derived from a representation as the minimum over a set of linear functions. The information matrix mapping then is upper semicontinuous on the closed cone of nonnegative definite matrices, but it fails to be continuous. The rank of information matrices is shown to reflect identifiability of the parameter system of interest. Most of the development is carried out for parameter systems of full rank, but is seen to generalize to the rank deficient case.
3.1. SUBSYSTEMS OF INTEREST OF THE MEAN PARAMETERS The full mean parameter vector 8 and a scalar subsystem c'9 represent just two distinct and extreme cases of a more general situation. The experimenter may wish to study s out of the total of k components, s < k, rather than being interested in all of them or a single one. This possibility is allowed for by studying linear parameter subsystems that have the form K'B for some k x 5 matrix K.
We call K the coefficient matrix of the parameter subsystem K'B. Onedimensional subsystems are covered as special cases through 5 1 and K = c. The full parameter system is included through K = Ik. The most restrictive aspect about parameter subsystems so defined is that they are linear functions of the full parameter vector 6. Nonlinear functions, such as 01/02, or cos(0! + \/^)> say, are outside the scope of the theory that we develop. If a genuinely nonlinear problem has to be investigated, a linearization using the Taylor theorem may permit a valid analysis of the problem.
61
62
CHAPTER 3: INFORMATION MATRICES
3.2. INFORMATION MATRICES FOR FULL RANK SUBSYSTEMS What is an appropriate way of evaluating the performance of a design if the parameter system of interest is K'Ql In the first place, this depends on the underlying model which again we take to be the classical linear model,
Secondly it might conceivably be influenced by the type of inference the experimenter has in mind. However, it is our contention that point estimation, hypothesis testing, and general parametric model-building all guide us to the same central notion of an information matrix. Its definition assumes that the k x s coefficient matrix K has full column rank s. DEFINITION. For a design with moment matrix M the information matrix for K'B, with k x s coefficient matrix K of full column rank s, is defined to be CK(M] where the mapping CK from the cone NND(fc) into the space Sym(.s) is given by
Here the minimum is taken relative to the Loewner ordering, over all left inverses L of K. The notation L is mnemonic for a left inverse; these matrices are of order s x k and hence may have more columns than rows, deviating from the conventions set out in Section 1.7. We generally call CK(A) an information matrix for K '6, without regard to whether A is the moment matrix of a design or not. The matrix CK(A) exists as a minimum because it matches the GaussMarkov Theorem in the form of Theorem 1.21, with X replaced by K and V by A. Moreover, with residual projector R = 4 KL for some left inverse L of K, Theorem 1.21 offers the representations
whenever the left inverse L of K satisfies LAR' = 0. Such left inverses L of K are said to be minimizing for A. They are obtained as solutions of the linear matrix equation
Occasionally, we make use of the existence (rather than of the form) of left inverses L of K that are minimizing for A.
3.3. FEASIBILITY CONES
63
We provide a detailed study of information matrices, their statistical meaning (Section 3.3 to Section 3.10) and their functional properties (Section 3.11 to Section 3.19). Finally (Section 3.20 to Section 3.25), we introduce generalized information matrices for those cases where the coefficient matrix K is rank deficient. 3.3. FEASIBILITY CONES Two instances of information matrices have already been dealt with, even though the emphasis at the time was somewhat different. The most important case occurs if the full parameter vector 9 is of interest, that is, if K 7*. Since the unique (left) inverse L of K is then the identity matrix Ik, the information matrix for 6 reproduces the moment matrix M,
In other words, for a design the matrix M () has two meanings. It is the moment matrix of and it is the information matrix for 0. The distinction between these two views is better understood if moment matrices and information matrices differ rather than coincide. This happens with scalar subsystems c'0, the second special case. If the matrix M lies in the feasibility cone A(c), Theorem 1.21 provides the representation
Here the information for c'd is the scalar (c'M~c)~l, in contrast to the moment matrix M. The design problem of Section 2.7 calls for the minimization of the variance c'M~c over all feasible moment matrices M. Clearly this is the same as maximizing (c'M~c)~l. The task of maximizing information sounds reasonable. It is this view that carries over to greater generality. The notion of feasibility cones generalizes from the one-dimensional discussion of Section 2.2 to cover an arbitrary subsystem K'6, by forming the subset of nonnegative definite matrices A such that the range of K is included in the range of A. The definition does not involve the rank of the coefficient matrix K. DEFINITION. For a parameter subsystem K'O, the feasibility cone A(K) is defined by
A matrix A e Sym(A:) is called feasible for K'd when A G A(K), a design is called feasible for K'O when M() e A(K).
64
Feasibility cones will also be used with other matrices in place of K. Their geometric properties are the same as in the one-dimensional case studied in Theorem 2.4. They are convex subcones of the closed cone NND(fc), and always include the open cone PD(fc). If the rank of K is as large as can be, k, then its feasibility cone A(K) coincides with PD(k) and is open. But if the rank of K is smaller than A:, then singular matrices A may, or may not, lie in A(K), depending on whether their range includes the range of K. Here the feasibility cone is neither open nor closed. In the scalar case, we assumed that the coefficient vector c does not vanish. More generally, suppose that the coefficient matrix K has full column rank s. Then the Gauss-Markov Theorem 1.21 provides the representation
It is in this form that information matrices appear in statistical inference. The abstract definition chosen in Section 3.2 exhibits its merits not until we turn to their functional properties. Feasibility cones combine various inferential aspects, namely those of estimability, testability, and identifiability. The following sections elaborate on these interrelations.
3.4. ESTIMABILFTY
First we address the problem of estimating the parameters of interest, K'O, for a model with given model matrix X. The subsystem K'B is estimable if and only if there exists at least one n x s matrix U such that
as pointed out in Section 1.18. This entails K X'U, or equivalently, range/C C range Jf' = range A^Jf. With moment matrix M = (l/n)X'X, we obtain the estimability condition
that is, the moment matrix M must lie in the feasibility cone for K'B. This visibly generalizes the result of Section 2.2 for scalar parameter systems. It also embraces the case of the full parameter system 0 discussed in Section 1.23. There estimability forces the model matrix X and its associated moment matrix M to have full column rank k. In the present terms, this means M e A(Ik) = PD(k).
3.5. GAUSS-MARKOV ESTIMATORS AND PREDICTORS
65
3.5. GAUSS-MARKOV ESTIMATORS AND PREDICTORS Assuming estimability, the optimal estimator y for the parameter subsystem y = K 'B is given by the Gauss-Markov Theorem 1.20. Using U 'X K', the estimator y for y is
Upon setting M (\/ri)X'X, the dispersion matrix of the estimator y is
The dispersion matrix becomes invertible provided the coefficient matrix K has full column rank s, leading to the precision matrix
The common factor n/a2 does not affect the comparison of these matrices in the Loewner ordering. In view of the order reversing property of matrix inversion from Section 1.11, minimization of dispersion matrices is the same as maximization of information matrices. In this sense, the problem of designing the experiment in order to obtain precise estimators for K'Q calls for a maximization of the information matrices CK(M). A closely related task is that of prediction. Suppose we wish to predict additional reponses Yn+j *' 0 + +y, for j 1,... ,5. We assemble the prediction sites xn+i in a k xs matrix K = (JCM+I , . . . , xn+s), and take the random vector y from (1) as a predictor for the random vector Z = (Yn+i,..., Yn+s)'. Clearly, y is linear in the available response vector Y = (Yi,..., Yn)', and unbiased for the future response vector Z,
Any other unbiased linear predictor LY for Z has mean squared-error matrix
The first term appears identically in the estimation problem, the second term is constant. Hence y is an unbiased linear predictor with minimum mean squared-error matrix,
66
CHAPTERS: INFORMATION MATRICES
For the purpose of prediction, the design problem thus calls for minimization of K'M~K, as it does from the estimation viewpoint. The importance of the information matrix %(M) is also justified on computational grounds. Numerically, the optimal estimate 6 for 6 is obtained as the solution of the normal equations
Multiplication by 1/n turns the left hand side into MO. On the right hand side, the k x 1 vector T = (\fn)X'Y is called the vector of totals. If the regression vectors jt 1 } ... ,*/ are distinct and if Y/7 for j = 1,... ,/i,- are the replications under jc/, then we get
where Y,. = (1/n,-) )/<,. ^7 *s tne average over all observations under regression vector *,-. With this notation, the normal equations are MO = T. Similarly, if the subsystem y = K '0 is of interest then the optimal estimate y = K'6 is the unique solution of the reduced normal equations
where L is a left inverse of K that is minimizing for M. The 5 x 1 vector LT is called the adjusted vector of totals. In order to verify that y solves the reduced normal equations, we use the representation y = (l/n)K'M~X'Y. Every left inverse L of K that is minimizing for M satisfies 0 = LMR1 = LM LML'K'. Because of MM~X' X', we get what we want,
Our goal of maximizing information matrices typically forces the eigenvalues to become as equal as possible. For instance, if we maximize the smallest eigenvalue of CK(M) then, since M varies over the compact set M, the largest eigenvalue of CK(M) will be close to a minimum. Thus an optimal information matrix Cx(M) usually gives rise to reduced normal equations in which the condition number of the coefficient matrix CK(M) is close to the optimum. It is worthwhile to contemplate the role of the Gauss-Markov Theorem. We are using it twice. The first application is to a single model and a family of estimators, and conveys the result that the optimal estimator for K'6 has
3.7. F-TEST OF A LINEAR HYPOTHESIS
67
a precision matrix proportional to CK(M) if the given model has moment matrix M. Here the role of the Gauss-Markov Theorem is the standard one, to determine the optimal member in the family of unbiased linear estimators. The second use of the Gauss-Markov Theorem occurs in the context of a family of models and a single estimation procedure, when maximizing CK(M) over a set M of competing moment matrices. Here the role of the GaussMarkov Theorem is a nonstandard one, to represent the matrix CK(M) in a way that aids the analysis. 3.6. TESTABILITY To test the linear hypothesis K'6 0, we adopt the classical linear model with normality assumption of Section 1.4. Let the response vector y follow an n-variate normal distribution N^.^, with mean vector of the form /A = X6. Speaking of the "hypothesis K'6 = 0" is a somewhat rudimentary way to describe the hypothesis HQ that the mean vector /u, is of the form X6 for some 6 Rk with K'd = 0. This is to be tested against the alternative H\ that the mean vector /A is of the form X9 for some 6 Rfc with K'6 ^ 0. If the testing problem is to make any sense, then the hypothesis HQ and the alternative H\ must be disjoint. They intersect if and only if there exist some vectors Q\ and 62 such that
Hence disjointness holds if and only if Xd\ Xfy implies K'Q\ K'&I. Because of linearity we may further switch to differences, 00 = #1 - #2- Thus X&Q = 0 must entail K '00 = 0. This is the same as requiring that the nullspace of X is included in the nullspace of K1. Using Lemma 1.13, a passage to orthogonal complements yields
With moment matrix M = (\/ri)X'X, we finally obtain the testability condition
pointing yet again to the importance of the feasibility cone A(K). 3.7. F-TEST OF A LINEAR HYPOTHESIS Assuming testability, the F-test for the hypothesis K'6 = 0 has various desirable optimality properties. Rather than dwelling on those, we outline the
68
underlying rationale that ties the test statistic F to the optimal estimators for the mean vector under the full model, and under the hypothesis. We proceed in five steps. I. First we find appropriate matrix representations for the full model, and for the hypothetical model. The full model permits any mean vector of the form JJL XO for some vector 0, that is, /i can vary over the range of the model matrix X. With projector P = X(X'X)~X', we therefore have JJL = PIJL, In the full model. A similar representation holds in the submodel specified by the hypothesis HQ. To this end, we introduce the matrices
From Lemma 1.17, the definition of P\ does not depend on the particular version of the generalized inverses involved. The matrices PQ and P\ are idempotent, and symmetric; hence they are orthogonal projectors. They satisfy PI - PPi and PQ = P(In- P^. We claim that under the hypothesis H0, the mean vector //, varies over the range of the projector PQ. Indeed, for mean vectors of the form /t = XO with K'6 = 0 we have P\X6 = 0 and
Hence ft lies in the range of PQ. Conversely, if IJL is in the range of PQ then we obtain, upon choosing any vector OQ = GX'PQJJL with G e (X'X)~ and using K'(X'XYX'Pi = K'(X'X)~X' from Lemma 1.17,
In terms of projectors we are therefore testing the hypothesis HQ: p. PQIJL, within the full model fi = Pp. The alternative HI thus is /A ^ PQIJL, within /A = PM. n. Secondly, we briefly digress and solve the associated estimation problems. The optimal estimator for /*, in the grand model is PY. The optimal estimator for IJL under the hypothesis H0 is P0Y, the orthogonal projection of the response vector Y onto the range of the matrix P0. The difference between the two optimal estimators,
should be small if /u, belongs to the hypothesis. A plausible measure of size of this n x 1 vector is its squared Euclidean length,
3.7. F-TEST OF A LINEAR HYPOTHESIS
69
(The reduction from the vector P\Y to the scalar Y'PiY can be rigorously justified by invariance arguments.) A large model variance a2 entails a large variability of the response vector Y, and hence of the quadratic form Y'P\Y. Therefore its size is evaluated, not on an absolute scale, but relative to an estimate of the model variance. An unbiased, in fact, optimal, estimator for a2 is
invoking the residual projector R = In- P = In - X(X'X] X'. Indeed, this is the usual estimator for a2, in that Y'RY is the residual sum of squares and n - rank X is the number of associated degrees of freedom. HI. In the third step, we return to the testing problem. Now the normality assumption comes to bear. The random vector P\Y is normally distributed with mean vector P\IL and dispersion matrix o-2P\, while the random vector RY follows a centered normal distribution with dispersion matrix a2R. Moreover, the covariance matrix of the two vectors vanishes,
Hence the two statistics P\Y and RY are independent, and so are the quadratic forms Y'PiY and Y'RY. Under the hypothesis, the mean vector PIIJL }i - PQIJL vanishes. Therefore the statistic
has a central F-distribution with numerator degrees of freedom rank PI = rank K, and with denominator degrees of freedom rank/? = n rank X. Large values of F go together with the squared norm of the difference vector P\Y outgrowing the estimate of the model variance a2. In summary, then, the F-test uses F as test statistic. Large values of F indicate a significant deviation from the hypothesis K'O = 0. The critical value for significance level a is F~*_k(l - a), that is, the 1 - a quantile of the central F-distribution with numerator degrees of freedom s = rank K and denominator degrees of freedom n - k, where k = rank X. Tables of the F-distributions are widely available. For moderate degrees of freedom and a .05 the critical value lies in the neighborhood of 4. IV. Fourthly, we discuss the global behavior of this test. For a full appreciation of the F-test we need to know how it performs, not just under the hypothesis, but in the entire model /a = XB. Indeed, the statistic F quite generally follows an F-distribution, except that the distribution may become
70
noncentral, with noncentrality parameter
Here we have used the formula X'X(X'X)
K = K, and the abbreviation
The noncentrality parameter vanishes, \L 'P\\L = 0, if and only if the hypothesis HQ : K'B = 0 is satisfied. The expected value of the test statistic F reflects the noncentrality effect. Namely, it is well known that, for n > 2 + rank X, the statistic F has expectation
The larger the noncentrality parameter, the larger values for F we expect, and the clearer the test detects a significant deviation from the hypothesis HQ. V. In the fifth and final step, we study how the F-test responds to a change in the model matrix X. That is, we ask how the test for K'B = 0 compares in a grand model with a moment matrix M as before, relative to one with an alternate moment matrix A, say. We assume that the testability condition A 6 A(K) is satisfied. The way the noncentrality parameter determines the expectation of the statistic F involves the matrices
It indicates that the F-test is uniformly better under moment matrix M than under moment matrix A provided
If K is of full column rank, this is the same as requiring CK(M) > CK(A) in the Loewner ordering. Again we end up with the task of maximizing the information matrix CK(M). Some details have been skipped over. It is legitimate to compare two F-tests by their noncentrality parameters only with "everything else being equal". More precisely, model variance a2 and sample size n ought to be the
3.8. ANOVA
71
EXHIBIT 3.1 ANOVA decomposition. An observed yield y decomposes into PQy+P\y+Ry. The term P$y is fully explained by the hypothesis. The sum of squares ||.Pi y||2 is indicative for a deviation from the hypothesis, while \\Ry\\2 measures the model variance.
same. That the variance per observation, cr2/n, is constant is also called for in the estimation problem. Furthermore, the testing problem even requires equality of the ranks of two competing moment matrices M and A because they affect the denominator degrees of freedom of the F-distribution. Nevertheless the more important aspects seem to be captured by the matrices CK(M) and CK(A). Therefore we keep concentrating on maximizing information matrices. 3.8. ANOVA The rationale underlying the F-test is often subsumed under the heading analysis of variance. The key connection is the decomposition of the identity matrix into orthogonal projectors,
These projectors correspond to three nested subspaces: i. U" = range(Po + PI + R), the sample space; ii. = range(Po + PI), the mean space under the full model; Hi. H = range PQ, the mean space under the hypothesis HQ. Accordingly the response vector Y decomposes into the three terms P0Y, PiY, and RY, as indicated in Exhibit 3.1.
72
The term P0Y is entirely explained as a possible mean vector, both under the full model and under the hypothesis. The quadratic forms
are mean sums of squares estimating the deviation of the observed mean vector from the hypothetical model, and estimating the model variance. The sample space IR" is orthogonally decomposed with respect to the Euclidean scalar product because in the classical linear model, the dispersion matrix is proportional to the identity matrix. For a general linear model with dispersion matrix V > 0, the decomposition ought to be carried out relative to the pseudo scalar product (x,y)v = x'V~ly. 3.9. IDENTIFIABILITY Estimability and testability share a common background, identifiability. We assume that the response vector Y follows a normal distribution Nê;o.2/n. By definition, the subsystem K'B is called identifiable (by distribution) when all parameter vectors B\, #2 R* satisfy the implication
The premise means that the parameter vectors d\ and fy specify one and the same underlying distribution. The conclusion demands that the parameters of interest then coincide as well. We have seen in Section 3.6 that, in terms of the moment matrix M = (l/ri)X'X, we can transcribe this definition into the identifiability condition
Identifiability of the subsystem of interest is a necessary requirement for parametric model-building, with no regard to the intended statistical inference. Estimability and testability are but two particular instances of intended inference. 3.10. FISHER INFORMATION A central notion in parametric modeling, not confined to the normal distribution, is the Fisher information matrix for the model parameter 0. There are two alternative definitions. It is the dispersion matrix of the first logarithmic derivative with respect to 6 of the model density. Or, up to a change of sign, it is the expected matrix of the second logarithmic derivative. In a classical
3.11. COMPONENT SUBSETS
73
linear model with normality assumption, the Fisher information matrix for the full mean parameter system turns out to be (n/<r2)M if the underlying design has moment matrix M. For the first s out of k parameters, 6\,..., 05, the Fisher information matrix is taken to be (n/a2)C, where the matrices M and C are related through their inverses,
We refer to this relation as the dispersion formula. The subscripting in (M"1)!! indicates that C"1 is the upper left 5 x s-block in the inverse of M. The formula creates the impression (wrongly, as we see in the next section) that two inversions are necessary, of M and of C, to obtain a reasonably simple relationship for C in terms of M. In our notation, we can write (0 1 ? ..., 0 S )' = K'O by choosing for K the block matrix (/s,0)'. With this notation we find
Thus calling C/^(M) the information matrix for K'O also ties in with the dispersion formula. We do not embark on the general theory of Fisher information for verifying the dispersion formula, but illustrate it by example. Strictly speaking, the classical linear model with normality assumption has k +1 parameters, the k components of 0 plus the scalar model variance or2. Therefore the Fisher information matrix for this model is of order (fc + 1) x (/c + 1),
In the present development, the mean parameter system 8 is of interest, that is, the first k out of all k + 1 parameters. For its inverse information matrix, the dispersion formula yields (cr2/ri)M~l, whence its information matrix becomes (n/cr2)M. Disregarding the common factor n/cr2, we are thus right in treating M to be the information matrix for 0. This concludes our overview of the statistical background why CK(M) is rightly called the information matrix for K'O. Whatever statistical procedure we have in mind, a reasonable design must be such that its moment matrix M lies in the feasibility cone A(K), in which case CK(M) takes the closed form expression (K'M~K)~l. 3.11. COMPONENT SUBSETS The complexity of computing CK(A) is best appreciated if the subsystem of interest consists of 5 components of the full vector 0, rather than of an arbi-
74
trary linear transformation of rank s. Without loss of generality we consider the first s out of the k mean parameters 6\,..., 0*, that is,
The corresponding block partitioning of a k x k matrix A is indicated through
with s x s block A\\, (k s) x (k - s) block A22, and s x. (k s) block Al2=A2l. The feasibility cone for the first s out of all k mean parameters comprises all nonnegative definite matrices such that the range includes the leading 5 coordinate subspace. Its members A obey the relation
calling for a generalized inversion of the k x k matrix A, followed by a regular inversion of the upper left s x s subblock. This emphasis on inversions is misleading, the dependence of CK(A) on A is actually less complex. A much simpler formula is provided by the Gauss-Markov Theorem 1.21, and even applies to all nonnegative definite matrices rather than being restricted to those in the feasibility cone. With any left inverse L of K and residual projector R = Ik - KL, this formula reads
The liberty of choosing the left inverse L can be put to good use in making this expression as simple as possible. For K = (/ 5 ,0)', we take
With this choice, the formula for Cx(A) turns into
The matrix A\\ A\iA22Ai\ is called the Schur complement of A22 in A. Our derivation implies that it is well defined and nonnegative definite, but a direct proof of these properties is instructive.
3.12. SCHUR COMPLEMENTS
75
3.12. SCHUR COMPLEMENTS Lemma. A symmetric block matrix
is nonnegative definite if and only if A22 is nonnegative definite, the range of A22 includes the range of A2\, and the Schur complement AH A\2A22A2\ is nonnegative definite. Proof. The direct part is proved first. With A being nonnegative definite, the block A22 is also nonnegative definite. We claim that the range of A22 includes the range of A2\. By Lemma 1.13, we need to show that the nullspace of A22 is included in the nullspace of A2l = A\2. Hence we take any vector y with A22y = 0. For every d > 0 and for every vector jc, we then get
Letting 8 tend to zero, we obtain x'Auy > 0 for all x, hence also for x. Thus we get x'A\2y 0 for all jc, and therefore Auy = 0. It follows from Lemma 1.17 that A22A22A2\ = A2\ and that A\2A22A2\ does not depend on the choice of the generalized inverse for A22. Finally nonnegative definiteness of AH -A\2A22A2i is a consequence of the identity
For the converse part of the proof, we need only invert the last identity:
There are several lessons to be learnt from the Schur complement representation of information matrices. Firstly, we can interpret it as consisting of the term AH, the information matrix for (0i,..., 0$)' in the absence of nuisance parameters, adjusted for the loss of information Ai2A22A2i, caused by entertaining the nuisance parameters (0 5 +i,..., 9k)'.
76
Secondly, Lemma 1.17 says that the penalty term AÂÂ-^ has the same range as A\2- Hence it may vanish, making an inversion of Aî obsolete. It vanishes if and only if A\2 = 0, which may be referred to as parameter orthogonality of the parameters of interest and the nuisance parameters. For instance, in a linear model with normality assumption, the mean parameter vector 6 and the model variance a2 are parameter orthogonal (see Section 3.10). Thirdly, the complexity of computing CK(A) is determined by the inversion of the (k s) x (k - s) matrix A22- In general, we must add the complexity of computing a left inverse L of the k x s coefficient matrix K. For a general subsystem of interest K'O with coefficient matrix K of full column rank s, four formulas for the information matrix CK(A) are now available:
Formula (1) clarifies the role of Cx(A) in statistical inference. It is restricted to the feasibility cone A(K) but can be extended to the closed cone NND(fc) by semicontinuity (see Section 3.13). Formula (2) utilizes an arbitrary left inverse L of K and the residual projector R = Ik - KL. It is of Schur complement type, focusing on the loss of information due to the presence of nuisance parameters. The left inverse L can be chosen in order to relieve the computational burden as much as possible. Formula (3) is based on a left inverse L of K that is minimizing for A, and shifts the computational task over to determining a solution of the linear matrix equation L(K,AR'} = (/s,0). Formula (4) has been adopted as a definition and is instrumental when we next establish the functional properties of the mapping CK. The formula provides a quasi-linear representation of CK, in the sense that the mappings A H-+ LAV are linear in A, and that CK is the minimum of a collection of such linear mappings. 3.13. BASIC PROPERTIES OF THE INFORMATION MATRIX MAPPING Theorem. Let K be a k x s coefficient matrix of full column rank s, with associated information matrix mapping
3.13. BASIC PROPERTIES OF THE INFORMATION MATRIX MAPPING
77
Then CK is positively homogeneous, matrix superadditive, and nonnegative, as well as matrix concave and matrix isotonic:
Moreover CK enjoys any one of the following three equivalent properties: a. (Upper semicontinuity) The level sets
are closed, for all C e Sym(s). b. (Sequential semicontinuity criterion) For all sequences (Am}m>\ in NND(fc) that converge to a limit A, we have
c. (Regularizatiori) For all A,B e NND(fc) we have
Proof. The first three properties are immediate from the definition of CK(A) as the minimum over the matrices LAL'. Superadditivity and homogeneity imply matrix concavity:
Superadditivity and nonnegativity imply matrix monotonicity,
Moreover CK enjoys property b. Suppose the matrices Am > 0 converge
78
to A such that CK(Am) > CK(A). With a left inverse LofK that is minimizing for A we obtain
Hence CK(Am) converges to the limit CK(A). It remains to establish the equivalence of (a), (b), and (c). For this, we need no more than that the mapping CK NND(A:) > Sym(s) is matrix isotonic. Let B = (B e Sym(/c) : trace B2 < 1} be the closed unit ball in Sym(/c), as encountered in Section 1.9. There we proved that every matrix B e B fulfills \x'Bx\ < x'x for all x R*. In terms of the Loewner ordering this means -Ik i be a sequence in NND(fc) converging to A, and satisfying CK(Am) > CK(A) for all m > 1. For every e > 0, the sequence (Am)m>\ eventually stays in the ball A+eB. This entails Am < A+eIk and, by monotonicity of CK, also CK(Am) < CK(A+eIk) for eventually all m. Hence the sequence (CK(Am))m>l is bounded in Sym(.y), and possesses cluster points C. From CK(Am) > CK(A), we obtain C > CK(A). Let (CK(Amt))(>l be a subsequence converging to C. For all 8 > 0, the sequence (CK(Am())e>l eventually stays in C + dC, where C is the closed unit ball in Sym(.s). This implies CK(Amt) >C~8Ise Sym(s), for eventually all L In other words, Am( lies in the level set {CK > C - dls}. This set is closed, by assumption (a), and hence also contains A. Thus we get CK(A) > C dls and, with d tending to zero, CK(A) > C. This entails C = CK(A). Since the cluster point C was arbitrary, (b) follows. Next we show that (b) implies (c). Indeed, the sequence Am = A + (l/m)B converges to A and, by monotonicity of CK, fulfills CK(Am) > CK(A). Now (b) secures convergence, thus establishing part (c). Finally we demonstrate that (c) implies (a). For a fixed matrix C e Sym(5) let (At)t>\ be a sequence in the level set {CK > C} converging to a limit A. For every m > 1, the sequence (Af)e>i eventually stays in the ball A + (l/m)B. This implies At < A + (l/m)Ik and, by monotonicity of CK, also CK(Ae) <CK(A + (l/m)4). The left hand side is bounded from below by C since Ae {CK > C}. The right hand side converges to CK(A) because of regularity. But C < CK(A) means A e {CK > C}, establishing closedness of the level set [CK > C}. The main advantage of defining the mapping CK as the minimum of the linear functions LAV is the smoothness that it acquires on the boundary of its domain of definition, expressed through upper semicontinuity (a). Our definition requires closedness of the upper level sets {CK > C} of the matrixvalued function CK, and conforms as it should with the usual concept of upper semicontinuity for real-valued functions, compare Lemma 5.7. Property (b)
3.14. RANGE DISJOINTNESS LEMMA
79
provides a handy sequential criterion for upper semicontinuity. Part (c) ties in with regular, that is, nonsingular, matrices since A + (l/m)B is positive definite whenever B is positive definite. A prime application of regularization consists in extending the formula CK(A) (K'A~K)~l from the feasibility cone A(K) to all nonnegative definite matrices,
By homogeneity this is the same as
where the point is that the positive definite matrices
converge to A along the straight line from 1% to A. The formula remains correct with the identity matrix Ik replaced by any positive definite matrix B. In other words, the representation (K'A~lK)~l permits a continuous extension from the interior cone PD(fc) to the closure NND(fc), as long as convergence takes place "along a straight line" (see Exhibit 3.2). Section 3.16 illustrates by example that the information matrix mapping CK may well fail to be continuous if the convergence is not along straight lines. Next we show that the rank behavior of CK(A) completely specifies the feasibility cone A(K). The following lemma singles out an intermediate step concerning ranges. 3.14. RANGE DISJOINTNESS LEMMA Lemma. Let the k x s coefficient matrix K have full column rank 5, and let A be a nonnegative definite k x k matrix. Then the matrix
is the unique matrix with the three properties
Proof. First we show that AK enjoys each of the three properties. Nonnegative definiteness of AK is evident. Let L be a left inverse of K that is minimizing for A, and set R = Ik KL. Then we have Cx(A) LAL', and
80
EXHIBIT 3.2 Regularization of the information matrix mapping. On the boundary of the cone NND(fc), convergence of the information matrix mapping CK holds true provided the sequence Ai,A2,-.. tends along a straight line from inside PD(fc) towards the singular matrix A, but may fail otherwise.
the Gauss-Markov Theorem 1.19 yields
This establishes the first property. The second property, the range inclusion condition, is obvious from the definition of AK. Now we turn to the range disjointness property of A - AK and K. For vectors u e IR* and v IRS with (A - AK]U = Kv, we have
3.15. RANK OF INFORMATION MATRICES
81
In the penultimate line, the product LAR' vanishes since the left inverse L of K is minimizing for A, by the Gauss-Markov Theorem 1.19. Hence the ranges of A - AK and K are disjoint except for the null vector. Second consider any other matrix B with the three properties. Because of B > 0 we can choose a square root decomposition B = UU'. From the second property, we get range U = range B C range K. Therefore we can write U = KW. Now A > B entails AK > B, by way of
Thus A - B (A - AK] + (AK - B] is the sum of two nonnegative definite matrices. Invoking the third property and the range summation Lemma 2.3, we obtain
A matrix with range null must be the null matrix, forcing B = AK. The rank behavior of information matrices measures the extent of identifiability, indicating how much of the range of the coefficient matrix is captured by the range of the moment matrix.
3.15. RANK OF INFORMATION MATRICES Theorem. Let the k x s coefficient matrix K have full column rank s, and let A be a nonnegative definite k x k matrix. Then
In particular CK(A) is positive definite if and only if A lies in the feasibility cone A(K). Proof. Define AK = KCK(A)K'. We know that A = (A - AK) + AK is the sum of two nonnegative definite matrices, by the preceding lemma. From Lemma 2.3 we obtain range A n range K = (range(A - AK)r\ range K) + (range AK n range K) = range AK.
82
Since K has full column rank this yields
In particular, CK(A) has rank s if and only if range A n range K = range AT, that is, the matrix A lies in the feasibility cone A(K). The somewhat surprising conclusion is that the notion of feasibility cones is formally redundant, even though it is essential to statistical inference. The theorem actually suggests that the natural order of first checking feasibility and then calculating CK(A) be reversed. First calculate CK(A) from the Schur complement type formula
with any left inverse L of K and residual projector R = Ik - KL. Then check nonsingularity of CK(A) to see whether A is feasible. Numerically it is much simpler to find out whether the nonnegative definite matrix CK(A) is singular or not, rather than to verify a range inclusion condition. The case of the first s out of all k mean parameters is particularly transparent. As elaborated in Section 3.11, its information matrix is the ordinary Schur complement of A22 in A. The present theorem states that the rank of the Schur complement determines feasibility of A, in that the Schur complement has rank 5 if and only if the range of A includes the leading s coordinate subspace. For scalar subsystems c'9, the design problem of Section 2.7 only involves those moment matrices that lie in the feasibility cone A(c). It becomes clear now why this restriction does not affect the optimization problem. If the matrix A fails to lie in the cone A(c), then the information for c'B vanishes, having rank equal to
The formulation of the design problem in Section 2.7 omits only such moment matrices that provide zero information for c'B. Clearly this is legitimate. 3.16. DISCONTINUITY OF THE INFORMATION MATRIX MAPPING We now convince ourselves that matrix upper semicontinuity as in Theorem 3.13 is all we can generally hope for. If the coefficient matrix K is square and nonsingular, then we have CK(A) = K~lAK~l' for A e NND(fc). Here the mapping CK is actually continuous on the entire closed cone NND(fc). But continuity breaks down as soon as the rank of K falls below k.
3.16. DISCONTINUITY OF THE INFORMATION MATRIX MAPPING
83
This discontinuity is easiest recognized for scalar parameter systems c'6. We simply choose vectors dm ^ 0 orthogonal to c and converging to zero. The matrices Am (c + dm)(c + dm)' then do not lie in the feasibility cone A(c), whence Cc(Am) 0. On the other hand, the limit cc' has information one for c'6,
The same reasoning applies as long as the range of the coefficient matrix K is not the full space IR*. To see this, let KK' = ];<$ ^j zizj ê an eigenvalue decomposition. We can choose vectors dm ^ 0 orthogonal to the range of K and converging to zero. Then the matrices Am = ^j<s A/(z, + dm)(zj + dm)' satisfy
whence C/((/4m) is singular. On the other hand, the limit matrix KK' comes with the information matrix
The matrices Am do converge to KK', but convergence is not along a straight line. In this somewhat abstract reasoning, the approximating matrices Am fail to lie in the feasibility cone, contrary to the limit matrices cc' and KK'. The following example is less artificial and does not suffer from this overemphasis of the feasibility cone. Consider a line fit model with experimental domain T = [!;!]. A design T on T is optimal for the intercept a if and only if /i^ = 0, as seen in Section 2.20. In particular, the symmetric two-point designs ri>m that assign equal mass 1/2 to the points l/m are optimal. For parameter r e R, we now introduce the two-point designs 7y)OT that assign mass 1/2 to the support points 1/ra and r/m. If m > |r|, then the support points lie in the experimental domain [-1; 1]. As m tends to infinity, the corresponding moment matrices Ar,m converge to a limit that does not depend on r,
The limit matrix (QQ) is an optimal moment matrix for the intercept a belonging, for example, to the design T which assigns all its mass to the single point zero.
84
EXHIBIT 3.3 Discontinuity of the information matrix mapping. For the intercept in a line fit model, the information of the two-point designs rr,m exhausts all values between the minimum zero (r = -1) and the maximum one (r = 1). Yet the moment matrices Ar,m converge to a limit not depending on r.
For the matrices Ar^m, the Schur complement formula of Lemma 3.12 yields the information value for the intercept a,
constant in m. With varying parameter r, the designs 7>jm exhaust all possible information values, between the minimum zero and the maximum one, as shown in Exhibit 3.3. For r ^ I , the unsymmetric two-point designs rr,m provide an instance of discontinuity,
The matrices Arfn are nonsingular. Also the limiting matrix (QQ) is feasible. Hence in the present example convergence takes place entirely within the feasibility cone A(c), but not along straight lines. This illustrates that the lack of continuity is peculiar to not just the complement of the feasibility cone, but the boundary at large. Next we show that the notion of information matrices behaves consistently in the case of iterated parameter systems. As a preparation we prove a result from matrix algebra.
3.18. ITERATED PARAMETER SUBSYSTEMS
85
3.17. JOINT SOLVABILITY OF TWO MATRIX EQUATIONS Lemma. Let the matrices A e Rtxs, B e Rtxk, C e Ukxn, D e Usxn be given. Then the two linear matrix equations AX = B and XC = D are jointly solvable if and only if they are individually solvable and AD = BC. Proof. The direct part is trivial, AD - AXC = BC. For the converse part, we first note that the single equation AX = B is solvable if and only range B C range A. By Lemma 1.17, this is the same as AA~B B. Similarly C 'X' = D' is solvable if and only if C 'C '~D' - D'. Altogether we now have the three conditions
With generalized inverses G of A and H of C, the matrix X = DH + GB GADH solves both equations: AX = ADH + AGB - AGADH = B results from (1), and XC = DHC + GBC - GADHC = D + G(BC - AD) = D results from (2) and (3).
3.18. ITERATED PARAMETER SUBSYSTEMS Information matrices for K'9 depend on moment matrices. But moment matrices themselves are information matrices for 6. The idea of "iterating on information" holds true more generally as soon as the subsystems under investigation are appropriately nested. To be precise, let us consider a first subsystem K'O with k x s coefficient matrix K of rank s. The information matrix mapping for K'O is given by
Left inverses of K that are minimizing for A are denoted by LK>A. Now we adjoin a sub-subsystem H'K'6, given by an s x r coefficient matrix H of rank r. The information matrix mapping associated with H is
Left inverses of H that are minimizing for B are denoted by
86
However, the subsystem H'K'B can also be read as (KH)'d, with k x r coefficient matrix KH and information matrix mapping
Left inverses of KH that are minimizing for A are denoted by LKHyA. The interrelation between these three information matrix mappings makes the notion of iterated information matrices more precise. 3.19. ITERATED INFORMATION MATRICES Theorem. With the preceding notation we have
Moreover, a left inverse LKH>A of KH is minimizing for A if and only if it factorizes into some left inverse LK^ of K that is minimizing for A and some left inverse LHtClcÂ) of H that is minimizing for CK(A),
Proof. There exist left inverses LK<A of K minimizing for A and LHK(A) of H minimizing for CK(A). With this, we define the matrix L = LH<CK(A)^K^' Obviously LKH = LHK(A)H Ir, that is, L is a left inverse of KH. The three associated residual projectors are denoted by
For L to be minimizing for A, it remains to show that we have RK^KH = RK anc* Thus we get
= 0. But
3.20. RANK DEFICIENT SUBSYSTEMS
87
This establishes the converse part of the factorization formula, that LH,CK(A)LK^ is a left inverse of KH that is minimizing for A and it proves the iteration formula,
For the direct part of the factorization formula, let LKH^ be a left inverse of KH that is minimizing for A, with residual projector RKn Ik -KHLKH>A. Setting RK = Ik KL, with an arbitrary left inverse L of K, we have RK = RKRKH, and LKHtAAR'K = LKHjAAR^HR'K = 0. We consider the two matrix equations
The first equation has solution X = HLKHyA, for instance. The solutions of the second equation are the left inverses of K that are minimizing for A. The four coefficient matrices satisfy LKH>AK(IS^0} LKHjA(K,AR'K). In view of Lemma 3.17, the two equations admit a joint solution which we also denote by*. The first equation provides the factorization LKH^ = (L,Kn^K)X. The factor X is a left inverse of K that is minimizing for A, by the second equation. The factor LKHfAK is a left inverse of H. Setting RH ls HLKH^K, the first equation and XK = Is yield
Hence LKH^K is a left inverse of H that is minimizing for CK(A), and the proof is complete. This concludes the study of information matrices for subsystems K'6 with coefficient matrix K of full column rank s.
3.20. RANK DEFICIENT SUBSYSTEMS
Not all parameter systems K'6 that are of statistical interest have a coefficient matrix K that is of full column rank.
88
For instance, in the two-way classification model of Section 1.27, the mean parameters comprise a subvector a = (i,..., aa)', where or, is the mean effect of level i of factor A. Often these effects are of interest (and identifiable) only after being corrected by the mean effect a. (I/a) ]l<a ;. This gives rise to the special coefficient matrices defined by
The subsystem Kaa consists of the parameter functions (i - a.,..., aa - a.)', and is called the system of centered contrasts of factor A. The coefficient matrix Ka fails to have full column rank a. In fact, the matrix Ja is the orthogonal projector onto the one-dimensional subspace of Ra spanned by the unity vector la = (1,..., 1)'. Its residual projector Ka projects onto the orthogonal complement, and therefore has rank a - 1. Together they provide an orthogonal decomposition of the parameter subspace Ua, corresponding to
The matrix Ja is called the averaging matrix or the diagonal projector and Ka is called the centering matrix or the orthodiagonal projector. We return to this model in Section 3.25. 3.21. GENERALIZED INFORMATION MATRICES FOR RANK DEFICIENT SUBSYSTEMS It is neither helpful nor wise to remedy rank deficiency through a full rank reparametrization. In most applications, the parameters have a definite meaning and this meaning is destroyed or at least distorted by reparametrization. Instead the notion of information matrices is generalized in order to cover coefficient matrices that do not have full column rank. Which difficulties do we encounter if the k x s coefficient matrix K has rank r < s? In the case of a full column rank, r = s, a feasible moment matrix A e A(K) has information matrix CK(A) given by the formula
If K is rank deficient, r < s, so is the product K'A K and the matrix K'A K fails to be invertible. The lack of invertibility cannot, at this point, be overcome by generalized inversion. A singular matrix K'A~K has many generalized inverses, and we have no clue which one deserves to play the distinctive role of the information matrix for K '0.
3.21. GENERALIZED INFORMATION MATRICES FOR RANK DEFICIENT SUBSYSTEMS
89
If invertibility breaks down, which properties should carry over to the rank deficient case? Here is a list of desiderata. i. The extension must not be marred by too much algebra of specific versions of generalized inverse matrices. ii. A passage to the case of full column rank coefficient matrices must be visibly simple. Hi. The power of the F-test depends on the matrix AK = K(K'A~K)~K', as outlined in Section 3.7, and any generalization must be in line with this. iv. The estimation problem outlined in Section 3.5 leads to dispersion matrices of the form K'A~K even if the coefficient matrix K fails to have full column rank. The transition to generalized information matrices ought to be antitonic relative to the Loewner ordering, so that minimization of dispersion matrices still corresponds to maximization of generalized information matrices. v. The functional properties from the full rank case should continue to hold true. We contend that these points are met best by the following definition. DEFINITION. For a design with moment matrix M, the generalized information matrix for K'B, with a k x s coefficient matrix K that may be rank deficient, is defined to be the k x k matrix MK where the mapping A H- AK from the cone NND(fc) to Sym(fc) is given by
The major disadvantage of the definition is that the number of rows and columns of generalized information matrices AK no longer exhibits the reduced dimensionality of the parameter subsystem of interest. Both AK and A are k x k matrices, and the notation is meant to indicate this. Rank, not number of rows and columns, measures the extent of identifiability of the subsystem K'd. Otherwise the definition of a generalized information matrix stands up against the desiderata listed above. i. The Gauss-Markov Theorem 1.19 provides the representation
where the residual projector R = Ik - KG is formed with an arbitrary generalized inverse G of K. This preserves the full liberty of choosing generalized inverses of K, and of RAR'.
90
ii. Let us assume that the coefficient matrix K has full column rank s. A passage from AK to CK(A) is easily achieved by premultiplication of AK by L and postmultiplication by L', where L is any left inverse of K,
Conversely we obtain AK from CK(A) by premultiplication by K and postmultiplication by K'. Indeed, inserting KL = lk - R, we find
using RAR'(RAR'} RA = RA as in the existence part of the proof of the Gauss-Markov Theorem 1.19. Hi. The testing problem is covered since we have AK = K(K'A~K)~K' if A lies in the feasibility cone A(K), by the Gauss-Markov Theorem 1.20. iv. The estimation problem calls for a comparison of the Loewner orderings among generalized information matrices AK, and among dispersion matrices K'A~K. We establish the desired result in two steps. The first compares the generalized inverses of an arbitrary moment matrix A and its generalized information matrix AK.
3.22. GENERALIZED INVERSES OF GENERALIZED INFORMATION MATRICES Lemma. Let A be a nonnegative definite k x k matrix and let AK be its generalized information matrix for K'O. Then every generalized inverse of A also is a generalized inverse of AK. Proof. Let Q be a left identity of K that is minimizing for A, AK = QAQ'. We know from the Gauss-Markov Theorem 1.19 that this equality holds if and only if QK = K and QAR' = 0, with R = lk - KL for some generalized inverse L of K. The latter property means that QA = QAL'K'. Postmultiplication by Q' yields QAQ' = QAL'K'Q' = QAL'K' = QA. We get
3.23. EQUIVALENCE OF INFORMATION ORDERING AND DISPERSION ORDERING
91
the last equation following from symmetry. Now we take a generalized inverse G of A. Then we obtain
and this exhibits G as a generalized inverse of AK. In the second step, we verify that the passage between dispersion matrices and generalized information matrices is order reversing, closely following the derivation for positive definite matrices from Section 1.11. There we started from
for positive definite matrices A > B > 0. For nonnegative definite matrices C > D > 0, the basic relation is
which holds if D~ is a generalized inverse of D that is nonnegative definite. For instance we may choose D~ = GDG' with G e D~. 3.23. EQUIVALENCE OF INFORMATION ORDERING AND DISPERSION ORDERING Lemma. Let A and B be two matrices in the feasibility cone A(K). Then we have
Proof. For the direct part, we apply relation (1) of the preceding section to C = AK and D BK, and insert for D~ a nonnegative definite generalized inverse fi~ of B,
Upon premultiplication by K 'A and postmultiplication by its transpose, all terms simplify using AKA~K K, because of Lemma 3.22, and since the proof of Theorem 3.15 shows that AK and K have the same range and thus Lemma 1.17 applies. This leads to 0 < K'B~K - K'A~K. The converse part of the proof is similar, in that relation (1) of the preceding section is used with C = K'B~K and D K'A~K. Again invoking Lemma 1.17 to obtain DD~K' = K', we get 0 < CD~C - C. Now using CC~K' = K', premultiplication by KC~ and postmultiplication by its transpose yield 0 < KD~K' - KC~K'. But we have KD~K' = AK and KC~K' = BK, by Theorem 1.20.
92
The lemma verifies desideratum (iv) of Section 3.21 that, on the feasibility cone A(K), the transition between dispersion matrices K'A'K and generalized information matrices A% is antitonic, relative to the Loewner orderings on the spaces Sym(s) and Sym(Ar). It remains to resolve desideratum (v) of Section 3.21, that the functional properties of Theorem 3.13 continue to hold for generalized information matrices. Before we state these properties more formally, we summarize the four formulas for the generalized information matrix AK that parallel those of Section 3.12 for CK(A):
Formula (1) holds on the feasibility cone A(K), by the Gauss-Markov Theorem 1.20. Formula (2) is of Schur complement type, utilizing the residual projector R = Ik- KG, G e K~. Formula (3) involves a left identity Q of K that is minimizing for A, as provided by the Gauss-Markov Theorem 1.19. Formula (4) has been adopted as the definition, and leads to functional properties analogous to those for the mapping CK, as follows. 3.24. PROPERTIES OF GENERALIZED INFORMATION MATRICES Theorem. Let K be a nonzero k x s coefficient matrix that may be rank deficient, with associated generalized information matrix mapping
Then the mapping A *-> AK is positively homogeneous, matrix superadditive, and nonnegative, as well as matrix concave and matrix isotonic. Moreover it enjoys any one of the three equivalent upper semicontinuity properties (a), (b), (c) of Theorem 3.13. Given a matrix A NND(&), its generalized information matrix AK is uniquely characterized by the three properties
It satisfies the range formula range AK = (range A) n (range K), as well as the iteration formula AKH = (AK)KH, with any other 5 x r coefficient matrix H.
3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS
93
Proof. The functional properties follow from the definition as a minimum of the linear functions A H- QAQ', as in the full rank case of Theorem 3.13. The three characterizing properties are from Lemma 3.14, and the proof remains unaltered except for choosing for K generalized inverses G rather than left inverses L. The range formula is established in the proof of Theorem 3.15. It remains to verify the iteration formula. Two of the defining relations are
There exist matrices Q and R attaining the respective minima. Obviously RQ satisfies RQKH = RKH = KH, and therefore the definition of AKH yields the first inequality in the chain
The second inequality follows from monotonicity of the generalized information matrix mapping for (KH)'d, applied to AK < A. This concludes the discussion of the desiderata (i)-(v), as stipulated in Section 3.21. We will return to the topic of rank deficient parameter systems in Section 8.18.
3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS A rank deficient parameter subsystem occurs with the centered contrasts Kaa of factor A in a two-way classification model, as mentioned in Section 3.20. By Section 1.5 the full parameter system is 6 = ("), and includes the effects /3 of factor B. Hence the present subsystem of interest can be written as
94
In Section 1.27 we have seen that an a x b block design W has moment matrix
where Ar and Af are the diagonal matrices formed from the treatment replication vector r Wlb and blocksize vector 5 = Wla. We claim that the generalized information matrix for the centered treatment contrasts Kaa is
Except for vanishing subblocks the matrix MK coincides with the Schur complement of A5 in A/,
The matrix (2) is called the contrast information matrix, it is often referred to as the C-matrix of the simple block design W. Therefore the definition of generalized information matrices for rank deficient subsystems is in accordance with the classical notion of C-matrices. Contrast information matrices enjoy the important properties of having row sums and column sums equal to zero. This is easy to verify directly, since postmultiplication of Ar W&~Wf by the unity vector la produces the null vector. We can deduce this result also from Theorem 3.24. Namely, the matrix MK has its range included in that of K, whence the range of Ar WA^ W is included in the range of Ka. Thus the nullspace of the matrix Ar - W&-W includes the nullspace of Ka, forcing (Ar - Wb;W')la to be the null vector. To prove the claim (1), we choose for K the generalized inverse K' = (#a,0), with residual projector
Let la denote the a x 1 vector with all components equal to I/a. In other words, this is the row sum vector that corresponds to the uniform distribution on the a levels of factor A. From Ja = lal a and Ar7a = r, as well as Wla = 5, we obtain
3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS
95
Using
and
we further get
Hence for RMR we can choose the generalized inverse
Multiplication now yields
In the last line we have used A5AS W W. This follows from Lemma 1.17 since the submatrix As lies in the feasibility cone A(W), by Lemma 3.12. (Alternatively we may look at the meaning of the terms involved. If the blocksize Sj is zero, then SjS~j~ = 0 and the ;' th column of W vanishes. If Sj is positive, then SjSf = s/s."1 = 1.) In summary, we obtain
and formula (1) is verified. We return to this model in Section 4.3. The Loewner ordering fails to be a total ordering. Hence a maximization or a minimization problem relative to it may, or may not, have a solution. The Gauss-Markov Theorem 1.19 is a remarkable instance where minimization relative to the Loewner ordering can be carried out. The design problem embraces too general a situation; maximization of information matrices in the Loewner ordering is generally infeasible. However, on occasion Loewner optimality can be achieved and this is explored next.
96
EXERCISES
3.1 For the compact, convex set
and the matrix
verify
Is CK(M] compact, convex? 3.2 Let the matrix X = (Xi, X2) and the vector 6 = (0/, 02')' be pardoned so that X9 = X\ Q\ + X2&2- Show that the information matrix for 62 is ^2'1*2.1, where X2.i = (/ - Xl(X{X^~X{)X2. 3.3 In the general linear model E[Y] = KB and D[Y] = A, find the joint moments of Y and the zero mean statistic Z = RY, where R I KL for some generalized inverse L of K. What are the properties of the covariance adjusted estimator Y -AR'(RAR')~Z of E[Y]1 [Rao (1967), p. 359]. 3.4 In the linear model Y ~ N^.^, show that a 1 a confidence ellipsoid for y = K'B is given by
where ^^(l - a) is the 1 a quantile of the F-distribution with numerator degrees of freedom s and denominator degrees of freedom n k. 3.5 Let K G M*X5 be of full column rank s. For all A e NND(fc) and z 6 RJ, show that (i) K'A'K C (CK(A))~, (ii) z range CK(A) if and only if Kz e range A [Heiligers (1991), p. 26].
EXERCISES
97
3.6 Show that K(K'K)-2K' e (KK'Y and K'(KK'YK = 75, for all A: e IR*XS with rank /C = 5. 3.7 Show that (K'A'KY1 < (K'K}-lK'AK(K'KYl for all A <E .4(#), with equality if and only if range AK C range K. 3.8 In Section 3.16, show that the path
is tangent to NND(2) at 8 = 0. 3.9 Let K Rkxs be of arbitrary rank. For all A, B e NND(fc), show that AK > BK if and only if range AK D range BK and c'A~c < c'B~c for all c e range fi^. 3.10 For all A E NND(fc), show that (i) (AK}K = A*, (ii) if ranged = range H then A# = AH, (iii) if range ^4 C range /C then /l/^ = >4. 3.11 Show that (jcjc ')K equals jcjc' or 0 according as x range K or not.
3.12 Show that Ac equals c(c'A~c)~lc' or 0 according as c 6 range A or not. 3.13 In the two-way classification model of Section 3.25, show that LM = (7fl, -WAj) is a left identity of K = (Ka,Q)' that is minimizing for
3.14 3.15
(continued) Show that the Moore-Penrose inverse of K'M K is A r - W&-W. (continued} Show that Ar~ + A,:WrWA7 C (A, - WA;W)", where D = As - WA~W is the information matrix for the centered block contrasts [John (1987), p. 39].
CHAPTER 4
Loewner Optimality
Optimality of information matrices relative to the Loewner ordering is discussed. This concept coincides with dispersion optimally, and simultaneous optimality for all scalar subsystems. An equivalence theorem is presented. A nonexistence theorem shows that the set of competing moment matrices must not be too large for Loewner optimal designs to exist. As an illustration, product designs for a two-way classification model are shown to be Loewner optimal if the treatment rectfication vector is fixed. The proof of the equivalence theorem for Loewner optimality requires a refinement of the equivalence theorem for scalar optimality. This refinement is presented in the final sections of the chapter. 4.1. SETS OF COMPETING MOMENT MATRICES The general problem we consider is to find an optimal design in a specified subclass H of the set of all designs, H C E. The optimality criteria to be discussed in this book depend on the design only through the moment matrix M (). In terms of moment matrices, the subset H of H leads to a subset M (H) of the set of all moment matrices, A/(E) C M(H). Therefore we study a subset of moment matrices,
which we call the set of competing moment matrices. Throughout this book we make the grand assumption that there exists at least one competing moment matrix that is feasible for the parameter subsystem K'd of interest,
Otherwise there is no design under which K'B is identifiable, and our optimization problem would be statistically meaningless.
98
4.3. MAXIMUM RANGE IN TWO-WAY CLASSIFICATION MODELS
99
The subset M of competing moment matrices is often convex. Simple consequences of convexity are the following. 4.2. MOMENT MATRICES WITH MAXIMUM RANGE AND RANK Lemma. Let the set M of competing moment matrices be convex. Then there exists a matrix M e M which has a maximum range and rank, that is, for all A e M we have
Moreover the information matrix mapping CK permits regularization along straight lines in M,
Proof. We can choose a matrix M e M with rank as large as possible, rankM = max{rank A : A e M}. This matrix also has a range as large as possible. To show this, let A e M be any competing moment matrix. Then the set M contains B = \A + \M, by convexity. From Lemma 2.3, we obtain range A C range Z? and range M C range/?. The latter inclusion holds with equality, since the rank of M is assumed maximal. Thus range A C range B range M, and the first assertion is proved. Moreover the set M contains the path (1 a)A + aB running within M from A to B. Positive homogeneity of the information matrix mapping CK, established in Theorem 3.13, permits us to extract the factor 1 - a. This gives
This expression has limit CK(A) as a tends to zero, by matrix upper semicontinuity, Often there exist positive definite competing moment matrices, and the maximum rank in M then simply equals k.
4.3. MAXIMUM RANGE IN TWO-WAY CLASSIFICATION MODELS The maximum rank of moment matrices in a two-way classification model is now shown to be a + b -1. As in Section 1.27, we identify designs e H with
100
CHAPTER 4: LOEWNER OPTIMALITY
their a x b weight matrices W e T. An a x b block design W has a degenerate moment matrix,
Hence the rank is at most equal to a + b - 1. This value is actually attained for the uniform design, assigning uniform mass l/(ab) to every point (/,/) in the experimental domain T = {l,...,a} x {!,...,6}. The uniform design has weight matrix 1al b. We show that
occurs only if x = Sla and y -&//,, for some S 6 R. But
entails x = y.la, while
gives y = -x.lb. Premultiplication of x -y.la by 2a yields x. = y. = S, say. Therefore the uniform design has a moment matrix of maximum rank a + b-l. An important class of designs for the two-way classification model are the product designs. By definition, their weights are the products of row sums r, and column sums s,, w,; = r/sy. In matrix terms, this means that the weight matrix W is of the form
with row sum vector r and column sum vector s. In measure theory terms, the joint distribution W is the product of the marginal distributions r and s. An exact design for sample size n determines frequencies n /y that sum to n. With marginal frequencies ,-. = ]C/<6 nn anc* ni ]i<fl nih tne product property turns into n/y = n/.n.y/n. For this reason a product design is sometimes called a proportional frequency design. We call a vector r positive and write r > 0 when all its entries are positive. A product design with positive row and column sum vectors again has
4.4. LOEWNER OPTIMALITY
101
maximal rank,
The proof of rank maximality for the uniform design carries over, since
implies x = ~(s'y)la and y = -(r'x)lb with r'x = -s'y. The present model also serves as an example that subsets M. of the full set M (E) of all moment matrices are indeed of interest. For instance, fixing the row sum vector r means fixing the replication numbers r, for levels i = l,...,a of factor A. The weight matrices that achieve the given row sum vector r lead to the subsets T(r) = {W <E T: Wlb = r} and M(r) = M (T(r)). The subset T(r) of block designs with given treatment replication vector r is convex, as is the subset M(r) of corresponding moment matrices. The set M(r} is also compact, being a closed subset of the compact set M(E), see Lemma 1.26. If the row sum vector r is positive, then the maximum rank in M(r} is a + b - 1, and is attained by product designs rs' with s > 0. We return to this model in Section 4.8. 4.4. LOEWNER OPTIMALITY It is time now to be more specific about the meanings of optimality. The most gratifying criterion refers to the Loewner comparison of information matrices. It goes hand in hand with estimation problems, testing hypotheses, and general parametric model-building, as outlined in Section 3.4 to Section 3.10. DEFINITION. A moment matrix M 6 M. is called Loewner optimal for K'B in M when the information matrix for K'd satisfies
provided the k x s coefficient matrix K has full column rank 5. If K is rank deficient, we take recourse to the generalized information matrices for K'6 and demand
The optimality properties of designs are determined by their moment matrices A/(). Given a subclass of designs, H C H, a design e E is called Loewner optimal for K'0 in H when its moment matrix M() is Loewner optimal tor K'd in M(H).
102
The following theorem summarizes the various aspects of Loewner optimality, emphasizing in turn (a) information matrices, (b) dispersion matrices, and (c) simultaneous scalar optimality. We do not require the coefficient matrix K to be of full column rank s. From (b) every Loewner optimal moment matrix M is feasible, that is, M e A(K). Part (c) comes closest to the idea of uniform optimality, a notion that some authors prefer. 4.5. DISPERSION OPTIMALITY AND SIMULTANEOUS SCALAR OPTIMALITY Theorem. Let the set M of competing moment matrices be convex. Then for every moment matrix M e M, the following statements are equivalent: a. (Information optimality) M is Loewner optimal for K'0 in M. b. (Dispersion optimality) M e A(K) and K'M~K < K'A'K for all A e
MnA(K).
c. (Simultaneous scalar optimality) M is optimal for c'Q in M, for all vectors c in the range of K. Proof. The proof that (a) implies (c) is a case of iterated information since for some vector z Ks, the coefficient vector c satisfies c = Kz. Monotonicity yields Mc = (MK)C > (AK)C = Ac, for all A e M, by Theorem 3.24. Hence M is optimal for c'Q in M.. Next we establish that (c) implies (b). Firstly, being optimal for c'0, the matrix M must lie in the feasibility cone A(c). Since this is true for all vectors c in the range of K, the matrix M actually lies in the cone A(K). Secondly the scalar inequalities z'K'M'Kz <z'K'A~Kz, with arbitrary vector z IRS, evidently prove (b). Finally we show that (b) implies (a). Let B M n A(K) be a feasible competing moment matrix that satisfies K'M~K < K'B~K. Lemma 3.23 gives MK > BK. Hence M is Loewner optimal in M n A(K). This extends to all matrices A e M, feasible or not, by regularization. For a (0; 1), feasibility of B entails feasibility of (1 - a)A + aB, whence MK > ((I - a)A + aB)K. A passage to the limit as a tends to zero yields MK >AK. Thus M is Loewner optimal in M. Part (c) enables us to deduce an equivalence theorem for Loewner optimality that is similar in nature to the Equivalence Theorem 2.16 for scalar optimality. There we concentrated on the set M (E) of all moment matrices. We now use (and prove from Section 4.9 on) that Theorem 2.16 remains true with the set A/(H) replaced by a set M of competing matrices that is compact and convex.
4.6. GENERAL EQUIVALENCE THEOREM FOR LOEWNER OPTIMALITY
103
4.6. GENERAL EQUIVALENCE THEOREM FOR LOEWNER OPTIMALITY Theorem. Assume the set M of competing moment matrices is compact and convex. Let the competing moment matrix M e M have maximum rank. Then M is Loewner optimal for K'O in M if and only if
Proof. As a preamble, we verify that the product AGK is invariant to the choice of generalized inverse G e M~. In view of the maximum range assumption, the range of M includes the range of K, as well as the range of every other competing moment matrix A. Thus from K = MM~K and A = MM~A, we get AGK = AM'MGMM'K = AM~K for all G e M~. It follows that K'M-AM-K is well denned, as is K'M'K. The converse provides a sufficient condition for optimality. We present an argument that does not include the rank of K, utilizing the generalized information matrices AK from Section 3.21. Let G be a generalized inverse of M and introduce Q = MKG' = K(K'M-KYK'G'. From Lemma 1.17, we obtain QK K. This yields
The second inequality uses the inequality from the theorem. The direct part, necessity of the condition, invokes Theorem 2.16 with the subset M in place of the full set M (E), postponing a rigorous proof of this result to Theorem 4.13. Given any vector c = Kz in the range of K, the matrix M is optimal for c'B in M, by part (c) of Theorem 4.5. Because of Theorem 2.16, there exists a generalized inverse G 6 A/~, possibly dependent on the vector c and hence on z, such that
As verified in the preamble, the product AGK is invariant to the choice of the generalized inverse G. Hence the left hand side becomes z'K'M~AM~Kz. Since z is arbitrary the desired matrix inequality follows. The theorem allows for arbitrary subsets of all moment matrices, M. C M(H), as long as they are compact and convex. We repeat that the necessity
104
part of the proof invokes Theorem 2.16 which covers the case M == A/(E), only. The present theorem, with M. = A/(H), is vacuous. 4.7. NONEXISTENCE OF LOEWNER OPTIMAL DESIGNS Corollary. No moment matrix is Loewner optimal for K' 6 in M(H) unless the coefficient matrix K has rank 1. Proof. Assume that is a design that is Loewner optimal for K'6 in H, with moment matrix M and with support points *i,... ,Xf. If i 1, then is a one-point design on x\ and range K C range *!*/ forces /C to have rank 1. Otherwise t > 2. Applying Theorem 4.6 to A = *,-*/, we find
Here equality must hold since a single strict inequality leads to the contradiction
because of i > 2. Now Lemma 1.17 yields the assertion by way of comparing ranks, rank K = rank K'M'K = rank K'M'XixlM-K = 1. The destructive nature of this corollary is deceiving. Firstly, and above all, an equivalence theorem gives necessary and sufficient conditions for a design or a moment matrix to be optimal, and this is genuinely distinct from an existence statement. Indeed, the statements "If a design is optimal then it must look like this" "If it looks like this then it must be optimal" in no way assert that an optimal design exists. If existence fails to hold, then the statements are vacuous, but logically true. Secondly, the nonexistence result is based on the Equivalence Theorem 4.6. Equivalence theorems provide an indispensable tool to study the existence problem. The present corollary ought to be taken as a manifestation of the constructive contribution that an equivalence theorem adds to the theory. In Chapter 8, we deduce from the General Equivalence Theorem insights about the number of support points of an optimal design, their location, and their weights. Thirdly, the corollary stresses the role of the subset M of competing moment matrices, pointing out that the full set M(H) is too large to permit Loewner optimal moment matrices. The opposite extreme occurs if the subset consists of a single moment matrix, MQ = {A/0}. Of course, MQ is Loewner
4.8. LOEWNER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS
105
optimal for K '0 in MQ. The relevance of Loewner optimality lies somewhere between these two extremes. Subsets M of competing moment matrices tend to be of interest for the reason that they show more structure than the full set A/(E). The special structure of M often permits a direct derivation of Loewner optimality, circumventing the Equivalence Theorem 4.6.
4.8. LOEWNER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS This section continues the discussion of Section 4.3 for the two-way classification model. We are interested in the centered contrasts of factor A,
By Section 3.25, the contrast information matrix of a block design W is Ar W&j W. If a moment matrix is feasible for the centered contrasts Kaa, then W must have positive row sum vectors. For if r, vanishes, then the / th row and column of the contrast information matrix Ar - WAjW are zero. Then its nullity is larger than 1, and its range cannot include the range of Ka. Any product design has weight matrix W = rs', and fulfills WkjW = rs'hjsr' = rr'. This is so since &~s is a vector with the /th entry equal to 1 or 0 according as s, is positive or vanishes, entailing s'k~s = ]Crjj>oJ = 1Therefore all product designs with row sum vector r share the same contrast information matrix Ar - rr'. They are feasible for the treatment contrasts if and only if r is positive. Let r be a fixed row sum vector that is positive. We claim the following. Claim. The product designs rs' with arbitrary column sum vector 5 are the only Loewner optimal designs for the centered contrasts of factor A in the set T(r) of block designs with row sum vector equal to r, with contrast information matrix Ar - rr'. Proof. There is no loss in generality assuming the column sum vector s to be positive. This secures a maximum rank for the moment matrix M of the product design rs'. A generalized inverse G for M and the matrix GK are given by
106
Now we take a competing moment matrix A, with top left block A\\. In Theorem 4.6, the left hand side of the inequality turns into
The right hand side equals K'M K = Ka&r lKa. Hence the two sides coincide if An = A r , that is, if A lies in the subset M (T(r)). Thus Theorem 4.6 proves that the product design rs1 is Loewner optimal, Ar - rr' > Ar - W&~W for all W e T(r). But then every weight matrix W e T(r) that is optimal must satisfy Wk~ W = rr', forcing W to have rank 1 and hence to be of the form W = rs'. Thus our claim is verified. Brief contemplation opens a more direct route to this result. For an arbitrary block design W e T(r) with column sum vector s, we have Wk~s = r as well as s'&js = 1. Therefore, we obtain the inequality
Equality holds if and only if W = rs'. This verifies the assertion without any reference to Theorem 4.6. We emphasize that, in contrast to the majority of the block design literature, we here study designs for infinite sample size. For a given sample size n, designs such as W = rs' need not be realizable, in that the numbers w,-s/ may well fail to be integers. Yet, as pointed out in Section 1.24, the general theory leads to general principles which are instructive. For example, if we can choose a design for sample size n that is a proportional frequency design, then this product structure guarantees a desirable optimality property. It is noteworthy that arbitrary column sum vectors s are admitted for the product designs rs'. Specifically all observations can be taken in the first block, s = (1,0,..., 0)', whence the corresponding moment matrix has rank a rather than maximum rank a+b-1. Therefore Loewner optimality may hold true even if the maximum rank hypothesis of Theorem 4.6 fails to hold. The class of designs with moment matrices that are feasible for the centered contrasts decomposes into the cross sections T(r) of designs with positive row sum vector r. Within one cross section, the information matrix Ar rr' is Loewner optimal. Between all cross sections, Loewner optimality is ruled out by Corollary 4.7. We return to this model in Section 8.19.
4.9. THE PENUMBRA OF THE SET OF COMPETING MOMENT MATRICES
107
4.9. THE PENUMBRA OF THE SET OF COMPETING MOMENT MATRICES Before pushing on into the more general theory, we pause and generalize the Equivalence Theorem 2.16 for scalar optimality, as is needed by the General Equivalence Theorem 4.6 for Loewner optimality. This makes for a selfcontained exposition of the present chapter. But our main motivation is to acquire a feeling of where the general development is headed. Indeed, the present Theorem 4.13 is but a special case of the General Equivalence Theorem 7.14. As in Section 2.13, we base our argument on the supporting hyperplane theorem. However, we leave the vector space Rk which includes the regression range X, and move to the matrix space Sym(/:) which includes the set of competing moment matrices M. The reason is that, unlike the full set M (H), a general convex subset M cannot be generated as the convex hull of a set of rank 1 matrices. Even though our reference space changes from Rk to Sym(/c), our argument is geometric. Whereas before, points in the reference space were column vectors, they now turn into matrices. And whereas before, we referred to the Euclidean scalar product of vectors, we now utilize the Euclidean matrix scalar product (A,B] = trace AB on Sym(A:). A novel feature which we exploit instantly is the Loewner ordering A < B that is available in the matrix space Sym(fc). In attacking the present problem along the same lines as the Elfving Theorem 2.14, we need a convex set in the space Sym(fc) which takes the place of the Elfving set 72. of Section 2.9. This set of matrices is given by
We call P the penumbra of M since it may be interpreted as a union of shadow lines for a light source located in M + fi, where B e NND(fc). The point M G M then casts the shadow half line {M - SB : 8 > 0}, generating the shadow cone M -NND(A:) as B varies over NND(fc). This is a translation of the cone -NND(&) so that its tip comes to lie in M. The union over M e M of these shadow cones is the penumbra P (see Exhibit 4.1). There is an important alternative way of expressing that a matrix A e Sym(fc) belongs to the penumbra P,
This is to say that P collects all matrices A that in the Loewner ordering lie below some moment matrix M e M. An immediate consequence is the following.
108
EXHIBIT 4.1 Penumbra. The penumbra P originates from the set M C NND(Jt) and recedes in all directions of -NND(/c). By definition of p2(c), the rank 1 matrix cc'/p2(c) is a boundary point of P, with supporting hyperplane determined by N > 0. The picture shows the equivalent geometry in the plane R2.
4.10. GEOMETRY OF THE PENUMBRA Lemma. Let the set M of competing moment matrices be compact and convex. Then the penumbra P is a closed convex set in the space Sym(fc). Proof. Given two matrices A M - B and A = M - B with M, M e M and B,B > 0, convexity follows with a G (0; 1) from
In order to show closedness, suppose (Am)m>i is a sequence of matrices in P converging to A in Sym(A:). For appropriate matrices Mm M we have Am < Mm. Because of compactness of the set M, the sequence (Mm)m>\ has a cluster point M e M, say. Thus A < M and A lies in P. The proof is complete. Let c e Rk be a nonvanishing coefficient vector of a scalar subsystem c'0. Recall the grand assumption of Section 4.1 that there exists at least one competing moment matrix that is feasible, M n A(c) ^ 0. The generalization of the design problem of Section 2.7 is
4.11. EXISTENCE THEOREM FOR SCALAR OPTIMALITY
109
In other words, we wish to determine a competing moment matrix M M that is feasible for c'6, and that leads to a variance c'M'c of the estimate for c'6 which is an optimum compared to the variance under all other feasible competing moment matrices. In the Elfving Theorem 2.14, we identified the optimal variance as the square of the Elfving norm, (p(c))2. Now the corresponding quantity turns out to be the nonnegative number
For 5 > 0, we have (8M) - NND(fc) = 8P. Hence a positive value p2(c) > 0 provides the scale factor needed to blow up or shrink the penumbra P so that the rank one matrix cc' comes to lie on its boundary. The scale factor p2(c) has the following properties.
4.11. EXISTENCE THEOREM FOR SCALAR OPTIMALITY
Theorem. Let the set M of competing moment matrices be compact and convex. There exists a competing moment matrix that is feasible for c'6, M e M n A(c), such that
Every such matrix is optimal for c'6 in M, and the optimal variance is p2(c). Proof. For 5 > 0, we have cc' 6 (SM) -NND(fc) if and only if cc' < SM for some matrix M M.. Hence there exists a sequence of scalars dm > 0 and a sequence of moment matrices Mm e M such that cc' < 8mMm and p2(c) = limw_oo 8m. Because of compactness, the sequence (Mm)m>i has a cluster point M G Ai, say. Hence we have cc' < p2(c)M. This forces p2(c) to be positive. Otherwise cc' < 0 implies c = 0, contrary to the assumption c / 0 from Section 2.7. With p2(c) > 0, Lemma 2.3 yields c E range cc1 C range M. Hence M is feasible, M A(c). The inequality c'M~c < p2(c) follows from c'M~(cc')M~c < 2 p (c)c'M~MM~c = p2(c)c'M~c. The converse inequality, p2(c) = inf{6 > 0 : cc' e 8P} < c'M'c, is a consequence of Theorem 1.20, since c(c'M~c}~1 c' <M means cc' < (c'M'c)M e (c'M'c)P. Finally we take any other competing moment matrix that is feasible, A e M n A(c). It satisfies cc' < (c'A~c}A e (c'A~c)P, and therefore p2(c) < c'A~c. Hence p2(c) = c'M~c is the optimal variance, and M is optimal forc'flinM Geometrically, the theorem tells us that the penumbra P includes the segment {ace1 : a < l/p2(c)} of the line {ace' : a e IR}. Moreover
110
the matrix cc'/p2(c) lies on the boundary of the penumbra P. Whereas in Section 2.13 we used a hyperplane in Rk supporting the Elfving set 7 at its boundary point c/p(c), we now take a hyperplane in Sym(k) supporting the penumbra P at cc'/p2(c). Since the set P has been defined so as to recede in all directions of - NND(&),
the supporting hyperplane permits a matrix N normal to it that is nonnegative definite. 4.12. SUPPORTING HYPERPLANES TO THE PENUMBRA Lemma. Let the set M of competing moment matrices be compact and convex. There exists a nonnegative definite k x k matrix N such that
Proof. In the first part of the proof, we make the preliminary assumption that the set M of competing moment matrices intersects the open cone PD(fc). Our arguments parallel those of Section 2.13. Namely, the matrix cc'/p2(c) lies on the boundary of the penumbra P. Thus there exists a supporting hyperplane to the set P at the point cc'/p2(c), that is, there exist a nonvanishing matrix N e Sym(fc) and a real number y such that
The penumbra P includes - NND(), the negative of the cone NND(&). Hence if A is a nonnegative definite matrix, then 8A lies in P for all 8 > 0, giving
This entails trace AN > 0 for all A > 0. By Lemma 1.8, the matrix N is nonnegative definite. Owing to the preliminary assumption, there exists a positive definite moment matrix B G M. With 0 ^ N > 0, we get 0 < trace BN < y. Subdividing by y > 0 and setting N N/y ^ 0, we obtain a matrix N with the desired properties. In the second part of the proof, we treat the general case that the set M intersects just the feasibility cone A(c), and not necessarily the open cone PD(Aj. By Lemma 4.2, we can find a moment matrix M M with maximum range, ranged C range M for all A M. Let r be the rank of M and choose
4.13. GENERAL EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY
111
an orthonormal basis u\,..., ur 6 Rk for its range. Then the k x r matrix U (i,..., UT) satisfies range M = range U and U 'U Ir. Thus the matrix UU' projects onto the range of M. From Lemma 1.17, we get UU 'c = c and UU'A = A for all A eX. Now we return to the discussion of the scale factor p2(c). The preceding properties imply that for all 6 > 0 and all matrices M e M, we have
Therefore the scale factor p2(c) permits the alternative representation
The set U'MU of reduced moment matrices contains the positive definite matrix U'MU. With c = U'c, the first part of the proof supplies a nonnegative definite r x r matrix N such that trace AN < 1 = c'Nc/p2(c) for all A e U'MU. Since A is of the form U'AUjve have trace AN = trace AUNU'. Onjhe other side, we get c'Nc = c'UN U'c. Therefore the k x k matrix N = UNU' satisfies the assertion, and the proof is complete. The supporting hyperplane inequality in the lemma is phrased with reference to the set M, although the proof extends its validity to the penumbra P. But the extended validity is besides the point. The point is that P is instrumental to secure nonnegative definiteness of the matrix N. This done, we dismiss the penumbra P. Nonnegative definiteness of N becomes essential in the proof of the following general result.
4.13. GENERAL EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY
Theorem. Assume the set M of competing moment matrices is compact and convex, and intersects the feasibility cone A(c). Then a competing moment matrix M e M is optimal for c'O in M if and only if M lies in the feasibility cone A(c) and there exists a generalized inverse G of M such that
Proof. The converse is established as in the proof of Theorem 2.16. For the direct part we choose a nonvanishing matrix N > 0 from Lemma 4.12 and define the vector h = Nc/p(c), with p(c) = i/p2(c). From Lemma 4.12,
112
we have p2(c) c'Nc, and c'h p(c). We claim that the vector h satisfies the three conditions
By definition of h, we get h'Ah c'NANc/p2(c) = trace ANc(c'Nc)~l c'N. For nonnegative definite matrices V, Theorem 1.20 provides the general inequality X(X'V~X)~X' < V, provided range X C range V. Since N is nonnegative definite, the inequality applies with V = N and X Nc, giving Nc(c'NN~Nc)~c'N < N. This and the supporting hyperplane inequality yield
Thus condition (0) is established. For the optimal moment matrix M, the variance c'M~c coincides with the optimal value p2(c) from Theorem 4.11. Together with p2(c) = c'Nc, we obtain
The converse inequality, h'Mh < 1, appears in condition (0). This proves condition (1). Now we choose a square root decomposition M = KK', and introduce the vector a = K'h K'M~c(c'M~c)~lc'h (compare the proof of Theorem 2.11). The squared norm of a vanishes,
by condition (1), and because of c'M c = p2(c) = (c'h)2. This establishes condition (2). The remainder of the proof, construction of a suitable generalized inverse G of M, is duplicated from the proof of Theorem 2.16. The theorem may also be derived as a corollary to the General Equivalence Theorem 7.14. Nevertheless, the present derivation points to a further way of proceeding. Sets M. of competing moment matrices that are genuine subsets of the set A/(H) of all moment matrices force us to properly deal with matrices. They preclude shortcut arguments based on regression vectors that provide such an elegant approach to the Elfving Theorem 2.14.
EXERCISES
113
Hence our development revolves around moment matrices and information matrices, and matrix problems in matrix space. It is then by means of information functions, to be introduced in the next chapter, that the matrices of interest are mapped into the real line. EXERCISES 4.1 What is wrong with the following argument: 4.2 In the two-way classification model, use the iteration formula to calculate the generalized information matrix for Kaa from the generalized information matrix
[Krafft (1990), p. 461]. 4.3 (continued) Is there a Loewner optimal design for Kaa in the class {W e T: W'la s} of designs with given blocksize vector 5? 4.4 (continued) Consider the design
for sample size n = a+b1. Show that the information matrix of W /n for the centered treatment contrasts is Ka/n. Find the <j> -efficiency relative to an equireplicated product design 1as''.
CHAPTER 5
Real Optimality Criteria
On the closed cone of nonnegative definite matrices, real optimality criteria are introduced as functions with such properties as are appropriate to measure largeness of information matrices. It is argued that they are positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous. Such criteria are called information functions. Information functions conform with the Loewner ordering, in that they are matrix isotonic and matrix concave. The concept of polar information functions is discussed in detail, providing the basis for the subsequent duality investigations. Finally, a result is proved listing three sufficient conditions for the general design problem so that all optimal moment matrices lie in the feasibility cone for the parameter system of interest.
5.1. POSITIVE HOMOGENEITY An optimality criterion is a function tf> from the closed cone of nonnegative definite 5 x 5 matrices into the real line,
with properties that capture the idea of whether an information matrix is large or small. A transformation from a high-dimensional matrix cone to the one-dimensional real line can retain only partial aspects and the question is, which. No criterion suits in all respects and pleases every mind. An essential aspect of an optimality criterion is the ordering that it induces among information matrices. Relative to the criterion </> an information matrix C is at least as good as another information matrix D when <t>(C) > 4>(D)' With our understanding of information matrices it is essential that a reasonable criterion be isotonic relative to the Loewner ordering,
114
5.2. SUPERADDITIVITY AND CONCAVITY
115
compare Chapter 4. We use the same sign > to indicate the Loewner ordering among symmetric matrices and the usual ordering of the real line. A second property, similarly compelling, is concavity,
In other words, information cannot be increased by interpolation, otherwise the situation <j> ((I - a)C + aD) < (1 - )</>(C) + a<J>(D) will occur. Rather than carrying out the experiment belonging to (1 - a)C + aD, we achieve more information through interpolation of the two experiments associated with C and D. This is absurd. A third property is positive homogeneity,
A criterion <f> that is positively homogeneous satisfies </> ((n/<r 2 )C) = (n/a2)(f)(C). Indeed, Section 3.5 has shown that the true information matrix is not C/c(M), but (n/cr2)CK(M). It is directly proportional to the number of observations n, and inversely proportional to the model variance or2. If the optimality criterion 0 is positively homogeneous, then we can omit the common factor n/a2 and concentrate on the matrix C = CK(M). For a closer study, it is preferable to divide the properties of concavity and monotonicity into more primitive ones. A real-valued function </> on the closed cone NND(s) is called super additive when
The following lemma tells us that, in the presence of positive homogeneity, superadditivity is just another view of concavity. 5.2. SUPERADDITIVITY AND CONCAVITY Lemma. For every positively homogeneous function <f> : NND(s) > R, the following two statements are equivalent: a. (Superadditivity) <j> is superadditive. b. (Concavity) <f> is concave. Proof. In both directions, we make use of positive homogeneity. Assuming (a), we get
116
CHAPTERS: REAL OPTIMALITY CRITERIA
ity,
Thus superadditivity implies concavity. Conversely, (b) entails supperadditiv-
An analysis of the proof shows that strict superadditivity is the same as strict concavity, in the following sense. The strict versions of these properties cannot hold if D is positively proportional to C, denoted by D oc C. Because then we have D = 8C with 8 > 0, and positive homogeneity yields
Furthermore, we apply the strict versions only in cases where at least one term, C or D, is positive definite. Hence we call a function <f> strictly superadditive on PD(s) when
A function <f) is said to be strictly concave on PD(s) when
5.3. STRICT SUPERADDITrVITY AND STRICT CONCAVITY Corollary. For every positively homogeneous function <f> : NND(s) IR, the following two statements are equivalent: a. (Strict superadditivity) </> is strictly superadditive on PD(.s). b. (Strict concavity) <f> is strictly concave on PD(s). Proof. The proof is a refinement of the proof of Lemma 5.2.
Next we show that, given homogeneity and concavity, monotonicity reduces to nonnegativity. A function <f> on the closed cone NND(s) is said to be nonnegative when
5.4.
NONNEGATIVITY AND MONOTONICITY
117
it is called positive on PD(s) when
Notice that a positively homogeneous function < vanishes for the null matrix, <(0) = 0, because of <(0) = <f>(2 0) = 2<(0). In particular, <f> is constant only if it vanishes identically. 5.4. NONNEGATIVITY AND MONOTONICITY Lemma. For every positively homogeneous and superadditive function <}> : NND(s) > 1R, the following three statements are equivalent: a. (Nonnegativity) <f> is nonnegative. b. (Monotonicity) < is isotonic. c. (Positivity on PD(,s)) Either $ is nonnegative and (f> is positive on PD(s), or else (f> is identically zero. Proof. First we show that (a) implies (b). If < is nonnegative, then we apply superadditivity to obtain <f> (C) = <f> (C - D + D) > <f> (C - D) + <f> (D) > <f>(D), for all C > D > 0. Next we show that (b) implies (c). Here C > 0 forces <f>(C) > <(0) = 0, so that < is nonnegative. Because of homogeneity, 0 is constant only if it vanishes identically. Otherwise there exists a matrix D > 0 with $(>) > 0. We need to show that <f>(C) is positive for all matrices C e PD(s). Since by Lemma 1.9, the open cone PD(s) is the interior of NND(s), there exists some e > 0 such that C eD NND(s), that is, C > eD. Monotonicity and homogeneity entail <(C) > (D) > 0. This proves (c). The properties called for by (c) obviously encompass (a). LJ Of course, we do not want to deal with the constant function <f> = 0. All other functions </> that are positively homogeneous, superadditive, and nonnegative are then positive on the open cone PD(.s), by part (c). Often, although not always, we take $ to be standardized, $(/s) 1- Every homogeneous function 0 can be standardized according to (l/<(/ s ))<, without changing the preordering that < induces among information matrices. Every homogeneous function <f> on NND(s) vanishes at the null matrix. When it is positive otherwise, we say </> is positive,
Such functions <f> characterize the null matrix, in the sense that <(C) = 0 holds if and only if C = 0. A function <f> on NND(^) is called strictly isotonic
118
when
When this condition holds just for positive definite matrices, say that <f> is strictly isotonic on PD(.y). 5.5. POSITIVITY AND STRICT MONOTONICITY
we
Corollary. For every positively homogeneous and superadditive function <f> : NND(s) -> R, the following two statements are equivalent: a. (Positivity) <f> is positive. b. (Strict monotonicity) <j> is strictly isotonic. Moreover, if (f> is strictly superadditive on PD(s), then <f> is strictly isotonic on PD(s). Proof. That (a) implies (b) follows as in the proof of Lemma 5.4. Conversely, (b) with D = 0 covers (a) as a special case. The statement on the strict versions is a direct consequence from the definitions. We illustrate these properties using the trace as criterion function, 4>(C) = trace C. This function is linear even on the full space Sym(.s), hence it is positively homogeneous and superadditive. Restricted to NND(s), it is also strictly isotonic and positive. It fails to be strictly superadditive. Standardization requires a transition to trace C/s. Being linear, this criterion is also continuous on Sym (.$) 5.6. REAL UPPER SEMICONTINUITY In general, our optimality criteria on NND(s) is called upper semicontinuous when the upper level sets {(f> > a} {C > 0 : $(C) > a} are closed, for all a G R. This definition conforms with the one for the matrix-valued information matrix mapping CK, in part (a) of Theorem 3.13. There we met an alternative sequential criterion (b) which here takes the form liiriw--^ <f>(Cm) = 4>(C), for all sequences (Cm)m>\ in NND(s) that converge to a limit C and that satisfy <(>(Cm) > </>(C) for all m > 1. This secures a "regular" behavior at the boundary, in the same sense as in part (c) of Theorem 3.13.
5.8. INFORMATION FUNCTIONS
119
5.7. SEMICONTINUITY AND REGULARIZATION Lemma. For every isotonic function $ : NND(s) > R, the following three statements are equivalent: a. (Upper semicontinuity) The level sets
are closed, for all a e R. b. (Sequential semicontinuity criterion) For all sequences (Cm}m>\ in NND(^) that converge to a limit C we have
c. (Regularization) For all C,D e NND(s), we have
Proof. This is a special case of the proof of Theorem 3.13, with s = 1 and CK = <f>. 5.8. INFORMATION FUNCTIONS Criteria that enjoy all the properties discussed so far are called information functions. DEFINITION. An information function <f> on NND(.s) is a function <f> : NND(s) > R that is positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous. The most prominent information functions are the matrix means <j>p, for p [-00; 1], to be discussed in detail in Chapter 6. They comprise the classical D-, A-, E-, and T-criteria as special cases. The list of defining properties of information functions can be rearranged, in view of the preceding lemmas and in view of the general discussion in Section 5.1, by requiring that an information function be isotonic, concave, and positively homogeneous, as well as enjoying the trivial property of being nonconstant and the more technical property of being upper semicontinuous. It is our experience that the properties as listed in the definition are more convenient to work with. Information functions enjoy many pleasant properties to which we will turn in the sequel. We characterize an information function by its unit level
120
set, thus visualizing it geometrically (Section 5.10). We establish that the set of all information functions is closed under appropriate functional operations, providing some kind of reassurance that we have picked a reasonable class of criteria (Section 5.11). We introduce polar information functions (Section 5.12) which provide the basis for the duality discussion of the optimal design problem. We study the composition of an information function with the information matrix mapping (Section 5.14), serving as the objective function for the optimal design problem (Section 5.15). 5.9. UNIT LEVEL SETS There exists a bewildering multitude of information functions. We depict this multitude more visibly by associating with each information function <f> its unit level set
The following Theorem 5.10 singles out the characteristic properties of such sets. In general, we say that a closed convex subset C C NND(s) is bounded away from the origin when it does not contain the null matrix, and that it recedes in all directions of NND(s) when
The latter property means that C + NND(s) C C for all C e C, that is, if the cone NND(^) is translated so that its tip comes to lie in C C then all of the translate is included in the set C. Exhibit 5.1 shows some unit level sets C = {p > 1} in Rj. Given a unit level set C C NND(^), we reconstruct the corresponding information function as follows. The reconstruction formula for positive definite matrices is
Thus <f>(C) is the scale factor that pulls the set C towards the null matrix or pushes it away to infinity so that C comes to lie on its boundary. However, for rank deficient matrices C, it may happen that none of the sets SC with 8 > 0 contains the matrix C whence the supremum in (1) is over the empty set and falls down to -oo. We avoid this pitfall by the general definition
For S > 0, nothing has changed as compared to (1) since then (SC) +
5.9. UNIT LEVEL SETS
121
EXHIBIT 5.1 Unit level sets. Unit contour lines of the vector means 3>p of Section 6.6 in IR+, for p = -co, 1,0,1/2,1. The corresponding unit level sets are receding in all directions of R+, as indicated by the dashed lines.
NND(s) = S(C + (l/S)NND(s)) - SC. But for 8 = 0, condition (2) turns into C 6 (OC) + NND(s) = NND(s), and holds for every matrix C > 0. Hence the supremum in (2) is always nonnegative. The proof of the correspondence between information functions and unit level sets is tedious though not difficult. We ran into similar considerations while discussing the Elfving norm p(c) in Section 2.12, and the scale factor p2(c) in Section 4.10. The passage from functions to sets and back to functions must appear at some point in the development, either implicitly or explicitly. We prefer to make it explicit, now.
122
5.10. FUNCTION-SET CORRESPONDENCE Theorem. The relations
define a one-to-one correspondence between the information functions <f> on NND(s) and the nonempty closed convex subsets C of NND(s) that are bounded away from the origin and recede in all directions of NND(s). Proof. We denote by 4> the set of all information functions, and by F the collection of subsets C of NND(s) which are nonempty, closed, convex, bounded away from the origin, and recede in all directions of NND(j). The defining properties of information functions </> e 3> parallel the properties enjoyed by the sets C e F. An information function </> e 4> is
o finite i positively homogeneous ii superadditive iii nonnegative A unit level set C e F is 0 I II III IV V
iv v
nonconstant upper semicontinuous
a subset of NND(s) bounded away from the origin convex receding in all directions of NND(s) nonempty closed
In the first part of the proof, we assume that an information function <f> is given and take the set C to be defined through the first equation in the theorem. We establish for C the properties 0-V, and verify the second equality in the theorem. This part demonstrates that the mapping $ i - > { < > l } o n < f > has its range included in F, and is injective. 0. Evidently C is a subset of NND(s). 1. Because of homogeneity 0(0) is zero, and the null matrix is not a member of C. II. Concavity of < implies for all C, D 6 C and a e (0; 1). Hence (1 - a)C + aD is a member of C, and the set C is convex. III. Fix a matrix C e C. Then superadditivity and nonnegativity of < entail for all D > 0. Thus C + NND(s) is included in C, or in other words, C recedes in all directions of NND(s). IV. Because of (f>(Is) > 0, we have V. Closedness of C is an immediate consequence of the upper semicontinuity of <f>.
5.10. FUNCTION-SET CORRESPONDENCE
123
In order to verify the second equality in the theorem we fix C NND(s) and set a = sup{8 > 0 : C (8C) + NND(^)}. It is not hard to show that (C) = 0 if and only if a = 0. Otherwise <f>(C) is positive, as is ex. From
we learn that <f>(C) < a. Conversely we know for S < a that (1/S)C e C. But we have just seen that C is closed. Letting 5 tend to a, we get (l/a)C G C, and 0((l/a)C) > 1. This yields <j>(C) > a. In summary, e(C) = a and the first part of the proof is complete. In the second part of the proof we assume that a set C with the properties 0-V is given, and take the function < to be defined through the second formula in the theorem. We show that $ satisfies properties o-v, and verify the first equality in the theorem. This part demonstrates that the mapping <f> - {$ > 1} from 4> to F is surjective. 0. We need to show that < is finite. Otherwise we have <f>(C) oo for some matrix C. This entails C e 8C for all 8 > 0. Closedness of C yields 0 = limgôo C/8 G C, contrary to the assumption that C is bounded away from the origin. 1. Next comes positive homogeneity. For C > 0 and 8 > 0, we obtain
iii. Since 8 = 0 is in the set over which the defining supremum is formed, we get 4>(C) >0. ii. In order to establish superadditivity of $, we distinguish three cases. In case <f>(C) = 0 = <(/)), nonnegativity (iii) yields <f>(D). In case <j>(C) > 0 = <HD), we choose any 5 > 0 such that We obtain
and <f>(C + D) > 8. A passage to the supremum gives In case </>(C) > 0 and <(/)) > 0, we choose any y, 8 > 0 such that C 6 yC and D 8C. Then convexity of C yields
Thus we have
and therefore
124
iv. Since C is nonempty there exists a matrix C C. For any such matrix we have <f>(C] > 1. By Lemma 5.4, < is nonconstant. v. In case a < 0, the level set is closed. Thus upper semicontinuity of <f> follows from closedness of C via equality of the sets
To prove the direct inclusion, we take any matrix C e {< > a} and choose numbers Sm > 0 converging to <f>(C) such that C e (8mC) + NND(s) = 8mC for all m > 1. The matrices Dm = C/8m are members of C. This sequence is bounded, because of \\Dm\\ \\C\\/8m > \\C\\/<f>(C). Along a convergent subsequence, we obtain by closedness. This yields
whence entails <;
Converselv. if C G aC then the definition of <6 immediatelv for all This proves
In particular, a = 1 yields the first formula in the theorem, Altogether the mapping <j> H- {< > 1} from < onto F is bijective, with inverse mapping given by the second formula in the theorem. The cone NND(^) includes plenty of nonempty closed convex subsets that are bounded away from the origin and recede in all directions of NND(s), and so there are plenty of information functions. The correspondence between information functions and their unit level sets matches the well-known correspondence between norms and their unit balls. However, the geometric orientation is inverted. Unit balls of norms include the origin and exclude infinity, while unit level sets of information functions exclude the origin and include infinity, in a somewhat loose terminology. The difference in orientation also becomes manifest when we introduce polar information functions in Section 5.12. First we overview some functional operations for constructing new information functions from old ones.
5.11. FUNCTIONAL OPERATIONS New information functions can be constructed from given ones by elementary operations. We show that the class of all information functions is closed under formation of nonnegative combinations and least upper bounds of fi-
5.12. POLAR INFORMATION FUNCTIONS AND POLAR NORMS
125
nite families of information functions, and under pointwise infima of arbitrary families. Given a finite family of information functions, fa,..., <f>m, every nonnegative combination produces a new information function,
if at least one of the coefficients 5, > 0 is positive. In particular, sums and averages of finitely many information functions are information functions. The pointwise minimum is also an information function,
Moreover, the pointwise infimum of a family </>/ with / ranging over an arbitrary index set J is an information function,
unless it degenerates to the constant zero. Upper semicontinuity follows since the level sets (inf/ e j <fo > a} = Cli&iifa > } are intersections of closed sets, and hence closed. This applies to the least upper bound of a finite family
where denotes the class of all information functions. The set over which the infimum is sought is nonempty, containing for instance the sum ^,-<i fall is also bounded from below, for instance by fa. Therefore the infimum cannot degenerate, and lub,<m <fc is an information function. These structural properties of information functions suggest that our definition of information functions is not only statistically reasonable, but also mathematically sound. The second half of Chapter 11, from Section 11.10 onwards, makes extensive use of compositions of information functions. However, the functional operation of greatest import is to come next: polarity. 5.12. POLAR INFORMATION FUNCTIONS AND POLAR NORMS Polarity is a special case of a duality relationship, and as such based on the scalar product of the underlying linear space, (C,D) = trace CD for all C,> G Sym(.s). The polar function <f> of a given information function </> is
126
best thought of as the largest function satisfying the (concave version of the generalized) Holder inequality,
For the definition, it suffices that < is defined and positive on the open cone PD(s). DEFINITION. For a function <f> : PD(s) > (0;oo), the polar function <f> : NND(s) -[0; oo) is defined by
That the function </> so defined satisfies the Holder inequality is evident from the definition for C > 0. For C > 0, it follows through regularization provided <f> is isotonic, by Lemma 5.7. In contrast, a real-valued function < on the space Sym(.s) is called a norm when </> is absolutely homogeneous: subadditive: positive: for all
for all for all (
Sym(s) Sym(s), and
A norm < has polar function < defined by
This leads to the (convex version of the generalized) Holder inequality,
Alternatively the polar function <j> is the smallest function to satisfy this inequality. The principal distinction between a norm and an information function is, of course, that the first is convex and the second is concave. Another difference, more subtle but of no less importance, is that norms are defined everywhere in the underlying space, while information functions have a proper subset for their domain of definition. One consequence concerns continuity. A norm is always continuous, being a convex function on a linear space which is everywhere finite. An information function is required, by definition, to be upper semicontinuous.
5.13. POLARITY THEOREM
127
Semicontinuity is not an automatic consequence of the other four properties that constitute an information function. On the other hand, an information function is necessarily isotonic, by Lemma 5.4, while a norm is not. Another feature emerges from the function-set correspondence of Section 5.10. An information function </> is characterized by its unit level set {< > 1}. A norm < corresponds to its unit ball {< < 1}. The distinct orientation towards infinity or towards the origin determines which version of the Holder inequality is appropriate, as well as the choice of an infimum or a supremum in the definitions of the polars. In order that our notation indicates this orientation, we have chosen to denote polars of concave and convex functions by a superscript oo or 0, respectively. Some matrix norms < have the property that
In this case, the similarity between the polars of information functions and the polars of norms becomes even more apparent. The next theorem transfers the well-known polarity relationship from norms to information functions. Polars of information functions are themselves information functions. And polarity (nomen est omen} is an idempotent operation. The second polar recovers the original information function. 5.13. POLARITY THEOREM Theorem. For every function < : NND(s) R that is positively homogeneous, positive on PD(s), and upper semicontinuous, the polar function <f> is an information function on NND(s). For every information function 4> on NND(s) the polar function of </> recovers </>,
Proof. The definition of the polar function is < mfoo'/'c* where the functions ^c(I>) - (C,D)/<f>(C) are information functions on NND(^). Hence <t> is positively homogeneous, superadditive, nonnegative, and upper semicontinuous, as mentioned in Section 5.11. It remains to show that < is nonconstant. We exploit positive homogeneity to obtain the estimate
where the set T> = {C > 0 : trace C 1} is compact. Owing to upper semicontinuity, the supremum of </> over T> is attained and finite. Therefore the value <f>(Is) is positive, whence the function 0 is nonconstant.
128
In the second part of the proof, we take <f> to be an information function on NND(s), and derive the relation f>. The argument is based on and the unit level sets associated with
It suffices to show that hen the function-set correspondence of Theorem 5.10 tells us that < and ( coincide. We claim that the two sets C and C satisfy the relation
For the direct inclusion, take a matrix D C, that is, D > 0 and 4>(D) > 1. For C <E C, the Holder inequality then yields (C, D) > <t>(C)<t>(D) > 1. The converse inclusion holds since every matrix D > 0 satisfying (C, D) > I for all C e C fulfills <f>(D) - inf c>0 {C/<HC),>) > inf c > 0:<MC )>i{C,D) > 1. Thus (1) is established. Applying formula ((1) to and we get
Formulae (1) and (2) entail the direct inclusion The converse inclusion is proved by showing that the complement of C is included in the complement of C0000, based on a separating hyperplane argument. We choose a matrix E e NND(s) \ C. Then there exists a matrix D e Sym(s) such that the linear form {-,>) strongly separates the matrix E and the set C. That is, for some y e R, we have
Since the set C recedes in all directions of NND(.s), the inequality shows in particular that for all matrices C e C and A > 0 we have {, D) < (C+8A, D) for all 8 > 0. This forces (A,D) > 0 for all A > 0, whence Lemma 1.8 necessitates D > 0. By the same token, 0 < (E,D) < y. Upon setting D = D/v the. sfrnna sp.naration of F. and C takes the form
Now (1) gives D C, whence (2) yields and the prooi is complete.
Therefore
5.14. COMPOSITIONS WITH THE INFORMATION MATRIX MAPPING
129
Thus a function $ on NND(s) that is positively homogeneous, positive on PD(s), and upper semicontinous is an information function if and only if it coincides with its second polar. This suggests a method of checking whether a given candidate function <f> is an information function, namely, by verifying <f> = 00000. The method is called quasi-linearization, since it amounts to representing <b as an infimum over the family of linear functions
This looks like being a rather roundabout way to identify an information function. However, for our purposes we need to find the polar function anyway. Hence, the quasi-linearization method of computing polar functions and to obtain, as a side product, the information function properties is as efficient as can be. Another instance of quasi-linearization is the definition of information matrices in Section 3.2, CK(A) = minLKix*. LK=is LAL1. The functional properties in Theorem 3.13 follow immediately from the definition, underlining the power of the quasi-linearization method. The theory centers around moment matrices A e NND(/c). Let K'6 be a parameter system of interest with a coefficient matrix K that has full column rank s. We wish to study A H-> < o CK(A), the composition of the information matrix mapping CK : NND(/c) > NND(s) of Section 3.13 with an information function < on NND(s) of Section 5.8. This composition turns out to be an information function on the cone NND(fc) of k x k matrices. 5.14. COMPOSITIONS WITH THE INFORMATION MATRIX MAPPING Theorem. Let the k x s coefficient matrix K have full column rank s. For every information function $ on NND(s), the composition with the information matrix mapping, <j> o CK, is an information function on NND(fc). Its polar is given by
Proof. In the first part of the proof, we verify the properties of an information function as enumerated in the proof of Theorem 5.10. 0. Clearly <f> o CK is finite. 1. Since CK and <f> are positively homogeneous, so is the composition <f> o CK. ii. Superadditivity of CK and $, and monotonicity of <f> imply superadditivity of <f> o CK-
130
iii. Nonnegativity of CK and of <f> entail nonnegativity of $ o CK. iv. Positive definiteness of CK(Ik), from Theorem 3.15, and positivity of <f> on PD(s), from Lemma 5.4, necessitate < o #(4) > 0. Hence <f> o CK is nonconstant. v. In order to establish upper semicontinuity of <f> o CK we use regularization,
for all
From Theorem 3.13. we know that the matrices satisfy and converge Monotonicity of <f> yields Therefore part (b) of Lemma 5.7 ascertains the convergence of The second part of the proof relies on a useful representation of the polar information function <f> that is based on the unit level set of <f>,
To see this, we write the definition i Making use of regularization, we obtain the inequality chain
Hence equality holds throughout this chain, thus proving (1). The unit level set of the composition < o CK is
Indeed, for such that For all
with Conversely, if for. then monotomcitv and
we get there exists a matrix imolies we get
with
5.15. THE GENERAL DESIGN PROBLEM
131
To see this, we estimate KCK' (see Section 3.21). Monotonicitv leads to the lower bound (A.B) Now (KCK'B}. The lower bound is attained at KCK' proves (3). Finally we apply (1) to <f> o CK and then to <f> to obtain from (2) and (3), for all B > 0,
A first use of the polar function of the composition <f> o CK is made in Lemma 5.16 to discuss whether optimal moment matrices are necessarily feasible. First we define the design problem in its full generality. 5.15. THE GENERAL DESIGN PROBLEM Let K'0 be a parameter subsystem with a coefficient matrix K of full column rank s. We recall the grand assumption of Section 4.1 that M is a set of competing moment matrices that intersects the feasibility cone A(K). Given an information function <f> on NND(s) the general design problem then reads
This calls for maximizing information as measured by the information function $, in the set M of competing moment matrices. The optimal value of this problem is, by definition,
A moment matrix M e M is said to be formally <j>-optimal for K'B in M when </> (CK(M)) attains the optimal value v (<). If, in addition, the matrix M lies in the feasibility cone A(K), then M is called <f>-optimal for K'6 in M. The optimality properties of designs are determined by their moment matrices M(). Given a subclass H, a design e H is called (f>-optimalfor K'6 in H when its moment matrix M() is </>-optimal for K'6 in M(E).
132
However, an optimal design is not an end in itself, but an aid to identifying efficient practical designs. The appropriate notion of efficiency is the following. DEFINITION. The $-efficiency of a design e H is defined by
It is a number between 0 and 1, and gives the extent (often quoted in percent) to which the design exhausts the maximum information v(<f>) for K'6 in M. A formally optimal moment matrix M that fails to be feasible for K'B is statistically useless, even though it solves a well-defined mathematical optimization problem. However, pathological instances do occur wherein formally optimal moment matrices are not feasible! An example is given in Section 6.5. The appropriate tool to check feasibility of M is given in Section 3.15. The information matrix CK(M) must have rank5. The following lemma singles out three conditions under which every formally optimal matrix is feasible. When an information function is zero for all singular nonnegative definite matrices, we briefly say that it vanishes for singular matrices.
5.16. FEASIBILITY OF FORMALLY OPTIMAL MOMENT MATRICES
Lemma. If the set M of competing moment matrices is compact, then there exists a moment matrix M G M that is formally <-optimal for K' 0 in M, and the optimal value v(<f>) is positive. In order that every formally <-optimal moment matrix for K'6 in M lies in the feasibility cone A(K), and thus is <-optimal for K'Q in M, any one of the following conditions is sufficient: a. (Condition on M) The set M is included in the feasibility cone A(K). b. (Condition on </>) The information function <f> vanishes for singular matrices. c. (Condition on<j>) The polar information function < vanishes for singular matrices and is strictly isotonic on PD(s), and for every formally optimal moment matrix M e M there exists a matrix D e NND(*) that solves the polarity equation
5.17. SCALAR OPTIMALITY, REVISITED
133
Proof. By Theorem 5.14, the composition <f> o CK is upper semicontinuous, and thus attains its supremum over the compact set M. Hence a formally optimal matrix for the design problem exists. Because of our grand assumption in Section 4.1, the intersection MnA(K) contains at least one matrix B, say. Its information matrix C^(B) is positive definite and has a positive information value < (CK(B)}, by Theorem 3.15 and Lemma 5.4. Therefore the optimal value is positive, v(<f>) > <fr(CK(B)) > 0. Under condition (a), all competing moment matrices are members of A(K), including those that are formally optimal. Under condition (b), the criterion </> vanishes for singular information matrices CK(A). As v(</>) > 0, any formally optimal moment matrix M has a nonsingular information matrix CK(M). Then M must lie in A(K), by Theorem 3.15. Finally we turn to condition (c). Let z e Rs be a vector such that z'CK(M)z = 0. We show that z vanishes. Since <f> is zero for singular matrices, <(>(CK(M)) <f>(D) = 1 forces D to be positive definite. If z ^ 0, we obtain the contradiction
as follows from the Holder inequality, the property z'CK(M)z = 0, the polarity equation, and strict monotonicity of <f> on PD(.$i). Hence z = 0. This entails positive definiteness of C#(M), and feasibility of M. In the Duality Theorem 7.12, we find that for a formally optimal moment matrix M e M., there always exists a matrix D NND(s) satisfying the polarity equation of part (c). Theorem 7.13 will cast the present lemma into its final form, just demanding in part (c) that the polar function <f> vanishes for singular matrices and is strictly isotonic for positive definite matrices. 5.17. SCALAR OPTIMALITY, REVISITED Loewner optimality and information functions come to bear only if the coefficient matrix K has a rank larger than 1. We briefly digress to see how these concepts are simplified if the parameter system of interest is one-dimensional. For a scalar system c'O, the information "matrix" mapping A H- CC(A) is actually real-valued. It is positive on the feasibility cone A(c) where CC(A) (c'A~cYl > 0, and zero outside. The Loewner ordering among information numbers reverses the ordering among the variances c'A'c. Hence Loewner optimality for c'O is equivalent to the variance optimality criterion of Section 2.7.
134
The concept of information functions becomes trivial, for scalar optimality. They are functions < on NND(l) = [0;oo) satisfying 0(y) = <f>(l)y for all y > 0, by homogeneity. Thus all they achieve is to contribute a scaling by the constant <(!) > 0. The composition <f> o Cc orders any pair of moment matrices in the same way as does Cc alone. Therefore Cc is the only information function on the cone NND(fc) that is of interest. It is standardized if and only if c has norm 1. The polar function of Cc is Q(jB) = c'Bc. This follows from Theorem 5.14, since the identity mapping $(7) = y has polar 0(S) = inf y> o yS/y = 8. The function B i-> c'Bc is the criterion function of the dual problem of Section 2.11. In summary, information functions play a role only if the dimensionality s of the parameter subsystem of interest is larger than one, s > 1. The most prominent information functions are matrix means, to be discussed next. EXERCISES 5.1 Show that <J> > & if and only if {</> > 1} D (<A > 1), for all information functions 5.2 In Section 5.11, what are the unit level sets of 5.3 Discuss the behavior of the sets SC as S tends to zero or infinity, with C being (i) the unit level set {<f> > 1} of an information function $ as in Section 5.9, (ii) the penumbra M NND(fc) of a set of competing moment matrices M as in Section 4.9, (iii) the Elfving set conv(A? u ()) of a regression range X as in Section 2.9. 5.4 Show that a linear function <f> : Sym(.s) * IR is an information function on NND(s) if and only if for some D e NND(s) and for all C e Sym($) one has <f>(C) = trace CD.
for 5.5 Is an information function?
for
5.6 Show that neither of the matrix norms are isotonic on NND(s) [Marshall and olkin (1969), p. 170]. 5.7 Which properties must a function < on NND(s) have so that </>(|C|) is a matrix norm on Sym(.s), where |C| = C+ + C_ is the matrix modulus of Section 6.7?
CHAPTER 6
Matrix Means
The classical criteria are introduced: the determinant criterion, the averagevariance criterion, the smallest-eigenvalue criterion, and the trace criterion. They are just four particular cases of the matrix means <f>p, with parameter p e [oo;l]. The matrix mean of a given matrix is the same as the vector mean of the eigenvalue vector of the matrix. This and a majorization inequality show the matrix mean (f>p to be an information function. Its polar is proportional to the matrix mean <$>q where the numbers p and q are conjugate in the interval [oo; 1]. 6.1. CLASSICAL OPTIMALITY CRITERIA The ultimate purpose of any optimality criterion is to measure "largeness" of a nonnegative definite 5 x 5 matrix C. In the preceding chapter, we studied the implications of general principles that a reasonable criterion must meet. We now list specific criteria which submit themselves to these principles, and which enjoy a great popularity in practice. The most prominent criteria are the determinant criterion, the average-variance criterion, definite) the smallest-eigenvalue criterion, the trace criterion,
positive
and
Each of these criteria reflects particular statistical aspects, to be discussed in Section 6.2 to Section 6.5. Furthermore, they form but four particular members of the one-dimensional family of matrix means <f>p, as defined in Section 6.7. In the remainder of the chapter, we convince ourselves that the matrix mean <j>p qualifies as an information function if the parameter p lies in the interval [-00; 1].
135
136
CHAPTER 6: MATRIX MEANS
6.2. D-CRITERION The determinant criterion <fo(C) differs from the determinant del C by taking the 5 th root, whence both functions induce the same preordering among information matrices. From a practical point of view, one may therefore dispense with the 5 th root and consider the determinant directly. However, the determinant is positively homogeneous of degree s, rather than 1. For comparing different criteria, and for applying the theory of information functions, the version <fo(C) = (det C)1/5 is appropriate. Maximizing the determinant of information matrices is the same as minimizing the determinant of dispersion matrices, because of the formula
Indeed, in Section 3.5 the inverse C l of an information matrix was identified to be the standardized dispersion matrix of the optimal estimator for the parameter system of interest. Its determinant is called the generalized variance, and is a familiar way in multivariate analysis to measure the size of a dispersion matrix. This is the origin of the great popularity that the determinant criterion enjoys in applications. In a linear model with normality assumption, the optimal estimator K'0 for an estimable parameter system K'6 has distribution N^/ fl .( ff 2/ n ) C -i, with C = (K'M'KY1 and M = (\/ri)X'X. It turns out that the confidence ellipsoid for K'0 has volume inversely proportional to (det C)1/2. Hence a large value of det C secures a small volume of the confidence ellipsoid. This is also true for the ellipsoid of concentration which, by definition, is such that on it the uniform distribution has the same mean vector K'6 and dispersion matrix (ar2/ri)C~l as has K'0. For testing the linear hypothesis K'0 0, a uniform comparison of power leads to the Loewner ordering, as expounded in Section 3.7. Instead we may evaluate the Gaussian curvature of the power function, to find a design so that the F-test has good power uniformly over all local alternatives close to the hypothesis. Maximization of the Gaussian curvature again amounts to maximizing det C. Another pleasing property is based on the formula det(H'CH) = (det //2)(det C), with a nonsingular 5 x 5 matrix H. Suppose the parameter system K'0 is reparametrized according to H'K'6. This is a special case of iterated information matrices, for which Theorem 3.19 provides the identities
Thus the determinant assigns proportional values to Cx(A) and CKu(A), and
6.4. E-CRITERION
137
the two function induced orderings of information matrices are identical. In other words, the determinant induced ordering is invariant under reparametrization. It can be shown that the determinant is the only criterion for which the function induced ordering has this invariance property. Yet another invariance property pertains to the determinant function itself, rather than to its induced ordering. The criterion is invariant under reparametrizations with matrices H that fulfill det H = 1, since then we have det CKH(^) = det CK(A), We verify in Section 13.8 that this invariance property is characteristic for the determinant criterion. 6.3. A-CRITERION Invariance under reparametrization loses its appeal if the parameters of interest have a definite physical meaning. Then the average-variance criterion provides a reasonable alternative. If the coefficient matrix is partitioned into its columns, K = (c\,... ,cs), then the inverse l/<_i can be represented as
This is the average of the standardized variances of the optimal estimators for the scalar parameter systems c[6,...,c'sd formed from the columns of K. From the point of view of computational complexity, the criterion </>_! is particularly simple to evaluate since it only requires the computation of the s diagonal entries of the dispersion matrix K'A~K. Again we can pass back and forth between the information point of view and the dispersion point of view. Maximizing the average-variance criterion among information matrices is the same as minimizing the average of the variances given above. 6.4. E-CRITERION The criterion $-00, evaluation of the smallest eigenvalue, also gains in understanding by a passage to variances. It is the same as minimizing the largest eigenvalue of the dispersion matrix,
Minimizing this expression guards against the worst possible variance among all one-dimensional subsystems z'K'O, with a vector z of norm 1. In terms of variance, it is a minimax approach, in terms of information a maximin approach. This criterion plays a special role in the admissibility investigations of Section 10.9.
138
The eigenvalue criterion <_oo is one extreme member of the matrix mean family <p, corresponding to the parameter 6.5. T-CRITERION The other extreme member of the <(>p family is the trace criterion <j>\. By itself the trace criterion is rather meaningless. We have made a point in Section 5.1 that a criterion ought to be concave so that information cannot be increased by interpolation. The trace criterion is linear, and this is so weak that interpolation becomes legitimate. Yet trace optimality has its place in the theory, mostly accompanied by further conditions that prevent it from going astray. An example is Kiefer optimality of balanced incomplete block designs (see Section 14.9). The trace criterion is useless if the regression vectors x e X have a constant squared length c, say. Then the moment matrix Af () of any design H satisfies
whence the criterion <fo is constant. For instance, in the trigonometric fit model of Section 2,22, it assigns the value d + 1 to all moment matrices,
Similarly <fo is constant in the two-way classification model of Section 1.5, since
In these situations the criterion <fo provides no distinction whatsoever. The trace criterion also exemplifies the pathologies discussed in Section 5.15, that a formally optimal moment matrix may fail to be feasible. As an example, we take the parabola fit model of Section 1.6, with all three parameters being of interest. We assume that the experimental domain is the symmetric unit interval T = [!;!]. The <fo information of a design r on T is
The moments /u; = $<\.\\ t> dr for j = 2,4 attain the maximum value 1 if and only if the design is concentrated on the points 1. Thus every <ft-optimal
6.6. VECTOR MEANS
139
design T has at most two support points, 1, whence a formally optimal moment matrix has rank at most equal to 2. No such moment matrix can be feasible for a three-dimensional parameter system. The weaknesses of the trace criterion are an exception in the matrix mean family <p, with p e [-00; 1]. Theorem 6.13 shows that the other matrix means are concave without being linear. Furthermore they fulfill at least one of the conditions (b) or (c) of Lemma 5.16 for every formally optimal moment matrix to be feasible (see Theorem 7.13). 6.6. VECTOR MEANS Before turning to matrix means, we review the vector means 4>p on the space Rs. The nonnegative orthant W+ = [0;oo)* is a closed convex cone in R5. Its interior is formed by those vectors A e Rs that are positive, A > 0. It is convenient to (1) define the means <J>P for positive vectors, and (2) extend the definition to the closed cone Us+ by continuity, and (3) cover all of the space Us by a modulus reduction. For positive vectors, A = ( A i , . . . , A s )' > 0, the vector mean 4>p is defined by
For vectors A e Us+ with at least one component 0, continuous extension yields
For arbitrary vectors, A e Rs, we finally define
The definition of <J>P extends from positive vectors A > 0 to all vectors A RJ in just the same way for both p > 1 and for p < 1. It is not the definition, but the functional properties that lead to a striking distinction. For p > 1, the vector mean 4>p is convex on all of the space R*; for p < 1, it is concave but only if restricted to the cone IR+. We find it instructive to follow up this distinction, for the vector means as well as for the matrix means, even
140
though for the purposes of optimal designs it suffices to investigate the means of order p e [-00; 1], only. If A = als is positively proportional to the unity vector ls then the means <&p do not depend on p, that is, 4>p(a75) = a for all a > 0. They are standardized in such way that <t>PC/5) = 1. For p ^ 00, the dependence of 0 is strictly isotonic provided the other components are fixed. With the full argument vector A held fixed, the dependence on the parameter/? [00; oo] is continuous. Verification is straightforward for p tending to 00. For/? tending to 0 it follows by applying the 1'Hospital rule to log ^(A). If the vector A has at least two distinct components then the means ^(A) are strictly increasing in /?. The most prominent members of this family are the arithmetic mean 4>!, the geometric mean <J>0, and the harmonic mean <!>_!. Our notation suggests that they correspond to the trace criterion <fo, the determinant criterion <fo, and the average-variance criterion <_i. The precise relationship is as follows. 6.7. MATRIX MEANS Again we find it instructive to define the matrix means <j>p for every parameter p e [-00; oo], on all of the space Sym^). Later, we contrast the convex behavior for /? > 1 with the concave behavior for p < 1. For p > 1, the matrix mean <j>p is a norm on the space Sym(.s); for p < 1, it is an information function on the cone NND(s). For a matrix C e Sym(5i), we let A(C) = ( A i , . . . , A,)' be the vector consisting of the eigenvalues A; of C, in no particular order but repeated according to their multiplicities. The matrix mean <f>p is defined through the vector mean <>p,
Since the vector means 3>p are invariant under permutations of their argument vector A, the order in which A(C) assembles the eigenvalues of C does not matter. Hence <f>p(C) is well defined. An alternative representation of <f>p(C) avoids the explicit use of eigenvalues, but instead uses real powers of nonnegative definite matrices. This representation enters into the General Equivalence Theorem 7.14 for <f>poptimality. There the conditions are stated in terms of not eigenvalues, but matrices and powers of matrices. To this end let us review the definition of the powers Cp for arbitrary real parameters n nrnvide.H C is nnsitive definite. Using an eigenvalue decomposition the definition is
6.7. MATRIX MEANS
141
For integer values p, the meaning of Cp is the usual one. In general, we obtain trace Cp trace For positive definite matrices, C 6 PD(s), the matrix mean is represented by
For singular nonnegative definite matrices, C e NND(s) with rank C < s, we have
For arbitrary symmetric matrices, C e Sym(s), we finally get
where 1C I, the modulus of C, is denned as follows. With eigenvalue decomthe positive part C+ and the negative part C_ are position given by
They are nonnegative definite matrices, and fulfill C = C+ - C_. The modulus of C is defined bv |C| = C+ + C_. It is nonneeative definite and thus substansatisfies tiating (3) It is an immediate consequence of the definition that the matrix means q>p on the space Sym(.s) are absolutely homogeneous, nonnegative (even positive if p 6 (0;ooj), standardized, and continuous. This provides all the properties that constitute a norm on the space Sym(^), or an information function on the cone NND(s), except for subadditivity or superadditivity. This is where the domain of definition of (f>p must be restricted for sub- and superadditivity to hold true, from Sym(s) for p > 1, to NND(s) for p < 1. We base our derivation on the well-known polarity relation for the vector means <J>P and the associated Holder inequality in Rs, using the quasilinearization technique of Section 5.13. The key difficulty is the transition from the Euclidean vector scalar product (A,/x) = X'fi on Rs, to the Euclidean matrix scalar product ( C , D ) = trace CD on Sym(s). Our approach
142
uses a few properties of vector majorization that are rigorously derived in Section 6.9. We begin with a lemma that provides a tool to recognize diagonal matrices. 6.8. DIAGONALITY OF SYMMETRIC MATRICES Lemma. Let C be a symmetric s x s matrix with vector 8(C) = (en,..., css)' of diagonal elements and vector A(C) = ( A i , . . . , A s ) ' of eigenvalues. Then the matrix C is diagonal if and only if the vector S(C) is a permutation of the vector A(C). Proof. We write c = 8(C) and A = A(C), for short. If C is diagonal, C = Ac, then the components c;; are the eigenvalues of C. Hence the vector c is a permutation of the eigenvalue vector A. This proves the direct part. For the converse, we first show that Ac may be obtained from C through an averaging process. We denote by Sign(s) the subset of all diagonal s x s matrices Q with entries qy} {1} for / < s, and call it the sign-change group. This is a group of order 2s. We claim that the average over the matrices QCQ with Q e Sign(s) is the diagonal matrix Ac,
To this end let e, be the ; th Euclidean unit vector in IR5. We have e-QCQej = quqjjCij. The diagonal elements of the matrix average in (1) are then
The off-diagonal elements are, with i ^ /,
Hence (1) is established. Secondly, for the squared matrix norm, we verify the invariance property
6.8. DIAGONALITY OF SYMMETRIC MATRICES
143
On the one hand, we have ||(>C()||2 = trace QCQQCQ = trace C2 = ) y<5 A 2 . On the other hand, we have ||AC||2 = trace ACAC = ]Cy<5c/; = ] 7 < S A 2 , where the last equality follows from the assumption that c is a permutation of A. Finally we compute the squared norm of the convex combination (1) and utilize the invariance property (2) to obtain, because of convexity of the squared norm,
Hence (3) holds with equality. Since the convexity of the squared norm is strict and every weight 1/2* is positive, equality in (3) forces all matrices in the sum to be the same, QCQ = C for all Q Sign(s). Going back to (1) we find that Ac = 1/25 GSign(*) C = C, that is, C is diagonal. D The proof relies on equation (1), to transform C into QCQ, then take the uniform average as Q varies over the sign-change group Sign(s), and thereby reproduce Ac. The vectors A(C) and 5(C) are related to each other in a similar fashion; to transform A(C) into Q\(C), then take some average as Q varies over a finite group Q, and finally reproduce 8(C). The relation is called vector majorization. It is entirely a property of column vectors, and a change of notation may underline this. Instead of S(C) and A(C), we choose two arbitrary vectors x and y in IR*. The group Q in question is the permutation group Perm(/c), that is, the subset of all permutation matrices in the space Rkxk. A permutation IT of the numbers ! , . . . , & induces the permutation matrix
where j is the yth Euclidean unit vector of Uk. It is readily verified that the mapping IT i- Q^ is a group isomorphism between permutations, and permutation matrices. Hence Perm(/c) is a group of order k\. A permutation matrix Q^ acts on a vector y e Uk by permuting its entries according to TT~I,
144
CHAPTER 6. MATRIX MEANS
We are interested in averages of the type
The with weights aQ that satisfy mmQePeTm(k) aQ>Q vector x has its components less spread out than those of y in the sense that x results from averaging over all possible permutations Qy of y. This averaging property is also reflected by the fact that the matrix 5 is doubly stochastic. A matrix S R /cx * is called doubly stochastic when all elements are nonnegative, and all row sums as well as all column sums are 1,
The set of all doubly stochastic k x k matrices is a compact and convex subset of the matrix space Rkxk, closed under transposition and matrix multiplication. Every permutation matrix is doubly stochastic, as is every average Y^QePerm(k) aQQ- Conversely, the Birkhoff theorem states that every doubly stochastic matrix is an average of permutation matrices. We circumvent the Birkhoff theorem in our exposition of vector majorization, by basing its definition on the property of 5 being doubly stochastic. 6.9. VECTOR MAJORIZATION The majorization ordering compares vectors of the same dimension, expressing that one vector has its entries more balanced, or less spread out, than another vector. For two vectors x,y e Rk, the relation x -< y holds when x can be obtained from y by a doubly stochastic transformation, for some doubly stochastic k x k matrix 5. In this case, the vector x is said to be majorized by the vector y. Not all pairs of vectors are comparable under majorization. A necessary reauirement is that the comnonent sums of the two vectors are the same: if then The relation is reflexive and transitive,
Hence the majorization ordering constitutes a preordering, For this preordering to be a partial ordering, antisymmetry is missing. We claim that vector majorization satisfies a weaker version which we call
6.9. VECTOR MAJORIZATION
145
antisymmetry modulo Perm(/c),
To prove the direct part in (1), we consider decreasing rearrangements and partial sum sequences. Given a vector x its decreasing rearrangement, jq, is defined to be the vector that has the same components as x, but in decreasing order. For x to be a permutation of another vector y, it is evidently necessary and sufficient that the two vectors share the same decreasing rearrangements, JC| = yi. This equality holds if and only if the partial sums over the consecutive initial sections of jq = (x^,..., x^)' and y^ (y^,..., y i/t )' coincide,
If jc -x y, then the two sums are the same for h k, since jc,y,jtj,y| share one and the same component sum. We show that majorization implies an ordering of the partial sums,
To this end, let x be majorized by y, that is, x = Sy for some doubly stochastic matrix S. We can choose two permutation matrices Q and R that map x and y into their decreasing rearrangements, jt| = Qx and yj = Ry. Since R' is the inverse of /?, we obtain y = R % It follows that x = Qx = QSy = QSR 'yj = /*yi, where the matrix P = QSR' is doubly stochastic. For h < k, we get
with coefficients finally yields
that satisfy
and
This
We have verified property (3). If jc and y majorize each other, then the two inequalities from (3) yield the equality in (2). Hence the direct part in (1) is established. For the converse, we only need to apply the majorization definition to x = Qy and y = Q'x. The proof of our claim (1) is complete.
146
We are now in a position to resume the discussion of Section 6.8, and derive the majorization relation between the diagonal vector 8(C) and the eigenvalue vector A(C) of a positive definite matrix C. 6.10. INEQUALITIES FOR VECTOR MAJORIZATION Lemma. Let C be a positive definite 5 x 5 matrix with vector 8(C) = (en,...,c s s )' of diagonal elements and vector A(C) = ( A i , . . . , A,)' of eigenvalues. Then we have: a. (Schur inequality) 8(C) is majorized by A(C). b. (Monotonicity of concave or convex functions) For every strictly concave function g : (0; oo) > R, we have the inequality
while for strictly convex functions g the inequality is reversed. In either case equality holds if and only if the matrix C is diagonal. c. (Monotonicity of vector means) For parameter p (-00; 1) the vector means <&p obey the inequality
while for parameter p e (l;oo), the inequality is reversed. In either case equality holds if and only if the matrix C is diagonal. Proof. We write and for short. For oart fa), we choose an eigenvalue decomposition and define the s x s matrix S with entries Since Z' = (zi,...,zs) is an orthogonal sxs matrix, the rows and columns of S sum to 1. Thus 5 is doubly stochastic, and we have, for all i < s.
This yields c = S\, that is, c is majorized by A. To prove part (b), let g be a strictly concave function. For a single subscript / < s, equalitv (1) vields
6.11.
THE HOLDER INEQUALITY
147
Summation over / gives
as claimed in part (b). Equality holds in (3) if and only if, for all / < s, equality holds in (2). Then strict concavity necessitates that in (1) positive coefficients s/y come with an identical eigenvalue Ay = c,,, that is, stj > 0 implies c(/ = A y for all i,j < s. Therefore A; = Yî-s >os'iî lL,i<ssijc"> ôr all 7 < 5, that is, A = S'c. Thus c and A are majorized by each other. From the antisymmetry property (1) in Section 6.9, the vector c is a permutation of the vector A. Hence C is diagonal by Lemma 6.8. The case of a strictly convex function g is proved similarly. The proof of part (c) reduces to an application of part (b). First consider the case p e (oo;0). Since the function g(x) = xp is strictly convex, part (b) yields Ej<s8(cjj) < ;<,?(*/), that is, g(*p(c)) < g(*p(X)). But g is also strictly antitonic, whence we obtain $>p(c} > $ P (A). For p 0, we use the strictly concave and strictly isotonic function g(x) log x, for p e (0; 1) the strictly concave and strictly isotonic function g(x) = xp, and for p (l;oo) the strictly convex and strictly isotonic function g(x) X?. LJ We next approach the Holder inequality for the matrix means <j>p. Two numbers p,q e (-00; oo) are called conjugate when p + q pq. Except for p q = 0, the defining relation permits the familiar form l/p + l/q = 1. For finite numbers p and q, conjugacy implies that both lie either in the interval (oo;l), or in (l;oo). The limiting case, as p tends to 00, has q = 1. Therefore we extend the notion of conjugacy to the closed real line [-00; oo] by saying that the two numbers -oo and 1 are conjugate in the interval [00; 1], while the two numbers 1 and oo are conjugate in [1; oo]. The only self-conjugate numbers are p = q = 0, and p = q = 2 (see Exhibit 6.1). 6.11. THE HOLDER INEQUALITY Theorem. Let p and q be conjugate numbers in [oo; 1], and let p and q be conjugate numbers in [l;oo]. Then any two nonnegative definite 5 x 5 matrices C and D satisfy
Assume C to be positive definite, C > 0. In the case p,q (oo;l), equality holds in the left inequality if and only if D is positively proportional to Cp~l, or equivalently, C is positively proportional to Dq~^. In the case
148
EXHIBIT 6.1 Conjugate numbers, p + q = pq. In the interval [-00; 1] one has p < 0 if and only if q > 0; in the interval [l;oo] one has p > 2 if and only if <? < 2. The only self-conjugate numbers are p = q = 0 and p q = 2.
p,q (l;oo), the equality condition for the right inequality is the same with p in place of p, and q in place of q. Proof. First we investigate the cases p,q (oo; 1), for positive definite matrices C and D. We choose an eigenvalue decomposition Z 'A A Z. With the orthogonal 5 x 5 matrix Z' = (z\,..., zs) that comes with D, we define C = ZCZ'. Then the scalar product (C, D) of the matrices C and /) is the same as the scalar product (S(C),A(Z))) of the diagonal vector 5(C) of C and the eigenvalue vector \(D} of D.
The Holder inequality for vector means provides the lower bound 0>p (8(C))<bq (A(/))). Lemma 6.10 bounds 4>p (S(C)) from below by <t>p (A(C)). By construction, C has the same eigenvalues as C, that is, A(C) = A(C). Thus
6.12. POLAR MATRIX MEANS
149
we obtain the inequality chain
We denote the components of A(D) by A y , while 8(C) has components Equality holds in the first inequality of (1) if and only if, for some a > 0 and for all j < s, we have A in case p,q ^ 0, and in case p = q 0. In the second inequality of (1), equality holds if and only if the matrix C is diagonal. Hence c;/ are the eigenvalues of C as well as of C, and we have Altogether equality in (1) entails
That this condition is also sufficient for equality is seen bv straightforward verification. Evidently is equivalent with The case p,q e (l;oo) uses similar arguments, with reversed inequalities in (1). The extension from positive definite matrices to nonnegative definite matrices C and D follows by continuity. The extension to parameter values P->4 {-0,!} and p,q {l,oo} follows by a continuous passage to the limit. If p, q / 0 and C, D > 0, then D is proportional to Cp~l if and only if Cp and Dq are proportional. The latter condition looks more symmetric in C and D. The polar function of a matrix mean <f>p is defined by
In either case the denominators are positive and so the definition makes sense. The formulae are the same as those in Section 5.12. It turns out that the polar functions again are matrix means, up to a scale factor. 6.12. POLAR MATRIX MEANS Theorem. Let p and q be conjugate numbers in [-co; 1], and let p and q be conjugate numbers in [1; oo]. Then the polar functions of the matrix means
150
are
Proof. We again write (C, D) = trace CD. In the first part, we concentrate on the case p,q < 1. Fix a positive definite matrix D. The Holder inequality proves one half of the polarity formula,
The other half follows from inserting the particular choice Dq~l for C, provided p,q (-00; 1). This choice fulfills the equality condition of the Holder inequality. Hence we obtain
For/? = -oo, we insert for C the choice /5. Forp = 1, we choose the sequence Cm = zz' + (l/m)/s > 0 and let m tend to infinity, where z is any eigenvector of D corresponding to the smallest eigenvalue Amin(D). This shows that the functions <f> and s<f>q coincide on the open cone PD(,s). Because of continuity they also coincide on the closed cone NND(s). The first polarity formula is established. In the second part of the proof, we turn to the case p, q > 1. For D = 0, we have <^(0) = 0 s^(0). In order to handle symmetric matrices D ^ 0, we utilize from Section 6.7, the positive and negative parts >+ and >_ and the modulus \D\ = D+ + Z)_. Let C be another symmetric matrix. Monotonicity of the trace, as discussed in Section 1.11, entails -(C,D T ) < 0 < (C,D T ). Hence we obtain a bound for the trace scalar product,
This bound and representation (3) from Section 6.7 yield
(This estimate has no counterpart for polars of information functions.) We restrict ourselves to the case p, q e (1; oo) and a nonsingular matrix D. The other cases then follow by continuity. Nonsingularity of D forces |D| to
6.13.
MATRIX MEANS AS INFORMATION FUNCTIONS AND NORMS
151
be positive definite. Hence the Holder inequality applies:
Together with (2), this establishes the first half of the polarity formula, The other half follows if we insert the choice Positive and negative parts fulfill the orthogonality relation entailing
) for
From i
and
we obtain
Hence equality holds, and the proof is complete. The functional properties of the matrix means may be summarized as follows. 6.13. MATRIX MEANS AS INFORMATION FUNCTIONS AND NORMS Theorem. Let p and q be conjugate numbers in [-00; 1]. Then the matrix mean (f>p is an information function on NND(s), with s<j>q as its polar function; $p is strictly concave on PD(s) if p e (-oo;l), (j>p is strictly isotonic on NND(s) if p e (0; 1], and <f>p is strictly isotonic on PD(s) if p e (-00; 0]. Let p and q be conjugate numbers in [l;oo]. Then the matrix mean <j>p is a norm on Sym(s), with sfa as its polar function; <j>p is strictly convex if p e (l;oo), 4>p is strictly isotonic on NND(s) if p e [l;oo). Proof. With p and q interchanged, Theorem 6.12 gives <f>p = (l/s)<f>. Hence <f>p is an information function, by Theorem 5.13. Theorem 6.12 also comprises the polarity formula <j> = s<f>q. It remains to investigate strict concavity and strict monotonicity.
152
Strict concavity on PD(s) follows from Corollary 5.3, provided tj>p is strictly superadditive on PD(s). To this end we show that, for C > 0, D ^ 0 and p e (-00; 1), the equality (f>p(C+D) = <f>p(C)+<t>p(D) forces D to be positively proportional to C. Indeed, upon introducing E (C+D)p~l, we have equality in the Holder inequality whence
Now we assume that equality holds. Then E minimizes {C, F)/<f>px>(F) over F > 0. The Holder inequality states that this occurs only if E is positively proportional to CP~I. But E = (C + DY'1 and E = aCp~l with a > 0 imply C + D = a1/^-1^, and D = (aV(p-i) _ i)c. If a < 1, then D < 0 and if a = 1, then D = 0. Either case contradicts D ^ 0. Hence a > 1 and D is positively proportional to C. This establishes strict concavity on PD(^), for p e (-00; l). From Corollary 5.5, if p (0;1], then (f>p is positive and hence strictly isotonic. Moreover, if p e (-00; 0], then <f>p is strictly superadditive on PD(s) and hence strictly isotonic on PD(s). The proof for the norms <f>p is similar and hence omitted. Having established the functional properties of the matrix means <j>p, we return to the design problem. 6.14. THE GENERAL DESIGN PROBLEM WITH MATRIX MEANS In Section 5.15, we introduced the general design problem for a parameter system K'6, with coefficient matrix K of full column rank s. If the optimality criterion is a matrix mean <f>p, with parameter p e [-co; 1], the problem takes the form
For C > 0 fixed, <f>p(C) is isotonic in p. Hence the optimal value function for alll (CK(M)) is isotonic in p, that is, In particular, the optimal values v(<f>p) are bounded from above by v(d>\).
6.15. ORTHOGONALITY OF TWO NONNEGATIVE DEFINITE MATRICES
153
The matrix means of greatest importance are singled out in Section 6.1 to Section 6.5, the D-, A-, E-, and T-criteria <fr>, 4>_i, <f>-oo, and <f>\. The determinant criterion <fo is self-polar. The average-variance criterion <>_! has the matrix mean s<fo/2 for its polar function. The smallest-eigenvalue criterion 0-oo and the trace criterion <fo form a polar pair. In summary, for the classical criteria, the polarity relations are
In Section 6.5, we illustrated by example that there need not exist a <f>\optimal moment matrix for K'6 in M. However, this anomaly can happen only for the trace criterion <fo. Indeed, if p E [-00;0], then the matrix mean <^>p vanishes for singular matrices, and part (b) of Lemma 5.16 applies. If p e (0; 1), then the mean <f>p has polar function s<f>q which vanishes for singular matrices and is strictly isotonic on PD(s). This verifies the first half of part (c) in Lemma 5.16. The second half, solvability of the polarity equation <l>p(C)4>(D) = trace CD = 1, is established later, in Theorem 7.13. At any rate, the matrices solving the polarity equation in part (c) of Lemma 5.16 play a vital role. In Theorem 7.9, we show on an abstract level that the polarity equation is solvable provided C is positive definite. For the matrix means <f>p we can determine these solutions quite explicitly. An auxiliary matrix result is established first.
6.15. ORTHOGONALITY OF TWO NONNEGATIVE DEFINITE MATRICES Let A and B be two n x k matrices. We call A orthogonal to B when the matrix scalar product (A,B) = trace A'B vanishes,
This notion of orthogonality refers to the scalar product in the space Rnxk. It has nothing to do with an individual k x k matrix A being orthogonal. Of course, the latter means A 'A Ik and n k. If A and B are square and symmetric, then for the scalar product to vanish, it is clearly sufficient that the matrix product AB is null. For nonnegative definite matrices A and B the converse also holds true:
In order to establish the direct part, we choose two square root decompositions, A = UU' and B = VV (see Section 1.14). We get (A,B) = trace UU'VV = \.race(U'V)'U'V = \\U'V\\2. Hence if A is orthogonal
154
to B, then U'V is null. Premultiplication by U and postmultiplication by V' leads to Q=UU'VV = AB. Moreover the equation AB = 0 permits a geometrical interpretation in terms of the ranges of A and B. The equation means that the range of B is included in the nullspace of A. By Lemma 1.13, the latter is the orthogonal complement of the range of A. Hence AB is the null matrix if and only if the ranges of A and B are orthogonal subspaces. For nonnegative definite matrices A and R we thus have
Notice the distinct meanings of orthogonality. The left hand side refers to orthogonality of the points A and B in the space Sym(fc), while the right hand side refers to orthogonality of the two subspaces range A and range B in the space HI*. Lemma 2.3 is a similar juxtaposition for sums of matrices and sums of subspaces. The equivalences (1), (2), and (3) are used repeatedly and with other nonnegative definite matrices in place of A and B.
6.16. POLARITY EQUATION Lemma. Let C be a positive definite 5 x 5 matrix, and let p be a number in [00;!]. Then D e NND(s) solves the polarity equation
if and only if
where the set 5 consists of all rank 1 matrices zz' such that z is a norm 1 eigenvector of C corresponding to its smallest eigenvalue A m j n (C). Proof. For the direct part, let D > 0 be a solution of the polarity equation. In the case p e (-00; 1), the polarity equation holds true if and only if D = aCP~l for some a > 0, by Theorem 6.11. From trace C(aCp~l) a trace Cp = 1, we find a = I/trace Cp. In the case p = 1, trace monotonicity of Section 1.11 and the polarity
6.17.
MAXIMIZATION OF INFORMATION VERSUS MINIMIZATION OF VARIANCE
155
equation yield
where the penultimate line employs the definitions of fa and </>-oo, and the last line uses the polarity formula s^-oo = 4>\' Thus we have traceC(D hmm(D)Is) = 0, with C positive definite. This entails D A m i n (D)/ 5 , that is, D is positively proportional to Is C1"1. From trace C(als) = 1, we get a = I/trace C1. The last case, p = -oo, is more delicate. With C > A m i n (C)/ s , we infer as before
This entails trace (C - A min (C)/ s )> = 0, but here D may be singular. However, C - A m i n (C)/ 5 is orthogonal to D in the sense of Section 6.15, that is, CD = A m i n (C)D. With eigenvalue decomposition D = ^2j<s hjZjZJ, postmultiplication by Zj yields \jCzj = A m j n (C)A,z ; , for all ; < s. If Ay ^ 0, then Zj is a norm 1 eigenvector of C corresponding to A m i n (C), and ZjZJ G S. Furthermore, the polarity equation implies that the nonnegative numbers A min (C)A, sum to 1, ;<,A m in(C)A ; - - trace A min (C)D - trace CD = 1. Hence A m i n (C)D = Y^j:\>o(^mm(C)^j}zjZ- is a convex combination of rank 1 matrices from S. The proof of the direct part is complete. The converse follows by straightforward verification. 6.17. MAXIMIZATION OF INFORMATION VERSUS MINIMIZATION OF VARIANCE Every information function </> is log concave, that is, log $ is concave. To see this, we need only apply the strictly isotonic and strictly concave logarithm function,
156
In view of log </> = log(l/0), log concavity of </> is the same as log convexity of 1/0. Log convexity always implies convexity, as is immediate from appealing to the strictly isotonic and strictly convex exponential function. Hence the design problem, a maximization problem with concave objective function 0, is paired with a minimization problem in which the objective function N *-* 1/0 (#'#.) is actually log convex, not just convex. The relation between 0 and 1/0 has the effect that some criteria are covered by the information function concept which at first glance look different. For instance, suppose the objective is to minimize a convex matrix mean 0p of the dispersion matrix C"1, in order to make the dispersion matrix as small as possible. A moment's reflection yields, with p = -p,
Hence the task of minimizing the norm 0p of dispersion matrices C l is actually included in our approach, in the form of maximizing the information function 0P of information matrices C, with p = p. By the same token, minimization of linear functions of the dispersion matrix is equivalent to maximization of a suitable information function (compare the notion of linear optimality in Section 9.8). Maximization of information matrices appears to be the appropriate optimality concept for experimental designs, much more so than minimization of dispersion matrices. The necessary and sufficient conditions for a design to have maximum information come under the heading "General Equivalence Theorem". EXERCISES 6.1 Show that the nroiection of ( positive part C+, that is, onto the cone NND(s) is the
6.2 Disprove that if C = A - B and A,B > 0 then A > C+ and B > C_, for C Sym(.y). 6.3 Show that the matrix modulus satisfies (1990), p. 129]. 6.4 For a given matrix C NND(s), how many solutions has the equation in SymCO? 6.5 Are |(0,1,1)' and |(3,1,1)' comparable by vector majorization? , for all C e Sym(s) [Mathias
EXERCISES
157
6.6 Verify that, for p < 1, the matrix mean $p fails to be concave on Svmf.vV
and 6.7 Prove the oolaritv equation for vector means, for i if A vector solves and and only if in case conv S in case where the set 5 consists of all Euclidean unit vectors ef such that y, is the smallest component of y.
6.8 Let w-i,..., ws be nonnegative numbers. For p ^ 0, 00 and (0; oo)s, the weighted vector mean is defined by
What is the aoorooriate definition for of definition of < to the space Us.
Extend the domain
6.9 (continued) Show that if wi,...,ws are positive and sum to 1, then the polar of the weighted vector mean is w^Vs) on NND(s), where p and q are conjugate over [-00; 1] [Gutmair (1990), p. 29]. 6.10 Let W > 0 be a nonnegative definite matrix. For p ^ 0, 00 and C > 0, the weighted matrix mean is defined by What is the appropriate definition for p 0, 00? Extend the domain of definition of <^ to the space Sym(,s).
6.11 Show that the polar of the weighted matrix mean $\ '(C) = trace WC equals l/A m a x(WD~) or 0 according as D e A(W) or not [Pukelsheim (1980), p. 359].
CHAPTER 7
The General Equivalence Theorem
The General Equivalence Theorem provides necessary and sufficient conditions for a moment matrix to be <f> -optimal for the parameter system of interest in a compact and convex set of competing moment matrices, where <f> is an information function. The theorem covers nondifferentiable information functions, and singular moment matrices. The proof is based on convex analysis, the tools being subgradients and normal vectors to a convex set. The particular versions of the theorem are given that are applicable to the matrix means <f>p. 7.1. SUBGRADIENTS AND SUBDIFFERENTIALS The topic of this chapter is the General Equivalence Theorem 7.14 which provides necessary and sufficient conditions for a moment matrix to solve the design problem. The main results, from Section 7.10 onwards, pertain to moment matrices M. In Chapter 8 we turn to designs proper. The auxiliary results, up to Section 7.9, develop tools from convex analysis. This approach avoids an overemphasis of differentiability. A differentiability assumption precludes such objective functions as the smallest-eigenvalue criterion 4>_oo, and it does not permit moment matrices to become singular. In the latter case, we even have to face a lack of continuity, as met in Section 3.16. We make extensive use of two notions of convex analysis: subgradients and normal vectors to a convex set. To begin with, we discuss subgradients and subdifferentials of concave functions, in the linear space of symmetric matrices Sym(fc) with Euclidean scalar product (A,B) = trace AB. For a concave function g : NND() > R and for a nonnegative definite matrix M e NND(fc), a symmetric matrix B e Sym(fc) is called a subgradient of g at M when it satisfies the subgradient inequality
158
7.2. NORMAL VECTORS TO A CONVEX SET
159
EXHIBIT 7.1 Subgradients. Each subgradient Bi,B2,B3,... to a concave function g at a point M is such that it determines an affine function A - g(M) + (A M, Bj) that bounds g from above and coincides with g at M.
The set of all subgradients of g at M is called the subdifferential of g at M, and is denoted by dg(M). When the set dg(M) is nonempty the function g is said to be subdifferentiable at M. In the design problem g is a composition The subgradient inequality has a pleasing geometrical interpretation. Let us consider the linear function T(A) = (A,B). Then the right hand side of the subgradient inequality is the affine function A i- g(M) + T(A - M), with value g(M) at A = M. Hence any subgradient of g at M gives rise to an affine function that globally bounds g from above and, locally at M, coincides with g (see Exhibit 7.1). We rigorously prove whatever properties of subgradients we need. However, we briefly digress and mention some more general aspects. The subdifferential dg(M) is a closed and convex set. If M is singular then the subdifferential dg(M) may be empty or not. If M is positive definite then dg(M) is nonempty. The function g is differentiable at M if and only if the subdifferential dg(M) is a one-point set. In this case the unique subgradient is the gradient, dg(M) = {Vg(M)}. A similar relation holds for generalized matrix inversion (compare Section 1.16). The matrix A is invertible if and only if the set A~ of generalized inverses is a one-point set. In this case the unique generalized inverse is the inverse, A' = {A'1}.
4> C/c-
7.2. NORMAL VECTORS TO A CONVEX SET The other central notion, normal vectors to a convex set, generalizes the concept of orthogonal vectors to a linear subspace. Let M be a convex set of symmetric k x k matrices, and let M be a member of M. A matrix B e Sym(A:)
160
CHAPTER?: THE GENERAL EQUIVALENCE THEOREM
EXHIBIT 12 Normal vectors to a convex set. Any normal matrix B to a set M at a point M is such that the angle with the directions A - M is greater than or equal to a right angle, for all A 6 M.
is said to be normal to M at M when it satisfies the normality inequality
Geometrically, this means that the angle between B and all the directions A M from M to A e M is greater than or equal to a right angle, as shown in Exhibit 7.2. If M. is a subspace, then every matrix B normal to M at M is actually orthogonal to M. If M is a subspace containing the matrix A, then it also contains SA for 8 > 0. The inequality (8A, B) < (M, B), for all 6 > 0, forces {>!,#} to be nonpositive, (A,B) < 0. Since the same reasoning applies to A e M, we obtain (-4,5) = 0, for all A M. This shows that the matrix B is orthogonal to the subspace M. The following lemma states that, under our grand assumption that the set of competing moment matrices M. intersects the feasibility cone A(K), we can formulate an "equivalent problem" in which the set M intersects the open cone of positive definite matrices. The "equivalent problem" is obtained by reparametrizing the original quantities. This is the only place in our development, and a technical one, where we make use of a reparametrization technique. A similar argument was used in the second part of the proof of Lemma 4.12.
7.3. FULL RANK REDUCTION Lemma. Let the set M. C NND(/:) be convex. Then there ev'ct a mimh*r r < k and a k x r matrix U with U'U = Ir such that the set of reduced r x r moment matrices satisfie and the reduced
7.3. FULL RANK REDUCTION
161
information matrix mapping Cu >K fulfills
Proof. By Lemma 4.2, the set M contains a moment matrix M with maximum rank r, say. Let be an eigenvalue decomposition of M. Then the k x r matrix U = (u\,...,ur) satisfies U'U = Ir, and UU' projects onto the range of M. The set M. U 'M.U contains the r x r matrix U'MU = U'U&\ U 'U = AA which is positive definite. Hence the intersection of M and PD(r) is nonempty. Since the range of M is maximal it includes the range of M, for all M e M. By our grand assumption of Section 4.1, there is a matrix in M. such that its range includes the range of K, hence so does M. Recalling that UU' projects onto the range of M, we get
Now we verify the equality string, upon setting >
for
The first and the last equation hold by definition. The two minima are equal because the two sets of matrices over which the minima are formed are the same. Namely, given GU'MUG' with GU'K = Is, we have that L = GU' is a left inverse of K with LML' = GU'MUG'. Conversely, given LML' with LK = /,, we see that G = LU satisfies GU'K = LUU'K = LK = Is and GU'MUG' = LUU'MUU'L' = LML'. This establishes the first equation of the lemma. For the second equation, we use Cv>K(M) = CK(M), and conclude with CK(M) = CK(U(U'MU)U ') = CK(UMU'). The lemma says something about the reparametrized system K'6 = (UU'K)'O = (U'K)'U'O = K'6, where K = U'K and 0 = U'O. The set of information matrices C#(M) for K'6, as M varies over the set M of competing moment matrices, coincides with the set of reduced information matrices C~(M) for K'6, as M varies over the set M of reduced moment /C
162
matrices. Therefore, the moment matrix M is optimal for K'6 in M if and only HU'MU is optimal for K'8 in M. We employ the lemma in the Duality Theorem 7.12. Otherwise it allows us to concentrate on the assumption that the set M contains a positive definite matrix, that is, M intersects the interior of the cone NND(fc). The following theorem provides the essential tool for our optimality investigations. 7.4. SUBGRADIENT THEOREM Theorem. Let g : NND(fc) R be a concave function, and let the set M C NND(fc) be convex and intersect the cone PD(fc). Then a moment matrix M e M maximizes g over M if and only if there exists a subgradient of g at M that is normal to M at M, that is, there exists a matrix B dg(M) such that
Proof. The converse part of the proof is a plain consequence of the notions of subdifferentiability and normality, g(A) < g(M) + (A - M,B) < g(M) for all A e M. The direct part is more challenging in that we need to exhibit the existence of a subgradient B that is normal to M. at M. We invoke the separating hyperplane theorem in the space Sym(fc) x IR with scalar product ((A, a), (B, ft)) = (A,B) + aft = (trace AB) + aft. In this space, we introduce the two sets
It is easily verified that both sets are convex. Optimality of M forces any point (A, a) in the intersection /C n C to fulfill a < g(A) < g(M) < a. This is impossible. Hence the sets 1C and are disjoint. Therefore there exists a hyperplane separating /C and , that is, there exist a pair (0,0) ^ (B,ft) Sym(k) x R and a real number y such that
In other words, the hyperplane H = {(A, a) e Sym(fc) x IR : (A, B)+aft y} is such that the set /C is included in one closed half space associated with H, while is included in the opposite closed half space. In (1), we now insert (A/,g(M)) e K. for (A,a) and for (A, a):
7.5. SUBGRAD1ENTS OF ISOTONIC FUNCTIONS
163
This yields e$ > 0 for all e > 0, whence /3 is nonnegative. We exclude the case j8 = 0. Otherwise (1) turns into sup~^ft(^4, B} < iniAeM(A,B}, with B ^ s\^\) 0. That is, in Sym(fc) there exists a hyperplane separating the sets NND(fc) and M. But by assumption, the set M contains a matrix that is positive definite. Therefore NND(fc) and M cannot be separated and /3 is positive. Knowing j8 > 0, we define B -B//3. Replacement of y by g(M)/3 + (M,B) from (2) turns the first inequality in (1) into (M,B). With a = g(A), we get
Therefore B is a subgradient of g at M. The second inequality in (1) becomes
Lettine a tend to e(M} we see that the suberadient B is normal to M at M. It remains to compute subdifferentials, and to this end we exploit the properties that are enjoyed by an information function. First we make use of concavity and monotonicity. Again we have in mind compositions g = < o CK on NND(/c), given by information functions < on NND(^). 7.5. SUBGRADIENTS OF ISOTONIC FUNCTIONS Lemma. Let the function g : NND(fc) > U be isotonic and concave. Then for all M e NND(fc), every subgradient of g at M is nonnegative definite. Let the function <f> : NND(s) R be strictly isotonic on PD(s) and concave. Then for all C PD(s), every subgradient of 4> at C is positive definite. Proof. Let B be a subgradient of g at M > 0. In the subgradient inequality g(^4) < g(M) + (A - M,B), we insert A M + zz', with z IR*. Monotonicity yields 0 < g(M + zz') g(M), whence we get
Hence the matrix B is nonnegative definite. Let be a subgradient of $ at C > 0. Then strict monotonicity on PD(s) entails
Therefore the matrix E is positive definite.
164
We now study the subdifferential of compositions of < with the information matrix mapping CK, where the k x 5 coefficient matrix K has full column rank s. 7.6. A CHAIN RULE MOTIVATION The tool for computing derivatives of compositions is the chain rule. The definition of subgradients in Section 7.1 applies to functions with values in the real line R, with its usual (total) ordering. We extend the idea to functions with values in the linear space Sym(.s), with its Loewner (partial) ordering. A subgradient mapping T of the information matrix mapping CK at M is defined to be a linear mapping T from the space Sym(A:) into the space Sym(s) that satisfies the subgradient inequality
where the inequality sign refers to the Loewner ordering in the space Sym(.s). Can we find such subgradient mappings Tl The answer is in the affirmative. The information matrix CK(M) is a minimum relative to the Loewner ordering,
where L is a left inverse of K that is minimizing for A/. We define a linear mapping T from Sym(A:) to Sym(s) by T(A) = LAL', and claim that T is a subgradient mapping of CK at M. Indeed, for all A > 0 we have
It remains open whether all subgradient mappings of CK at M arise in this way. Now let <f> be an isotonic and concave optimality criterion. We turn to the composition of a subgradient D of $ at C/t(Af), with a subgradient mapping T(A) LAL' of CK at M. Thus we want to merge two subgradient inequalities,
The subgradient D is nonnegative definite, by the first part of Lemma 7.5. As seen in Section 1.11, the linear form C i- (C,D) is then isotonic relative
7.7. DECOMPOSITION OF SUBGRADIENTS
165
to the Loewner ordering. Now we apply the first inequality to E = CK(A). Making CK(A) larger according to the second inequality, we get, for all A
Therefore the matrix T'(D) L'DL is a subgradient of the composition 0 o CK at M. However, this argument leaves open the question whether all subgradients of <t> o CK at M are obtained this way. The answer is in the affirmative if M is positive definite, and in the negative if M is singular. Indeed, let F be a nonnegative definite matrix orthogonal to M, that is, F > 0 and (F,M} =0. Then we get 0 < (A,F) = (A - M, F), for all A > 0. Hence the matrix
is also a subgradient of </> o CK at M. It emerges as Corollary 7.8 to the (somewhat technical) next theorem that all subgradients are of this form if M is feasible for K'6.
7.7. DECOMPOSITION OF SUBGRADIENTS Theorem. Let the function <f> : NND(s) > U. be isotonic and concave, and let M be a nonnegative definite k x k matrix with generalized information matrix MK = KCK(M}K' for K'6, where the k x s matrix K has full column rank s. Then for every symmetric k x k matrix B, the following three statements are equivalent: a. B is a subgradient of </> o CK at M. b. K'BK is a subgradient of <f> at CK(M), and B is nonnegative definite and orthogonal to M - MK. c. K'BK is a subgradient of <f> at CK(M), and there exist a generalized inverse G of M and a nonnegative definite k x k matrix F orthogonal to M such that B = GMKBMKG' + F. Proof. The proof builds on the properties of generalized information matrices as summarized in Theorem 3.24.
166
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
I. First we show that (a) implies (b). Let B be a subgradient of <f> o CK at M. Nonnegative definiteness of B is established in Lemma 7.5. We have
With A = MK, the subgradient inequality gives 0 < (MK M,B) = -(M MK,B). But M MK is nonnegative definite, by Lemma 3.14 and so Lemma 1.8 provides the converse inequality (M MK, B) > 0. This proves B to be orthogonal to M MK. Setting C = CK(M), we get (M,B) = (MK,B) = (KCK',B). For every nonnegative definite s x s matrix E, we have CK(KEK') minLKsxk. LK=Is LKEK'L1 = E, giving
Thus K 'BK is a subgradient of < at C, and (b) is established. II. Secondly we assume (b) and construct matrices G and F that fulfill (c). With R being a projector onto the nullspace of M, we introduce the k x k matrix
This definition is legitimate: F is the generalized information matrix of B for R'0, in the terminology of Section 3.21. By Theorem 3.24, the matrix F enjoys the three properties
By (1), F is nonnegative definite. From (2), we have MF = 0, that is, F is orthogonal to M. By assumption, the matrix B is orthogonal to M - MK, whence we have MB = MKB and BM = BMK. This yields
A passage to the complement turns (3) into (nullspace(fi-F))+(range M) = Uk. As in the proof of Theorem 2.16, this puts us in a position where we can find a nonnegative definite k x k matrix H with a range that is included in
7.8. DECOMPOSITION OF SUBDIFFERENTIALS
167
the nullspace of B - F and that is complementary to the range of M,
The first inclusion means (B - F)H = 0 whence (4) gives (M + H)(B F)(M + H) MKBMK. Choosing for M the generalized inverse G = (M + HY1 from Lemma 2.15, we obtain B - F = GMKBMKG. This provides the representation required in part (c). III. Thirdly we verify that (c) implies (a). For A > 0, we argue that
The first inequality holds because K'BK is a subgradient of $ at CK(M). The equality uses the representation AK KCK(A)K'. For the second inequality we observe that B inherits nonnegative definiteness via MKBMK from K'BK, by Lemma 7.5. Thus monotonicity yields (AK,B} < (A,B). Since G'MG is a generalized inverse of M it is also a generalized inverse of MK, by Lemma 3.22. Inserting the assumed form of B we obtain (M, B} = (MKG'MGMK,B) + (M,F} = (MK,B). Except for the full column rank assumption on the coefficient matrix K, the theorem is as general as can be. The matrix M need neither be positive definite nor lie in the feasibility cone A(K), and the optimality criterion $ is required to be isotonic and concave only. Part (c) tells us how to decompose a subgradient B of </> o CK at M into various bits and pieces that are simple to handle individually. For the purpose of construction, we reverse the issue. Given a subgradient D of $ at C, a generalized inverse G of M, and a nonnegative definite matrix F orthogonal to M, is GKCDCK'G' + F a subgradient of 4> o CK at M? This is indeed true provided M is feasible.
7.8. DECOMPOSITION OF SUBDIFFERENTIALS Corollary. For every nonnegative definite k x k matrix M, the subdifferentials of 0 o CK at M and of 0 at C = CK(M) fulfill the inclusion
168
relations
Moreover, if M lies in the feasibility cone A(K], then the three sets are equal. Proof. The first inclusion is verified in Section 7.6. The second follows from part (c) in Theorem 7.7, upon replacing MK by KCK' and K'BK by D. Moreover, let M be feasible. For any member B = GKCDCK'G1 + F of the last set, we define Z = CK'G1. Because of feasibility, Theorem 3.15 yields LK = CK'G'K = (K'M-K)-1K'M~K = Is and LML' = CK'G'MGKC = C. Thus B = L'DL + F is a member of the first set. D Lemma 7.5 provides some partial insight into the subgradients D of (f> at C = CK(M), for isotonic and concave criteria <. The following theorem says much more for information functions <f>. It places all the emphasis on the polarity equation
which we have already encountered earlier, in part (c) of Lemma 5.16, and which for the matrix means <J>P is solved explicitly in Lemma 6.16. 7.9. SUBGRADIENTS OF INFORMATION FUNCTIONS Theorem. Let </ be an information function on NND($). Then for every pair C and D of nonnegative definite s x s matrices, the following three statements are equivalent: a. (Subdifferential of < b. (Polarity equation) c. (Subdifferential of (
and and
In particular, the Subdifferential of <f> at C is
provided </>(C) is positive. Moreover, if C is positive definite, then < is subdifferentiable at C, d<(C) ^ 0-
7.9. SUBGRADIENTS OF INFORMATION FUNCTIONS
169
Proof. It suffices to establish the equivalence of (a) and (b). The equivalence of (b) and (c) then follows from the polarity correspondence < = (f)0000 of Theorem 5.13. First we derive (b) from (a). We insert SC with 8 > 0 into the subgradient inequality, dtf>(C) < </>(C) + (8C - C,<j>(C)D}. Cancelling 0(C) > 0, we get 8 - 1 < (8 - 1){C,D). The values 8 = 2 and 8 = 1/2 yield {C,D) = 1. With this, the subgradient inequality simplifies (j>(E) < <f>(C)(E,D) for all E > 0. For positive definite matrices E, we subdivide by <() > 0 to obtain 1 < <(C)inf >0 (E,D)/4>(E) = 4>(Q<j>(D). The Holder inequality contributes the converse direction, <#>(C)^ OC (D) < (C,D) = 1. Thus equality holds, and (b) is established. Next we prove (a) from (b). In view of <#>(C)^(D) = 1, the factor 0(C) must be positive. For E > 0, the definition of the polar function yields 1 = <(C)<(D) < <(C){, >}/<() and 0() < <J>(C) + (E - C,<t>(C)D). Regularization extends this inequality to all matrices E > 0. Hence (j>(C)D is a subgradient of <j> at C. Moreover, if C is positive definite, then d<f>(C) is nonempty. To see this, we notice that positive definiteness of C makes the setZ> = { D > 0 : (C,D) 1} compact. Let D T> be such that the upper semicontinuous function < attains its maximum over T>. The double polarity relation of Theorem 5.13 entails
Hence D solves the polarity equation, </>(C)<(.D) = 1 = (C,D). A similar argument was employed in the proof of Theorem 5.13. For a composition of the form </> o CK, we now have a couple of ways of representing the subdifferential. If M A(K] then C = CK(M) is positive definite, d(<f> o CK(M)} is nonempty, and it fulfills
The first two equalities follow from Corollary 7.8. The last equality applies the present theorem to the information function <f> o CK on NND(fc) and the polar function N -<t>(K'NK) from Theorem 5.14. For the matrix means <f>p, the polarity equation submits itself to the explicit solution of Lemma 6.16, thus also giving us complete command over all subgradients.
170
If a scalar parameter system c'B is of interest, then < is the identity on [0;oo), with derivative one on (0;oo), as remarked in Section 5.17. Hence for M A(c}, we get
These expressions are familiar to us from the proof of the Equivalence Theorem 2.16 for scalar optimality, except for the matrix F. The General Equivalence Theorem 7.14 follows just the same lines. If subdifferentiability of 4> o CK at M fails to hold, that is, d($ oCK)(M) = 0, then M is neither feasible for K' B nor formally $-optimal for K'6 in M. In this case M is of no interest for the general design problem of Section 5.15.
7.10. REVIEW OF THE GENERAL DESIGN PROBLEM With these results from convex analysis, we now attack the design problem in its full generality. The objective is to maximize the information as measured by some information function $ on NND(s):
The set M C NND(fc) of competing moment matrices is assumed to be compact and convex. The k x 5 coefficient matrix K is assumed to be of full column rank s, with information matrix mapping CR. We avoid trivialities by the grand assumption of Section 4.1 that there exists at least one competing moment matrix that is feasible, M n A(K) ^ 0. Although our prime interest is in the design problem, we attack it indirectly by first discussing its dual problem, for the following reason. Theorem 7.4 tells us that a moment matrix MI is optimal if and only if there exists a subgradient BI with certain properties. Similarly another moment matrix M2 will be optimal if and only if there exists a subgradient B2 with similar properties. In order to discuss multiplicity of optimal moment matrices, we need to know how MI relates to B2. The answer is given by the dual problem which represents B (more precisely, its scaled version N = B/<f>(C)) in a context of its own. We start with a family of upper bounds for the optimal value v((f>). Derivation of these bounds is based on the polar function < of Section 5.12, but elementary otherwise.
7.11. MUTUAL BOUNDEDNESS THEOREM FOR INFORMATION FUNCTIONS
171
7.11. MUTUAL BOUNDEDNESS THEOREM FOR INFORMATION FUNCTIONS Theorem. Let M 6 M. be a competing moment matrix and let N be a matrix in the set
Then we have 4>(CK(M)) < l/(f>(K'NK), with equality if and only if M and N fulfill conditions (1), (2), and (3) given below. More precisely, upon setting C = CK(M} we have
with respective equality if and only if
Proof. Inequality (i) and equality condition (1) are an immediate consequence of how the set M is defined. Inequality (ii) uses M > MK = KCK' and monotonicity of the linear form A >-> trace AN, from Theorem 3.24 and Section 1.11. Equality in (ii) means that N is orthogonal to M - MK > 0. By Section 6.15, this is the same as condition (2). Inequality (iii) is the Holder inequality from Section 5.12, leading to condition (3). The theorem generalizes the results for scalar optimaliy of Theorem 2.11, suggesting that the general design problem is accompanied by the dual problem:
The design problem and the dual problem bound each other in the sense that every value for one problem provides a bound to the other, <f>(CK(M)) < l/<f>(K'NK). Another way of expressing this relation is to write
172
For a moment matrix M to attain the supremum and therefore to be formally optimal for K'6, that is, disregarding the identifiability condition M -A(K), it is sufficient that there exists a matrix TV e A/" such that <}>(CK(M)} = l/<f>(K'NK). We now appeal to the Subgradient Theorem 7.4 and the Full Rank Reduction Lemma 7.3 to show that the condition is also necessary.
7.12. DUALITY THEOREM Theorem. We have
In particular, a moment matrix M e M is formally optimal for K'O in M if and only if there exists a matrix TV A/" such that
Moreover, any two matrices M e M and N e M satisfy <f>(CK(M)) l/<t>(K'NK) if and only if they jointly satisfy the three conditions (1), (2), and (3) of Theorem 7.11. Proof. There exists a moment matrix M M that is formally optimal for K'O in M, and the optimal value v(<f>) is positive, by Lemma 5.16. Thus C = CK(M) satisfies <f>(C) = v(4>) > 0. In the first part of the proof, we assume that the set M contains a positive definite matrix, MnPD(k) ^ 0. Then Theorem 7.4 secures the existence of a subgradient B o f ( j > o C K a t M that is normal to M at M. From Theorem 7.9, the matrix N = B/(C) satisfies the polarity equation 4(C)^(K'NK) = trace M N = 1. Since B is normal to M at M so is N:
Thus the matrix N is a member of the set A/", and it satisfies </>(C) = 1/4>(K'NK). Hence M is an optimal solution of the design problem, N is an optimal solution of the dual problem, and the two problems share a common optimal value. In the second part, we make do with the assumption that the set M meets the feasibility cone A(K). From Lemma 7.3, we first carry through a full rank reduction based on the k x r matrix U and the set M = U'MU that appear there. Because of M n PD(r) ^ 0, the first part of the present proof
7.12. DUALITY THEOREM
173
is applicable and yields
Here we have set ft = {N e NND(r) : trace AN < 1 for all A M}. We claim that this set is related to the set M C NND(fc) of Theorem 7.11 through
For the direct inclusion, we take a matrix N e A/" and define N = UNV. Then A4 = V'MU implies N M since trace MN = trace MUNU' = trace MN < 1 for all M X. Furthermore U'U = Ir entails U'NU = N, whence N is a member of U 'AfU. For the converse inclusion, we are given a matrix N eM. From the proof of Lemma 7.3, we borrow
Thus N = U'NU lie-in Af since trace MN - traceU'MUU'NU = trace MN < 1 for all M e A-l. This proves (2). Now we may continue in (1),
and observing UU'K K concludes the proof. We illustrate the present theorem with the pathological example of the parabola fit model of Section 6.5. We consider the design which places equal mass 1/2 on the two points 1. Its information matrix for the full parameter vector 9 is singular,
Under the trace criterion, the information for 6 is <i(M) = (1/3) trace M = 1. Now we define the matrix N 73/3 > 0. For the moment matrix M2(r) of an arbitrary design r on T = [1; 1], we compute
174
Thus the matrix N lies in the set .A/", providing for the design problem the upper bound
Since <f>i(M) attains this bound, the moment matrix M is formally optimal. However, M provides only two degrees of freedom for three unknown parameters whence feasibility fails to hold, as pointed out in Section 6.5. The counterpart of this nonexistence example are the following three sufficient conditions which secure that every formally optimal moment matrix is automatically feasible.
7.13. EXISTENCE THEOREM FOR OPTIMAL MOMENT MATRICES
Theorem. There exists a moment matrix M e M that is formally <f>optimal for K'O in M, and the optimal value v(<f>) is positive. In order that every formally <-optimal moment matrix for K'O in M lies in the feasibility cone A(K), and thus is </>-optimal for K'O in M, any one of the following conditions is sufficient: a. (Condition on M) The set M is included in the feasibility cone A(K). b. (Condition on <f>) The information function < vanishes for singular matrices. c. (Condition on <f>) The polar information function < vanishes for singular matrices and is strictly isotonic on PD(s). Specifically, for the matrix means <f>p with parameter p e [-00; 1), there exists a <t>p-optimal moment matrix for K'O in M.. Proof. Parts (a) and (b) are copied from Lemma 5.16 and part (c) is the first half of condition (c) of that lemma. Hence it suffices to show that the second half is always true. But for every formally optimal matrix A/, there exists a solution N e M of the dual problem, and M and N jointly satisfy conditions (1), (2), (3) of Theorem 7.11. Hence C = CK(M) and D = K'NK solve the polarity equation of Lemma 5.16. The arguments concerning the matrix means <f>p are compiled in Theorem 6.13. For p G [-00; 0], they vanish for singular matrices C (see (2) in Section 6.7). For p e (0; 1) the polar function s<j>q vanishes for singular matrices and is strictly isotonic on PD(s). For the design problem, we are interested in feasible moment matrices only. Therefore we recast the Duality Theorem 7.12 into the following form which is the key result of optimal design theory. The assumptions underlying the design problem are those from Section 7.10.
7.14. THE GENERAL EQUIVALENCE THEOREM
175
7.14. THE GENERAL EQUIVALENCE THEOREM Theorem. Let M e M be a competing moment matrix that is feasible for K'd, with information matrix C = CK(M). Then M is $-optimal for K' 6 in M if and only if there exists a nonnegative definite 5 x 5 matrix D that solves the polarity equation
and there exists a generalized inverse G of M such that the matrix N = GKCDCK'G' satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for A we insert M, or any other matrix M e M that is </>-optimal for K'B in M. Proof. For the direct part, we do not need the feasibility assumption. Let M be a formally </>-optimal moment matrix for K'B in M. We cannot appeal directly to the Subgradient Theorem 7.4 since there we require M n PD(fc) ^ 0 while here we only assume Mr\A(K) ^ 0. Instead we piece things together as follows. The Duality Theorem 7.12 provides a matrix N G Af with 4>(C) - l/<f>(K'NK). Conditions (3), (2), and (1) of Theorem 7.11 yield
Theorem 7.9 states that <j>(C)N is a subgradient of <f> o C^ at M. From Corollary 7.8, it has a representation
with D e d<f>(C), G M-, and 0 < F_LM. It follows from Lemma 7.5 that the matrix D D/<f>(C) is nonnegative definite. From part (b) of Theorem 7.9, it also solves the polarity equation. The matrix N = GKCDCK'G' < GKCDCK'G' + F/<f>(C) = N satisfies trace AN < trace AN < 1, by the trace monotonicity of Section 1.11 and because of N e Af. The converse refers to the more elementary Theorem 7.11 only. The normality inequality shows that the matrix N = GKCDCK'G' lies in the set Af. By feasibility of M, we have K'GK = C~l and K'NK = D. Now the polarity equation gives <(C) = l/(f>(K'NK). Hence M and W are optimal solutions of the primal problem and of the dual problem, respectively. In case of optimality, M and any other optimal matrix M e M satisfy trace M N = 1 = trace MAf, by condition (1) of Theorem 7.11.
176
We record the variants that emerge if the full parameter vector 6 is of interest, and if maximization takes place over the full set M(E) of all moment matrices.
7.15. GENERAL EQUIVALENCE THEOREM FOR THE FULL PARAMETER VECTOR

Theorem. Let M e M be a competing moment matrix that is positive definite. Then M is $-optimal for 0 in M if and only if there exists a nonnegative definite k x k matrix N that solves the polarity equation
and that satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for A we insert M, or any other matrix M e M that is <-optimal for 6 in M. Proof. For the full parameter vector 0, any feasible moment matrix M e M is positive definite by Theorem 3.15. In the General Equivalence Theorem 7.14, we then have G M~l, K = Ik, and Cjk(M) M. The polarity equation and the normality inequality simplify accordingly. 7.16. EQUIVALENCE THEOREM Theorem. Let M e M(S) be a moment matrix that is feasible for K'0, with information matrix C = CK(M). Then M is ^-optimal for K'8 in M(H) if and only if there exists a nonnegative definite s x s matrix D that solves the polarity equation
and there exists a generalized inverse G of M such that the matrix N = GKCDCK'G' satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for x we insert any support point *,- of any design e H that is <-optimal for K'B in H.
7.18. MERITS AND DEMERITS OF EQUIVALENCE THEOREMS
177
Proof. Since M (E) contains the rank 1 moment matrices A = xx', the normality inequality of Theorem 7.14 implies the present normality inequality. Conversely, the present inequality implies that of Theorem 7.14, for if the moment matrix A e M(H) belongs to the design 17 e E, then we get trace AN = E^suPP *, *l(x)x'Nx < I . In the case of optimality of A/, let e E be a design that is <-optimal for K'S in H, with moment matrix M = A/() and with support points *!,...,* e X. Then Jt/Afjt, < 1 for some / entails the contradiction trace MN = ,.<< (*,-) jc/ATjc,- < 1 - trace MN. 7.17. EQUIVALENCE THEOREM FOR THE FULL PARAMETER VECTOR Theorem. Let M M(S) be a moment matrix that is positive definite. Then M is </>-optimal for 0 in M(E) if and only if there exists a nonnegative definite k x k matrix N that solves the polarity equation
and that satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for x we insert any support point jc, of any design H that is <-optimal for 9 in H. Proof. The proof parallels that of the previous two theorems.
7.18. MERITS AND DEMERITS OF EQUIVALENCE THEOREMS Equivalence theorems entail an amazing range of consequences. It is worthwhile to pause and reflect on what we have achieved so far. Theorem 7.14 allows for a fairly general set M of competing moment matrices, requiring only that M is a compact and convex subset of the cone NND(&), and that it contains at least one feasible matrix for K'B. It is for this generality of M that we term it a General Equivalence Theorem. Most practical applications restrict attention to the set M(H) of all moment matrices, and in such cases we simply speak of an Equivalence Theorem. Exhibit 7.3 provides an overview. In any case the optimality characterization comes in two parts, the polarity equation and the normality inequality. The polarity equation refers to 5 x 5 matrices, as does the information function <. The normality inequality applies to k x k matrices, as does the information matrix mapping CK. Thus
178
EXHIBIT 7.3 A hierarchy of equivalence theorems. A General Equivalence Theorem (GET) allows for a compact and convex subset M of the set M(H) of all moment matrices. Over the full set M(a) we speak of an Equivalence Theorem (ET). Either theorem simplifies if interest is in the full parameter vector 6.
the two parts parallel the two components </> and CK of the composition </> CK. In order to check the optimality of a candidate matrix M, we must in some way or other compare M with the competing matrices A. Equivalence theorems contribute a way of comparison that is linear in A, and hence simple to evaluate. More involved computations do appear but need only be done once, for the optimality candidate A/, such as determining the information matrix C = C#(A/), a solution D of the polarity equation, and an appropriate generalized inverse G of M. The latter poses no problem if M is positive definite; then G is the regular inverse, G = M~l. In general, however, the choice of the generalized inverse G does matter. Some version of G satisfy the normality inequality, others do not. Yet more important, for the matrix means </>p, reference to the polarity equation disappears. Lemma 6.16 exhibits all of its solutions. If p is finite then the solution is unique.
7.19. GENERAL EQUIVALENCE THEOREM FOR MATRIX MEANS Theorem. Consider a matrix mean <f>p with parameter p finite, p e (-00; 1]. Let M M be a competing moment matrix that is feasible for K'O, with information matrix C = CK(M). Then M is ^-optimal for K'6 in M if and only if there exists a generalized inverse G of M that satisfies the normality inequality
7.19. GENERAL EQUIVALENCE THEOREM FOR MATRIX MEANS
179
In case of optimality, equality obtains in the normality inequality if for A we insert M or any other matrix M M that is $p-optimal for K'6 in M. Specifically, let M e M be a competing moment matrix that is positive definite. Then M is $p-optimal for 0 in M if and only if
Proof. The polarity equation has solution D = Cp l/ trace Cp, by Lemma 6.16. Hence the General Equivalence Theorem 7.14 specializes to the present one. ALTERNATE PROOF. If the optimality candidate M is positive definite then the theorem permits an alternate proof based on differential calculus, provided we are ready to use the facts that the functions fp(X) = trace Xp, for p / 0, and fo(X) det X are differentiable at a positive definite matrix C, with gradients
The chain rule then yields the gradient of the matrix mean </>p at C,
Because of positive definiteness of M, the information matrix mapping C% becomes differentiable at M. Namely, utilizing twice that matrix inversion X-1 has at M the differential -M~1XM-1, the differential of CK(X) = (K'X~lKYl at M is found to be
Composition with the gradient of <f>p at C yields the gradient of <J>P o CK atM,
That is, the gradient is proportional to M~lKCp+lK'M~l. Hence it is normal to M at M if and only if the normality inequality of Theorem 7.19 holds true.
180
7.20. EQUIVALENCE THEOREM FOR MATRIX MEANS Theorem. Consider a matrix mean <f>p with parameter p finite, p e (-00; 1]. Let M be a moment matrix that is feasible for K'0, with information matrix C = CK(M). Then M is ^-optimal for K'O in M(H) if and only if there exists a generalized inverse G of M that satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for x we insert any support point */ of any design H that is <p-optimal for K'O in H. Specifically, let M A/(E) be a moment matrix that is positive definite. Then M is ^-optimal for 6 in M(H) if and only if
Proof.
The results are special cases of Theorem 7.19.
For the smallest-eigenvalue criterion <-oo, Lemma 6.16 provides many solutions of the polarity equation, unless the smallest eigenvalue of C has multiplicity 1. Some care is needed to appropriately accommodate this situation. 7.21. GENERAL EQUIVALENCE THEOREM FOR E-OPTIMALITY Theorem. Let M e M be a competing moment matrix that is feasible for K'O, with information matrix C CK(M). Then M is <_oo-optimal for K'O in M if and only if there exist a nonnegative definite s x s matrix with trace equal to 1 and a generalized inverse G of M that satisfy the normality inequality
In case of optimality, equality obtains in the normality inequality if for A we insert M, or any other matrix M e M that is 0_oo-optimal for K'O in M\ and any matrix E appearing in the normality inequality and any moment matrix M 6 M that is $-00-optimal for K'O in M, with information matrix C = CK(M\ fulfill
where the set S consists of all rank 1 matrices zz' such that z 6 Rs is a norm 1 eigenvector of C corresponding to its smallest eigenvalue.
7.22. EQUIVALENCE THEOREM FOR E-OPTIMALITY
181
Specifically, let M e M be a competing moment matrix that is positive definite. Then M is ^-oo-optimal for 0 in M if and only if there exists a nonnegative definite k x k matrix E with trace equal to 1 such that
Proof. We write A = A m j n (C), for short. For the direct part, we solve the polarity equation in the General Equivalence Theorem 7.14. Lemma 6.16 states that the solutions are D = E/\ with E e conv 5, where S comprises the rank 1 matrices zz' formed from the norm 1 eigenvectors z of C that correspond to A. Hence if M is ^-oo-optimal, then the normality inequality of Theorem 7.14 implies the present one. For the converse, the fact that trace E 1 entails that every spectral decomposition E = Y^j<raizizj represents as a convex combination of the rank 1 matrices z\z{, . ,zrz'r. The eigenvectors Zj that come with E are eigenvectors of C corresponding to its smallest eigenvalue, A. To see this we notice that the nonnegative definite sxs matrix F = C-A/5 and the matrix E satisfy
The first inequality exploits the monotonicity behavior discussed in Section 1.11. The middle equality expands C into CK'G'MGKC. The last inequality uses the special case trace MGKCECK'G' < A of the normality inequality. Thus we have trace FE = 0. Section 6.15 now yields FE = 0, that is, CE = A. Postmultiplication by Zj gives Czj = Az;, for all ; = 1,... ,r, giving E e conv S. It follows that the matrix D = E/\ solves the polarity equation and satisfies the normality inequality of the General Equivalence Theorem 7.14, thus establishing (/ôo-optimality of M. Now let M e M be another moment matrix that is </>_oo -optimal for K'd in M, with information matrix C. Then M and M share the same optimal value, A m i n (C) = A. Since the dual problem has optimal solution N = GKCECK'G'/\, condition (1) of Theorem 7.11jields trace MN = 1. Hence the normality inequality is an equality for A M. Moreover, with condition (2) of Theorem 7.11, we may continue 1 = trace MN trace CE/\. Therefore the nonnegative definite sxs matrix F C - \IS is orthogonal to E. Again we get FE 0, thus establishing property (1). Postmultiplication of (1) by Zj shows that z, is a eigenvector of C, whence follows (2) 7.22. EQUIVALENCE THEOREM FOR E-OPTIMALITY Theorem. Let M be a moment matrix that is feasible for K'6, with information matrix C = CK(M). Then M is 4>_oo-optimal for K'6 in M(H) if
182
and only if there exist a nonnegative definite s xs matrix E with trace equal to 1 and a generalized inverse G of M that satisfy the normality inequality
In case of optimality, equality obtains in the normality inequality if for x we insert any support point *, of any design H that is <_oo-optimal for K'6 in H; and any matrix E appearing in the normality inequality and any moment matrix M G A/(H) that is ^-oo-optimal for K'B in A/(E), with information matrix C = CK(M), fulfill conditions (1) and (2) of Theorem 7.21. Specifically, let M e M(H) be a moment matrix that is positive definite. Then M is <_oo-optimal for 0 in A/(H) if and only if there exists a nonnegative definite k x k matrix E with trace equal to 1 such that
Proof.
The theorem is an immediate consequence of Theorem 7.21.
If M is ^-oo-optimal for K '6 in M and the smallest eigenvalue A of CK(M) has multiplicity 1, then the matrix E in the two preceding theorems is uniquely given by zz', where z e R* is a norm 1 eigenvector of CK(M) corresponding to A. If the smallest eigenvalue has multiplicity greater than 1 then little or nothing can be said of which matrices E > 0 with trace E = 1 satisfy the normality inequality. It may happen that all rank 1 matrices E = zz' that originate with eigenvectors z fail, and that any such matrix E must be positive definite. As an illustration, we elaborate on the example of Section 2.18, with regression range X = {x e R2 : \\x\\ < 1}. The design (J) = 1/2 = (J) with moment matrix M = I2/2 is <_oo-optimal for 6 in H. Indeed, the positive definite matrix E = I2/2 satisfies x Ex = ||*||2/2 < A m i n (Af), for all x e X. On the other hand, the norm 1 eigenvectors of M corresponding to 1/2 are the vectors z e R2 with ||z|| = 1. For x z, we get x'zz'x = 1 > A m j n (M). Hence no rank 1 matrix E zz' fulfills the normality inequality. The same example illustrates that 0_oo-optimality may obtain without any scalar optimality property. For every coefficient vector c ^ 0, the Elfving Theorem 2.14 shows that the unique optimal design for c'B is the one-point design in c/\\c\\ e X, with optimal variance (p(c))2 = ||c||2, where p is the Elfving norm of Section 2.12. Indeed, the variance incurred by the $_oooptimal moment matrix I2/2 is twice as large, c'(I2/2)~lc = 2||c||2. Nevertheless, there are some intriguing interrelations with scalar optimality which may help to isolate a ^-oo-optimal design. Scalar optimality always comes into play if the smallest eigenvalue of the </>_oo-optimal information matrix has multiplicity 1.
7.24. E-OPTIMALITY, SCALAR OPTIMALITY, AND ELFVING NORM
183
7.23. E-OPTIMALITY, SCALAR OPTIMALITY, AND EIGENVALUE SIMPLICITY Theorem. Let M M. be a competing moment matrix that is feasible for K'O, and let z e Rs be an eigenvector corresponding to the smallest eigenvalue of the information matrix Cx(M). Then M is <^_oo-optimal for K'B in M and the matrix E = zz'/\\z\\2 satisfies the normality inequality of Theorem 7.21 if and only if M is optimal for z'K'O in M. If the smallest eigenvalue of C#(M) has multiplicity 1, then M is </>-oooptimal for K'B in M. if and only if M is optimal for z'K'O in M.. Proof. We show that the normality inequality of Theorem 7.21 for <_oooptimality coincides with that of Theorem 4.13 for scalar optimality. With E = zz'/\\z\\2, the normality inequality in Theorem 7.21 reads
The normality inequality of Theorem 4.13 is
With c = Kz the two left hand sides are the same. So are the right hand sides, because of c'M~c = z'K'M~Kz = z'C~lz = \\z\\2/\mia(CK(M)). If the smallest eigenvalue of CK(M} has multiplicity 1 then the only choice for the matrix E in Theorem 7.21 is E = zz'/\\z\\2. Under the assumption of the theorem, the right hand sides of the two normality inequalities in the proof are the same, ||z|| 2 /A m i n (C/f(M)) = z'K'M~Kz. The following theorem treats this equality in greater detail. It becomes particularly pleasing when the optimal solution is sought in the set of all moment matrices, Af (E). Then the ElfVing Theorem 2.14 represents the optimal variance for z'K'O as (p(Kz))2, where p is the Elfving norm of Section 2.12. 7.24. E-OPTIMALITY, SCALAR OPTIMALITY, AND ELFVING NORM Theorem. fulfill Every moment matrix M e M(H) and every vector 0 ^ z R5
184
If equality holds, then M is </>_oo-optimal for K'O in M(E) and any moment matrix M e M(E) that is </>_oo-optimal for K'O in M(E) is also optimal for z'K'e inM(H). Proof. Let M 6 M(H) be a moment matrix that is feasible for K'6. Then we get 0 7^ ATz range /C C range M C (#), where C(X} is the regression range introduced in Section 2.12. It follows that Kz has a positive ElfVing norm, p(Kz) > 0. If M is not feasible for K'O, then the information matrix C = CK(M) is singular, by Theorem 3.15, and \min(CK(M)) = 0 < (\\z\\/p(Kz))2. Otherwise maximization of the smallest eigenvalue of C > 0 is the same as minimization of the largest eigenvalue of C'1 = K'M'K. Using AmaxCC" 1 ) = maxjijuî z'C~lz, we have for all 0 / z IR5, the trivial inequalities
If equality holds for M and z, then attaining the upper bound (\\z\\/ p(Kz))2, the moment matrix M is 4>-oo-optimal for K'O in A/(S). For every (^-oo-optimal moment matrix M for K'O in A/(E), we get
Now (p(tfz)) 2 = z 'K'M'Kz shows that M is optimal for z 'K'O in M(E). For ||z || = 1, the inverse optimal variance l/(p(Kz))2 is the maximum information for z'K'O in M(S). Therefore the theorem establishes the mutual boundedness of the problems of maximizing the smallest eigenvalue A m i n (M) as M varies over M(S), and of minimizing the largest information for z'K'O as z varies over the unit sphere in Rs. The two problems are dual to each other, but other than in Theorem 7.12 here duality gaps do occur, as is evidenced by the example at the end of Section 7.22. Duality does hold in the parabola fit model of Section 2.21. Indeed, the vector c = (1,0,2)' fulfills ||c||/p(c) = 1/%/S. The optimal moment matrix M for c'B has smallest eigenvalue 1/5. The present theorem proves that M is $-00-optimal for 0 in A/(S). This will guide us to determine the </>_oo-optimal design for polynomial fit models of arbitrary degree d > 1, in Section 9.13.
EXERCISES
185
In the present chapter, our concern has been to establish a set of necessary and sufficient conditions for a moment matrix to be optimal. We have concentrated on moment matrices M rather than on designs . In the next chapter we explore what consequences these results imply for the optimality of a design , in terms of the support points and weights of .
EXERCISES 7.1 Let T be a linear mapping from a scalar product space C into a scalar product space /C. The transposed operator T' is defined to be the unique linear mapping from /C into C that satisfies (Tx,y) = (x,T'y) for all jt e , y e /C. Find the transposed operator of (i) T(x) = Ax : Uk -> R", where A e R"x*, (ii) T(A) = LAL' : Sym(k) - Sym(s), where L e Rs*k. 7.2 For p G (-00; 1], prove that <f>p is differentiable in C > 0 by showing that the subdifferential is a singleton,
7.3 Prove that </>_oo is differentiable in C > 0 if and only if the smallest eigenvalue of C has multiplicity 1, by discussing whether d<f>-oo(C) = conv 5 is a singleton or not. 7.4 For p e (0; 1), show that (f>p is not subdifferentiable at singular matrices
7.5 True or false: if <(C) > 0, then 6<f>(Q ^ 0? 7.6 Let </> be an information function, and let C > 0 be such that <f> (C) = 0. Show that D is a subgradient of <f> at C if and only if D is orthogonal to C and 0(D) > 1. 7.7 For the dual problem of Section 7.11, show that if N M is an optimal solution, then so is NK(K'NK)-K'N. 7.8 Show that a moment matrix M e M n A(K) is </>-optimal for K'O in .M if and only if there exists a solution D e NND(s) of the polarity equation and there exists a left inverse L e R ixAr of K that is minimizing for M such that N = L 'DL satisfies the normality inequality.
186
7.9 Demonstrate by example that in the Equivalence Theorem 7.16, the generalized inverse G cannot generally be taken to be the MoorePenrose inverse M+ [Pukelsheim (1981), p. 17]. 7.10 Demonstrate by example that in the Equivalence Theorem 7.16, the domain of the normality inequality cannot generally be reduced from x X to x X n range M [Pukelsheim (1981), p. 17]. 7.11 Use the General Equivalence Theorem to deduce Theorem 4.13. 7.12 Show that any matrix E e NND(s) with trace E = 1 satisfies (i) E < Is, and (ii) c'Ec = c'c if and only if E - cc'/\\c\\2, where 0 ^ c e R5. 7.13 For the two-point regression range X = < (]), (Jj i, show that 72 is the unique 4>-<x-optimal matrix for 0 in Af(E). Which matrices E of
O. O- !(?)>!(!!). i( -r.1) satisfyTheorem 7-22?
7.14 Show that ^(ôo) < r2, where f(^_oo) is the <_oo-optimal value for 6 in E and r is the Euclidean in-ball radius of the Elfving set Tl =
conv(;ru (-#)).
7.15
(continued) Demonstrate by example that f(</>_oo) < r2 is possible.
CHAPTER 8
Optimal Moment Matrices and Optimal Designs
The interrelation between optimal moment matrices and optimal designs is studied. Necessary conditions on optimal support points are derived in terms of their number, their location, and their weights. The optimal weights on linearly independent regression vectors are given as a fixed point of a nonlinear equation. Multiplicity of optimal moment matrices is characterized by a linear relationship provided the criterion is strictly concave.
8.1. FROM MOMENT MATRICES TO DESIGNS The General Equivalence Theorem 7.14 concentrates on moment matrices. This has the advantage of placing the optimization problem in a linear space of finite dimensions. However, the statistical interest is in the designs themselves. Occasionally we experience a seamless transition between optimal moment matrices and optimal designs, such as in the Elfving Theorem 2.14 on scalar optimality. However, the general rule is that the passage from moment matrices to designs is difficult, and limited in its extent. All we can aim at are necessary conditions that aid in identifying the support points and the weights of optimal designs. This is the major theme of the present chapter. First we establish an upper bound on the number of support points. The bound applies to all designs, not just to optimal ones. The theorem states that every k x k moment matrix A e M (H) is achieved by a design 17 which has at most \k(fc + 1) + 1 support points, and that for a feasible moment matrix A, the 5 x 5 information matrix CK(A) (possibly scaled by some 8 > 1) is realized by a design with a support size that obeys the tighter bound ^s(s + 1) + s(rank A - s).
187
188
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
8.2. BOUND FOR THE SUPPORT SIZE OF FEASIBLE DESIGNS Theorem. Assume that the k x s coefficient matrix K of the parameter system K'O is of full column rank s. Then for every moment matrix A e M(H), there exists a design 17 e H such that
For every design 17 H that is feasible for K'O, there exists a desig such that
Proof. From Lemma 1.26, the set of all moment matrices M(H) admits a representation as a convex hull,
The set M(H) is a convex and compact subset of the linear space Sym(&) of symmetric matrices. The latter has dimension \k(k + 1), whence (1) and (2) are immediate consequences of the Caratheodory theorem. For the tighter bound (5), we replace the linear space Sym(A:) by a hyperplane in the range of an appropriate linear transformation T on Sym(A;). The construction of T is reminiscent of the full rank reduction in Lemma 7.3. Let the moment matrix A/ of 17 have rank r. Feasibility entails s < r < k. We choose some k x r matrix U with U'U = lr such that UU' projects onto the range of M. Then the matrix U'M~K is well defined, as follows from the preamble in the proof of Theorem 4.6. We claim that U'M'K has rank s. To this end we write UU' = MG where G is a generalized inverse of M (see Section 1.16). From MG (UU1)1 = G'M and M2G2M2 = MG'MGM2 = M2, we infer G2 e (M2)~. Now we get K'G'UU'GK = K'G'MG2K = K'(M2)~K. By Lemma 1.17, the last matrix has rank s, and so has U'GK = U'M~K. With a left inverse L of U'M~K, we define the projector R lr U'M'KL. Now we introduce the linear transformation T through
In view of T(A) = T(UU'AUU'}, the range of T is spanned by T(UBU') = B - R 'BR with B e Sym(r). For counting dimensions, we may choose R to
8.2. BOUND FOR THE SUPPORT SIZE OF FEASIBLE DESIGNS
189
be in its simplest form
This induces the block partitioning
The symmetric sxs matrix B\\ contributes ^s(s+\} dimensions, while another s(r - s) dimensions are added by the rectangular s x (r - s) matrix 512 = B2'r Therefore the dimension of the range of T is equal to |s(.s + l) + ,s(r-s). In the range of T, we consider the convex set M generated by the finitely many matrices T(xx') which arise from the support points x of 17,
Since M. contains T(M}, the scale factor e = inf{5 > 0 : T(M) 8M} satisfies e < 1. If e = 0 then T(M) = 0, and RU'M'K = 0 yields the contradiction
Hence e > 0, and we may define 8 = l/e > 1. Our construction ensures that 5T(M) lies in the boundary of M.. Let T~t be the hyperplane in the range of T that supports M in 8T(M). Thus we obtain 8T(M) H n M = conv(H n 5(17)). By the Caratheodory theorem, 8T(M) is a convex combination of at most
members of H n 5(rj). Hence there exists a design such that T(M()) = dT(M), the support points of are support points of 17, and the support size of is bounded by (5). It remains to establish formula (3). We have rangeM() C range M. This entails UU'M()UU' = M(). Because of RU'M'K = 0, we obtain
Evidently the range of K is included in the range of M() whence M() is feasible for K'B. Premultiplication by K'M(g)~ and inversion finally yield In the case of optimality, we obtain the following corollary.
190
8.3. BOUND FOR THE SUPPORT SIZE OF OPTIMAL DESIGNS Corollary- Let $ be an information function. If there exists a </> -optimal moment matrix M for K'O in M(H), then there exists a </>-optimal design for K'B in H such that its support size is bounded according to
Proof. Every design with a feasible moment matrix for K'O needs at least s support points. This is so because the summation in range M () ZLcesupp f (range **') must extend over at least s terms in order to include the range of K, by Lemma 2.3. For a design 17 that has M for its moment matrix, we choose an improved design as in Theorem 8.2, with support size bounded from above by (5) of Theorem 8.2. Positive homogeneity and nonnegativity of <f> yield
Optimality of M forces the design to be optimal as well. If there are many optimal designs then all obey the lower bound, but some may violate the upper bound. The corollary only ascertains that at least one optimal design respects both bounds simultaneously. For scalar optimality, we have 5 = 1. Thus if the moment matrix M is optimal for c'd in A/(H), then there exists an optimal design for c'O in H such that
The upper bound k also emerges when the Caratheodory theorem is applied in the context of the Elfving Theorem 2.14. For the full parameter vector 6, we have s = k, and the bounds become
In Section 8.6, we provide examples of optimal designs that require \k(k + \} support points. On the other hand, polynomial fit models illustrate attainment of the lower bound k in the very strict sense that otherwise a design becomes inadmissible (see Theorem 10.7). Before studying the location of the support points we single out a matrix lemma. 8.4. MATRIX CONVEXITY OF OUTER PRODUCTS Lemma. Let Y and Z be two k x s matrices. Then we have, relative to the Loewner ordering,
8.5. LOCATION OF THE SUPPORT POINTS OF ARBITRARY DESIGNS
191
for all a (0; 1), with equality if and only if Y Z. Proof. With X (1 - a)Y + Z, the assertion follows from
The following theorem states that in order to find optimal support points, we need to search the "extreme points" X of the regression range X only. Taken literally, this does not make sense. The notion of an extreme point applies to convex sets and X generally fails to be convex. Therefore we first pass to the Elfving set 72. = conv(#u (-<)) of Section 2.9. Since 72. is convex, it has points which are extreme, that is, which do not lie on a straight line connecting any other two distinct points of 72.. Convex analysis tells us that every extreme point of 72, is a member of the generating set X U (-X). We define those extreme points of 72. which lie in X to form the subset X.
8.5. LOCATION OF THE SUPPORT POINTS OF ARBITRARY DESIGNS Theorem. Let X be the set of those regression vectors in X that are extreme points of the Elfving set 72. = conv(^ U ( X ) ] . Then for every design rj 6 H with support not included in X, there exists a design e H with support included in X such that
Proof. Let x\,...,xt e X be the support points of 17. Being closed and convex, the set 72. is the convex hull of its extreme points. Hence for every i = 1,...,^, there exist extreme points V n , . . . , V j r t . of 72. such that Jt, = ]/<M( otijyij, with min;<M( ,; > 0 and )y-<fl. a,; = 1. Since at least one support point jc, of TJ is assumed not to Be an extreme point of 7, Lemma 8.4 yields *,-*/ ^ ,<, aîyy/y Thus we get M(TJ) =
192
where the design is denned to allocate weight 17 (*/),; to the point y<;, for / = 1,... ,^ and ;' = 1,... ,n,. The points yij are extreme in 72, and hence contained in X or in X. If y/; e X is extreme in 72. then, because of point symmetry of 71, -y/y- e ^ is also extreme in 71. From the fact that (yi/)(y/;)' -ft;)'/;' we may replace Vij X \yy yij X. This proves the existence of a design with supp C #, and with moment matrix Af () improving upon M(TJ). The preceding result relates to the interior representation of convex sets by sweeping mass allocated in the interior of the ElfVing set into the extreme points out on the boundary. This improvement applies to arbitrary designs 77 e 5. For designs that are optimal in H the support points */ must attain equality in the normality inequality of the Equivalence Theorem 7.16, x-Nxi = 1. This property relates to the exterior representation of convex sets, in that the matrix N induces a cylinder which includes the Elfving set 71. Altogether the selection of support points of designs optimal in H is narrowed down in two different ways, to search the extreme points x of 72. that lie in X, or to solve the quadratic equation x'Nx = 1 that arises with an optimal solution N of the dual problem. Polynomial fit models profit from the second tool, as detailed in Section 9.5. The following example has a regression range X so simple that plain geometry suggests concentrating on the extreme points X.
8.6. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT SQUARE Over the unit square as the regression range, X = [0;1]2, we consider a two-way first-degree model without constant term,
also called a multiple linear regression model. We claim the following. Claim. For p [-00; 1), the unique design p H that is ^-optimal for 6 in H, assigns weights w(p) to (J) and \(l - w(p)) to (J) and (J), where
Proof. The proof of our claim is arranged in three steps, for p > -oo. The case p = oo is treated separately, in a fourth step.
8.6. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT SQUARE
193
I. The regression vectors x X that are extreme points of 7 visibly are
For p (oo; 1), the matrix mean (f>p is strictly isotonic on PD(2), by Theorem 6.13. Therefore Theorem 8.5 forces the ^-optimal design p to have its support included in X, whence its moment matrix takes the form
II. By interchanging the weights W2 and H>3 we create the design p, with moment matrix
If p and p are distinct, w>2 / n>3, then their mixture r\ = \(p + p) improves upon p. Namely, the matrix mean <f>p is strictly concave on PD(s), by Theorem 6.13, and the matrices M(p) and M(p} share the same eigenvalues. This entails
contradicting the optimality of p. Thus gp and p are equal, and we have W2 = H>3 = i(l Wi). Upon setting w(p) = w\, the moment matrix of gp becomes
III. It now suffices to maximize cf>p over the one-parameter family
This maximization is carried out by straightforward calculus. The matrix Mw has eigenvalues \(\ + 3w) and |(1 - H>). Hence the objective function fp = <f>p o M is
194
EXHIBIT 8.1 Support points for a linear fit over the unit square. For p e (-00; 1), the support of the (ftp-optimal design p for d comprises the three extreme points (Q)>(?))(}) in 72. n X. The 0-oo-optimal design -00 is supported by the two points (Q), ("), and has optimal information one for c'6 with c = \(\}-
The unique critical point of fp is w(p) as given above. It lies in the interior of the interval [0; 1], and hence cannot but maximize the concave function f IV. For ^-oo-optimality, the limiting value lim/,_>_00 w(p) = 0 suggests the candidate matrix M0 = \h- Indeed, the smallest eigenvalue ^(1 - w) of Mw is clearly maximized at w = 0. Thus our claim is proved. See also Exhibit 8.1. The argument carries over to the linear fit model over the A>dimensional cube [0; \}k (see Section 14.10). The example is instructive from various perspectives. The smallest-eigenvalue optimal design _oo makes do with a minimum number of support points (two). Moreover, there is an intriguing relationship with scalar optimality. The eigenvector corresponding to the smallest eigenvalue of the matrices Mw is c = (_?j). It is not hard to see that -< is uniquely optimal for c'O in H. This is akin to Theorem 7.23. For p > oo, the designs gp require all three points in the set X for their support, and hence attain the upper bound |/c(A:+l) = 3 of Theorem 8.3. The
8.7. OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS
195
average-variance optimal design _j has weight vv(-l) = (\/3-l)/(A/3+3) = 0.1547 which, very roughly, picks up half the variation between w(-oo) = 0 and H-(O) = 1/3. The determinant optimal design & distributes the weight 1/3 uniformly over the three points in X, even though the support size exceeds the minimum two. (In Corollary 8.12 we see that a determinant optimal design always has uniform weights l//c if the support size is a minimum, k.) The design with u>(l) = 1 is formally (fo-optimal for 6 in H. This is a onepoint design in (|), and fails to be feasible for 0. Theorem 9.15 proves this to be true in greater generality. This concludes our discussion of the number and the location of the support points of optimal designs. Next we turn to the computation of the optimal weights. Theorem 8.7 and its corollaries apply to the situation where there are not very many support points: x\,..., xt are assumed to be linearly independent. 8.7. OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS Theorem. Assume that the i regression vectors Xi,...,xi X C Rk are linearly independent, and form the rows of the matrix X Uexk, that is, X' = (*i,.. .,*). Let H be the set of designs for which the support is included in {*],... ,xe}. Let 6 H be a design that is feasible for K'O, with information matrix C = CK(M(g)}. Then the design is $-optimal for K'6 in H if and only if there exists a nonnegative definite s x s matrix D solving the polarity equation ^(C)^(D) = trace CD = 1 such that the weights w,- = (*,-) satisfy
where a\\,... â^ are the diagonal elements of the nonnegative definite i x t matrix A = UCDCU' with U = (XX')'1XK. Proof. If Xi is not a support point of , then we quite generally get a,-,- = 0. To see this, we may assume that the initial r weights are positive while the others vanish, w i , . . . , wr > 0 = wr+i = wf. Then x\,...,xr span the range of Af, by Lemma 2.3. Because of feasibility, there exists some r x s matrix H such that
By assumption, X
has full row rank , whence

For i>r, we get
(H',G)ei = 0 and an = 0, by the definitions of A and U.
196
Let Aw be the diagonal matrix with weight vector w = ( w j , . . . , we)' e Ue on the diagonal. The moment matrix M of may then be represented as M =X'bwX. Hence
is a specific generalized inverse of M, where A+ is the diagonal matrix with diagonal entries 1/vv, or 0 according as tv, > 0 or w, = 0. With the Euclidean unit vector et of Uk we have x, X'ei. Thus we obtain
which equals a/y/w? or 0 according as w, > 0 or H>, = 0. For the direct part of the proof, we assume to be <f> -optimal in E. The Equivalence Theorem 7.16 provides a solution D > 0 of the polarity equation and some generalized inverse G e M~ such that
If xi is a support point of , then (2) holds with equality, while the left hand side is au/wj. This yields w, = ^/a^. If *, is not a support point of , then Wii = 0 = an, as shown in the beginning of the proof. For the converse, (2) is satisfied with G GQ since the left hand side only takes the values 1 or 0. By Theorem 7.16, the design is optimal For the matrix means </>p with parameter p 6 (-00; 1], the polarity equation has solution D = Cp~1/^ trace Cp, by Lemma 6.16. Hence the matrix A = UCDCU' is proportional to B = UCf>+lU'. With diagonal elements b\\,...,bw of B, the optimal weights satisfy
From the optimal value becomes
we obtain
Hence
Application to average-variance optimality is most satisfactory. Namely, if p = -1, then B = UU' only involves the given regression vectors, and no weights.
8.9. C-OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS
197
8.8. A-OPTEVfAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS Corollary. A design is <_i -optimal for K' 9 in H if and only if the weights w, = (*,) satisfy
where 6 n , . . . , f t ^ are the diagonal elements of the matrix B = UU' with U = (XX'Y^XK. The optimal value is Proof. cussion. The corollary is the special case p = I from the preceding dis-
If X is square and nonsingular, t = k, then we obtain U = X' lK.li the full parameter B is of interest, then K = Ik and B = (XX1)'1. The averagevariance optimal weights for the arcsin support designs in Section 9.10 are computed in this way. If a scalar parameter system c'O is of interest, s = I, then (XX1 )~lXc is a vector, and matters simplify further. We can even add a statement guaranteeing the existence of linearly independent regression vectors that support an optimal design.
8.9. C-OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS Corollary. There exist linearly independent regression vectors x\,..., Xi in X that support an optimal design for c'O in H. The weights w, = (*,) satisfy
where u\,...,ut are the components of the vector u (XX')~lXc. optimal variance is
The
Proof. An optimal design 17 for c'd in H exists, according to the Elfving Theorem 2.14, or the Existence Theorem 7.13. From Section 8.3, we know that k support points suffice, but there nothing is said about linear independence.
198
If the optimal design TJ has linearly dependent support points x\,... , x f , then the support size can be reduced while still supporting an optimal design , as follows. The ElfVing Theorem 2.14 provides the representation
where yt = (*/)*/. Because of linear dependence there are scalars / A I , . . . , /A, not all zerOj such that 0 = ^2i<e /A,V,. We set /AO = T^,i<t M/- If Po is negative, then we replace the scalars /A, by /A,-. Hence t*Q>0 and at least one of the scalars /A, is positive. Now we investigate the "likelihood ratios" i7(jc/)//A/ and introduce
say. Because of
we may rewrite (1) as
with Wf = rj(xi) a/A/. The numbers w, are nonnegative, for if /A, < 0, then this is obvious, and if tt, > 0, then ^(jc/)//^, > a by definition of a. We apply the ElfVing norm p to (2) and use the triangular inequality to obtain By the ElfVing Theorem 2.14, the design with weights (*,) = w/ is optimal for c'Q. It has a smaller support size than 77, because o We may continue this reduction process until it leaves us with support points that are linearly independent. Finally Corollary 8.8 applies, with As an example we consider the line fit model, with experimental domain T = [-1; 1). We claim the following. Claim. The design that assigns to the points jci = (\), x2 = (j), the weights
is optimal for c'6 in H, with optimal variance Proof. The set of regression vectors that are extreme points of the ElfVing set is X = {*i,Jt2}. By Theorem 8.5, there exists an optimal design in H
8.11. OPTIMAL WEIGHTS ON GIVEN SUPPORT POINTS
199
with support included in X. The present corollary then proves our claim, by way of
Without the assumption of linear independence, we are left with a more abstract result. First we single out a lemma on the Hadamard product, that is, entrywise multiplication, of vectors and matrices. 8.10. NONNEGATIVE DEFINITENESS OF HADAMARD PRODUCTS Lemma. Let A and B be nonnegative definite i x i matrices. Then their Hadamard product A * B = ((,,>,,)) is nonnegative definite. Proof. Being nonnegative definite, B admits a square root representation B = UU' with U = (MI, ... ,ue) R / x / . For jc e R , we then get
8.11. OPTIMAL WEIGHTS ON GIVEN SUPPORT POINTS Theorem. Assume that the design E is </>-optimal for K'O in H. Let E e Rsxs be a square root of the information matrix, CK(M(g)) C EE'. Let D NND(s') and G M()~ satisfy the polarity equation and the normality inequality of the Equivalence Theorem 7.16. Let the support points *!,...,* of form the rows of the matrix X e IR* x *,thatis,^' = (jci,... ,jCf), and let the corresponding weights H>, = (*,) form the vector w e IR. Then, with A = XGKE(E'DE}1I2E'K'G'X' e NND(^), the weight vector w solves
and the individual weights w, satisfy H>, < I/a? < A ma x(CD), for all / = 1,..., i.
200
Proof. The proof builds on expanding the quadratic form x/GKCDCK'G'xf from Theorem 7.16 until the matrix M() appears in the middle. We set N = GKCDCK'G', and v/ = E'K'G'x; for i < L Theorem 7.16 states that
We expand E'DE into (E'DE)ll2ls(E'DE)1/2. Since C is nonsingular, so is E. From E'-lE~l = C~l = K'M(^)~K = K'G'M()GK, we get 7S = E'K'G'M(C)GKE. Inserting M() = << *>;*;*/> we continue in (1),
This proves (A*A)w = l f . From the lower bound a?M>, for (2), we obtair wi < 1/fl2/. On the other hand, we may bound (1) from above by
Since E'DE and EE'D = CD share the same eigenvalues, this yields By Lemma 8.10, the coefficient matrix A *A of the equation (A*A)w = le is nonnegative definite. The equation is nonlinear in w since A depends on w, through G, E, and D. Generally, the fixed points w of the equation (A * A)w = lt appear hard to come by. For the matrix means <f>p with parameter p e (-00; 1], the solution of the polarity equation is D = Cp~l/lrace Cp, by Lemma 6.16. With E = C1/2 we find that A is proportional to XGKCP/^K'G'X1. For p = -2 the power p/2 + 1 vanishes, but the optimal moment matrix still enters through a generalized inverse G. For the full parameter vector 6, we get G = M"1 and A oc XMpl2'lX'. For a scalar system c'O we obtain ,4 oc XGcc'G'X. The polarity equation of the General Equivalence Theorem 7.14 entails Amax(CZ)) < trace CD 1. Hence the upper bound A ma x(CD) on the weights H>, is not unreasonable. For the matrix means <f>p with p e (oo;l], we get A m ax(CD) = A max (C p )/trace Cp. While in general this bound depends on
8.13. MULTIPLICITY OF OPTIMAL MOMENT MATRICES
201
the optimal information matrix C, an exception emerges for determinant optimality.
8.12. BOUND FOR DETERMINANT OPTIMAL WEIGHTS Corollary. Every $o-optimal design for K'd in 5 has all its weights bounded from above by 1/5. Moreover, every (^-optimal design for 6 in H that is supported by k regression vectors has uniform weights l/k. Proof. With D = C~l/s, we get wt < A max (C>) = A max (/,)/5 = 1/5. Moreover, if the full parameter vector 8 is of interest, then w, < l/k. If there are no more than k support points, then each weight must attain the upper bound l/k in order to sum to 1 Next we turn to multiplicity of optimal designs. We assume that the information function 4> is strictly concave, forcing uniqueness of the optimal information matrix C. Optimal moment matrices M need not be unique, but are characterized as solutions of a linear matrix equation.
8.13. MULTIPLICITY OF OPTIMAL MOMENT MATRICES Theorem. Assume that the information function </> is strictly concave on PD(5). Let the moment matrix M At be </>-optimal for K'd in M, with generalized inverse G of M that satisfies the normality inequality of the General Equivalence Theorem 7.14. Then any other moment matrix M e M is also </>-optimal for K'0 in M if and only if
Proof. For the direct part, we relate the given optimal solution M of the primal problem to the optimal solution N = GKCK(M}DCK(M}K'G' of the dual problem of Section 7.11. We obtain NK = GKCK(M)D and K'NK = D. Any other optimal moment matrix M, and the dual optimal solution N that comes with M, jointly fulfill equation (2) of Theorem 7.11. Postmultiplication of this equation by K gives
At this point, we make use of the strict concavity of <f> on PD(5), twice. Firstly, Corollary 5.5 states that < is strictly isotonic on PD(5). This forces D
202
to be positive definite, for otherwise z 'Dz = 0 and z ^ 0 lead to the contradiction
much as in the proof of Theorem 5.16. Therefore we may cancel D in (1). Secondly, strict concavity entails uniqueness of the optimal information matrix. Cancelling CK(M) CK(M) in (1), we now obtain the equation MGK = K. For the converse part we note that M is feasible for K'9. Premultiplying MGK = K^by K'M~, we obtain K'M'K = K'M~MGK = K'GK = K'M~K. Thus MJias the same information matrix for K'O as has M. Since M is optimal, so is M. The theorem may be reformulated in terms of designs. Given a $-optimal moment matrix M for K'6 in M and a generalized inverse G of M from the General Equivalence Theorem 7.14, a design is 0-optimal for K'9 in H if and only if M()GK = K.
For general reference, we single out what happens with the matrix means
8.14. MULTIPLICITY OF OPTIMAL MOMENT MATRICES UNDER MATRIX MEANS
Corollary. If p e (-00; 1), then given a moment matrix M e M that is <p-optimal for K'O in M, any other moment matrix M e M is also <f>poptimal for K'6 in M if and only if MGK - K, with G e M~ satisfying the normality inequality of Theorem 7.19. Given a moment matrix M e M that is </>_oo-optimal for K'6 in M, then for any other moment matrix M M. to be also <_oo-optimal for K'Q in Ai, it is sufficient that MGK = K, and it is necessary that MGKE = KE, with G e M~ and E satisfying the normality inequality of Theorem 7.21. Proof. For p e (-oo;l) the matrix means <f>p are strictly concave on PD(s), by Theorem 6.13. Hence Theorem 8.13 applies. For p = oo, we continue equation (1) in the proof of Theorem 8.13 using D = E/A m i n (C), from Lemma 6.16, and CE = A min (C) = C, from (1) in Theorem 7.21.
8.16. MATRIX MEAN OPTIMALITY FOR COMPONENT SUBSETS
203
In the remainder of this chapter we treat more specialized topics: simultaneous optimality relative to all matrix means, and matrix mean optimality if the parameter system of interest K'd comprises the first s out of all k components of 0, or if it is rank deficient. If a moment matrix remains ^-optimal while the parameter p of the matrix means runs through an interval (p\,p2), then optimality extends to the end points p\ and p2, by continuity. The extreme cases p\ ~ -oo and p2 ~ 0 are of particular interest since <_oo and <fo are continuous extensions of the matrix means <f)p with p (-00;0) (see Section 6.7).
8.15. SIMULTANEOUS OPTIMALITY UNDER MATRIX MEANS Lemma. If M is <f>p-optimal for K'6 in M, for all p e (-00;0), then M is also <_oo-optimal and (^-optimal for K'6 in M. Proof. Fix a competing matrix A e M. For every p (-00; 0) we have <!>P(CK(M}} > ({>P(CK(A)). By continuity the inequality extends top = -oo,0. Hence M is ^-optimal also for For the first 5 out of the k components Q\,..., Bk, we partition any moment matrix according to
with s x s block A/n, (k - s) x (k - s) block A/22, and s x k block M\i = A/2'r The vectors x = (xi,...,xk)' e Uk such that jcs+1 = = Jt* = 0 form a subspace which we denote by IRS x {0}. Theorem 7.20 and Theorem 7.22 specialize as follows.
8.16. MATRIX MEAN OPTIMALITY FOR COMPONENT SUBSETS Corollary. Let M e M (E) be a moment matrix such that its range includes the leading s coordinate subspace, Rs x {0}, with information matrix If p (oo; 1], then M is ^-optimal for (6\,..., 8s)f in M (S) if and only if there exists some s x (k - s) matrix B such that
c = MH M^M^MII.
In case of optimality, B satisfies BA/22 = ^tu-
204
The matrix M is <_oo-optimal for (#1,..., 6S)' in M(H) if and only if there exist a nonnegative definite s x s matrix E with trace equal to 1 and some s x (k s) matrix B such that
In case of optimality, B satisfies EBM-& = -EM\2Proof. We set K = (/s,0)' and choose a generalized inverse G of M according to Theorem 7.20. Partitioning the 5 x k matrix CK'G' = (A,B) into a left s xs block A and a right s x(k-s) block B, we postmultiply by K to get /I - CK'G'K = 7S. This yields GKCDCK'G' = (IS,B)'D(IS,B). The normality inequality now follows with Z) = Cp~1/ trace C^ and Z) = /Amin(C') from Lemma 6.16. In the case of optimality, we have, upon setting N = (1S,B)'D(IS,B),
Because of condition (2) of Theorem 7.11, we may equate the bottom blocks to obtain DBM22 -DM\2. If p > -oo, then D cancels; if p = -oo, then AminCC") cancels. In order to discuss parameter systems K'0 with a k x s coefficient matrix of less than full column rank, we briefly digress to discuss the basic properties of Moore-Penrose matrix inverses.
8.17. MOORE-PENROSE MATRIX INVERSION
Given a rectangular matrix A e Ukxs, its Moore-Penrose inverse A+ is defined to be the unique s x k matrix that solves the four Moore-Penrose equations
The Moore-Penrose inverse of A obeys the formulae
Therefore, in order to compute A+, it suffices to find the Moore-Penrose inverse of the nonnegative definite matrix AA', or of A 'A. If a nonnegative definite matrix C has r positive eigenvalues A l 5 . . . , A r , counted with
8.18. MATRIX MEAN OPTIMALITY FOR RANK DEFICIENT SUBSYSTEMS
205
their respective multiplicities, then any eigenvalue decomposition permits a straightforward transition back and forth between C and C + ,
It is easily verified that the matrix A+ obtained by (1) and (2) solves the Moore-Penrose equations. One can also show that this solution is the only one. 8.18. MATRIX MEAN OPTIMALITY FOR RANK DEFICIENT SUBSYSTEMS If K is rank deficient and a moment matrix M is feasible for K'O, M e A(K), then the dispersion matrix K'M~K is well defined, but has at least one vanishing eigenvalue, by Lemma 1.17. Hence K'M~K is singular and regular inversion to obtain the information matrix C fails. Instead one may take recourse to generalized information matrices, to a reparametrization argument, or to Moore-Penrose inverses. Indeed, the positive eigenvalues of K'M~K and (K'M~K}+ are inverses of each other, as are all eigenvalues of K'M~K and (K'M~K)~l if K has full column rank s. On the other hand, the matrix mean <j>p(C) depends on C only through its eigenvalues. Therefore we introduce a rank deficient matrix mean <p', by requiring it to be the vector mean of the positive eigenvalues \l,...,\r of C = (K'M~K)+:
The definition of the rank deficient matrix mean p is defined in Section 6.7. If p e [-00; 0], then the ordinary matrix mean 4>p vanishes identically for singular matrices C, whereas </>p' does not. Nevertheless, with the rank deficient matrix means </>p', the equivalence theorems remain valid as stated, except that in Theorem 7.19, negative powers of C = (K'M'KY must be interpreted as positive powers of K'M'K. This leads to the normality inequality trace AGKC(K'M~K)l-pCK'G' < trace C(K'M-K)l-p for all A e M.
In Theorem 7.21, A m i n (C) becomes the smallest among the positive eigenvalues of C, entailing l/A min (C) - \max(K'M~K). An alternate approach leading to the same answers is based on reparametrization, K'6 UH'd, with some k x r matrix H which has full column
206
rank r and with some s x r matrix U which satisfies U'U = lr. Then we get
Respecting multiplicities, the positive eigenvalues of (K 'M K)+ are the same as all eigenvalues of (H'M~H)~l. Therefore, by adopting (H'M~H)~l as the information matrix for K'O, the resulting optimality criterion coincides with the one that is built on Moore-Penrose inversion, <f>p((H'M~H)~l) = *p(\l,...,\,) = 4>;((K'M-Kr). Furthermore, if the coefficient matrix K is a k x k orthogonal projector, then Moore-Penrose inversion and the generalized information matrices of Section 3.21 lead to the same answer. Namely, for a feasible moment matrix M its generalized information matrix for K'O is MK = K(K'M-K)-K'. With K = K2 = K', it is easy to verify that (K'M~K)+ = MK. Such coefficient matrices arise with the system of centered contrasts in twoway classification models. There the coefficient matrix is the ax a centering matrix Ka. Equivalently, by filling up with zero matrices to obtain a k x k matrix, k = a + b, the coefficient matrix may be taken to be the orthogonal k x k projector
8.19. MATRIX MEAN OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS For the centered contrasts of the factor A in a two-way classification model, Section 4.8 establishes Loewner optimality of the product designs rs' in the set T(r) of designs with given treatment marginals r. Relative to the full set T of all a x b block designs we now claim the following. Claim. The equireplicated product designs las' with arbitrary column sum vector s are the unique ^'-optimal designs for the centered contrasts of factor A in the set T of all block designs, for every p e [-00; 1]; the optimal contrast information matrix is Ka/a, and the optimal value is I/a. Proof. The centered contrasts of factor A have a rank deficient coefficient matrix
An equireplicated product design 1as' has row sum vector 1a = la/a, that is, the levels i = 1,..., a of factor A are replicated an equal number of times. Its
8.19. MATRIX MEAN OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS
207
moment matrix Af, a generalized inverse G, and the product GK are given by
(see Section 4.8). Hence the standardized dispersion matrix is K'GK aKa, and contrast information matrix is C = Ka/a. The powers of C are Cp+l = Ka/ap+l. Let A e M(T) be a competing moment matrix,
If the row sum vector of W coincides with one given before, r = r, then we get K'GAGK = a2Ka&rKa. The left hand side of the normality inequality of Section 8.18 becomes
The right hand side takes on the same values, trace Cp = (a - l)/ap. This proves ^'-optimality of 1as' for Kaa in T for every p [-oo;l], by Theorem 7.19 and Lemma 8.15. Uniqueness follows from Corollary 8.14, since by equating the bottom blocks in
we obtain aW 'Ka = 0 and W 1as'. The optimal value is found to be
In other words, the optimal information for the contrasts is inversely proportional to the number of levels. This completes the proof of our claim. A maximal parameter system may also be of interest. The expected response decomposes into a mean effect, the centered contrasts of factor A, and the centered contrasts of factor B,
208 tem
This suggests an investigation of the parameter sys-
Here we claim that simultaneous <p'-optimality pertains to the uniform design 1al b. This is the unique product design that is equireplicated and equiblocksized. It assigns uniform weight \/(ab) to each combination (/,;') of factor A andB. Claim. The uniform design 1al b is the unique ^'-optimal design for the maximal parameter system (1) in the set T of all block designs, for every p [-00; 1]; the dispersion matrix D = K'(M(1 alb)]~K and the </>p'-optimal value are
Proof, With the generalized inverse G of M = M(l al b ) as in Section 4.8, we get MGK K for the coefficient matrix K from (1). Hence M is feasible. Furthermore C = D+ is readily calculated from D, and
say. With regression vectors
from Section 1.5, the normality inequality of Theorem 7.19 becomes (e/,d/)Af(e/,<*/)' = 1 + (a - l)/aP + (b - l)/bP = trace CP. Therefore the uniform design is (^'-optimal, for every p e [-00; 1]. Uniqueness follows from Corollary 8.14 since
forces r = s = and W = This proves the claim.
EXERCISES
209
In the following chapter we continue our discussion of designs that are optimal under the matrix means <f>p, with a particular emphasis on 0o-, $_i-, $_oo-, and <fo-optimality in polynomial fit models. EXERCISES 8.1 In the third-degree polynomial fit model over T = [-1; 1], consider the scalar parameter system c'S with c (1,2,4,8)'. Show that the optimal design for c'O in T assigns weights 5/52, 12/52, 20/52, 15/52 to the points -1,-1/2,1/2,1, and has information 26~2 [Hoel and Levine (1964), p. 1557]. 8.2 (continued) Show that the optimal design for c'O on the equispaced support points -1,-1/3,1/3,1 assigns weights 35/464, 135/464, 189/ 464, 105/464, and has information 29~2 and efficiency 0.8. Show that the uniform design on l,l/3 is 65% efficient for c'O. 8.3 In the proof of Corollary 8.16, show that equality of the top blocks in MNK = KCK'NK yields BM2i = -M12Af22M2i, and that this identity is implied by BM22 = ^128.4 Show that the solution to the Moore-Penrose equations is unique. 8.5 Show that (PA)+ = (PA)+P and (AQ)+ = Q(AQ)+, for all A e R BX *, orthogonal n x n projectors P, and orthogonal k x k projectors Q. 8.6 Show that A+ > B+ if and only if rank A = rank B, for all B > A > 0 [Milliken and Akdeniz (1977)]. 8.7 Show that K+AK+t > (K'A~K)+, for all A <E A(K) [Zyskind (1967), Gaffke and Krafft (1977)]. 8.8 Show that if U 6 R^ xr has rankr and $ is an information function on NND(r), then A i> tf>(U'AU) is an information function on NND(fc). 8.9 (continued) Furthermore assume range U = range K and U'U = Ir. Show that the matrix means satisfy <f>p(U'AKU) <&p(\i(AK),..., \r(AK)) for all A e NND(fc). 8.10 (continued) Show that if K' = K+, then <}>p(U'AKU) = <}>p((K'A-K)+) for all A A(K).
CHAPTER 9
D-, A-, E-, T-Optimality
Optimality of moment matrices and designs for the full parameter vector in the set of all designs is discussed, with respect to the determinant criterion, the average-variance criterion, the smallest-eigenvalue criterion, and the trace criterion. Optimal designs are computed for polynomial fit models, where designs with arcsin support are seen to provide an efficient alternative. In trigonometric fit models, optimal designs are obtained not only for infinite sample size, but even for every finite sample size. Finally we illustrate by example that the same design may be optimal in models with different regression functions.
9.1. D-, A-, E-, T-OPTIMALITY The most popular optimality criteria in the design of experiments are the determinant criterion fa, the average-variance criterion </>_i, the smallesteigenvalue criterion <_oo, and the trace criterion <fo, introduced in Section 6.2 to Section 6.5. The equivalence theorems for these criteria take a particularly simple form, and even more so if the optimum is sought in the set of all moment matrices, M = A/(E), and if the parameter system of interest is the full mean parameter vector 0 (see Theorem 7.20 and Theorem 7.22). In the present chapter, we discuss these criteria in greater detail. To begin with, we introduce yet another, global criterion. It turns out to be equivalent to the determinant criterion even though it looks entirely unrelated.
9.2. G-CRITERION A special case of scalar optimality arises if the experimenter wishes to investigate x '0, the mean value for the response that comes with a particular regression vector x /(/). The performance of a design with a feasible moment matrix M is then measured by the information value CX(M) = (x'M~x)~l
210
9.3. BOUND FOR GLOBAL OPTIMALITY
211
for x'6, or equivalently, by the standardized variance x'M x of the optimal estimator x'O. However, if the experimenter is interested, not just in a single point x'6, but in the regression surface x H+ x'O as x varies over the regression range X, then a global performance measure is called for. The following criterion g concentrates on the smallest possible information and provides a natural choice for a global criterion. It is defined through
A design E H is called globally optimal in M(H) when its moment matrix M satisfies g(M) = sapAM^g(A). Thus we guard ourselves against the worst case, by maximizing the smallest information over the entire regression range X. Traditionally one prefers to think in terms of variance rather than information, as pointed out in Section 6.17. The largest variance over X is
The global criterion thus calls for maximization of g(A), the smallest possible information over X, or minimization of d(A\ the largest possible variance over X, as A varies over the set M(H) of all moment matrices. A bound on the optimal value of the global criterion is easily obtained as follows. 9.3. BOUND FOR GLOBAL OPTIMALITY Lemma. Assume that the regression range X C IR* contains k linearly independent vectors. Then every moment matrix M e M (H) satisfies
Proof. If M is singular, then there exists a regression vector x e X which is not a member of the range of M. Hence we obtain d(M) = oo and g(M) = 0, and the bounds are correct. If M is nonsingular and belongs to the design e H, then the bounds follow from
Indeed, the upper bound I/k for the minimum information g, and the lower bound k for the maximum variance d are the optimal values. This
212
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
comes out of the famous Kiefer-Wolfowitz Theorem. Furthermore, this theorem establishes the equivalence of determinant optimality and global optimality if the set of competing moment matrices is as large as can be, M = M (E). The moment matrices and designs that are determinant optimal are the same as those that are globally optimal, and valuable knowledge about the location and weights of optimal support points is implied. 9.4. THE KIEFER-WOLFOWITZ THEOREM Theorem. Assume that the regression range X C Rk contains k linearly independent vectors. Then for every moment matrix M e M(E) that is positive definite, the following four statements are equivalent: a. b. c. d. (Determinant optimality) M is <fr)-optimal for 6 in Af (E). (Normality inequality) x'M~lx < k for all x X. (Minimax variance) d(M) = k. (Global optimality) M is globally optimal in A/(E).
In the case of optimality, any support point *,- of any design e E that is <fo-optimal for 0 in a satisfies
Proof. Theorem 7.20, with p = 0, covers the equivalence of parts (a) and (b), as well as property (1). From (b), we obtain d(M) < k. By Lemma 9.3, we then have d(M) = k, that is, (c). Conversely, (c) plainly entails (b). Condition (c) says that M attains the global optimality bound of Lemma 9.3, whence (c) implies part (d). It remains to show that part (d) implies (c). By the Existence Theorem 7.13, there is a moment matrix M e M(S) which is <fo-optimal for 6 in M(a). We have proved so far that, because of optimality, this matrix M satisfies d(M) k and is globally optimal. Now we assume part (d), that is, M is another globally optimal matrix in M(E), besides M. The two matrices must lead to the same optimal value, d(M) = d(M) = k. Hence part (d) implies part (c). Property (2) simply reiterates Corollary 8.12. The argument in the proof that leads back from (d) to (c) is of the sort: sufficiency of an optimality condition together with the existence of an optimal solution implies necessity. The proofs of the Gauss-Markov Theorem 1.19 and the Elfving Theorem 2.14 are arranged along the same lines.
9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
213
Property (2) entails that if a <ft)-optimal design has a minimum number of support points, k, then it distributes the weight l/k uniformly to each of them, as stated in Corollary 8.12. This phenomenon occurs in polynomial fit models.
9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS The Kiefer-Wolfowitz Theorem 9.4 is now used to determine the <fo-optimal designs for the full parameter vector in the class of all designs, for polynomial fit models on the symmetric unit interval [-!;!]. By Section 1.6, the model equation for degree d > I is
with tf e [!;!] The regression function / maps t e [1;1] into the power vector x = f ( t ) = (1, r , . . . , td}' e M.d+l. The full parameter vector 6 comprises the k d + 1 components BQ, &\,..., 6d. Rather than working with designs e a on the regression range X C Ud+l, we concentrate on the set T of all designs T on the experimental domain T [-!;!]. In the dth-degree model, the moment matrix of a design T on T is given in Section 1.28:
A design r is feasible for the full parameter vector 0 if and only if its moment matrix Md(r) is positive definite. The minimum support size of r then is d+l, because r must have at least d + 1 support points r0, f i , . . . , td e [-1; 1] such that the corresponding regression vectors jc/ = /(f,-) are linearly independent. If the design T is <f> -optimal for 0 in T, for an information function <f> on NND(&) which is strictly isotonic on PD(s), then the support size is actually equal to d + I. This is so because the Equivalence Theorem 7.17 states that every support point r,- of T maximizes the function P ( t ) f(t)'Nf(t) over / e [-1; 1] where (f> (Md(r))N is a subgradient of $ at Md(r), by Theorem 7.9. Lemma 7.5 then shows that the matrix N is positive definite. Thus the bottom right entry of N must be positive, whence P ( t ) is a polynomial of degree 2d. Therefore P has at most d - 1 local maxima on the real line R, attained at points ? i , . . . ,fd-i. saY- In order to achieve the minimum support size d + 1, these must be distinct points in the interior of the interval [1;1], and the boundary points t$ 1, and td = 1 must also attain the maximum value.
214
Degree
2 3 4 5 6 7 8 9 10
2
Legendre Polynomial Pd on [ 1, 1] (-l + 3r )/2 (-3r + 5r3)/2 (3-30?2 + 35*4)/8 (15/-70f 3 +63f s )/8 (-5 + 105f2 - 315f4 + 231f6)/16 (-35/ + 315f3 - 693/5 + 429f7)/16 (35 - 1260/2 + 6930/4 - 12012/6 + 6435f8)/128 (315* - 4620f3 + 18018*5 - 25740?7 + 12155f9)/128 (-63 + 3465r2 - 30030/4 + 90090/6 - 109395/8 + 46189/10)/256
EXHIBIT 9.1 The Legendre polynomials up to degree 10. We have PQ(t) = 1 and /^(r) = t, the next nine polynomials are shown in the exhibit.
This yields k = d +1 support points of the form
for any design T e T that in the dth-degree model is <-optimal for 0 in T. We have constructed the matrix N as the subgradient of <f> at Af d (r). However, we may also view N as an optimal solution of the dual problem of Section 7.11. Hence any one such N works for all <-optimal design T, as does the polynomial P(t) f(t}'Nf(t). Thus the support points fo>'i>*</ are common to all <-optimal designs for 6 in T. This applies to the determinant criterion <fo, the theme of this section. Therefore the <fo -optimal support for d in T is of the form -1 = /0 < h < < td_\ < id = 1, where we claim that the interior points ?i,...,/d_i are the local extrema of the Legendre polynomial Pd. For the various ways to characterize these classical polynomials, we refer to an appropriate differential equation. Alternatively they are obtained by orthogonalizing the powers t,...,td relative to Lebesgue measure on [-!;!]. The first ten polynomials Pd are listed in Exhibit 9.1. Claim. The unique design TQ that in the d th-degree model is <fo-optimal for 6 in T assigns equal weight \/(d + 1) to the d + 1 points f, that solve the equation
where Pd is the derivative of the d th-degree Legendre polynomial Pd. Proof. Uniformity of the weights follows from the Kiefer-Wolfowitz The-
9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
215
orem 9.4. The location of the support is found by studying the normality inequality f(t}'(Md(T*)}-lf(t] < d+l. Introducing the (d+l) x (d+l) matrix X through
we get Md(r*) = X'X/(d + 1). Since Md(r*} is nonsingular so is X. This reduces the normality inequality to IJA""1/!')!!2 < 1 fr au" f [-!;!] The inverse V of X' admits an explicit representation once we change from the power basis l,t,...,td to the basis provided by the dth-degree Lagrange polynomials L/ with nodes fo, h > > t<i->
where in the products, k ranges from 0 to d. Evidently L/(f ; ) equals 1 or 0 according as / = j or / ^ j. The same is true of the dth-degree polynomial P(t) = e!Vf(t) since we have P(tj) = e!Vf(tj) - efVX'ej = etc,. Thus the two polynomials are identical,
In other words, the matrix V, which has the power basis coefficient vectors of the Lagrange polynomials as rows, is the inverse of the matrix X', which has the power vectors jc, = /(/,) that come with the nodes ?, as columns. This yields \\X'~lf(t)\\2 = ||V/(r)||2 = ?=oL?(0- In order to compare this sum of squares with the constant 1, we use the identity
Indeed, on either side the polynomials are at most of degree 2d+l. At tj, they share the same value 1 and the same derivative 2L;-(fy), for j = 0,1,...,d. Hence the two sides are identical. The polynomial Q(t) = n*=o(' ~'*) nas degree d+\ and satisfies Q(r,-) = 0. Around f,-, it has Taylor expansion
216
With this, the Lagrange polynomials and their derivatives at r, become
We let c e R be the coefficient of the highest term td+l in (1 -t2)Pd(t). By assumption this polynomial has zeros tk, giving cQ(t) = (1 t2)Pd(t). Only now do we make use of the particulars of the Legendre polynomial Pd that it is characterized through the differential equation
In terms of Q this means cQ(t) = -d(d+l)Pd(t), and cQ(t) = -d(d+l)Pd(t). Insertion into (2) yields
For i = 1,... ,d 1, the points /, are the zeros of Pd, and so the derivatives L,-(fj) vanish. This aids in evaluating (1), since it means that the only nonvanishing terms occur for / = 0 and / = d. The left hand side in (3) is (l-t2)Pd(t)-2tPd(t). Thus t = 1 in (3) implies L 0 (-l) = -d(d+l)/4 and L rf (l) = d(d+l)/4 in (4). With Pd(-l) = (-\)d and Pd(l) = 1, we insert L0 and Ld from (4) into (1) to obtain
This verifies the normality inequality. The proof of our claim is complete. For d > 1, the optimal value i^(<fo) = <o(To) obeys the recursion relation
with starting value vi(<fo) = 1. Exhibit 9.2 in Section 9.6 provides a list up to degree d = 10, of the <fo-optimal designs rfi for 0 in T and of their optimal values vd(<f)o). The initial cases d = 1,2 are easy to verify directly. The line fit model has (fo-optimal design TQ supported by 1 with uni-
9.6. ARCSIN SUPPORT DESIGNS
217
form weight 1/2, and optimal value ui(<fo) = 1. The parabola fit model has ^-optimal design TQ supported by 1,0,1 with uniform weight 1/3, and optimal value i/ 2 (<fo) = 41/3/3 = 0.52913. Although the optimal support points have an explicit representation, as zeros of derivatives of Legendre polynomials, they are not easy to compute numerically. However, in Section 5.15 we argued that an optimal design is not an end in itself, but helps to identify good practical designs. To this end we introduce designs with arcsin support (see Exhibit 9.3). Their support is constructed easily, and they are nearly optimal in many polynomial fit problems. 9.6. ARCSIN SUPPORT DESIGNS The theory of classical polynomials says that the designs TQ which are <fooptimal for 6 in T converge to the arcsin distribution, as the degree d tends to infinity. The arcsin distribution on [-1;1] has distribution function A and Lebesgue density a, given by A(t) = | + (1/w) arcsin(f) and a(t) = l/(ir(l - r2)1/2). The convergence becomes plainly visible through a histogram representation of the designs TQ, as in Exhibit 9.4. Because of the limiting behavior, it seems natural to approximate the optimal support points r, by the d th-degree quantiles $/ of the arcsin distribution,
for / = 0,1,... ,d. Symmetry of the arcsin distribution entails symmetry of the quantiles, s^ = sd_i for all / = 0 , 1 , . . . , d , with s0 = 1 and sd = 1. Exhibit 9.5 illustrates the construction. Designs with arcsin support are very efficient in many polynomial fit problems. They deserve a definition of their own. DEFINITION. In a polynomial fit model of degree d over the experimental domain [1; 1], an arcsin support design ad is defined by having for its support the d th-degree quantiles
of the arcsin distribution on [-!;!]. The set of all arcsin support designs for degree d is denoted by ^d. The member with uniform weights l/(d + 1) is designated by <TQ. It is <fo-optimal for 0 if the set of competing designs is restricted to the arcsin support designs 2d. This follows from applying the Kiefer-Wolfowitz Theorem 9.4 to the finite regression range Xd = [f(s0), f(si),..., f ( s d ) } .
218
EXHIBIT 9.2 Polynomial fits over [-1; 1): ^-optimal designs r* for 6 in T. Left: degree d of the fitted polynomial. Middle: support points and weights of rfi, and a histogram representation. Right: optimal value vd(<f>o) of the determinant criterion.
9.6. ARCSIN SUPPORT DESIGNS
219
EXHIBIT 93 Polynomial fits over [-1; 1]: ^-optimal designs a* for G in 2d. Left: degree d of the fitted polynomial. Middle: arcsin support points, weights of <rft, and a histogram representation. Right: efficiency of <r relative to the optimal value i^(<fo).
220
EXHIBIT 9.4 Histogram representation of the design TJ}. Superimposed is the arcsin density
EXHIBIT 9.5 Fifth-degree arcsin support. Top: as quantiles sf - A '(//5). Bottom: as projections s, = cos(l - i/5)tr of equispaced points on the half circle.
9.7. EQUIVALENCE THEOREM FOR A-OPTIMALITY
221
Exhibit 9.2 presents the overall optimal designs TQ. In contrast, the optimal arcsin support designs crfi and their (^-efficiencies are shown in Exhibit 9.3. For degrees d 1,2, the arcsin support designs aft coincide with the optimal designs TQ and hence have efficiency 1. Thereafter the efficiency falls down to 97.902% for degree 9. Numerical evidence suggests that the efficiency increases from degree 10 on. The designs are rounded to three digits using the efficient design apportionment for sample size n = 1000 of Section 12.12. The determinant criterion fa is peculiar in that optimal designs with a minimum support size d + 1 must assign uniform weights to their support points. For other criteria, the optimal weights tend to vary. We next turn to the matrix mean <f>_\, that is, the average-variance criterion. The quantity that takes the place of the largest variance d of Section 9.2 now becomes
for positive definite k x k matrices A, and d_\(A) oo for nonnegative definite k x k matrices A that are singular. Because of
the function d_\ is on A/(a) bounded from below by 1. 9.7. EQUIVALENCE THEOREM FOR A-OPTIMALITY Theorem. Assume that the regression range X C U.k contains k linearly independent vectors. Then for every moment matrix M M (E) that is positive definite the following four statements are equivalent: a. b. c. d. (Average-variance optimality) M is <_!-optimal for 6 in M(E). (Normality inequality] x'M~2x < trace M"1 for all x X. (Minimax property) d_i(M) = 1. (d ^-optimality) M minimizes d_\ in M (E).
In the case of optimality, any support point jc, of any design e S that is </>_i-optimal for 6 in S satisfies
Proof.
The proof parallels that of the Kiefer-Wolfowitz Theorem 9.4.
222
The present theorem teaches us a major lesson about the frame that the Kiefer-Wolfowitz Theorem 9.4 provides for determinant optimality. It is false pretense to maintain the frame also for other criteria, in order to force a structural unity where there is none. No interpretation is available that would promise any statistical interest in the function d^\. Therefore, the frame of the Kiefer-Wolfowitz Theorem 9.4 is superseded by the one provided by the General Equivalence Theorem 7.14. 9.8. L-CRITERION An optimality concept not mentioned so far aims at minimizing linear criteria of the dispersion matrix. It arises when a design is evaluated with a view towards the average variance on the regression surface x -> x'0 over X. Assuming M() to be positive definite, the evaluation is based on the criterion
where the distribution A on X reflects the experimenter's weighting of the regression vectors x e X. Upon setting W = A/(A), this approach calls for the minimization of trace WM (^)~1. The generalization to parameter subsystems K'0 is as follows. Let W be a fixed positive definite s x s matrix. Under a moment matrix M that is positive definite, the optimal estimator for K' 6 has a dispersion matrix proportional to K'M~1K (see Section 3.5). The notion of linear optimality for K'O calls for the minimization of
This is a linear function of the standardized dispersion matrix K'M 1K. Linear optimality poses no new challenge, being nothing but a particular case of average-variance optimality. Let H 6 R5X* be a square root of W, that is, W HH'. Then the criterion takes the form
Hence minimization of <fay is the same as maximization of <_i o CKH- Furthermore, the latter formulation extends to all moment matrices M e M(H), whether they are positive definite, or whether they are merely nonnegative definite. In summary, with weight matrix W = HH' > 0, linear optimality for K'O is the same as <_i-optimality for H'K'B. The optimality results for the average-variance criterion carry over, and also characterize linear optimality.
9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
223
9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS Continuing the discussion of polynomial fit models, we now study designs that for degree d are $_i -optimal for the full parameter vector 6 in the set T of all designs on the experimental domain T = [!;!]. Let Md be a moment matrix that is </>_]-optimal for 6 in M(E). The criterion <f>_i is strictly isotonic on PD(s), owing to Theorem 6.13. By the same reasoning as in Section 9.5, this determines the optimal support to consist of d + 1 points of the form -1 = tQ < t\ < < f d -i < td = 1. The </>_i-optimal weights w0, w},..., wd for the support points to, t\,..., td are unique, and are obtained using Corollary 8.8. Namely, let bu be the / th diagonal element of the matrix B (XX')~l where the regression vectors *, = /(*/) are the rows of the square and nonsingular matrix X (XQ,XI, ... ,xd)'. Then the optimal weights w, and the optimal value are
In summary, there is a unique design T^ that is <f>_\-optimal for 6 in T. Once its support points are found, the weights (1) and the optimal value (2) are easy to compute. Numerical computation of the </>_! -optimal designs r^ produces the results shown in Exhibit 9.6. The weights are rounded to three digits using the efficient design apportionment for sample size n 1000 of Section 12.5. The line fit model has </>_i -optimal design rlj supported by 1 with uniform weight 0.5 and optimal value v\ (</>_i) = 1. The parabola fit model has </>_i-optimal design r^ supported by -1,0,1 with weights 0.25, 0.5, 0.25, and optimal value V2(<t>-\) = 0.375. The class 2d of designs with a d th-degree arcsin support provides an efficient alternative. The design a^ that is <j>_\-optimal for 6 in 2d has weights also given by (1), except that the matrix B (XX'}~1 involves the arcsin support points st through
Exhibit 9.7 shows that the </>_!-efficiency of a^ is remarkably high. Next we turn to the left extreme matrix mean, the smallest-eigenvalue criterion <_oo- In order to determine the 4>_oo-optimal designs for polynomial fit models we need to refer to Chebyshev polynomials. We review their pertinent properties first.
224
CHAPTER 9. D-, A-, E-, T-OPTIMALITY
EXHIBIT 9.6 Polynomial fits over [-!;!]: ^-optimal designs T^, for 9 in T. Left: degree d of the fitted polynomial. Middle: support points and weights of r^r and a histogram representation. Right: optimal value u,/(<^_i) of the average-variance criterion.
9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
225
EXHIBIT 9.7 Polynomial fits over [-!;!]: <A_j-optimal designs o-^ for 6 in Ld. Left: degree d of the fitted polynomial. Middle: arcsin support points, weights of adr and a histogram representation. Right: efficiency of cr^ relative to the optimal value vd(<f>^).
226
Degree
2 3 4 5 6 7 8 9 10
-l + 2f 2 -3r + 4t3 l - & 2 + 8r4 5f - 20f3 + 16f5 -! + 18f 2 -48r 4 + 32r6 -7r + 56r3-112r5 + 64f7 1 - 32f2 + 160r4 - 256f6 + 128r8 9t - 120r3 + 432r5 - 576r7 + 256r9 -1 + 50r2 - 400r4 + 1120f6 - 1280r8 + 512/10
Chebyshev Polynomial Td on [-1;i]
EXHIBIT 9.8 The Chebyshev polynomials up to degree 10. We have T0(t) = 1 and 7\ (/) = t, the next nine polynomials are shown in the exhibit.
9.10. CHEBYSHEV POLYNOMIALS The Chebyshev polynomial Td of degree d is defined by
The Chebyshev polynomials for degree d < 10 are shown in Exhibit 9.8. It is immediate from the definition that the function Td is bounded by 1,
All extrema of Td have absolute value 1. They are attained at the arcsin support points
from Section 9.6:
In order to see that Td(t] is a polynomial in t, substitute cos (p for / and compare the real parts in the binomial expansion of the complex exponential function, cos(d<p) + i sin(d<p) = ei(fd (cos(<p) + isin(<p)) d . This leads to the polynomial representation 7X0 = Y^=ocjfj fr which only the highest
9.11. LAGRANGE POLYNOMIALS WITH ARCSIN SUPPORT NODES
227
coefficient and then every second are nonzero,
where [d/2\ is the integer part of d/2, We call c (CQ,CI, ... ,Q)' e IRd+1 the Chebyshev coefficient vector. With power vector f ( t ) = (1, t,..., td)', we can thus write Td(r) - c'/(r). Whereas the vector c pertains to the power basis l , f , . . . , f d , we also need to refer to the basis provided by the Lagrange interpolating polynomials, with nodes given by the arcsin support points Sj.
9.11. LAGRANGE POLYNOMIALS WITH ARCSIN SUPPORT NODES The Lagrange polynomials with nodes SQ, s\,..., s^ are
where in the products, k ranges from 0 to d. We find it unambiguous to use the same symbol L, as in Section 9.5 even though the present node set is distinct from the one used there, and so are the Lagrange polynomials and all associated quantities. Again we introduce the (d + 1) x (d + 1) matrix V = (VQ,VI, ... ,vd) that comprises the coefficient vectors v, from the power basis representation Li(t) = i>//(0- Fr degrees d = 1,2,3,4, the coefficient matrices V and the sign pattern (-l)d~l+j of vijtj-2j are shown in Exhibit 9.9. More precisely, the entries of V satisfy the following. Claim. For all i = 0,1,...,d and ; = 0,1,..., |///2j, we have
Proof. The proof of (1) is elementary, resorting to the definition of L, and exploiting the arcsin support points -1 = SQ < Si < < sd^ < sd = 1, solely their symmetry, sd_( = -5,-. The denominator satisfies
228
EXHIBIT 9.9 Lagrange polynomials up to degree 4. For degrees d = 1,2,3,4, the (d +1) x (d + 1) coefficient matrix V is shown that determines the Lagrange polynomial L,-(f) = y*._ n v;/f'', with nodes given by the arcsin support points 5,-. The bordering signs indicate the pattern
It remains to multiolv out the numerator oolvnomial and to find the coefficient belonging to td 2>. We distinguish three cases. I. Case d odd. Except for / - sd_t = t + s,-, the factors in P come in pairs, / - sk and / + sk. Hence we have P(t) = (t + Si)Q(t), where the polynomial
involves only even powers of t. In order to find the coefficient of the odd power td~2i in P, we must use the term t of the factor t + st and \(d - 1) - ; terms t2 in the factors of Q. This results in the coefficient
With \(d - 1) = \\(d - 1)J, we see that (3) and (2) establish (1). n. Case d even and i = \d. The definition of P misses out on the factor t 0. Since the remaining factors come in pairs, we get
9.12. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I
229
The even power td~2i uses \d j terms t2, and thus has coefficient
With \d-l = \\(d - 1)J now (4) and (2) imply (1). III. Case d even and i / \d. Here P comprises the factors t - 0 = t and t - sd_i = t + St. We obtain P(t) = t(t + s,-)G(0 with
The even power td 2> is achieved in P only as the product of the leading factor r, the term t in the factor t + Si, and \d-l-j terms t2 in the factors of Q. The associated coefficient is
This and (2) prove (1). Moreover (5) is nonzero unless the summation is empty. This occurs only if ; = \d; then the set contains \d - 1 numbers and does not include a ^-element subset. As depicted in Exhibit 9.9, we refer to (1) through the sign pattern (-\Y~l+l Furthermore both numerator and denominator in (1) are symmetrical in / and d i. The discussion of (5), together with (3) and (4) show that the numerator vanishes if and only if d is even and / = ^d ^ i. All these properties are summarized in
This concludes our preparatory remarks on Chebyshev polynomials, and Lagrange polynomials with nodes s,. In order to apply Theorem 7.24, we need the Elfving norm of the vector Kz. For the Chebyshev coefficient vector c and z = K'c, we compute the norm p(KK'c) as the optimal value of the design problem for c'KK'0. 9.12. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I In a d th-degree polynomial fit model, we consider parameter subsystems of the form QX = (Oi)iei, with an ^-element index set I = {ii,...,is}. Using
230
the Euclidean unit vectors e,e\,...,ed of Rd+1, we introduce the (d + 1) x 5 matrix K = (e,,,... ,eis) and represent the parameter system of interest by K'O = ( f t , , . . . , ft,)'. The matrix K fulfills K'K = Is and KK1 the latter is a diagonal matrix D where da is 1 or 0 according as / belongs to the set X or not. Because of the zero pattern of the Chebyshev coefficient vector c, the vector KK 'c depends on X only through the set
More precisely, we have KK'c = Y^j&jcd-2jed-2j- We assume the index set X to be such that J is nonempty, so that KK 'c ^ 0. In this section, we consider scalar optimality for c'KK'd and find, as a side result, the Elfving norm p(KK'c) to be \\K 'elf. It turns out that the optimal design is an arcsin support design. Claim. Let c e F8rf+1 be the coefficient vector of the Chebyshev polynomial Td(t) = c'f(t) on [!;!]. Then the unique design TJ that is optimal for in T has support points
and weights
and optimal variance (p(KK 'c)) = \\K 'c||4, where the coefficients w0, "i, , ud are determined from
If d is odd or J ^ {\d} then all weights are positive. If d is even and J {\d}, then we have c'KK'B = BQ and wd/2 = 1, that is, if d is even, then the one-point design in zero is uniquely optimal for BQ in T. Proof. The proof is an assembly of earlier results. Since the vector u = (HO, MI , i ud)' solves X'u KK 'c, the identity X' V = Id+i from Section 9.5 yields u = VKK'c, that is, , = X)yej vi,d-2jCd-2j- An exceptional case occurs for even degree d and one-point set J = {\d}\ then we have wd/2 = 1 and all other weights are 0.
9.12.
SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I
231
Otherwise we use the sign patterns to show that the weights are positive,
and
and
The normalization follows from
upon using
From (6) in Section 9.11, we see that the weights are symmetric, w, = Wd-iLet M be the moment matrix of the design TJ. The key relation is the following:
Premultiplication by c'KK'M~ gives \\K'c\\4 = c'KK'M'KK'c. That is, for the design TJ the optimality criterion for c'KK'O takes on the value ||AT'c||4. Now we switch to the dual problem. The matrix N = cc' satisfies f(t)'Nf(t) = (Td(t)}2 < 1, for all t e [-!;!]. The dual objective function has value c'KK'NKK'c = ||A"'c||4. Therefore the Mutual Boundedness Theorem 2.11 proves the matrices M and N to be optimal solutions of the design problem and its dual. The theorem also stipulates that any support point t of an optimal design satisfies the equation f(t)'Nf(t) = (Td(t)}2 = 1. By Section 9.10, this sin gles out the arcsin quantiles SQ, s\,..., sd as the only possible support points. The matrix identity X'V = Id+i shows that X is nonsingular, whence the regression vectors f ( s Q ) , f ( s i ) , . . . ,f(sd) are linearly independent. Hence Corollary 8.9 applies and provides the unique optimal weights, i These are the weights given above. Our claim is established. As a corollary we obtain a representation as in the Elfving Theorem 2.14,
Thus the coefficient vector KK 'c penetrates the ElfVing set 72. through the ddimensional face generated by (-I)d/(s0), (-l)*"1/^!)* > -/(*d-i)i/(**)
232
The full index set J = {0,1,..., d} gives the unique optimal design for c'B, denoted by TC, with optimal variance \\c\\4. The one-point set J = {d 2j} yields the optimal design Td_2j for the scaled individual parameter cd_2j8d-2j-> with optimal variance c^_2.. The same design is optimal also for the unsealed component 0</-2/ with optimal variance c^_2,. These designs satisfy the relation
Therefore rc is a mixture of the designs T d _ 2 y, with mixing weights c^_2 -/||c||2. We now leave scalar optimality and approach our initial quest, of discussing optimality with respect to the smallest-eigenvalue criterion <f>-oo. We claim the following. Claim. The design TJ has an information matrix C = CK(Md(rj)] for K '6 such that K 'c is an eigenvector corresponding to the eigenvalue ||A" 'c||~2. Proof. We choose a left inverse LoiK which satisfies LM (Id+i L'K') = 0, where M = Md(rj) is the moment matrix of TJ. As in Section 3.2, we then have C LML', in addition to LML'K1 = LM. Insertion of Me from the key relation (2) yields
For instance, in a fourth-degree model, the sets {0,1,3,4}, {0,1,4}, {0,3,4}, {0,4} all share the same set J = {0,2}, the same vector KK'c = Coeo+c4e4 (1,0,0,0,8)' e R5, and the same optimal design T{0,2}- The information matrices for K'8 are of size 4 x 4 , 3x3, 3x3, 2 x 2 , with respective eigenvectors (1,0,0,8)', (1,0,8)', (1,0,8)', (1,8)', and common eigenvalue l/(cJ + c3) = l/65. While it is thus fairly easy to see that \\K'c\\~2 is some eigenvalue, <_oooptimality boils down to showing that it is the smallest eigenvalue of C. 9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS The designs TJ of the previous section are symmetric whence the odd moments vanish. Therefore their moment matrices decompose into two interlacing blocks (see Exhibit 9.10). In investigating the subsystem K'B = 0j, we place a richness assumption
9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
233
EXHIBIT 9.10 E-optimal moment matrices. Because of symmetry of the design TC = T^, its moment matrix splits into two interlacing blocks, as shown for degrees d 1,2,3,4. Dots indicate zeros.
on the index set J which ensures that the smallest eigenvalue of the information matrix comes from the block that is associated with the Chebyshev indices d - 2j. More precisely, we demand that every non-Chebyshev index d 1 2j in J is accompanied by its immediate successor d 2/,
for all j 0,1,..., [\d\. Since the scalar case is taken care of by the previous section, we further assume that the index set J contains at least two indices. This prevents the set
from degenerating to the one-point set {\d}. We claim the following for a dth-degree polynomial fit model with experimental domain [-!;!]. Claim. Let c e Ud+l be the coefficient vector of the Chebyshev polynomial Td(t) c'f(t) on [-1;1], and assume that the index set X satisfies assumption (1). Then the design TJ of Section 9.12 is the unique <_oo-optimal design for K'6 % in T, with optimal value ||A"'c||"2. If d > 1, then the smallest eigenvalue of the information matrix C = CK(Md(Tj)} has multiplicity 1. Proof. From Section 9.12 we have p(KK'c) = ||/C'c||2. The proof rests on Theorem 7.24, in verifying the equality
234
That is, we wish to show that
with equality if and only if z is proportional to K 'c. I. Let M be the moment matrix of TJ. With a left inverse L of K that is minimizing for M, we can write C = LML'. Hence the assertion is
Because of the interlacing block structure of A/, the quadratic form on the left hand side is
where P and Q are polynomials of degree d and d 1 associated with the Chebyshev index set and its complement,
The contributions from P2 and Q2 are discussed separately. II. In (1) of Section 9.12, we had d th-degree polynomial satisfies
Any
A comparison of coefficients with
yields
Observing sign patterns we apply this to the Chebyshev polynomial Td:
9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
235
We also apply it to the dth-degree polynomial P, in that Now, for each / 6 J, the Cauchy inequality yields
If equality holds in all Cauchy inequalities, then we need exploit only any one index j J with j ^ \d to obtain proportionality, for / = 0,1,... , d, of P(s,-)Kd-2;|1/2 and (-l)d~~i+i\vitd_2j\l/2- Because of i/M_2;- ^ 0, this entails P(SI) a(-l)d~', for some a e IR. Hence equality holds if and only if P = aTd, that is, a d _ 2y = acd_2y for all j = 0,1,..., \\d\. III. The argument for Q2 is reduced to that in part II, by introducing the d th-degree polynomial
From sj < 1 and part II, we get the two estimates
If d is even and \d e J, then the last sum involves the coeffcient of t in P which is taken to be a,\ = 0.
236
EXHIBIT 9.11 Polynomial fits over [-!;!]: ^-oo-optimal designs vd_^ for 0 in T. Left: degree d of the fitted polynomial. Middle: support points and weights of T^ and a histogram representation. Right: optimal value v</(</>-oo) of the smallest-eigenvalue criterion.
9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II
237
Equality holds in (3) only if, for some /3 e R, we have P = f$Td. Equality holds in (2) only if Q2(Si) = s2Q2(si); in case d > 1, any one index / = l , . . . , d - 1 has 5? < 1 and hence entails Q(s,-) = 0 = P(SI) = (-l)d-'p. Thus equality obtains in both (2) and (3) if and only if 0^-1-2; 0 fr a^ ; = 0,l,...,Lj(d-l)J. IV. Parts II and III, and the assumption that an index d - 1 - 2; occurs in X only in the presence of d - 2j yield
With a = L'z and LK = 7S, we get a'KK'a = ||z||2. Therefore ||tf'c||-2 is the smallest eigenvalue of C. If d > 1, then a'Ma = \\z\\2/\\K'c\\2 holds if and only if a ac, that is, z = aK'c\ hence the eigenvalue ||/'c||~2 has multiplicity 1. V. If T is another ^^-optimal design for K'O in T, then T is also optimal for c'KK'O, by Theorem 7.24. Hence the uniqueness statement of that theorem carries over to the present situation. This completes the proof of our claim. For instance, in a fourth-degree model only the last two of the four sets {0,1,3,4}, {0,1,4}, {0,3,4}, {0,4} meet our assumption (1). Hence T{0,2} is 4>_oo-optimal in T for (0o,03,04)', as we^ as fr (0o>04)', with common optimal value 1/65. The present result states that many 0_oo-optimal designs are arcsin support designs. The case of greatest import is the full index set I {0,1,..., d} for which the design TC of the previous section is now seen to be also $-00optimal for 0 in T. It is in line with our earlier conventions that we employ the alternate notation r^ for TC. Exhibit 9.11 lists the ^-oo-optimal designs r^ up to degree 10, with weights rounded using the efficient design apportionment of Section 12.5. The line fit model has <_oo-optimal design supported by 1 with uniform weight 0.5 and optimal value f^-oo) = 1; the (ôo-optimal moment matrix for 0 is /2, its smallest eigenvalue, 1, has multiplicity 2. The parabola fit model has </>_oo-optimal design r 2 ^ supported by -1, 0, 1 with weights 0.2, 0.6, 0.2, and optimal value u2(<-oo) = 0.2. From Theorem 7.24, we also know now that the Chebyshev coefficient vector c determines the in-ball radius of the polynomial Elfving set 72., r2 = l/||c||2. The in-ball touches the boundary of ft only at 9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II In Section 9.12, we derived the optimal designs rd_2j for those individual parameters 0^-2y that are an even number apart from the top coefficient Bd.
238
A similar argument leads to the optimal designs rd_i_2j for the coefficients Od-\-2jthat are an odd number away from the top. Their support falls down on the arcsin support of one degree lower, d \. In order that the coefficient vector of the (d 1) st-degree Chebyshev polynomial Td_i fits into the discussion of the dth-degree model, we use the representation Td_i(t) = c'f(t), with dth-degree power vector f(t) = ( l , r , . . . ,td)' as before. That is, while ordinarily Td_\ comes with d coefficients c 0 ,ci,... ,Q_I, it is convenient here to append the entry cd = 0. Let again Ox = (#i)<ez be the parameter system of interest. Because of the zero pattern of the Chebyshev coefficient vector c introduced just now, the vector KK 'c depends on J only through the set
Assuming the index set I to be such that J is nonempty, we claim the following. Claim. For degree d > 1, let c e Rd+1 be the coefficient vector of the Chebyshev polynomial Td_i(t) = c'f(i) on [-!;!]. Then the unique design TJ that is optimal for in T has support points
and weights
and optimal variance (p(KK 'c)) = \\K'c ||4, where the coefficients w0, MI , . . . , ud_i are determined from
<2
If d is even or J ^ {\(d 1)}, then all weights are positive. If d is odd and j = {I(d - 1)}, then we have c'KK'O = QQ and w(d^)/2 = 1, that is, if d is odd then the one-point design in zero is uniquely optimal for OQ in T. Proof. The support of the design T~ gives rise to only d linearly independent regression vectors /(?o))/(î)> >/fo-i) in the d + 1 dimensional space Ud+l. Hence the moment matrix Md(r~) is singular, and feasibility of r~ for c'KK'O requires proof.
9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II
239
Using the (d - l)st-degree power vector f(t) = (l,t,...,td~1)', equaThis yields tion (1) without the last line reads
are the power basis coefficients of the Laeranee where polynomials with nodes The last line in (1) is and Using the symmetries we get
Thus the last line in (1) is fulfilled and T~ is feasible for c'KK'B. The rest \J of the proof duplicates the arguments from Section 9.12 and is therefore omitted. Again we deduce a representation as in the Elfving Theorem 2.14:
Here the coefficient vector ATAT'? nenetrates the Elfvine set 72. throueh the
(d l)-dimensional face that is generated by The one-point set I {d - 1 - 2;} yields the optimal design rd_i_2j for :he component 0</_i_2/, with optimal variance c j_1_2;. As a comparison, on the rfth-degree arcsin support points SQ,Si,...,sd, ;he design 0^-1-2; ât *s optimal fr Od-i-2j has weights
240
Degree
3 4 5 6 7 8 9 10
fib 0i
03
04
0s
06
0j
0s
09
#10
0.3600 1 0.2528 1 0.2062 1 0.1793 1
1 0.6141 1 0.4679 1 0.3893 1 0.3393
0.5625 1 0.4973 1 0.4312 1 0.3824 1
1 0.6863 I 1 0.5968 1 0.5821 1 0.6462 1 1 0.5404 1 0.6066 I 0.5085 1 0.5825 1 0.6331 1 1 0.4907 1 0.5609 1 0.6106 1 0.4549 1 0.5313 1 0.5860 1 0.6271 1
EXHIBIT 9.12 Arcsin support efficiencies for individual parameters 0y. For degree d < 10 and j 0,1,..., d, the table lists the efficiencies of the optimal arcsin support design ay for 0, in ?.d, relative to the optimal design TJ for 0, in T.
by Corollary 8.9. This design has efficiency
The efficiency is 1 for degree 1 and 2. From degree 3 on, the discrepancy between the arcsin support sets of degree d and d - 1 entails a drastic jump in the efficiencies, as shown in Exhibit 9.12. Finally, we treat the right extreme matrix mean, the trace criterion fa. Although there is little practical interest in this criterion its discussion is instructive. Just as the Elfving Theorem 2.14 addresses design optimality directly without a detour via moment matrices, so does the Theorem 9.15. Furthermore, we are thrown back to the concept of formal optimality of Section 5.15, that a moment matrix can have maximum fa -information for the full parameter vector 6 without being feasible for 0. 9.15. EQUIVALENCE THEOREM FOR T-OPTIMALITY Theorem. Assume that the regression range X C Uk contains k linearly independent vectors. Let R be the maximum length of all regression vectors, R = max{ ||jr|| : x 6 X}. Then a design e H is formally (fo-optimal for 0 in H if and only if every support point of has maximum length R. Proof. It is easy to establish the theorem by a direct argument. Every moment matrix M() obeys the bound trace M(g) = )*esupp f (*) x'x < R2. The bound is attained if and only if all support points of have maximum length R.
9.16. OPTIMAL DESIGNS FOR TRIGONOMETRIC FIT MODELS
241
From Section 2.17, the maximum length R of the regression vectors is the radius of the Euclidean ball circumscribing the regression range X. Usually this makes the situation easy to analyse. For polynomial fit models over the experimental domain T [1;1] the squared length of a regression vector is \\x\\2 = ||/(OII2 1 + /2 + + t2d. The maximum R2 = d + 1 is attained only at t = 1. Hence any ^-optimal design rf on [1;1] has a moment matrix of rank at most 2 and cannot be feasible for 0, except for the line fit model. The optimal value is Vd(4>i) R2/(d + l) = l. In our discussions of the polynomial fit models we have managed to study the optimality properties of, not just moment matrices, but designs proper. The following example, trigonometric fit models, even leads to optimal designs for finite samples sizes.
9.16. OPTIMAL DESIGNS FOR TRIGONOMETRIC FIT MODELS
The trigonometric fit model of degree d > 1 has regression function
and a total of k 2d + 1 components for the parameter vector 6 (see Section 2.22). The experimental domain is the "unit circle" T = [0;27r). We call a design rn an equispaced support design when r" assigns uniform weight \/n to each of n equispaced support points on the unit circle [0;27r). The support points of an equispaced support design T" are of the form
where a [0;27r) is a constant displacement of the unit roots 2Trj/n. For these designs we claim the following optimality properties. Claim. Every equispaced support design T" with n > 2d+l is $p-optimal, for all p [-00; 1], for the full parameter system 0 in the set T of all designs on the experimental domain T = [0;27r). The optimal value function strictly increases in p,
from
over
242
Proof. matrix
First we show that the designs r" all share the same moment
The diagonal entries and the off-diagonal entries of M are, with a,Z> = !,...,<*,
Evaluation of the five integrals (2), (3) and (4)-(6) is based on the two formulas
where m e {!,...,-!} determines some multiple of ITT. In order to establish (7), we set /3 = 2irm/n e (0;27r). The complex exponential function provides the relation e" = cos t + i sin t, and we get
Thus both sums in (7) are linear combinations ot the nnite geometric series The two series sum to , with quotients q This proves (7). and hence vanish, because of We now return to the integrals (2)-(6). For cos2 (at) I in , and apply (7) to obtain in (2) we set i
9.17. OPTIMAL DESIGNS UNDER VARIATION OF THE MODEL
243
Then sin2 = 1 - cos2 gives fsm2(bt)drn = \ in (3). The integrals (4), (5) evidently are of the form (7) and vanish. In the integral (6), the sin-cos addition theorem transforms the integrand into cos(at) sm(bt) = \ (sin(at + bt)-sm(at-bt)). Again the integral vanishes because of (7). Therefore every equispaced support design T" has moment matrix M as given by (1). Optimality of the designs T" and the moment matrix M is approached through the normality inequalities. With parameter p (-00; 1] we have, for all t e [0;27r),
Theorem 7.20 proves 0P-optimality of the designs T" if Lemma 8.15 extends optimality to the t^-oo-criterion. What do we learn from this example? Firstly, there are many designs that achieve the optimal moment matrix M. Hence uniqueness may fail for designs although uniqueness holds true for moment matrices. Secondly, the theory of designs for infinite sample size occasionally leads to a complete solution for the discrete problem of finding optimal designs for sample size n > k = 2d + l. Thirdly, every regression vector/(f) has squared length d + 1. Therefore every design is <fo-optimal for 6 in T, illustrating yet another time the poor performance of <ft. Finally, a single design may be optimal under the matrix means (j>p, for all parameters p e [-00;!]. This also follows from the symmetry properties that the model enjoys under rotations (compare Section 14.5). The concluding example of this chapter illustrates yet another feature that may occur, namely, that one and the same design remains optimal even under variation of the underlying model. 9.17. OPTIMAL DESIGNS UNDER VARIATION OF THE MODEL Speaking of designs T for the full parameter system 6 on an experimental domain T, we suppress any explicit reference to the regression function / that determines the statistical model. Of course, any notion of optimality is meaningless unless the model is specified. Optimality of a design usually holds in a specific model only. In most cases the underlying model is clearly understood and no ambiguity arises. There are rare instances where a design remains optimal under a variety of models. For example, on the experimental domain 1 [1;1], consider the design r which assigns uniform weight 1/3 to the support points 1,0,1. This design is <fo-optimal for 0 in the class T of all designs on [1;1], with respect to each of the following three models I, II, and III.
244
I. The first model has regression function f(t) = (l,r,f 2 )' and hence fits a parabola. The design on the regression range X which is induced by T has support points, moment matrix, inverse moment matrix, and normality inequality given by
Hence the design r is ^-optimal for 6 in T. The cfo-optimal value is 41//3/3 = 0.52913 (see also Section 9.5). II. The next model has regression function /(?) = (l,sin(| / 7rr),cos(5irr))' and fits a trigonometric polynomial of first degree over the half circle. The induced design has support points, moment matrix, inverse moment matrix and normality inequality given by
Again r is (^-optimal for 0 in T, with value 41/3/3 = 0.52913. III. The third model has regression function f(t) = (l,e',e~')'. The design has support points
With a = 1 + e + e~l = 4.08616 and b = 1 + e2 + e~2 = 8.52439, the moment
EXERCISES
245
matrix of and its inverse are
where c - 6a2 + 3b2 - 2a2b - 27 = 6.51738 is the determinant of 3M. The normality inequality turns out to be for a;;l Hence T is <fo-optimal for 0 in T in the present model as well, and has optimal value c1/3/3 = 0.62264. This chapter has focused on the important special cases of <p-optimality for all parameters in the class of all designs. We now return to the Loewner criterion, and investigate a kind of minimum performance requirement that any reasonable design should satisfy, that is, admissibility. While Loewner optimality of a design means that beats every competitor, admissiblity prohibits any competitor to be better than . The two notions are distinct because the Loewner ordering is only a partial ordering. EXERCISES 9.1 Show that the global criterion g of Section 9.2 is an information function on NND(&). 9.2 The following line of reasoning of Guest (1958) provides an alternative to our exposition in Section 9.5 from display (1) onwards. i. The normality inequality, with equality at the optimal support points tQ,t\,. ..,td, entails L;(ry) = 0 for all i 1 // _ 1 ii. The polynomial satisfies and Hence t\,..., td_\ are zeros of Q. Hi. The polynomials Q(t) and (1 t2)Q(t) have the same zeros. Therefore there exists some c ^ 0 so that Q solves the polynomial identity
on R.
246
iv. Represent O in the power basis. obtain and compare coefficients in
with
for all a\ = 6fls, and a0 = 22- In other words, the polynomial Q which solves the differential equation (*) is unique. v. Show that (1 - t2)Pd(t) solves (*), where Pd is the Legendre polynomial.
93
Verify that 2.1e nd is a reasonable fit to the determinant optimal value

VdM'
9.4 How does the linear dispersion criterion 4>w of Section 9.8 relate to the weighted matrix mean </>_^? 9.5 Show that relative to the arcsin distribution A of Section 9.6, the Chebyshev polynomials satisfy fTdTmdA = 0, 1/2, 1 according as d=m, d = m ^ 0, d = m = Q. 9.6 In Section 9.11, show that viid^_2j
\vd-i,d-l-2j\-
= SiVitd_2j and |v/ f d_i_2/| =
9.7 In the line fit model over the interval [-b;b] of radius b > 0, show that the uniform design on b is ^-oo-optimal for 6 in T, with information min{l,Z?2}. 9.8 In the quadratic fit model over T = [-V5;\/2], show that (i) the unique ^>_oo-optimal design for 0 in T assigns weights 1/8, 3/4, 1/8 to \/2, 0, v/2, (ii) its moment matrix has smallest eigenvalue 1/2 with multiplicity 2, and eigenvectors z = (-1/V5,0,1/V5)' and z = (0,1,0)', (iii) the only nonnegative definite trace 1 matrix E satisfying Theorem 7.22 is zz' [Heiligers (1992), p. 33]. 9.9 In the dth-degree polynomial fit model over [-1;1], show that the arcsin support design with weights l/(2d),l/d,... ,l/d,l/(2d) is the unique design that is optimal for the highest coefficient 6d [Kiefer and Wolfowitz (1959), p. 282]. 9.10 In the Jth-degree polynomial fit model over [1;1], show that (i) Amax(M d (r)) < d + 1 for all T T, (ii) supTT fy (Md(r)) = d+ 1 for all p e [l;oo]. 9.11 In the trigonometic fit model of Section 9.16, are the equispaced support designs the only $p-optimal designs for 0 in T?
C H A P T E R 10
Admissibility of Moment and Information Matrices
Admissibility of a moment matrix is an intrinsic property of the support set of the associated design. Polynomial fit models serve as an example. Nevertheless there are various interrelations between admissibility and optimality, a property relating to support points and weights of the design. The notion of admissibility is then extended to information matrices. In this more general meaning, admissibility does involve the design weights, as is illustrated with special contrast information matrices in two-way classification models. 10.1. ADMISSIBLE MOMENT MATRICES A kind of weakest requirement for a moment matrix M to be worthy of consideration is that M is maximal in the Loewner ordering, that is, that M cannot be improved upon by another moment matrix A. In statistical terminology, M has to be admissible. Let M C NND() be a set of competing moment matrices, and H C H be a subset of designs on the regression range
y ^_ IW <a r~ oA; .
DEFINITION. A moment matrix M e M is called admissible in M when every competing moment matrix A e M with A > M is actually equal to M. A design e H is called admissible in H when its moment matrix M() is admissible in M(H). In the sequel, we assume the set M to be compact and convex, as in the general design problem reviewed in Section 7.10. A first result on admissibility, in the set H of all designs, is Theorem 8.5. If a design rj e H has a support point that is not an extreme point of the Elfving set 72. = conv(X U (-.#)), then 17 is not admissible in H. Or equivalently, if e H is admissible in then its support points are extreme points of 72. Here is another result that emphasizes the role of the support when it comes to discussing admissibility.
247
248
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
10.2. SUPPORT BASED ADMISSIBILITY Theorem. Let 17 and be designs in H. If the support of is a subset of the support of 17 and 17 is admissible in H, then is also admissible in H. Proof. The proof is akin to that of Corollary 8.9, by introducing the minimum likelihood ratio a = min{ri(x)/^(x): x e supp}. Because of the support inclusion, a is positive. It satisfies i7(jc) - ag(x) > 0 for all x e X. Let e H be any competing design satisfying M() > M(). Then a+ 17 - a is a design in H, with moment matrix
Here equality must hold because of admissibility of rj. Since a is positive, we get M() = M(). This proves admissibility of . In measure theoretic terms the inclusion of the supports, supp C supp 17, means that is absolutely continuous relative to 17. A matrix M e M. is said to be inadmissible in M. when it is not admissible in M, that is, if there exists another competing moment matrix B e M such that B ^ M. In this case, B performs at least as well as M, relative to every isotonic function <f> on NND(). This calls for substituting B in place of M. The improvement can always be achieved with a matrix A M which is possibly distinct from B but which cannot be improved any further, that is, which is admissible. 10.3. ADMISSIBILITY AND COMPLETENESS Lemma. Let M M be a competing moment matrix. If M is inadmissible in M, then there exists an admissible matrix A e M which improves upon M, that is, A M. Proof. The subset M - {B e M : B > M} of M is compact. Hence the trace function attains its maximum over M at A G M, say. The matrix A is admissible in M. For if C > A then C M. Now trace C > trace /4 > trace C forces trace(C A) = 0, and C A. By definition of M we have /I > M. Since M is inadmissible there exists a matrix B e M with B^M. Strict monotonicity of the trace yields trace A > trace B > trace M, entailing A^M. In decision theoretic terms, the lemma says that the admissible designs form a complete class, that is, every inadmissible moment matrix in M may
10.4. POSITIVE POLYNOMIALS AS QUADRATIC FORMS
249
be improved upon, A^M, where the moment matrix A e M is admissible. Theoretically, the design problem is simplified by investigating the "smaller" subset Hadm of admissible designs, rather than the "larger" set H of all designs. Practically, the design set Eadm and the moment matrix set M(H a dm) may well lack the desirable property of being convex, besides the fact that the meanings of "smaller" and "larger" very much depend on the model. Here are two examples in which every design is admissible, the trigonometric fit model and the two-way classification model. In both models the regression vectors have constant length,
as mentioned in Section 6.5. But if x'x is constant over X, then any two designs and 17 that are comparable, M(T]} > M(), have identical moment matrices, M (17) = M(). This follows from
and the strict monotonicity of the trace function. Thus every moment matrix M e M(H) is admissible in M(H), and admissibility leads to no simplification at all. In plain words, admissibility may hold for lack of comparable competitors. In this respect, polynomial fit models show a more sophisticated structure. We take the space to explicitly characterize all admissible designs in this model. The derivation very much relies on the peculiarities of the model. Only fragments of the arguments prevail on a more general level. We begin with an auxiliary result from calculus. 10.4. POSITIVE POLYNOMIALS AS QUADRATIC FORMS Lemma. Let P be a polynomial defined on R, of even degree 2d. Then P is positive, P ( t ) > 0 for all t 6 (R, if and only if there exists a positive definite (d + 1) x (d + 1) matrix A such that, with power vector f ( t ) = (1, t,..., td)',
Proof. For the direct part, we extract the coefficient c e IR of the highest power t2d in P. Because of P > 0, we have c > 0. We proceed by induction on d > 1. If d = 1, then P is a parabola,
250
with a, )8 e IR and y a2 + )3. Because of P > 0, we have Thus we get the desired representation P(t) = ( l , t ) A ( l t ) , with
Assuming the representation holds for all positive polynomials R of degree 2(d 1), we deduce the result for an arbitrary positive polynomial P of degree Id. Because of P > 0, the roots of P are complex and come in conjugate pairs, Zj = ay +i/3 ; and 7; = a; - i/3;. Each pair contributes a parabola to the factorization of P,
with a y , Pj U. Because of P > 0 we have 0? > 0. Thus the factorized form of P is
say. The first polynomial in (1) is Q(t) bd_itd~l +td = (fr',l)/(0, with Z> = ( f t o , * i i - - - , ^ - i ) ' e ^d- Hence we can write
The second polynomial in (1) is In this sum, each term is nonnegative, the constant term is and 2 1 the highest power r ^- ) has coefficient /<d ftf > 0. Hence R is a positive polynomial of degree 2(d l). By assumption there is a positive definite d x d matrix B such that
Altogether, (1), (2), and (3) provide the representation P(t) = with the positive definite matrix
f(t}'Af(t),
This completes the direct part of the proof. The converse part holds because of f(t) ^ 0 for all t 6 R.
10.5. LOEWNER COMPARISON IN POLYNOMIAL FIT MODELS
251
10.5. LOEWNER COMPARISON IN POLYNOMIAL FIT MODELS Again we denote by T the set of all designs on the experimental domain T = [1; 1]. The model for a polynomial fit of degree d > 1 has regression function f(t) = (!,*,..., td}. A design T e T has moment matrix
Here the Loewner comparison of two moment matrices reduces to a comparison of moments. We claim that only the highest moments can differ, the others must coincide. To this end, we introduce a notation for the initial section of the first 2d-\ moments,
We can now give our claim a succinct form. Claim. Two designs cr, T e T satisfy moments of a and T fulfill Proof. if and only if the
and
For the proof of the direct part, we evaluate the quadratic form
With just two nonzero entries, Zi 1 for vectors z = (z> z\,. , zd)' 0 and Zi = a, we obtain, tor i 0,1,..., a 1,
By induction on i, we show that (1) implies
that is, the consecutive initial sections of the moments of a and r coincide. For i = 0 we have HQ(O) = 1 = /AO(T). Hence (1) forces JJLJ(or) == /A/(T) for all
252
; = !,...,</, that is, H(d)(<*} = AK<*)(T)- Assuming /i(rf+l--i)(o-) = M(d+i-i)(T), we deduce /i^+oC0") = /*(<*+/) ( T )- The restriction i < d-l leads to 2/ < d+i 1. Hence jt2,-(o-) = M2i(T) and (1) yields /A,+/(O-) = Pi+j(r), for all / = /+!,..., d. Adjoining At/+d(cr) = /t l+ d(T) to the assumed identity of the initial sections of length d + i - I, we obtain i*.(d+i)((r} P-(d+i)(T)- Thus (2) is established. The case / = d - 1 in (2) gives At(2d-i)(o") = ^(2d-\)(f)- Finally (1) yields lJL2d(') > ^2d( T ) where equality is ruled out because of Md(a) ^ Md(r). The proof of the converse is evident since Md(a) - Md(r) has entries 0 except for the bottom right entry fadfa) - M2d( T ) which is positive. Thus the proof of our claim is complete. A first admissibility characterization is now avaifabte. fn a o"tfi-cfegree model, a design r 6 T is admissible in T if and only if it maximizes the moment of highest order, 2d, among those designs that have the same lower order moments as has T,
This characterization is unwieldy and needs to be worked on. The crucial question is how much the 2dth moment At2d(0-) varies subject to the restriction /t(2d-i)(cr) = / A (2d-i)( T )- If there is no variation, then /u,2d(T) is the unique moment that goes along with iA(2d-i)(T)-> and again admissibility holds for lack of comparable competitors. Otherwise /i2d, given the initial section M(2</-i)( T )i is nonconstant, and admissibility is a true achievement.
10.6. GEOMETRY OF THE MOMENT SET

The existence of distinct, but comparable moments depends on how the given initial section relates to the set of all possible initial sections,
We call ju,(2d-i)(T) the moment set up to order 2d - 1. Its members are integrals,
just as the members of the set A/(H) of moment matrices are the integrals fxxx' dg, with e H. For this reason the arguments of Lemma 1.26 carry over. The moment set /A(2d-i)(T) is a compact and convex subset of R2d~\
10.7.
ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS
253
being the convex hull of the power vectors g(t) (t,...,t2d~1)' with t e The set M(2d-i)(T) includes all polytopes of the form conv{0,g(fi),..., g(hd-\}}- If tne points 0 ^ *i, , *2<*-i [-11] are pairwise distinct, then the Vandermonde determinant proves the vectors g(t\),... ,g(?2d-i) to be linearly independent,
[-i;i].
In this case, the above polytopes are of full dimension 2d-l. Therefore the moment set /A(2</-i)(T) has a nonempty interior. We now claim the following, for a given design T T. Claim. If /u,(2rf_i)(r) lies in the interior of the moment set /t(2d-i)(T), then there exists a design a e T satisfying /A(2d-i)(0") = M(2d-i)( r ) and ^2d(a'} /
^2d(f}-
Proof. The statement has nothing to do with moments, but follows entirely from convex analysis. A change in notation may underline this. Let C /i(2d)(T) be the moment set up to order Id, a convex set with nonempty interior in Rw+1, with m = 2d - 1. Let D C Um be its image under the projection ( y , z ) i-> y. We assume that y )H(2d-i)( T ) nes in tne interior of D. The assertion is that the cut Cy [z 1R : (y,z) C} contains at least two points. Convex analysis provides the fact that the number z lies in the interior of Cy if and only if the vector (y,z) lies in the interior of C. The latter is nonempty, and hence so is Cy. Thus Cy is a nondegenerate interval and contains another point /u,2</(cr), say, besides /^X1")- This proves the claim (see also Exhibit 10.1).
10.7. ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS We are now in a position to describe the admissible designs in polynomial fit models. Claim. For a design T G T in a d th-degree polynomial fit model on the experimental domain [-1;1], the following three statements are equivalent: a. (Admissibility) r is admissible in T, b. (Support condition) r has at most d - 1 support points in the open interval (-1;1).
254
CHAiTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
EXHIBIT 10.1 Cuts of a convex set. If C C R m+1 is a convex set with nonempty interior and y e IRm is an interior point of the projection D of C on Rm, then the cut Cy = |z 6 IR : (y,z) C\ is a nondegenerate interval.
c. (Normality condition) There exists a positive definite (d + l) x (d + l) matrix N that satisfies
with equality for the support points of T. Proof. First we prove that part (a) implies part (b). Let r be admissible. We distinguish two cases. In the first case, we assume that the vector M(2d-i)(T) lies n the boundary of the moment set /i(2d-i)(T). Then there exists a supporting hyperplane in R2d-1, that is, there exists a vector 0 ^ h e U2d~l such that
with power vector g(t) = (t,..., t2d *)'. Therefore the polynomial
is nonpositive on [-1; 1], and satisfies //*(*) dr = 0. The support points of T are then necessarily zeros of P. Because P < 0, they actually determine local maxima of P in [!;!]. The degree of P is at most 2d 1. As h ^ 0, the
10.7. ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS
255
polynomial P is nonconstant. Hence P possesses at most d - 1 local maxima on R and r has at most d - 1 support points in (1; 1). In the second case we assume that H(2d-i)(T) ues m the interior of the moment set M(2d-i)(T). Because of admissibility, the moment /-^(T) maximizes ^2d ver the designs that have initial section At(2rf-i)(7"), by Section 10.5. Stepping up to order Id, this puts the vector n^d)^} n the boundary of the moment set /t(2d)(T). Hence there exists a supporting hyperplane in U2d, that is, there exists a vector 0 ^ h e R2d such that
with enlarged power vector g(t) (t,...,t2d I,t2d)'. Therefore the polynomial
is nonpositive on [1; 1], and again the support points of r are local maxima of P in [-!;!]. In order to determine the degree of P, we choose a design cr with distinct but comparable moments. Section 10.6 secures the existence of such designs cr, for the present case. Moreover T is assumed to be admissible, whence //^(o") < M2d( T )> by Section 10.5. From
we obtain h2d > 0. If h2d = 0, then 0 ^ (hi,.. - ,/i 2 d-i)' e 032d~l defines a supporting hyperplane to /X(2d-i)(T) at n^d-i)^}, contradicting the present case that H(2d~i)(T) lies in the interior of the moment set M(2d-i)(T). This leaves us with h2d > 0. Any polynomial P of degree 2d with highest coefficient positive has at most d-1 local maxima on R. Thus r has at most d-1 support points in (1;1). Next we show that part (b) implies part (c). Let t\,...,tt be the support points of T in (1; 1). By assumption, we have i < d - 1. If i < d - 1, then we add further distinct points r ^ + i , . . . , td_\ in (-1; 1). Now the polynomial
is nonnegative inside [-1;1], nonpositive on the outside, and vanishes at the support points of r. Therefore the polynomial
256
is positive on R and of degree 2d. From Lemma 10.4, there exists a matrix AT PD(d +1) such that P(t) = f(t)'Nf(t). In summary, we obtain
with equality for the support points of T. Finally, we establish part (a) from part (c). Let a e T be a competing design with Md(o~) > Md(r). Since the support points t of T satisfy f(t)'Nf(t) = 1, the normality condition (c) yields
Strict monotonicity of the linear form A H-+ trace AN forces Md(o~) = Md(r}. Hence T is admissible, and the proof of the claim is complete. This concludes our discussion of admissible designs in polynomial fit models. The result that carries over to greater generality is the interplay of parts (a) and (c), albeit in the weaker version of Corollary 10.10 only. We may view part (c) as a special instance of maximizing a strictly isotonic optimality criterion <f>, defined through 4>(A) trace AN. This points to the broader issue of how optimality theory interacts with admissibility. Generally there are three instances when <-optimality entails admissibility, depending on whether <f> is strictly isotonic, or </> is strictly isotonic only on PD(A;), or < is merely isotonic. The increasing generality of the criteria <f> is compensated by appropriate conditions on the admissibility candidate M. 10.8. STRICT MONOTONICITY, UNIQUE OPTIMALITY, AND ADMISSIBILITY Lemma. Let M e M be a competing moment matrix. In order that M is admissible in M, any one of the following conditions is sufficient: a. (Strict monotonicity) M is <-optimal for 6 in M, for some strictly isotonic optimality criterion <. b. (Nonsingularity and strict monotonicity on PD(A:)) M is positive definite and M is </>-optimal for B in M, for some optimality criterion < that is strictly isotonic on PD(k). c. (Unique optimality) M is uniquely <f>-optimal for 6 in A4, for some isotonic optimality criterion <.
10.9.
E-OPTIMALITY AND ADMISSIBILITY
257
Proof. The argument for part (a) is indirect. If there exists a competitor B e M with B ^ M, then strict monotonicity of <f> on NND(fc) implies 4>(B) > <(M), whence M is not <-optimal. For part (b), the same reasoning applies to B ^ M > 0, appealing to the strict monotonicity of < on PD(fc) only. In part (c), every competitor A M with A > M satisfies <j>(A) > <f>(M), by monotonicity. Optimality of M entails <j>(A) <t>(M), and uniqueness forces A M. Part (c) leads to the most comprehensive results, with criteria of the form 0 = (fr^ o CK. We can even exhibit the coefficient matrices K that best serve this purpose. 10.9. E-OPTIMALITY AND ADMISSIBILITY Theorem. Let M e M be a competing moment matrix, with a full rank decomposition M - KK'. Then M is admissible in M if and only if M is uniquely $-00-optimal for K'd in .A/1. /Voo/ Let r be the rank of M, so that K eRkxr. Then the information matrix of M for K'6 is the identity matrix,
For the direct part, we assume admissibility. Let A e M be <_oo-optimal for K'd in M', such matrices exist by the Existence Theorem 7.13. Optimality of A yields
This means Ir < CK(A). Pre- and postmultiplication by K and K' give
where AK is the generalized information matrix for K'B from Section 3.21. Admissibility forces A = M, showing that M is uniquely 4>_oo-optimal for K'B inM. The converse follows from part (c) of Lemma 10.8, with As an application, we consider one-point designs (*) = 1. Their moment matrices are of the form M() = xx'. Such a design is admissible in H if and only if it is uniquely optimal for x '6 in H. The Elfving Theorem 2.14 permits us to check both optimality and uniqueness. Thus the one-point design (*) = 1 is admissible in H if and only if x is an extreme point of the Elfving set K = con\(Xu(-X)).
258
Another application occurs in the proof of the following theorem, on optimality relative to the information functions <f>u(A) = trace AN. These trace criteria are isotonic if N >Q, and strictly isotonic if N > 0 (see Section 1.11). Without loss of generality, we take N ^ 0 to be scaled in order to satisfy trace M N = 1. 10.10. T-OPTIMALITY AND ADMISSIBILITY Corollary. Let M e M be a competing moment matrix. For N e NND(fc), let <f>N be the optimality criterion given by <J>N(A) = trace AN. a. (Necessity) If M is admissible in M, then M is <#-optimal for 6 in M for some nonnegative definite k x k matrix N with trace MN = 1. b. (Sufficiency) If M is <#-optimal for 0 in M for some positive definite k x k matrix N with trace MN = 1, then M is admissible in M. Proof. We deduce part (a) from Theorem 10.9. For any full rank decomposition M = KK', the matrix M is <_oo-optimal for K'B in Ai, with optimal value A m i n (C/^(M)) = 1. By Theorem 7.21, there exist a matrix E > 0 with trace E = 1 and a generalized inverse G of M, such that N = GKEK'G' > 0 satisfies
In particular, M is <f>N-optimal for 6 in M. Part (b) follows from part (a) of Lemma 10.8. The merits of the theorem lie in its transparent geometric meaning. Given the matrix 0 ^ N e Sym(/c), the projection of a matrix A E Sym(fc) onto the one-dimensional subspace (N) = {aN : a e R} is
Hence if M maximizes A H- trace AN, then M has a longest projection onto (N) among all competitors A e M. The necessary condition (a) and the sufficient condition (b) differ in whether the matrix N points in any direction of the cone NND(fc), N > 0, or whether it points into the interior, N > Q (see Exhibit 10.2). Except for the standardization trace M N = 1, the normality inequality of Section 7.2 offers yet another disguise of <fov-optimality:
In the terminology of that section, the matrix N is normal to M at M.
10.10. T-OPTIMALITY AND ADMISSIBILITY
259
EXHIBIT 10.2 Line projections and admissibility. The matrices M\ and MI have a longest projection in the direction N^ > 0, but only MI is admissible. The matrices A/2 and MI are admissible, but only M2 has a longest projection in a direction N > 0.
Relative to the full set M(H), we need to refer to the regression vectors jc e X only. The design is (j>N -optimal for 6 in H if and only if
with equality for the support points of . This is of the same form as the normality condition (c) of Section 10.7. The geometry simplifies since we need not argue in the space Sym(fc) of symmetric matrices, but can make do with the space Rk of column vectors. Indeed, N induces a cylinder that includes the regression range X, or equivalently, the Elfving set Tl (see Section 2.10). Neither condition (a) nor (b) can generally be reversed. Here are two examples to this effect. Let HI be the set of all designs in the line fit model of Section 2.20, with regression range X\ {1} x [-1; 1]. Choosing N = (QQ), we find x'Nx = 1 for all jc e X\. Hence the one-point design (Q) =1 is <f>Noptimal, even though it is inadmissible. For instance, the design 17 (_\) = rj(!) = \ has a better moment matrix,
Therefore the converse of condition (a) fails to hold in this example. A variant of this model is appropriate for condition (b). We deform the regression range X\ by rounding off the second and fourth quadrant,
260
In the set Ha of all designs on X2, the design (Q) = 1 is uniquely optimal for
in Ha and hence admissible, by the Elfving Theorem 2.14 and by part (c) of Theorem 10.7. But the only matrix
satisfying
is the singular matrix N = (J|j). Indeed, we have a = (1,0)N(J) = 1. Nonnegative definiteness yields y > /32. Inserting x = (j), we get 1 + 2j8 + y < 1, that is, y < -2/3. Together this means
From Now we exploit the curved boundary of and Siihriivisinn hv we obtain eives '. As Jt? tends 1 tends giving 2 From (1), we get Thus N is necessarily singular. The and converse of condition (b) is false, in this setting (see Exhibit 10.3). The results of the present corollary extend to all matrix means <f>p with
10.11. MATRIX MEAN OPTIMALrTY AND ADMISSIBILITY Corollary. Let M e M be a competing moment matrix and let 0 ^ p
a. (Necessity) If M is admissible in M and N > 0 satisfies trace AN < 1 = trace MN, for all A e M, then M is </>p-optimal for HK'B in M where K is obtained from a full rank decomposition MNM = KK' and where H = (C*(Af ))<1+^/Grt. b. (Sufficiency) If M is uniquely <jHp-optimal for K'O in M., for some full column rank coefficient matrix K, then M is admissible in M.
10.11. MATRIX MEAN OPTIMALITY AND ADMISSIBILITY
261
EXHIBIT 10J Cylinders and admissibility. Left: the one-point design (J) = 1 is ^-optimal, but inadmissible over the line fit regression range X\. Right: the same design is admissible over the deformed regression range X^, but is ^-optimal for the singular matrix N = (^Q), only.
Proof. We recall that in part (a), admissibility of M entails the existence of a suitable matrix N', from part (a) of Corollary 10.10. The point is that N is instrumental in exhibiting the parameter system HK'6 for which the matrix M is <p-optimal. Suppose, then, that the k x r matrix K satisfies MNM = KK', where r is the rank of MNM. From range AT = range MNM C range M, we see that M is feasible for K'6. Hence C = CK(M) is a positive definite r x r matrix, and we have
For the positive definite matrix H = C(1+p)/(2p), we obtain
with q conjugate to p. Therefore the primal and dual objective functions take the value
The Duality Theorem 7.12 shows that M is ^-optimal for HK'6 in M, and
262
that N is an optimal solution of the dual problem. Part (b) follows from part (c) of Lemma 10.8. If the hypothesis in part (a) is satisfied by a rank 1 matrix N = hh', then MNM = cc', with c = Mh, and M is optimal for c'6 in M. Closely related conditions appear in Section 2.19 when collecting, for a given moment matrix M, the coefficient vectors c so that M is optimal for c'6. 10.12. ADMISSIBLE INFORMATION MATRICES The general optimality theory allows for parameter subsystems K'6, beyond the full parameter vector 0. Similarly admissibility may concentrate on information matrices, rather than on moment matrices. The requirement is that the information matrix C^(Af) is maximal, in the Loewner ordering, among whichever competing information matrices are admitted. To this end, we consider a subset C C C/^(M(H)) of information matrices for K'6 where, as usual, the coefficient matrix K is taken to be of full column rank. An information matrix C#(M) e C is called admissible in C when every competing information matrix CK(A) C with CK(A) > C#(A/) is actually equal to C/^(M). A design e H is said to be admissible for K'6 in H C H when its information matrix C#(M()) is admissible in C/c(A/(E)). Some of the admissibility results for moment matrices carry over to information matrices, such as Section 10.8 to Section 10.11, others do not (Theorem 10.2). We do not take the space to elaborate on these distinctions in greater detail. Instead we discuss the specific case of the two-way classification model where admissibility of contrast information matrices submits itself to a direct investigation. 10.13. LOEWNER COMPARISON OF SPECIAL C-MATRICES In the two-way classification model, let the centered contrasts of the first factor be the parameter system of interest. For the centered contrasts to be identifiable under an a x b block design W, the row sum vector r = Wlb must be positive. Then the special contrast information matrix Ar rr' is Loewner optimal in the set T(r) of designs with row sum vector equal to r, by Section 4,8. The first step is to transcribe the Loewner comparison of two special contrast information matrices into a comparison of the generating row sum vectors. We claim the following. Claim. In dimension a > 2, two positive stochastic vectors f, r e Ra satisfy
10.13. LOEWNER COMPARISON OF SPECIAL C-MATRICES
263
if and only if the components fulfill, for some / < a,
Proof. The major portion of the proof is devoted to showing that, under the assumption of (1), conditions (2) and (3) are equivalent to
Indeed, with K' = (Ka,Q) and utilizing generalized information matrices as in (1) of Section 3.25, condition (4) means (M(ts'))K > (M(rs'))K. By Lemma 3.23, this is the same as K'M(ts')~K < K'M(rs')~K. Insertion of the generalized inverse G of Section 4.8 yields an equivalent form of (4),
The a x (a - 1) matrix //, with iih row -l^-i while the other rows form the identity matrix 7 fl _i, satisfies KaH = H and H(H'H)~1H' = Ka. This turns (5) into //'A'1// < H'k~lH. Upon removing the ith component from t to obtain the shorted vector 7= ( f i , . . . , r,-_i, tM,..., ta)', we compute //'A,"1// = A f ~ ] -i- (l/ti)la-.\lg__r With f defined similarly, we may rearrange terms to obtain
Because of (1) the factor 1/f, - 1/r, is positive. It is a lower bound to the elements l/r; l/r; of the diagonal matrix A^1 A f ~\ whence (6) entails Ar 1 > A^1. Moreover, in view of the Schur complement Lemma 3.12, we see that (6) is equivalent to
Another appeal to Lemma 3.12, with the roles of the blocks in (7) interchanged, yields
264
This is merely another way of expressing conditions (2) and (3). Hence (4) is equivalent to (2) and (3). Only a few more arguments are needed to establish the claim. For the direct part we conclude from that t ^ r, whence follows (1), and then (4) leads to (2) and (3). For the converse, (1), (2), and (3) imply (4). Equality in (4) is ruled out by (1). If we assume A,-f?' = A r rr', then we trivially ge Since (1) secures / ^ r, there exists some k < a such that tk > rk and
We have more than two subscripts, a > 2, and so conditions (2) and (8) cannot hold simultaneously. Hence A, - tt' ^ Ar - rr', and the proof is complete. Thus we have A, tt' ^ Ar - rr' if and only if, with a single exception (1), the components of t are larger than those of r (2), and the discrepancies jointly obey the quantitative restriction (3). This makes it easy to compare two special contrast information matrices through their generating row s,um vectors, and to attack the problem whether comparable row sum vectors exist. 10.14. ADMISSIBILITY OF SPECIAL C-MATRICES For the set of special contrast information matrices
the question is one of admissibility of a particular member Ar - rr' in the class C. Claim. For a positive stochastic vector r Rfl in dimension a > 2, the matrix Ar - rr' is admissible in C if and only if r, < \ for all / = 1,..., a. 'Proof. We prove the negated statement, that ' is inadmissible in C if and only if r, > \ for some /. If Ar - rr' is inadmissible, then there exists a positive stochastic vector t e Ra such that . Consideration of the diagonal elements yields t} - tj > r; - r?, for all j. From condition (1) of Section 10.13, we get r, < r/, for some /. This is possible only if r, > \ and >/[l-r,,r(-). For the converse, we choose /, e [1 - r,,r/) and define
Then t (t\,...,ta)' is a positive stochastic vector satisfying (1) and (2) of
10.15. ADMISSIBILITY, MINIMAXITY, AND BAYES DESIGNS
265
Section 10.13. It also fulfills (3) since f, > 1 r, yields
By Section 10.13 then This completes the proof.h
whence Ar - rr' is inadmissible.
In view of the results on Loewner optimality of Section 4.8, we may summarize as follows. A block design W e T is admissible for the centered contrasts of factor A if and only if it is a product design, W = rs', and at most half of the observations are made on any given level / of factor A, r, < \ for all i < a. It is remarkable that the bound \ does not depend on a, the number of levels of factor A. The mere presence of the bound \ is even more surprising. Admissibility of a design for a parameter subsystem K '6 involves the design weights. This is in contrast to Theorem 10.2 on admissibility of a design for the full parameter vector 6 which exclusively concentrates on the design support. 10.15. ADMISSIBILITY, MINIMAXITY, AND BAYES DESIGNS The notion of admissibility has its origin in statistical decision theory. There, a parameter 6 e @ specifies the underlying model, and the performance of a statistical procedure T is evaluated through a real-valued function 6 H- R(6,T], called risk function. The terminology suggests that the smaller the function R(-,T), the better. This idea is captured by the partial ordering of smaller risk which, for two procedures T\ and T2, is defined through pointwise comparison of the risk functions,
It is this partial ordering to which admissibility refers. A procedure T\ is called admissible when every competing procedure T2 with T2 < T\ has actually the same risk as T\. Otherwise, T\ is inadmissible. Decision theory then usually relates admissibility to minimax procedures, that is, procedures that minimize the maximum risk,
Alternatively, we study Bayes procedures, that is, procedures that minimize some average risk,
266
where TT is a probability measure on (an appropriate sigmafield of subsets of) 6. In experimental design theory, the approach is essentially the same except that the goal of minimizing risk is replaced by one of maximizing information. A design is evaluated through its moment matrix M. The larger the moment matrix in the Loewner ordering, the better. The Loewner ordering is a partial ordering that originates from a pointwise comparison of quadratic forms,
where Sk = {x Rk : \\x\\ = 1} is the unit sphere in IR*. Therefore, the difference with decision theoretic admissibility is merely one of orientation, of whether an improvement corresponds to the ordering relation <, or the reverse ordering >. With this distinct orientation in mind, the usual decision theoretic approach calls for relating admissibility to maximin designs, that is, designs that maximize the minimum information,
This is achieved by Theorem 10.9. Alternatively, we may inquire into some average information.
where TT is a probability measure on 5^. This is dealt with in Corollary 10.10, with a particular scaling of N. From the analogy with decision theory, a design may hence be called a Bayes design when it maximizes a linear criterion (j>N(A) = trace AN. However, there is more to the Bayes approach than its use as a tool of decision theory. In the arguments above, TT conveys the experimenter's weighting of the various directions x e Sk. More generally, the essence is to bring to bear any prior information that may be available. Other ways of how prior information may permeate a design problem are conceivable, and each of them may legitimately be called Bayes, EXERCISES 10.1 10.2 Show that M is admissible in M if and only if HMH' is admissible in HMH1, for all H e GL(fc). In the parabola fit model over [-1; 1], a design r is optimal for BQ+ fa in T if and only if ftdr = 0 and /1 2 dr = | Which of the optimal .
EXERCISES
267
designs 1652].
with such that are admissible in T? [Kiefer and Wolfowitz (1965), p.
10.3 Show that M is admissible in M if and only if M is uniquely optimal for (CK(M})l'2K'9 in M, where K provides a full rank decomposition MNM = KK' for some N > 0 such that trace AN < 1 = trace MN for all A e M. 10.4 Show that is admissible for K'9 in H if and onlv if is admissible for for all
and
10.5 Verify that violate 10.6
fulfill (;i)-(3) of Section 10.13, but
(continued) Show that H (Ia-\,-la-i)' satisfies KaH = H and H(H'H)~1H' = Ka.
C H A P T E R 11
Bayes Designs and Discrimination Designs
The chapter deals with how to account for prior knowledge in a design problem. These situations still submit themselves to the General Equivalence Theorem which, in each particular case, takes on a specific form. In the Bayes setting, prior knowledge is available on the distribution of the parameters. This leads to a design problem with a shifted set of competing moment matrices. The same problem emerges for designs with bounded weights. In a second setting, the experimenter allows for a set of m different models to describe the data, and considers mixtures of the m information functions from each model. The mixing is carried out using the vector means 3>p on the nonnegative orthant R. This embraces as a special case the mixing ofm parameter subsets or ofm optimality criteria, in a single model. A third approach is to maximize the information in one model, subject to guaranteeing a prescribed efficiency level in a second model. Examples illustrate the results, with special emphasis on designs to discriminate between a second-degree and a third-degree polynomial fit model.
11.1. BAYES LINEAR MODELS WITH MOMENT ASSUMPTIONS An instance of prior information arises in the case where the mean parameter vector 6 and the model variance a2 are not just unknown, but some values for them are deemed more likely than others. This is taken into account by assuming that 6 and o2 follow a distribution, termed the prior distribution, which must be specified by the experimenter at the very beginning of the modeling phase. Thus the parameters 6 and a2 become random variables, in addition to the response vector Y. The underlying distribution P now determines the joint distribution of Y, 6, and a2. Of this joint distribution, we utilize in the present section some conditional moments only, as a counterpart to the classical linear model with moment assumptions of Section 1.3.
268
11.1. BAYES LINEAR MODELS WITH MOMENT ASSUMPTIONS
269
The expectation vector and the dispersion matrix of the response vector Y, conditionally with 6 and or2 given, are as in Section 1.3:
The expectation vector and the dispersion matrix of the mean parameter vector 0, conditionally with a2 given, are determined by a prior mean vector do 6 Rk, a prior dispersion matrix RQ NND(A:), and a prior sample size n0 > 1, through
Finally, a prior model variance a^ > 0 is the expected value of the model variance cr2,
Assumptions (1), (2), and (3) are called the Bayes linear model with moment assumptions. Assumption (2) is the critical one. It says that the prior estimate for 6, before any sampling evidence is available, is OQ and has uncertainty (o-2/n0)/?oOn the other hand, with the n x k model matrix X being of full column rank k, the Gauss-Markov Theorem 1.21 yields the sampling estimate 9 = (X'X)~1X'Y, with dispersion matrix (a2/n)M~l, where as usual M = X'X/n. Thus, uncertainty of the prior estimate and variability of the sampling estimate are made comparable in that the experimenter, through specifying nQ, assesses the weight of the prior information on the same per observation basis that applies to the sampling information. The scaling issue is of paramount importance because it touches on the essence of the Bayes goal, to combine prior information and sampling information in an optimal way. If the prior information and the sampling information are measured in comparable units, then the optimal estimator is easily understood and looks good. Otherwise, it looks bad. With a full column rank k x s coefficient matrix K, let us turn to a parameter system of interest K'6, to be estimated by T(Y) where T maps from R" into Rs. We choose a matrix-valued risk function, called mean squared-error matrix,
Two such matrices are compared in the Loewner ordering, if possible. It is possible, remarkably enough, to minimize the mean squared-error matrix among all affine estimators AY + b, where the matrix A e Rsxn and the shift
270
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
b e Rs vary freely. The shift b is needed since, even prior to sampling, there is a bias
unless BQ = 0. Any affine estimator that achieves the minimum mean squarederror matrix is called a Bayes estimator for K'B. 11.2. BAYES ESTIMATORS Lemma. In the Bayes linear model with moment assumptions, let the prior dispersion matrix jR0 be positive definite. Then the unique Bayes estimator for K'6 is K'8, where
and the minimum mean squared-error matrix is
Proof. I. We begin by evaluating the mean squared-error matrix for an arbitrary affine estimator T(Y) = AY + b. We only need to apply twice the fact that the matrix of the uncentered second moments, provided it exists, decomposes into dispersion matrix plus squared-bias matrix. That is, for a general random vector Z we have E/>[ZZ'] = D/>[Z] + (E/>[Z])(E/>[Z])'. Firstly, with Z = AY + b K'B, we determine the conditional expectation of ZZ' given B and a2,
where B AX-K'. Secondly with Z = BB + b, the conditional expectation given a2 is
Integrating over a2, we obtain the mean squared-error matrix of AY + b,
11.2. BAYES ESTIMATORS
271
K'OQ-AXBQ.
II. The third term, (BBo + b)(B8Q + b)', is minimized by b = -B00 =
III. The key point is to minimize the sum of the first two terms which, except for oj, is S(A) = AA1 + (l/no)(AX -K')R0(AX -K')'. With A = K'(n0RQl +X'X)-1X', we expand S(A + (A-A)). Among the resulting eight terms, there are two pairs summing to 0. The other four terms are rearranged loyiGldS(A) = S(A) + (A-A)(In+nQlXR0X')(A-AY > 5(1), with equality only for A A. IV. In summary, with & = K'^-AXOo = K'(nQR-1 + X'X^noRÔo, the unique Bayes estimator for K'6 is AY + b = K'O, with 6 as in (1) and mean squared-error matrix as in (2). The Bayes estimator B does not yet look good because the terms which appear in (1) are too inhomogeneous. However, we know that the sampling estimate 6 satisfies the normal equations, X'Xd = X'Y, and so we replace X'Y by X'X9. This exhibits 9 as a combination of the prior estimate &Q and the sampling estimate 6. In this combination, the weight matrices can also be brought closer together in appearance. We define
to be the prior moment matrix which, because of (2) of Section 11.1, is properly scaled. The corresponding scaling for the sampling portion is X'X = nM. With all terms on an equal footing the Bayes estimator looks good:
It is an average of the prior estimate 6fo and the sampling estimate B, weighted by prior and sampling per observation moment matrices M0 and A/, and by prior and experimental samples sizes n0 and n. The weights sum to the identity, (/ioM) + nAf )~1n0A/o + (oô + nM)~lnM = Ik. The mean squarederror matrix of the Bayes estimate K'd is
It depends on the Bayes moment matrix Ma which is defined to be a convex combination of prior and sampling moment matrices,
272
with a specifying the sampling weight that the experimenter ascribes to the observational evidence that is to complement the prior information. The Bayes design problem calls for finding a moment matrix M such that the mean squared-error matrix (4) is further minimized. As in the general design problem of Section 5,15, we switch to maximizing the inverse mean squared-error matrix, (K'M~lK)~l = CK(Ma), and recover the familiar information matrix mapping CK. This mapping is defined on the closed cone NND(&), whence we can dispense with the full rank assumption on MO = RQ l . As a final step we let a vary continuously in the interval [0; 1]. 11.3. BAYES LINEAR MODELS WITH NORMAL-GAMMA PRIOR DISTRIBUTIONS In this section we specify the joint distribution of Y, 6, and o2 completely, as a counterpart to the classical linear model with normality assumption of Section 1.4. The joint distribution is built up from conditional distributions in three steps. The distribution of Y, conditionally with 6 and tr2 given, is normal as in Section 1.4:
The distribution of 0, conditionally with a2 given, is also normal:
where the prior parameters are identical to those in (2) of Section 11.1. The distribution of the inverse of a2 is a gamma distribution:
with prior form parameter o > 0 and prior precision parameter /3o > 0. Assumptions (1), (2), and (3) are called a Bayes linear model with a normalgamma prior distribution. Assumption (2) respects the scaling issue, discussed at some length in Section 11.1. In (3), we parametrize the gamma distribution F^^, in order to have Lebesgue density
with expectation oo/A) and variance 2a0//3o- These two moments may guide the experimenter to select the prior distribution (3). For OQ > 2, the moment
11.4. NORMAL-GAMMA POSTERIOR DISTRIBUTIONS
273
of order -1 exists:
say. Solving for OQ we obtain 0 = 2 + /3/0-Q. The parametrization F2+ft)/(72.A) is more in line with assumption (3) of Section 11.1, in that the parameters are prior model variance ofi and prior precision j3o- Note that Fy;1 is the ^-distribution with / degrees of freedom. The statistician cannot but hope that the normal-gamma family given by (2) and (3) allows the experimenter to model the available prior information. If so, they profit from the good fortune that the posterior distribution, the conditional distribution of 6 and a2 given the observations Y, is from the same family.
11.4. NORMAL-GAMMA POSTERIOR DISTRIBUTIONS Lemma. In the Bayes linear model with a normal-gamma prior distribution, let the prior dispersion matrix RQ be positive definite. Then the posterior distribution is the normal-gamma distribution given by
where 6 is the Bayes estimator (1) of Section 11.2, and where the posterior precision increase is
with mean Proof. With z = l/o-2, the posterior distribution has a Lebesgue density proportional to the product of the densities from assumptions (1), (2), and (3) of Section 11.3,
274
where the quadratic form Q(B) involves )8i from (3),
With gamma density 'Xa0+/j;#)+0i(z) as given in Section 11.3, (1) and (2) follow from
Lemma 3.12 entails nonnegativity of ft, as a Schur complement in the nonnegative definite matrix
The determination of the expected value of fii parallels the computation of the mean squared-error matrix in part I of the proof of Lemma 11.2,
The posterior distribution testifies to what extent the experimental data alter the prior opinion as laid down in the prior distribution. More evidence should lead to a firmer opinion. Indeed, for the mean parameter vector 0, the posterior distribution (1) is less dispersed than the prior distribution (2) from Section 11.3,
and for the inverse model variance l/<r2, the precision parameter in the posterior distribution (2) exceeds that of the prior distribution (3) in Section 11.3, A) + ft > A)-
11.5. THE BAYES DESIGN PROBLEM
275
The prior and posterior conditional distributions of 6 are so easily compared because the common conditioning variable a2 appears in both dispersion matrices in the same fashion and cancels, and because (4) is free of the conditioning variable Y. The situation is somewhat more involved when it comes to comparing the prior and posterior distributions of 1 /a2. The posterior precision increase Pi depends on the observations y, whence no deterministic statement on the behavior of ft is possible other than that it is nonnegative. As a substitute, we consider its expected value, E/>[ft] = na^, called preposterior precision increase. This quantity is proportional to the sample size , but is otherwise unaffected by the design and hence provides no guidance towards its choice. For the design of experiments, (1) suggests minimizing (no/?^1 + X'X)~l. With a transition to moment matrices, R^1 = MQ and X'X = nM, this calls for maximization of the Bayes moment matrices
In summary, the Bayes design problem remains the same whether we adopt the moment assumptions of Section 11.1, or the normal-gamma prior distributions of Section 11.3. 11.5. THE BAYES DESIGN PROBLEM Given a prior moment matrix MQ NND(/c) and a sampling weight a G [0; 1], the matrix Ma(g) = (1 - a)MQ + aM(g) is called the Bayes moment matrix of the design e H. The Bayes design problem is the following:
A moment matrix M that attains the maximum is called Bayes <j> -optimal for K'O in M. If there is no sampling evidence, a = 0, then there is only one competitor, MQ, and the optimization problem is trivial. If all the emphasis is on the sampling portion, a = 1, then we recover the general design problem of Section 5.15. Hence we assume a e (0; 1). The criterion function M \-> <f> o CK((l a)MQ + aM) fails to be an information function for lack of homogeneity. As a remedy, we transform the set M. of competing moment matrices into
The set Ma inherits compactness and convexity from M. If M intersects the feasibility cone A(K), then so does Ma, by Lemma 2.3. Now, in the
276
formulation
the Bayes design problem is just one manifestation of the general design problem of Section 5.15. That is, the general design problem is general enough to comprise the Bayes design problem as a special case. As an illustration, we consider the question of whether optimal Bayes moment matrices for K'0 are necessarily feasible. From a Bayes point of view, the prior moment matrix MQ often is positive definite; then the set Ma is included in the open cone PD(/c). Hence by part (a) of the Existence Theorem 7.13, all formally ^-optimal moment matrices for K'B in Aia are automatically feasible for K'B. Thus, for the Bayes design problem, the existence issue is much less pronounced. As another example, we present the Bayes version of the General Equivalence Theorem 7.14. 11.6. GENERAL EQUIVALENCE THEOREM FOR BAYES DESIGNS Theorem. Assume that a prior moment matrix M0 NND(fc) and a sampling weight a e (0; 1) are given. Let the matrix M e M be such that the Bayes moment matrix B = (1 a)M0 + aM is feasible for K'0, with information matrix C - CK(B). Then M is Bayes ^-optimal for K'0 in M if and only if there exists a nonnegative definite s x s matrix D that solves the polarity equation
and there exists a generalized inverse G of B such that the matrix N = GKCDCK'G' satisfies the normality inequality
In the case of optimality, equality obtains in the normality inequality if for A we insert any matrix M e M that is Bayes $-optimal for K'0 in M. Proof. Evidently M is Bayes <-optimal for K '0 in M if and only if B is </>-optimal for K'0 in Ma = {(1 - a)MQ + aA: Ae M}. To the latter problem, we apply the General Equivalence Theorem 7.14. There, the normality inequality is
11.7. DESIGNS WITH PROTECTED RUNS
277
This is equivalent to a trace AN < 1 - (1 - a) trace MQN, for all A e M. Because of trace BN 1, the right hand side is trace (B - (1 - a)Mo)N a trace MN.lna trace AN < a trace MN we cancel a to complete the proof.
Three instances may illustrate the theorem. For a matrix mean <j>p with p e (-00; 1), the matrix M e M is Bayes <f>p-optimal for K'B in M if and only if some generalized inverse G of B = (1 a)Mo + aM satisfies
Indeed, in the theorem we have N = GKCp+lK'G'/irace Cp, and the common factor trace Cp cancels on both sides of the inequality. Second, for scalar optimality, M e M is Bayes optimal for c'B in M if and only if there exists a generalized inverse G ot B = (l-a)A/o + aM that satisfies
Third, we consider the situation that the experimenter wants to minimize a mixture of the standardized variances c'Ma(g)~c where c is averaged relative to some distribution /A. With W JR* cc'rf/x e NND(fc), this calls for the minimization of a weighted trace, trace WM a ()~. An optimal solution is called Bayes linearly optimal for 6 in M. This is the Bayes version of the criterion that we discussed in Section 9.8. Again it is easy to see that the problem is equivalent to finding the <_i-optimal moment matrix for H'6 in Ma, where W = HH' is a full rank decomposition. Therefore M e M is Bayes linearly optimal for B in M if and only if some generalized inverse G of B = (1 - a)M0 + aM fulfills
The optimization problem for Bayes designs also occurs in a non-Bayes setting.
11.7. DESIGNS WITH PROTECTED RUNS Planned experimentation often proceeds in consecutive steps. It is then desirable to select the next, new stage by taking into account which design 0 was used in the previous, old stage. Let us presume that n0 observations were taken under the old design 0- If the new sample size is n and the new design is , then joint evaluation of all n0 + n observations gives rise to the
278
per observation moment matrix
where a = n/(n$ + n] designates the fraction of observations of the new experiment. The complementary portion 1 - a consists of the "protected" experimental runs of the old design & Therefore an optimal augmentation of the old design 0 is achieved by maximizing a criterion of the form (f> o C/i:(Af a ()). This optimization problem coincides with that motivated by the Bayes approach and Theorem 11.6 applies. Another way of viewing this problem is that the weights of the old design provide a lower bound a(x) = &(*) fr the weights of the new design . There may also be problems which, in addition, dictate an upper bound b(x). For instance, it may be costly to make too many observations under the regression vector x. For the designs with weights bounded by a and b, the General Equivalence Theorem 7.14 specializes as follows. 11.8. GENERAL EQUIVALENCE THEOREM FOR DESIGNS WITH BOUNDED WEIGHTS Theorem. Assume a[a\b] = { e H : a(x) < (x) < b(x) for all x e X} is the set of designs with weights bounded by the functions a, b : X > [0; 1]. Let e H[a;6] be a design that is feasible for K'B, with information matrix C = C/c(A/()). Then is ^-optimal for K'O in E[;fc] if and only if there exists a nonnegative definite s x s matrix D that solves the polarity equation
and there exists a generalized inverse G of M(g) such that the matrix N = GKCDCK'G' satisfies the normality inequality
Proof. The set M(H [;&]) of moment matrices is compact and convex, and intersects A(K) by the assumption on . Hence the General Equivalence Theorem 7.14 applies and yields the present theorem but for the normality inequality
We prove that (1) and (2) are equivalent. First we show that (1) entails (2). For any two vectors y, z 6 X with g(y) and (z) lying between both bounds, the inequalities in (1) go either way and
11.8. GENERAL EQUIVALENCE THEOREM FOR DESIGNS WITH BOUNDED WEIGHTS
279
become an equality. Hence there exists a number c > 0 such that we have, for all x X,
For a competing design 17 e E[a;b], let x\,... ,xe be the combined support points of 17 and . We have the two identities
the latter invoking 1 = trace M()N. Splitting the sum into three terms and observing a(x) < 7j(jt) < b(x), we obtain (2) from
Conversely, that (2) implies (1), is proved indirectly. Assuming (1) is false there are regression vectors y,z 6 X with (v) < b(y) and (z) > a(z) fulfilling e = y 'Ny -z'Nz > 0. We choose 8 > 0 and define 17 in such a way that
Therefore 17 is a design in H[; b], with moment matrix M(17) = M()+d(yy 'zz'). But trace M(rj)N = l + 8e > 1 violates the normality inequality (2). D As an example we consider a constraint design for a parabola fit model, in part II of the following section. We place the example into the broader context of model discrimination, to be tackled by> different means in Section 11.18 and Section 11.22.
280
11.9. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, I With experimental domain [-1;1], suppose the experimenter hopes that a second-degree polynomial fit model will adequately describe the data. Yet it deems desirable to guard against the occurrence of a third-degree term. This calls for a test, in a third-degree model, of 03 = 0 versus #3 ^ 0. If there is significant evidence for 03 not to vanish, then the third-degree model is adopted, with parameter vector 0(3) = (0o> #i ? #2> 03)'- Otherwise a seconddegree model will do, with parameter vector 0(2) = (0o, 0i> flz)'The experimenter proposes that the determinant criterion fe is appropriate for evaluating the designs, in the present situation. The statistician advises the experimenter that there are many designs that provide information simultaneously for 03, 0(3), and 0(2). The efficiencies of the following designs are tabulated in Exhibit 11.1 in Section 11.22, along with others to be discussed later. The efficiencies are computed relative to the optimal designs, so these are reviewed first. i. In a third-degree model, the optimal design for the individual component 03 = e3'0(3) minimizes T ^ e^M^(r)e^. The solution is the arcsin support design r from Section 9.12, with weights, third-degree moment matrix, and inverse given by
Dots indicate zeros. The optimal information for 03 is 1/16 = 0.0625. ii. In a third-degree model, the ^-optimal design for the full vector 0(3) maximizes T i-+ (det M3(T))1/4. The solution T is shown in Exhibit 9.4 in Section 9.6,
11.9. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, I
281
The <fo-optimal information for 0(3) is 0.26750. Hi. In a second-degree model, the <o-optimal design r for the vector 0(2) maximizes r H- (det M^r))1/3, and is mentioned in Section 9.5, r(l) = T(0) = 1/3. The (^-optimal information for 0(2) is 0.52913. Under this design, in a third-degree model, neither the vector 0(3) nor the component 03 are identifiable. I. An allocation with some appeal of symmetry and balancedness is the uniform design T on five equispaced points,
It has respective efficiencies of 72%, 94%, and 84% for 03, 0(3), and 0(2). Of course, there is no direct merit in the constant spacing. The high efficiencies of r are explained by the fact that the support set T = (1, 1/2,0} is the union of the second-degree and third-degree arcsin support, as introduced in Section 9.6. We call T the arcsin support set for the discrimination between a second-degree and a third-degree model. II. As an alternative, half of the observations are drawn according to the old design T from the previous paragraph while the other half r is adjoined in a <fo-optimal way for 0(2), that is, f solves the maximization problem r H- (det M2(\r+ |r))1/3. The resulting design r = \(r + r) is
282
This design has respective efficiencies of 42%, 89%, and 94% for #3, 0(3), and 0(2). To see that r is the <fo-optimal augmentation of the old design T, we represent it as TW = |(T + T) with r(l) = w, ?(0) = l-2w. That is, the new part T is a symmetric design supported by the second-degree arcsin support points 1,0, and placing weight 2w on the boundary {1}. In the one-parameter subclass {TW : w (0; 1/2)} we might use differential calculus to maximize the objective function w ^ <^(M2(TW,)) = \<fa(M2(r} + M2(T)). However, reverting to calculus means ignoring the achievements of the General Equivalence Theorem which, after all, is derived from nothing but (sub)differential calculus. The design TW has second-degree moment matrix M(w) and its inverse are given by
where d = w + 17/80 - (w + 1/4)2. It is a member of the class T[a; 1] of designs with bounded weights, with lower bound function a(t) = 1/10 for t = 1,1/2,0 and a(t) = 0 elsewhere. Theorem 11.8 states that z'M(w)~lz is constant for those z = f(t} for which the new weight r(t) is positive. Our conjecture that one of the designs rw solves the problem requires us to consider positivity of r(t) for the points t 1,0. That is, the sum of all entries of (M(vf))~ 1 must be equal to the top left component. This leads to the equation w 2 -H>/6 = 11/120. The relevant solution is w = (1+^/71/5)712 = 0.3974. For the design T = T#, the moment matrix M(w] and its inverse are
The lower bound a(t) is exceeded only at / = 1,0 and, by construction of iv, we have f(t) 'M(w)~lf(t) 3.2. All weights r(t) stay below the constant
11.10. MIXTURES OF MODELS
283
upper bound 1. Therefore the normality inequality of Theorem 11.8 becomes
The inequality holds true by the very contruction of w. Namely, the polynomial P ( t ) = f ( t ) ' M ( w ) - l f ( t ) = 3.2-5.3f2+5.3f4 satisfies P(0) == P(l) = 3.2 and has a local maximum at 0. Thus P is on [-1; 1] bounded by 3.2, and optimality of the design r is established. Alternatively we could have utilized Theorem 11.6. There are other ways to find designs that perform equally well across various models. We explore two of them, to evaluate a mixture of the information obtained from each model and, from Section 11.19 on, to maximize the information in a first model, within a subclass of designs that achieve a prescribed efficiency in a second model. 11.10. MIXTURES OF MODELS The problem of maximizing mixed information from different models is relevant when the experimenter considers a number of potential models to describe the data, and wants to design the experiment to embrace each model in an efficient manner. Hence suppose that on a common experimental domain T, we are given m different regression functions,
Eventually the task is one of finding a design on T, but again we first concentrate on the optimization problem as it pertains to the moment matrices. With model dimensions equal to k\,... ,km, we introduce the cone
in the Euclidean space Sym(A:i x x km} Sym(A:i) x x Sym(km). The scalar product of two members A = (Ai,...,Am) and B = (Bi,...,Bm) in Sym(A;1 x x km} is the sum of the trace scalar products, (A, B) = trace A\B\ + + trace AmBm. In the space Sym(A:i x x km), the cone NND(i x x km) is closed and convex; for short we write A > 0 when A e NND(*i x - . x km). In the i th model, the moment matrix M, is evaluated through an information function //,. Compositions of the form <fr o CK. are but special manifestations of \l/i. This gives rise to the compound function
284
Thus ^(M) = (tj/i(Mi),..., i]/m(Mm))' is the vector of information numbers from each of the m models, and takes its values in the nonnegative orthant ra? = [0;ooT. To compress these m numbers into a single one, we apply an information function <J> on R, that is, a function
which is positively homogeneous, superadditive, nonnnegative, nonconstant, and upper semicontinuous. The prime choices are the vector means 4>p with p e [-00; 1], from Section 6.6. Thus the design problem is the following,
where the subset M^ C NND(fci x x km) is taken to be compact and convex. Increasing complexity to formulate the design problem does not necessarily mean that the problem itself is any more difficult than before. Here is a striking example. Suppose the regression function / = (/i, ...,/*)' has k components, and the experimenter is uncertain how many of them ought to be incorporated into the model. Thus the rival models have moment matrices MI arising from the initial sections /(,) of /,
In the /th model, the coefficient vector of interest is e, = (0,... ,0,1)' G R', in order to find out whether this model is in fact of degree /. The criterion becomes a Schur complement in the matrix A A/, by blocking off AH = Mi-i,
see Lemma 3.12 and its proof. If the information from the k models is averaged with the geometric mean 3>0 n ^+then tne criterion turns into
11.11. MIXTURES OF INFORMATION FUNCTIONS
285
Thus we simply recover determinant optimality in the model with regression function /, despite the challenging problem formulation. Generally, for maximizing 4> o tf/ over M^m\ little is changed compared to the general design problem of Section 5.15. The underlying space has become a Cartesian product and the optimality criterion is a composition of a slightly different fashion. Other than that, its properties are just the same as those that we encountered in the general design problem. The mapping ty is an Um-valued information function on NND(&i x x km), in that it is positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous, relative to the componentwise partial ordering A > 0 that the cone IR induces for vectors A in the space Rm. Therefore its properties are exactly like those that the mapping CK enjoys relative to the Loewner ordering C > 0 on the space Sym(s') (see Theorem 3.13). With the present, extended terminology, the information matrix mapping CK is an instance of a Sym(s)-valued information function on NND(/c). The following lemma achieves for the composition o if/ what Theorem 5.14 does for the composition </> o CK, in showing that $ o ty enjoys all the properties that constitute an information function on NND(A;1 x x km), and in computing its polar.
11.11. MIXTURES OF INFORMATION FUNCTIONS Lemma. Let <J> be an information function on IR, and let if/i,...,if/m be information functions on NND^),... ,NND(/:m), respectively. Set ty ( A i , . . . , \ltm)' and tf/ = (i/^ 0 0 ,..., ty)'. Then 4> o if/ is an information function on NND(A:1 x x km), with polar function (4> o if/)00 4> o i^00. Proof. The same steps o-v as in the proof of Theorem 5.14 show that < = <J> o i/r is an information function. Also the polarity relation is established in a quite similar fashion, based on the representation
With level sets {tf/ > A} - {A > 0 : ^r(A) > A}, for all A R, the unit level set of the composition < o tf/ is
For all B e KNO^ x - x km) and A e R?, we get
286
The last line uses inf /1 . e ( l/ , i > A .}( J 4,,fi,) = A,-^/30 (/?/). This is so since for A, > 0 we have {^/ > A,} = A,{^ > 1}, while for A, = 0 both sides are 0. Finally we apply (1) to $ o tfr and then to 4> to obtain from (2) and (3), for all B > 0,
The grand assumption of Section 4.1, Mr\A(K) / 0, ensures that for some moment matrix M e A4, the information matrix mapping CK(M) falls into PD(s), the interior of the cone NND(s) where an information function <f> is defined. For the present criterion 4> o &, the analoguous grand assumption requires that for some tuple M M^m\ we have i/f,(A/,) > 0 for all / < m, that is, at least one setting leads to positive information in each model. For such a tuple M the vector ^(M) is positive, and hence lies in the interior of the nonnegative orthant IR where <E> is defined.
11.12. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OF MODELS Theorem. For the problem of Section 11.10, let M = (M 1 ,...,M m ) e M^ be a tuple of moment matrices with fc(M/) > 0 for all / < m. Then M maximizes o ^(A) over A e M^ if and only if for all i <m there exist a number a, > 0 and a matrix Af, NND(&,) that solve the polarity equations
11.12. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OF MODELS
287
and that jointly satisfy the normality inequality
Proof. Just as in the Subgradient Theorem 7.4, we find that M e M^ maximizes the function $ o ^r : NND(/C! x x km) * IR over .M(m) if and only if there is a subgradient B of 4> o t[/ at M that is normal to M^ at M. Theorem 7.9, adapted to the present setting, states that a subgradient B exists if and only if the tuple N = B/(3> o ^r)(M) satisfies the polarity equation for <f> o if/,
where we have set ities for 4> and for /*-, yield
The polarity inequal-
Thus (4) splits into the conditions
The numbers a, = trace A/,Af, > 0 sum to 1, because of (5). If a, > 0, then we rescale (7) by I/a, to obtain (2); conversely, (2) implies (7) with Ni = a,-W/. If a, = 0, then ^(Af/) > 0 forces ^(ty) = 0 in (7), and AT/ = 0 is (another) particular solution to (5), (6), and (7). With AT,- = a,-Af,-, (5), (6), and (7) are now seen to be equivalent to (1) and (2). The normality of B to M^ at M then translates into condition (3). The result features all the characteristics of a General Equivalence Theorem. The polarity equation (1) corresponds to the component function 4>; it serves to compute the weighting a\,..., am. The set of polarity equations in (2) come with the individual criteria ^,, they determine the matrices Ni. The normality inequality (3) carries through a comparison that is linear in the competing tuples A e M^m\ It evaluates not the m normality inequalities
288
trace AiNi < 1 individually, but their average with respect to the weighting a,. More steps are required than before, but each one alone is conceptually no more involved than those met earlier. The vector means <J>P and the compositions fa = 4>i CKi let the result appear yet closer to the General Equivalence Theorem 7.14.
11.13. MIXTURES OF MODELS BASED ON VECTOR MEANS
Theorem. Consider a vector mean 4>p of finite order, p e (-oo;l], and compositions of the form fa <& o C#., with information functions <fr on NND(s/), for all i < m. Let the tuple (Mi,..., Af m ) e M(m) be such that for all i < m, the moment matrix M, is feasible for K/6, with information matrix Q^C^Mt). Then (Mi,... ,M) maximizes 4>p(<h o C/î), ...,<f>mo CKm(Am)) over (Ai,... ,Am) e A1(w) if and only if for every / < m there exists a matrix D/ e NND(s/) that solves the polarity equation
and there exists a generalized inverse G, of Af, such that
jointly satisfy the normality inequality
Proof. For a vector mean 4>p, the solution of the polarity equation is given by the obvious analogue of Lemma 6.16. With this, property (1) of Theorem 11.12 requires a/i/Âf,) to be positive and positively proportional to (fa(Mi)Y~l. This determines the weighting a,. The polarity equations (2) of Theorem 11.12 reduce to the present ones, just as in the proof of the General Equivalence Theorem 7.14. We demonstrate the theorem with a mixture of scalar criteria, across m different models. Let us assume that in the ith model, a scalar parameter system given by a coefficient vector c, e IR*( is of interest. The design T in T is taken to be such that the moment matrix A/, = JT fj(t)fj(t)' dr is feasible, MI A(ci). For the application of the theorem, we get
11.14. MIXTURES OF CRITERIA
289
Therefore (Mlt...tMm) 6 X (w) makes ^((c^fci)- 1 ,...,^^-^)- 1 ) a maximum over M^ if and only if for all / < m there exists a generalized inverse G, of M,- such that, for all (>4i,... ,Am) <E .M(w),
The discussion of Bayes designs in Section 11.6 leads to the average of various criteria, not in a set of models, but in one and the same model. This is but a special instance of the present problem.
11.14. MIXTURES OF CRITERIA We now consider the situation that the experimenter models the experiment with a single regression function
but wishes to implement a design in order to take into account m parameter systems of interest AT/0, of dimension st, evaluating the information matrix CK.(M) by an information function fa on NND(s,). This comprises the case that a single parameter system K'B is judged on the grounds of m different criteria <f>i,..., <f>m. With the composite information functions fa = fa o CK. on NND(fc), we again form the compound function
Carrying out the joint evaluation with an information function 4> on R, the problem now is one of studying a mixture of information functions, utilizing an information function on R:
where the set M C NND(A;) of competing moment matrices is compact and convex. Upon defining the set M(m) C NND(A: x x A;) to be M(m) = { ( M , . . . , M ) : M M}, the problem submits itself to the General Equivalence Theorem 11.12.
290
11.15. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OF CRITERIA Theorem. For the problem of Section 11.14, let M G M be a moment matrix with ^(M) > 0 for all / < m. Then M maximizes <1> o i f / ( A , . . . ,A) over A e M if and only if for all / < m there exist a number a, > 0 and a matrix Af, e NND() that solve the polarity equations
such that the matrix N
satisfies the normality inequality
Proof.
The result is a direct application of Theorem 11.12.
Again the result becomes more streamlined for the vector means <&p, and compositions ifa = <, o CK,. In particular, it suffices to search for a single generalized inverse G of M to satisfy the normality inequality. 11.16. MIXTURES OF CRITERIA BASED ON VECTOR MEANS Theorem. Consider a vector mean 4>p of finite order, p (-00; 1], and compositions of the form fa = fa o #., with information functions fa on NND(s/), for all i < m. Let M e M be a moment matrix that is feasible for K/6, with information matrix C, = C/^.(M). Then M maximizes 4>p(<fo o CKl(A),...,<j>m o CKm(A)] over A 6 M if and only if for all / < m there exists a matrix >,- e NND(s,) that solves the polarity equation
and there exists a generalized inverse G of M such that the matrix
satisfies the normality inequality
11.16. MIXTURES OF CRITERIA BASED ON VECTOR MEANS
291
Proof. Theorem 11.13 provides most of the present statement, except for a set of generalized inverses G l 5 . . . , Gm of M. We need to prove that there is a single generalized inverse G of M that works for all / < ra. We define the matrix N = ,.<, ,-#,-, with Nf = G^C/D/C/f/G/ as in the proof of Theorem 11.13. Our argument is akin to that of establishing part (a) of Corollary 10.11. Let r be the rank of MNM, and choose a full rank decomposition MNM = KK'. We have trace K'M~K = trace MN 1. We show that M is </>_!-optimal for K'6 in M, with optimal value
Optimality follows from the Mutual Boundedness Theorem 7.11. The matrix N is feasible for the dual problem. From (K'M-K)2 = K'M'MNMM'K = K'NK, we get (K'NK}1'2 = K'M~K. Hence the dual criterion takes the value
Now </>_!(C K (M)) = 1/4>(K'NK) shows that M is <_roptimal for K'B in M. From Theorem 7.19, there exists a generalized inverse G of M such that
This is the normality inequality we wish to establish. On the left hand side we obtain
The right hand side is trace AT'M K = trace MN = 1. As in Section 11.13, we demonstrate the theorem with a mixture of scalar criteria, this time in one and the same model. Again let the design T in T be feasible for the scalar parameter systems c-6 for all / < m. Then the maximum of <bp((c{A~ci)~l,...,(c!nA~cm)~l) over A e M is attained at M e M if and only if there exists a generalized inverse G of M such that
292
where W = For the harmonic mean 4>_i, we get W = where p is the uniform weighting on the points C i , . . . ,cm; this is inequality (3) of Section 11.6 with a = 1. 11.17. WEIGHTINGS AND SCALINGS In general we admit an arbitrary information function <f> on IR, to average the information from m distinct sources. Specifically, with p = 1,0, -1, the vector means 4>p comprise the arithmetic mean, the geometric mean, and the harmonic mean. However, it is always the arithmetic mean that is utilized to average the m individual normality inequalities trace AjNi < 1, albeit with a weighting a, which is generally not uniform. The reason is that the normality inequality, pertaining to (sub)gradients and derived from (sub)differential calculus, is intrinsically linear. For averaging subgradients, a method other than a linear one has no place in the theory. For a vector mean 4>p and compositions fa = <, o CK., the weights are
Only the geometric mean <S>0 leads to a constant and uniform weight a, = 1/m. Otherwise, the weights a/ vary with the value of the criterion function. For example, in the case of the harmonic mean, p = -1, two models with information <f>\(Ci) = 1 and <k.(C2) = 9 yield a\ = 0.9 and a2 0.1. That is, the emphasis is to improve upon models with comparatively little information. The weighting a/ reflects a relative scaling among the optimality criteria fa = Sensible scaling is a challenging issue when it comes to combining information that originates from different sources. We have emphasized this in Section 11.1, in the context of Bayes modeling, and it is also relevant for mixtures of models. The matrix means <j>p on NND(s) are standardized, (f>p(Is) = 1. This is convenient, and unambiguous, as long as a single model is considered. For the purpose of mixing information across various models, this standardization is purely coincidental and likely to be meaningless. It is up to the experimenter to tell which scaling is appropriate for the situation. One general method is to select the geometric mean <J>0. This is the only vector mean which is positively homogeneous (of degree 1/m) separately in each variable. Hence a comparison based on <J>0 is unaffected by scaling. As seen above, the averaging of the normality inequalities is then uniform, expressing scale invariance in another, dual way. An alternate solution is to scale each model by its optimal value. In other words, the information criterion <fo o CKi(Mi) is substituted by the efficiency
<t>i CK..
11.18. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, II
293
criterion fa o C/c;,(M/)/u,, where u, = ma\MiM. fa o CK.(Mi). This method is computationally expensive as it requires the m optimal values v, to be calculated first. 11.18. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, II We continue the discussion of Section 11.9 of finding efficient designs for discriminating between a second-degree and a third-degree model, for a polynomial fit on [!;!]. I. The geometric mean <0 of the ^-criterion for 0(2) m tne second-degree model and the (fo-criterion for 0(3) in the third-degree model is
The design r which maximizes this criterion is
Again dots indicate zeros. The respective efficiencies for 63, 0(3), and 0(2) are 66%, 98%, and 91%. The value of the optimality criterion is 0.35553. In order to verify optimality, we refer to Theorem 11.13. It is convenient to work with indices / = 2,3, for the second-degree and third-degree models. For j = 2, we get G2 - M2(r)-\ K2 = 73, C2 = M 2 (r), D2 = M2(r)/3, and a2 1/2. Similar quantities pertain to / = 3. Hence the normality inequality turns into P(t) < 1 for all t [-1; 1], where P is given by
294
Since the polynomial P attains the value 1 in 1 and in A/17/117, and has local maxima at -v/17/117, it is bounded by 1 on [-1; 1]. Therefore the design r maximizes the geometric mean of <^(M2(r}) and <fo(M 3 (T)), on the experimental domain T = [-!;!]. II. As an alternative, we propose the design that maximizes the same criterion but on the five-point arcsin support set T = {1, 1/2,0} of Section 11.9 (I):
Its respective efficiencies for #3, #(3), and 0(2) are 64%, 96%, and 90%. The criterion takes the value 0.34974 which is 98% of the maximum value of part (I). The efficiencies are excellent even though the design is inadmissible, as seen in part (b) of Section 10.7. To compute the design and verify its optimality, we again utilize Theorem 11.13. We guess the optimality candidate to be symmetric, r(l) = w, r(l/2) = M, r(0) = 1 - 2w - 2u. The inverse moment matrices of second-degree and third-degree are
with j th moment ftj 2w + 21 ', for / = 2,4,6, and with subblock determinants d = 1*4 - /t| and D = n,2^(> - A^- These matrices define the polynomial P on the left hand side of the normality inequality. If t = 0 rightly belongs to the optimal support, then we must have P(0) = 1, providing a relation for u in terms of w. If t = 1 is another optimal support point, then P(\) = I leads to an equation that implicitly determines w. In summary, we obtain
11.18. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, II
295
From this, the weights w = 0.279 and u = 0.164 are computed numerically. The polynomial P becomes P(t) = I + 0.7&2 - 3.89r4 + 3.11f6. Now P(0) = P(l) = P(l/2) = 1 proves optimality, on the arcsin support set T {1, 1/2,0}. III. Another option is to stay in the third-degree model and consider there the two parameter systems #3 = eÔ^) and 0(2) = K'B^. The geometric mean of the information for 63 and of the (^-information for 0(2) is
The design that maximizes this criterion over the five-point support set T {1, 1/2,0} is
The respective efficiencies for #3, 6^, and 0(2) are 100%, 94%, and 75%. The full efficiency for #3 indicates that the present design is practically the same as the optimal design from part (i) in Section 11.9. The derivation of the design follows the pattern of part (II), except for employing Theorem 11.16. Again we conjecture the optimal design to be symmetric. Let its moments be ja 2 ,/Lt 4 ,/x6, and again set d = 1*4 /i| and D n2fi6 - JJL^ The information for 63 is C\ = D/^, while the information matrix C2 for 0(2) and the matrix N of Theorem 11.16 are
Let the associated polynomial be P(t) = ( l , f , f 2 , f 3 ) W ( l , f , f 2 , f 3 ) ' . If t 0 is an optimal support point, then we have P(0) = 1, entailing a relation to express u in terms of w. On the other hand, P(l) = 1 yields an
296
equation that implicitly determines w. Thus we get
The resulting values w = 0.19 and u 0.44 are not feasible because the sum 2w + 2u exceeds 1. Hence t = 0 cannot be an optimal support point. This leaves us with the relation u = \ - w. From P(l) = 1, we determine w = 0.168, and hence the design T. Its moments yield the polynomial P(t) = 0.50 + 5.04r2 - I4.73t4 + 10.19r6. Now P(0) = 0.50 < 1 = P(l/2) = P(l) establishes the optimality of r, on the arcsin support set
r={i, 1/2,0}.
A different approach is not to average the criteria from the m models, but to optimize only a few, subject to side conditions on the others. A reasonable side condition is to secure some prescribed levels of efficiency. 11.19. DESIGNS WITH GUARANTEED EFFICIENCIES Finally we maximize information in one model subject to securing some prescribed efficiencies in other models. With the notation of Section 11.10, let vi maxMA/((m) î(Mi) be the largest information in the ith member of M = (Mi,...,M;,..., Mm) e M(m) under criterion //,, and let ,- (0; 1) be the efficiencies that the experimenter wants to see guaranteed. Thus interest is in designs with moment matrices M,- fulfilling ift(Afj) > /i, for all / < m. With A, = fiji/i, we define the vector A = ( A i , . . . , \m)' e IR, and introduce the level set of the function ^ = (^,..., $ m )' : NND(fci x - x km) - R m ,
Thus the set of competing moment matrices has shrunk to M^m) n {^r > A}. Of course, we presume that the set still contains some moment matrices that are of statistical interest. We wish to maximize the information in the m th model while at the same time observing the efficiency bounds in the other models:
Since the criterion if/m is maximized, we take its efficiency bound to be 0, \m = 0. For / < m, the bounds only contribute to the problem provided they are positive and stay below the optimum, A, 6 (0;i>,). Furthermore we
11.20. GENERAL EQUIVALENCE THEOREM FOR GUARANTEED EFFICIENCY DESIGNS
297
assume that there exists a moment matrix A G M^ that satisfies the strict inequalities tf/i(Ai) > A,- for all / < m. This opens the way to redrafting the General Equivalence Theorem 7.14 using Lagrange multipliers. The ensuing normality inequality refers to the set M.^ itself, rather than to the unwieldy intersection with the level set {tf/ > A}. 11.20. GENERAL EQUIVALENCE THEOREM FOR GUARANTEED EFFICIENCY DESIGNS Theorem. For the problem of Section 11.19, let M = (Mi,...,Mm) G M^ n {& > A} be a tuple of moment matrices that satisfies the efficiency constraints. Then M maximizes \lim(Am) over A e M^ n {tf/ > A} if and only if for all i < m, there exist a number a/ > 0 and a matrix N, G NND(A:,), with a,-0i(A//) = a/A, for / < m and am 1 - ]T,<W a, > 0, that solve the polarity equations
and that jointly satisfy the normality inequality
Proof. The General Equivalence Theorem 7.14 carries over to the compact and convex set M^ n [tj/ > A} and the criterion function </>(A) Aw(^w)- It then states that M is optimal if and only if there exists a tuple N G NND(fci x x km} that solves the polarity equation ^(M)^(N) = X)/<m trace M,yv, = 1 and that satisfies the normality inequality Y,i<mtrace îî < 1 for all A G X (w) n {^f > A}. This happens if and only if there is a matrix Nm G NND(fcOT) with ^m(Mm)^(Nm) - trace MmNm = 1 that satisfies
That (2) implies (3) is seen as follows. For
we have
Then am > 0 yield
trace AiNi =
trace AmNm +
trace AiNi < 1 and
298
Conversely, (3) leads to the additional polarity equations in (1) by viewing (3) as an optimization problem in its own right. For / < m, we introduce for the ith level set {i/r, > A,}, the concave indicator function g,(A), with values 0 or oo according as î(Ai) > A, or not. Furthermore we define gm(A) = trace AmNm. Then (3) means that M maximizes the objective function g(A) = ^< OT gj(A) over the full set M^m\ Convex analysis teaches us that this happens only if there exists a subgradient B of g at M that is normal to M^ at M. The assumptions set out in Section 11.19 ascertain that the subgradient has the form B = (u\B\,..., w m _iJ3 m _i,N m ), with Lagrange multipliers (Kuhn-Tucker coefficients) M, > 0 satisfying H/j/Âf,) = w/A,, and where /?, 6 Sym(A:,) is a subgradient of </f, at A// provided M, is positive. In the latter case, we have <A,(A//) = A, > 0 and Theorem 7.9 provides the representation 5, = /f,(M/)Af, with Nj solving (1). Finally the Lagrange multipliers w, are transformed to yield the coefficients a,. With (M,B) = 1 + ]C l<OT Mi'/'/(Af/) > 0, this is achieved by setting a,- = iil-0l-(MJ-)/(M,B> for / < m and am = 1/{M,B). The result is close to Theorem 11.12 both in content and format, for the reason that there exist Lagrange multipliers MI, ... , OT _i that make the present problem equivalent to maximizing ^m(Am) + )/ <m w, (^/(-A,) A,) = o ^r(A) ]/ <m MjAj, sav over lê unrestricted set M^m\ A prime application is the discrimination between two rival models. For this task, the theorem specializes as follows. 11.21. MODEL DISCRIMINATION Theorem. Consider two models i 1,2 with criteria of the form </>, o CK., where Kf is a A:, x s, matrix of full column rank and </>, is an information function on NND(s,). Assume that every member (Ai,A2) Ai(2) for which AI maximizes <fc o CK2 violates the given efficiency bound A for the first model, <fa(CKl(Ai)) < A. Let the pair (A/!,M2) Ai(2) be such that Af, is a member of the feasibility cone A(Ki), with information matrix C, = CK.(Mi), for i = 1,2, of which Q fulfills ^(Q) > A. Then (Mi,Af 2 ) maximizes <fc o CK2(A2) over those (A\,A2) M with 0i CKI(A\) > A if and only if <fo(Ci) = A and for / = 1,2 there exists a matrix DI e NND(5) that solves the polarity equation
and there exists a generalized inverse G, of A// such that, for some a G (0; 1), the matrices Nt = G/ATjCjD/Cj/f/G/ jointly satisfy the normality inequality
11.22. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, III
299
Proof. In Theorem 11.20 we have a} > 0. Otherwise the problem is solved by some pair (A\,A2) for which A2 maximizes <fo C%2 and, by assumption, this is not so. With a = a\ > 0 and 1 a = a2 > 0 the theorem follows. While these concepts are appealing they are available only at an additional computational expense. The optimality characterization involves a novel equation, <j>\(Ci) A, but there is also a further parameter, a, that enters into the normality inequality. 11.22. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, III For the last time, we turn to the setting discussed in Section 11.9 of discriminating between a second-degree and a third-degree polynomial model, for which in Section 11.18 we have found some optimal mixture designs. I. In order to find the design with largest (fo-information for 0(2) in the second-degree model, among the designs that guarantee 50% efficiency for 03 in the third-degree model, we must maximize <fo(M 2 (T)) subject to e3'M3~(T)e3 < 32. This is so since 50% of the optimal information for 63 is 1/32, by Section 11.9 (i). The solution is the design
As before, dots indicate zeros. The respective efficiencies for 63, 0(3), and 0(2) are 50%, 93%, and 94%. We use Theorem 11.21 to verify optimality. The two matrices
enter into the definition of the polynomial P in the normality inequality.
300
From P(l) = 1, we obtain a = 0.074, giving P(t) = 0.98 + 0.52f2 - 2.88f4 + 2.38f6. Now P(l) = P(0.3236) = 1 and ^(0.3236) = 0 imply that on the interval [1;1] the polynomial P is bounded by 1. This establishes the desired optimality property of r. II. As a last example, we take the same criterion as in part (I), but again restrict attention to the arcsin support set T. We obtain the design
The respective efficiencies for 63, 0(3), and 0(2) are 50%, 92%, and 93%. In order to prove optimality, we apply Theorem 11.21, essentially repeating the steps of part (II) in Section 11.18. The efficiency constraint, /x2/Z> = 32, gives u in terms of w. The matrices
together with a e (0;1), determine the polynomial P(t) on the left hand side of the normality inequality. From P(0) = 1, we get a formula for a. In summary, we obtain
where w is given implicitly through P(l) = 1. With the resulting weights w = 0.292 and u = 0.123, and a = 0.086, we get P(t) = 1.00+0.69/2-3.43/4+2.75f6: Thus J(l) = P(l/2) = P(0) = 1 establishes optimality on T, on the arcsin support set T = {1, 1/2,0}. The designs we have encountered are tabulated in Exhibit 11.1. The apparent multitude of reasonable designs for the purpose of model discrimination again reflects the fact that there is no unique answer when
11.22. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, III
301
Section
Design for polynomial fit on [1, 1]
(^-Efficiencies for
#3 %) %)
11.9 (i) 11.9 (ii) 11.9 (iii) 11.9 (I) 11.9 (II) 11.18 (I) 11.18 (II) 11.18 (III) 11.22 (I) 11.22 (II)
Optimal for fa, (value 0.0625) <fo-optimal for 0(3) (value 0.26750) 0o-optimal for 0(2) (value 0.52913) Uniform, on the arcsin support set T = {1, \,Q} Half (^-optimally augmented for 0(2) o-optimal for 0(2) and o-optimal for 0(2) and <t>o-optimal for 0(2) and in a single third-degree 0(3) 0(3), on T 03, on T, model
0.85 0 0.72 0.42 0.66 0.64 1.00
0.93 1 0 0.94 0.89 0.98 0.96 0.94 0.93 0.92
0.75 0.87 1 0.84 0.94 0.91 0.90 0.75 0.94 0.93
(fo-optimal for 0(2), 50% efficient for 03 <fo-optimal for 0(2), 50% efficient for 03, on T
0.5 0.5
EXHIBIT 11.1 Discrimination between a second- and a third-degree model. The design efficiency for the individual component 63 is evaluated in the third-degree model, as is the (^-efficiency for the full parameter vector 6^ = (00, 6\, #2,63)' The ^-efficiency for 0(2) = (Oo,Qi,&2.)' is computed in the second-degree model.
the question is to combine information that arises from various sources. We have restricted our attention to the criteria that permit an evident statistical interpretation, that is, which are mixtures of the form ^ o iff. The scope of the optimality theory extends far beyond these specific compositions. It would cover all information functions on the product cone 1C = NND(fci x - x km). Theorem 5.10 depicts this class geometrically, through a one-to-one correspondence with the nonempty closed convex subsets that are bounded away from the origin and recede in all directions of 1C. Clearly, 1C carries many more information functions than any one of its component cones NND(fc,). Little use seems to flow from this generality. In contrast, the generality of the General Equivalence Theorem 7.14 is useful and called for. It covers mixtures <1> o if/, just as it applies to any other information function <f>. It allows for quite arbitrary sets of competing moment matrices A4, as long as they are compact and convex. We have met the sets Ma for Bayes designs, M(E[0;b]) for designs with bounded weights, M(m) for mixtures of models, {(M,..., A f ) : M e M] for mixtures of criteria, and M^ n [iff > A} for designs with guaranteed efficiencies. Of course, growing problem complexity entails increased labor to find the optimal designs. Even when the optimal design is obtained, it is usually valid only for infinite sample size, as pointed out in Section 1.24. That is, its weights
302
need to be rounded off to conform to a finite sample size n. This is the topic of the following chapter. EXERCISES 11.1 Fill in the details of the proof of />[&] = ncr^ in Section 11.4. 11.2 Show that the one-point design in x = S((l - a)M0 + alk)~lc, with 8 such that ||jc|| 1, is Bayes optimal for c'0 in H, over the unit ball X = {x R* : ||*|| < 1} [Chaidner (1984), p. 293]. 11.3 (continued] Show that if
then the one-point design in x = (0.98,0.14,0.14)' is Bayes optimal for 61 in H, with minimum mean squared-error 1.04. In comparison, the one-point design in c = (1,0,0)' has mean squared-error 1.09 and efficiency 0.95. 11.4 (continued) Show that, with W > 0 and 8 = ( + (!-) trace M)/ trace Wl/2 > 0, the matrix M - (l/a)(6W1/2 - (1 - a)MQ) is positive definite provided a is close to 1 or Af0 is close to 0. Use a spectral decomposition M Y%=\ wixix! to define a design (*,) = w, in H. Show that is Bayes linearly optimal for 6 in H, with minimum average mean squared-error equal to the trace of Wl/2/8.
11.5 (continued) Show that if
then the design which assigns weights 0.16, 0.44, 0.40 to
is Bayes linearly optimal for 0 in E, with minimum average mean squared-error 21.7. In comparison, the design which assigns to the Eu-
EXERCISES
303
clidean unit vectors e\,eêi the weights 0.2,0.4,0.4 has average mean squared-error 21.9 and efficiency 0.99. 11.6 Use del M3 = C\ del M2 in the examples (I) and (II) of Section 11.18 to verify <J>o(<fo(M2), <fo(M3)) = Ci1/8<fo(M2)7/8. Comment on how the resulting designs reflect the heavy weighting of <fo(Af2).
C H A P T E R 12
Efficient Designs for Finite Sample Sizes
Methods are discussed to round the weights of a design for infinite sample size in order to obtain designs for finite sample size n. We concentrate on the efficient apportionment method since it has the best bound on the efficiency loss that is due to the discretization. Asymptotically the efficiency loss is seen to be bounded of order n~l; in the case of differentiability the order is n~2. In polynomial fit models, the efficient apportionment method even yields optimal designs in the discrete class of designs for sample size n, under the determinant criterion fo and provided the sample size n is large enough. 12.1. DESIGNS FOR FINITE SAMPLE SIZES A design for sample size n, e Ew, specifies for its support points jc,- X frequencies / = &,(*,-) e {l,2,...,n l,n}, and is standardized through 5^/< HI n. This inherent discreteness distinguishes the design problem for sample size n:
where H C H is a subset of designs for sample size n which compete for optimality. The optimal value of this problem is denoted by v(<f>,n). The discretization does not affect the class of optimality criteria to be considered, information functions. Indeed, the unstandardized moment matrix of , is
304
12.2. SAMPLE SIZE MONOTONICITY
305
(see Section 1.24). Therefore the optimal design for sample size n generally depends on the unknown model variance a-2, unless the optimality criterion (/> is homogeneous. This again singles out the information functions </> on NND(fc) as the only reasonable criteria, the other defining properties being called for by the same arguments as in the general design problem of Section 5.15. The information function to be used may well be a composition of the form <f> o CK, but at this point this is of no importance. The set of designs for sample size n is embedded in the set a of all designs by a transition from to its standardized version, M /n. This gives rise to the set of moment matrices
This set is still compact, being the image of the compact subset Xn of (Rk)n under the continuous mapping (*!,...,*) i- Y^j<nxixj/n' Hence the existence of optimal designs for sample size n poses no problem and parts (a) and (b) of Lemma 5.16 carry over. The striking distinction is that discreteness prevents the set M(E/i/) from being convex. Since almost everything in our development is built on convexity properties, none of those results apply. Thus it is generally beyond our approach to find the optimal value v(<f>,n) of the design problem for sample size n, or the $-optimal designs in subsets Ew of S. Instead we propose a specific apportionment method which takes any design for infinite sample size (preferably an optimal one) and efficiently rounds it to a design for sample size n. The rounding is carried out irrespective of the particular criterion </>, but with due attention to the general principles that underly the design of experiments. Section 12.2 to Section 12.5 introduce the efficient design apportionment. Its optimality property of enjoying the best efficiency bound among all designs for sample size n is studied in Section 12.6 to Section 12.11. From Section 12.13 on, we present a particular setting where rounding leads to designs that actually are $o-optimal in the discrete set of designs for sample size n. 12.2. SAMPLE SIZE MONOTONICITY Suppose a given design is rounded to a design for sample size n. The least complications arise if the support sets are the same (whence and n/n are mutually absolutely continuous),
in that the two designs then share identical identifiability properties for whatever parameter system K'O is under investigation. We pursue this case only.
306
CHAPTER 12: EFFICIENT DESIGNS FOR FINITE SAMPLE SIZES
Support point Weight Quota at 299 Apportionment Quota at 300 Apportionment
1 0.02557
2 0.03224
3 0.06234
4 0.87985
7.65 8
7.671 7
9.6398 10 9.672 10
18.6397 18
263.1 263
263.96 264
18.7 19
EXHIBIT 12.1 Quota method under growing sample size. The first support point loses an observation as the sample size increases from 299 to 300.
The rationale is that if is optimal, then it is presumably wise to maintain its support points when rounding. This requires at least as many observations as has support points, n > #supp = t, say. Thus the problem is the following. For a fixed design that is supported by the regression vectors jci,... ,xi in /f, we wish to discretize the weights w, = (*,) into frequencies n, summing to n. Then (*/) = H, defines a design for sample size n. We want to find a procedure for which the standardized and discretized design gn/n is "close" to the original design . The question is, what does "close" mean for the design of experiments. A naive approach is to focus on the quota n\\>i, as the fair share for the support point */. Of course, the numbers nw\,.. .,nwe generally fail to be integers. The usual numerical rounding, of rounding nwi up or down to the nearest integer n;5 has little chance of preserving the side condition that the frequencies n/ sum to n. A less naiVe, but equally futile approach is to minimize the total variation distance max/<^ \n-Jn wi\. This results in the quota method which operates in two phases. First it assigns to jc of the quota AZH>/ the integer part [nwi\. This leaves n ),< j/WiJ observations to be taken care of in the second phase. They are allocated, one by one, to those support points that happen to possess the largest fractional part nwf - [nwf\. The total variation distance is rarely a good statistical measure of closeness of two distributions. For design purposes, it suffices to note that the quota method is not sample size monotone. An apportionment method is called sample size monotone when for growing sample size n, the frequencies n, do not decrease for all i < I. If a method is not sample size monotone, then a sequential application may lead to the fatal situation that for sample size n + 1 an observation must be removed which was already realized as part of the apportionment of sample size n. That the quota method suffers from this deficiency is shown in Exhibit 12.1.
12.4. EFFICIENT ROUNDING PROCEDURE
307
12.3. MULTIPLIER METHODS OF APPORTIONMENT An excessive concern with the quotas nw, begs the question of treating each support point equally. The true problems are caused by the procedure used to round nvv, to n,, and to do so for each / = 1,... ,t in the same manner. Multiplier methods of apportionment pay due credit to these complications and reverse the emphasis. Multiplier methods are in a one-to-one correspondence with rounding procedures. A rounding function R is an isotonic function on R, z > z => R(z) > R(z), which maps a real number into one of the neighboring integers. The corresponding multiplier method seeks a multiplier v e [0; oo) to create pseudoquotas vwi, so that the rounded numbers , = R(vW{) sum to n. The point is that every pseudoquota vwt gets rounded using the same function R. The multiplier v takes the place of the sample size n, but is degraded to be nothing but a technical tool. For growing sample size n = Yî<e^(vwi}-> tne multiplier v cannot decrease. This proves that every multiplier method is sample size montone. There is another advantage. On a computer, the weights w, may not sum to 1, because of limited machine precision. This calls for a standardization W///A, with /x = Yli<e w, ^ 1. But a multiplier method yields the same result, whether based on w, (with multiplier v) or on w,//u, (with multiplier IJLV). Therefore multiplier methods are stable against machine inaccuracies. In the ensemble of all multiplier methods, the task becomes one of finding a rounding function R which is appropriate for the design of an experiment. However, we need to accommodate the following type of situation. Consider the function R(z) \z~\ which rounds z to the smallest integer greater than or equal to z. The (^-optimal designs of Section 9.5 have uniform weights, H>, = I/k for i < k. Any multiplier i > e ( ( m l)fc; mk\ yields the apportionment \vwi\ m which sums to mk. Hence only the sample sizes n = /c,2fc,... are realizable, the others are not. For this reason the definition of a rounding function R is changed at those points z where R has jumps, and offers the choice of the two limits of R(z 5) as 8 tends to 0. 12.4. EFFICIENT ROUNDING PROCEDURE The apportionment of designs will be based on the rounding where fractional numbers z always get rounded up to the next integer, while integers z may be rounded up or not. Because of the latter ambiguity, the rounding procedure |[-| maps a number z into a one-element set or a two-element set,
for all integers m. In particular, we have \z] G |[z]l for all z R.
308
For a design with weights w\,...,wt, the corresponding multiplier method results in an apportionment set (, n) consisting of those discretizations /ii,...,/ii summing to n such that, for some multiplier v > 0 and for all / < t, the frequency /i, lies in flW/U. We call \-\ the efficient rounding procedure, and E(,n) the efficient design apportionment, for reasons to be seen in Theorem 12.7. The fact that the efficient apportionment may yield a set E(g,n) rather than a singleton is again illustrated by the <fo-optimal designs of Section 9.5. For sample size n = mk + r with remainder r e (0,1,..., k - 1}, there are (*) ways to discretize the uniform weights w,- = I/A:, by assigning frequency m+l to r support points and frequency m to the other k r points. All of these assignments appear equally persuasive. The reason is that equality of the weights prevents the apportionment method from discriminating between the (*) possible discretizations. The same phenomenon occurs when two or more weights are too close together rather than being equal. The efficient design apportionment achieves the goal of Section 12.2 that the discretizations E($,n) are designs for sample size n that have the same support as , as soon as n > #supp = t. Namely, for n = t the unique member of E(g,l) is the uniform design (!,...,!) for sample size t. For n > i, the frequencies then satisfy n, > 1, by sample size monotonicity. Given the sample size n and the support size I of the design , rough bounds for a multiplier v are n -1 < v < n. The multiplier n -1 is generally too small since / e J[(n )w,-| implies The multiplier n is generally too large since n, [FWM;ill entails = n. The average of the two extremes is a good first choice, v =
n-\l.
We concentrate on the specific frequencies and introduce the discrepancy
The design +</(.*,) = \(n-\t}w^\ lies in the efficient design apportionment E(,n+d) for sample size n + d. If d = 0, then e E(,n) is a solution to the problem. If d is negative or positive then \d\ observations need to be added or removed. This is achieved by the following theorem which characterizes the efficient design apportionment without taking recourse to multipliers. 12.5. EFFICIENT DESIGN APPORTIONMENT Theorem. Let 6 H be a design with weights v v l 5 . . . , vtv > 0, and let !,...,/! > 0 be integers which sum to n > 1.
12.5. EFFICIENT DESIGN APPORTIONMENT
309
a. (Characterization) Then (/i 1 } ...,/if) lies in the efficient design apportionment (,) if and only if
b. (Augmentation, reduction) Suppose (HI, ... ,ne) e (,/*). If; and k are such that rijlwj = min^n//^ and (nk - l)/wk max,-<(n/ - l)/w/, then
Proof. By definition, any member (n\,..., ne) of (, n) is obtained from a multiplier v > 0 such that for all i < , we have ,- e flVw,]]. This entails n, - 1 < f w, < , and division by w, establishes the inequality in part (a). Conversely, for v e [max,-<(w,--l)/w/; min l <rt//w,], we get /-! < vw, < , for all / < ^. This yields n, e (fvv/]|, and proves (i,... ,n f ) e (,). For part (b), we define n^ n, for / / /' and t = ny + 1, and nr = , for / ^ fc and n~ = nfc - 1. Hence (/if,... ,n) and (n^,... ,n~) are designs for sample size n + 1 and n 1, respectively. Furthermore we introduce
Part (a) states that M < m. In the case of (nj,... ,), we wish to establish M+ m+. From n,* n^,' we infer m < m+. Either M+ is attained by some < < l J i ^ J; then we obtain M+ = (, l)/w, < M < m < m+. Or otherwise M+ = (n| l)/vv/ = nj/Wj = m gives M+ = m < m+. Thus part (a) yields (n|,...,/i|) e E(g,n + 1). In the case of (n^,...,n^), we obtain similarly m" = rij/Wi nj/Wj > m > M > M~ for some / ^ A:, or m~ = n^/wk (nk - l)/wk = M > M~. Thus a fast implementation of the efficient design apportionment has two phases. First use the multiplier n - \i to calculate the frequencies n, = \(n- ^)wi]. The second phase loops until the discrepancy (),<) -n is 0, either increasing a frequency n;- which attains H//VV, = min,< n,/H>, to n;- + 1, or decreasing some nk with (nfc - l)/wk max,<(/ - l)/w, to nk 1. An example is shown in Exhibit 12.2. Next we turn to the excellent global efficiency properties of the efficient design apportionment, thereby justifying the name. The approach is based on
310
3 4 5 6
9 10 11 12
13
14 15 16 17 18
i 6 100 i 010
3 1 2
1 1 1 1 101 1 1 2 2 QQ1 on 1 2 2 3
no
211 221 2 2 2 2 232 323 3 3 4 4 334 344 4 5 5 6
322 332 3 3 3 3 454 545 5 5 6 6 667 677 7 8 8 9
EXHIBIT 12.2 Efficient design apportionment. For a design with weights wj = 1/6, w2 = 1/3, w>3 = 1/2, the efficient apportionment (,) offers three discretizations for n = 1,2,7,8,13,14,... and is a singleton otherwise. Underlined numbers determine the minimum of rij/Wi and receive the next observations.
the likelihood ratio rf(x)/g(x) of any two designs 17, e H, as are the proofs of Theorem 10.2 and Corollary 8.9. We introduce the minimum likelihood ratio of 17 relative to :
The definition entails e^^ [0; 1]. The following lemma explicates the role of sî as an efficiency bound of 17 relative to . 12.6. PAIRWISE EFFICIENCY BOUND Lemma. Any two designs 17, e H satisfy, for all information functions < on NND(Jfe),
Proof. Monotonicity and homogeneity of an information function </> make (2) an immediate consequence of (1). Inequality (1) follows from
The merits of the efficiency bound e^^ are that it is easy to evaluate, and that it holds simultaneously over the set 4> of all information functions on NND(*):
12.7. OPTIMAL EFFICIENCY BOUND
311
Also it helps to formulate the apportionment issue as a succint optimization problem. Given a design e H, achieve the largest possible efficiency bound e^ among all standardized designs TJ H n / for sample size n. The problem is solved by the efficient design apportionment E(,n). 12.7. OPTIMAL EFFICIENCY BOUND Theorem. For every design e H, the best efficiency bound among all standardized designs for sample size n,
is attained by any member of the efficient design apportionment (,). Proof. Let assign weights w, > 0 to i support points *,- e X, and let 17 En/n be an arbitrary standardized design for sample size n. If for some i we have 17 (jc,) = 0, then e^/f vanishes and 17 is of no interest for the maximization problem. If the support of 17 contains additional points Xi+i,..., Jtjt, then the efficiency bound e^/g does not decrease by shifting the mass )f=+i TJ(JC,) mto tne point *i, say. Hence we need only consider designs 17 of the form 17(*,) = n,-/n for all / < ^, with ]T]K n, = n. In the main part of the proof, we start from a seF(1 , . . . , ) with Yî<i ni n which attains the best efficiency bound,
This set need not belong to the efficient design apportionment E(,n), so we modify it into another optimal set which lies in (, w). Again we put
While m < M, we loop through a construction based on some subscripts ;' and k satisfying m = rij/Wj and M = (nk - l)/wk. The assumption m < M forces ;' ^ k and nk > 2. We transfer an observation from the k th to the ;' th point,
This yields ;/w; > m and nk/wk = M > m. But m = min,-^ n,7v,- satisfies m < ne^(n) = m. Hence there exists a third subscript / ^ j,k where / = ,, such that n/w, = m < m < n,/Wi. Therefore in = m, and (i,... ,f) is just another optimal frequency set, but the optimal efficiency bound is attained by
312
fewer n; than before. So we copy (n\,... ,nt) into (ni,...,ne) and compare the new values of m and M. The loop terminates with M < m. Part (a) of Theorem 12.5 proves the terminal frequency set to be a member of (,n). Finally we prove that any two members (n\,... ,n f ) ^ (n^,... ,ne) in the efficient design apportionment E(g,n) have the same efficiency bound. To this end it suffices to establish m = m where
Let v and v be two multipliers that produce (n\,..., nf) and (n\,..., n f ) . We have v = v, otherwise v < v entails n, < / for all i <, and Yî<e nt = n Yî<eî forces equality, n, = n,, contrary to the assumption. Also we have v < m, otherwise we get n, < vwj and / 0 (T^w/H for some i. The same argument yields v < m. Thus there is a common multiplier v satisfying ,,, e IF*/H'il fr all * < ^Since the two sets of frequencies are distinct, there exists a subscript ;' with rij = vwj and Hy = vw}, + I. As both sum to n there is another subscript k with nk vwk + I and nk = nv^. We obtain rt//M>7 = v < m < rij/Wj and nk/\vk = v < m < nk/wk. This proves m = v = m, whence the efficiency bound is constant over (,). The construction in the proof is called for since designs other than those in the efficient apportionment (,) may also achieve the best efficiency bound f(n). For instance, in Exhibit 12.2, the optimal bound ^(8) = 6/8 is attained by the apportionment (3,2,3) which fails to belong to (,8) = {(2,3,3), (2,2,4), (1,3,4)}. The optimal bound e^(n) varies with underlying design . There are coarser efficiency bounds which hold uniformly over sets of designs, H C H.
12.8. UNIFORM EFFICIENCY BOUNDS Lemma. For a design e H and sample size n, let e^(n) be the optimal efficiency bound of Theorem 12.7. a. (Support size) For all designs H that have support size t, we have
b. (Rational weights) For all designs G H that have rational weights with
12.8. UNIFORM EFFICIENCY BOUNDS
313
common denominator N, we have
Proof. Again we set (*,) = w, for the i support points jc, of . For all /,;' = !,...,, part (a) of Theorem 12.5 entails (y - l)/w;- < n//w,, that is, w,-(ny - 1) < w,wy. Summation over / yields w,-(n - ) < n, and (n,7n)/Wj > 1 ^/n for all' i = 1,..., L This proves part (a). For part (b), let N be a common denominator, whence Nwj are integers. If the sample size is an integer multiple mN of N, then the efficient design apportionment is uniquely given by mNwj. Generally, for sample size n = mN + r with a remainder r e { 0 , . . . , N - l } , sample size monotonicity yields
This gives e^(n) > mN/n =
[n/N\/(n/N).
The first bound increases with n. The second bound periodically jumps to 1 and decreases in between. We use the example of Exhibit 12.2 in Section 12.6 to illustrate these bounds:
n
ef(n)
1 2 3 4 5 6 1 0 0 3 0 0 0 4 6 0 0 0 0 0 1
2 3 3 4 4 5 2 5
7 8 9
6 7 4 7 6 7 6 8 8 9 6 8 9 6 6 8 9
10
11 12
10
13 14 15 16
T5
10 13 12 13 12 12 14 11 14 12 14 14 15 12 15 12 15 15 16 13 16 12 16
17
18
9 10
1 -3//I
L/6J/(/6)
11
To
TO
6
8 11 6 11
9 12 1
T7 12 T7
n 14
16
1
15 T8 1
The asymptotic order n } cannot be improved without further assumptions on the optimality criterion </>. This is demonstrated by a line fit model with experimental domain [1;1] using the global criterion of Section 9.2,
The globally optimal design for 0 is the (fo-optimal design r(l) = |, with moment matrix M\(T) = I2. This follows from the Kiefer-Wolfowitz Theorem 9.4 and the discussion in Section 9.5. Hence the optimal value is v(g) = g(h) = l Now let us turn to designs for sample size n. For even sample size n = 2m, the obvious apportionment r n (l) = ra achieves efficiency 1. For an odd sample size, n 2m + 1, the efficient apportionment assigns m observations to -1 and m + 1 observations to +1, or vice versa. The two corresponding moment matrices are
314
The common criterion value is found to be |(1 - 1/n). For either design rn in E(r,n), the g-efficiency is
for odd sample size n. Therefore a tighter efficiency bound than er(n) is generally not available. As n tends to co, either bound of the lemma converges to 1. The asymptotic statements come in a somewhat smoother form by switching to the complementary bound on the efficiency loss, 1 - e$(n). Part (a) states that the loss is bounded by l/n, for every sample size n. De-emphasizing the constant I, this is paraphrased by saying that the efficiency loss is asymptotically bounded of order n~l. Inherent in any rounding method is the idea that for growing sample size n the discretized weights rii/n converge to the true weight w/. This is established in the following theorem. Furthermore, it is shown that with (,), the criterion values <t>(M(gn/n)) converge to <f>(M(g)). This is not entirely trivial since we need to rule out discontinuities such as in Section 3.16.
12.9. ASYMPTOTIC ORDER O(n~l) Theorem. Let e H be an arbitrary design. a. (Efficiency) The efficiency loss 1 - e^(n) that is associated with the efficient design apportionment for rounding into a design for sample size n is asymptotically bounded of order n~l. b. (Weights) The Euclidean distance between and the standardized members of the efficient apportionment E(g,n) is asymptotically bounded of order n~l. c. (Convergence) For every design e (, n) for sample size n, we have, for all information functions <f> on NND(fc),
Proof. Part (a) is a consequence of Lemma 12.8, 1 - e%(n) < l/n = 0(n-*). For part (b), let assign weight w/ to its support points jc/. Any member (!,...,/!{) in E(g,n) satisfies w,(n; 1) < n,-ivy, as in the proof of Lemma 12.8. Summation over / yields / 1 < nwj. In the case ny > nw y , we get |n, nvvy) = nj nwj < 1. In the case ny < nwy, we employ the efficiency
12.10. ASYMPTOTIC ORDER O(tl
315
bound ((n) and part (a) to obtain
Altogether this yields
and
In part (c), convergence of M(n/n) to M() follows from the convergence of the weights, since the support is common to all designs , and . Then the matrices An = (l/^(n))M(/) converge to M () and fulfill An > M() for all n, by Lemma 12.6. For an information function <f> on NND(/c), Lemma 5.7 now entails lim,,^ 4>(An) = <t> (M()) and lim,,^ < (M(&/n)) = <t> (M()).
In the general design problem the optimality criterion is given by an information function <f> on NND(/c). Then the asymptotic order O(n~l) carries over to the <f> -efficiency when discretizing a < -optimal design in some subclass H C H. For a design in the efficient apportionment (,), we get
for every sample size n, by Lemma 12.6. The asymptotic order becomes O(n~2) as soon as the optimality criterion </> is twice continuously differentiate in the optimal design . The differentiability requirement pertains to the weight vector (wi,..., wf)' e Ue of . 12.10. ASYMPTOTIC ORDER O(n~2) Theorem. Suppose the set H C H of competing designs is convex. For an information function < on NND(fc), let the design 6 H be <-optimal for d in H, assigning weight w/ to i support points Xi e X. Assume that the mapping (MI, ... ,M/) -* <f> (Yî<e "ii JC/JC/) K twice continuously differentiable at ( w 1 ; . . . , we)' e Re. Then, for some c > 0 and for all sample sizes n, the designs in the efficient design apportionments "(, n) satisfy
316
Proof. Since the efficient design apportionment leaves the support of fixed, we need only consider designs on the finite regression range X supp where, of course, continues to be tf> -optimal. The set of weights W = {(i7(jci),...,i7(^))' e Re : i) H, supp 17 C X} inherits convexity from H. The vector w = (wi,...,wf)' W maximizes the concave function /(MI, . . . , u f ) = <f> ($2i<e ui */*/) ver W. In the presence of differentiability, this happens if and only if the gradient V/(w) is normal to W at w,
If for any one Euclidean unit vector et G W strict inequality holds in (1), then this entails the contradiction w 'V/(iv) = V/(w>) = w'Vf(w). But equality for all Euclidean unit vectors extends to equality throughout (1),
The Taylor theorem states that for every u in a compact ball /C around w there exists some a e [0; 1] such that
where ///() denotes the Hesse matrix of / at u. For u e W, the gradient term vanishes because of (2). Hence for u e 1C n W, we obtain the estimate
where c = maxa6;c \max(Hf(u)}/f(w) is finite since the Hesse matrix Hf(u) depends continuously on u and AC is compact. We have c > 0 since otherwise for u ^ iv, we get 1 - f(u)/f(w) < 0, contradicting the optimality of w. Any assignment (!,...,) e E(g,n) leads to a vector u = (!/,..., ne/n)' with squared Euclidean distance to w bounded by \\u - w\\2 < 2i2/n2 (see the proof of Theorem 12.9). Hence there exists some n0 beyond which u lies in /C where (3) applies, 1 - <^-eff(^/) < c2/n2 = O(n~2}. Now the assertion follows where c is the maximum of c and the numbers (n2/2)(l <-eff(&AO) for & e (>) and " < An example is shown in Exhibit 12.3, for the efficient apportionment of the 0_oo-optimal design rl^ for the full parameter vector 6 in a tenth-degree polynomial fit model over [!;!]. In Section 9.13, it is shown that the small-
12.11. SUBGRADIENT EFFICIENCY BOUNDS
317
EXHIBIT 12.3 Asymptotic order of the E-efficiency loss. The scaled efficiency loss A_OO(/J) = (n2/121)(l - ^_oo-eff(T n /n)) has maximum c = 0.422 at n = 257, where rn is the efficient apportionment of the <_oo-optimal design T!^, for 0 in a polynomial fit model of degree 10 on [-!;!].
est eigenvalue of the associated moment matrix has multiplicity 1, whence the criterion function is differentiable at Af^rl^) and the present theorem applies. The exhibit displays the scaled efficiency loss
which for sample size n < 1000 stays below c = 0.422. Exhibit 12.4 in Section 12.12 illlustrates the situation for the determinant criterion (fo.
12.11. SUBGRADIENT EFFICIENCY BOUNDS On our way to exploit the assumption that the criterion is twice differentiable in the optimum, we may wonder what we can infer from simple differentiability. We briefly digress to provide some evidence that no new insight is obtained. Similar to Lemma 12.6, another pairwise comparison of two designs 17 and in H is afforded by the subgradient inequality of Section 7.1,
As before, we assume to be <-optimal, with moment matrix M = M() and with i support points jc, and weights w,. The design 17 = i~n/n is taken
318
to be close to , with weights ,/ on the same support points *,-. Dividing by the optimal value v(<j>) = <f>(M), we obtain a bound for the efficiency loss,
The first factor is close to 1 provided <f> is continuously differentiable in M. Then gn/n eventually lies in the neighborhood of where differentiability obtains, whence the subdifferential of <f> at Mn = A/(,/n) uniquely consists of the gradient V<j>(Mn). These gradients converge to V<(Af) because of continuous differentiability. By Theorem 7.9, the matrix N = V</(A/)/<(M) solves the polarity equation <t>(M)<t>(N) = (M,N) = 1. The General Equivalence Theorem 7.14 now translates the optimality of into the normality (in)equality max/<f x-Nxj = 1. Therefore, given 8 > 0, there exists some n0 such that for n > no, we have maxi<iXi'V<f>(Mn)xi/(f>(M) < 1 + S. w n n The second factor in (1) is bounded by )/<f \ i ~ i/ \ < ZLx^l + Wit)/n = 21/n, for the members (n\,...,ni) in the efficient design apportionment (,/i). To rid us of the factor 2, we may use the quota method instead and obtain X)/<* |vv, - n,/n| < l/n. In summary, (1) leads at best to the bound
This result is inferior to Theorem 12.9 in every detail. It excludes an initial section up to some unknown sample size o and it features a constant 1 + 5 larger than 1. It demands continuous differentiability, and it guides us towards the quota method with its appalling monotonicity behavior. However, an objection is in order. Theorem 7.9 says, amongst others, that every subgradient B e d^(M(t])) satisfies (M(T/),5)/0(M(^)) = 1. Thus hidden in the subgradient inequality is the term <(Af(i7)) - (M(i)),B} = 0, and estimating a term that is 0 anyway is unlikely to produce a tight bound. A more promising starting point is
With 17 = gn/n close to , again B becomes the gradient
(Mn) at Mn =
12.11. SUBGRADIENT EFFICIENCY BOUNDS
319
And again continuous differentiability is generally needed to make (2) converge to 0. Hence against initial hope, the objection does not lead to weaker assumptions than those called for by (1). The merits of (2) are that it comes for free with the General Equivalence Theorem 7.14. By Theorem 7.9, the matrix Nn = V(f>(Mn)/<J>(Mn) solves the polarity equation <f>(Mn)<J>(Nn) = (Mn,Nn) = 1. In order to check the optimality of TJ = &/, the General Equivalence Theorem 7.14 directs us to invest some effort to compute max,< */Wn*,-. Although optimality fails to hold if the maximum exceeds 1, inequality (2) still rewards us the efficiency bound
In the absence of differentiability, the bounds (2) and (3) may be quite bad. The following example builds on Section 3.16 and uses the singular moment matrix
where the criterion function is not continuous, let alone differentiable. The matrix belongs to the one-point design in zero, TO, which is optimal for the intercept in the line fit model with experimental domain [-!;!]. That is, the parameter of interest is
We converge to TO along a straight line, ra = (1 - )TO + TI, as a tends to 0. Our starting point is the design the moment matrix Ma of T and the matrix for the normality inequality are
320
Hence the maximum of (l,f)W 0 (l,r)' over t = 1,0, or even over t e [1;1], is 9/(4 - a) and converges to 9/4 as a tends to 0. The right hand sides in (2) and (3) converge 5/9 and 4/9, and leave a gap to the left hand sides 0 and 1, respectively.
12.12. APPORTIONMENT OF D-OPTIMAL DESIGNS EN POLYNOMIAL FIT MODELS In the remainder of the chapter, we highlight a rare case where efficient design apportionment preserves optimality: if is < -optimal in E, then E(g,n) is (f> -optimal in En. Thus simple rounding reconciles two quite distinct optimization problems, a convex one and a discrete one. We illustrate this very pleasing property with a polynomial fit model of degree d. As in Section 1.28, we shift attention from the designs e E on the regression range -X to designs T T on the experimental domain T [!;!]. On T, the set of all designs for sample size n is denoted by Tw. In a d th-degree model, the parameter vector 6 has d + 1 components. It is convenient to use the abbreviation
At least k observations are needed to secure nonsingularity of the moment matrix. The sample sizes n > k are represented as an integer multiple m of k plus a remainder r e {0,..., k 1},
In terms of n we have m = \n/k\ and r = n mk. The <fo-optimal design for 0 is TQ and has a support of smallest possible size, k, derived in detail in Section 9.5. Moreover TQ assigns uniform weights 11k to its k support points //. The efficient design apportionment (, ri) salvages from this uniformity as much as possible. Every reasonable apportionment method does the same, whence the following results are not so much indicative of any one apportionment method. Rather, they underline the peculiar properties of the determinant criterion <fo, and of designs which have a minimal support. For sample size n = mk+r > k, the members of the efficient design apportionment (, n) have in common that they assign at least m observations to each of the k support points f,-. They differ in where the remaining r observations are placed. Hence E(,ri) contains (*) designs for sample size n. It follows from Theorem 12.7 that all of them share the same efficiency bound. In the present setting, we also find that the ^-efficiencies themselves are constant.
12.12. APPORTIONMENT OF D-OPTIMAL DESIGNS IN POLYNOMIAL FIT MODELS
321
Claim. In a dth-degree polynomial fit model and for sample size n > k, the criterion value <f>o(Md(Tn/n)) is constant over the efficient design apportionment E(TQ,ri). With n = mk + r as above, the efficiency loss is bounded by
Proof. To see this, we represent the moment matrix as Md(Tn/n) = X'kuX, where the model matrix X is as in Section 9.5 and Au is the diagonal matrix with weights M, = rn(ti)/n on the diagonal. That is, r weights are equal to (m + l)/n, while the remaining k - r weights are m/n. Hence the optimality criterion takes the value
and does not depend on the specific assignments of the frequencies m + I and m. The overall optimal value is f</((&)) = (te\.X)2/k/k, whence follows the equality in (1). In order to establish the bound, we introduce the relative remainder a = r/k e [0;1] and multiply (1) by n2/k2 to obtain the scaled efficiency loss
The biggest efficiency loss is encountered in the first period, m = 1, when the discretization effect is felt most. Hence we get a) (1 + a - 2a) < 0.135. Thus the proof is complete.h For later periods, m oo, the functions (m + a) (m + a - m (1 + l/m) a ) converge uniformly for a e [0; 1] to the parabola |a(l-a), which at a = \ has maximum 1/8. Therefore the limiting bound for large sample size n tightens only slightly, from 0.135 to 0.125. Note that the same bounds c = 0.135 and 0.125 hold for all degrees d > I . For degree 10, the scaled efficiency loss is shown in Exhibit 12.4. In the first period, n = 11,..., 22, the maximum is 0.134 while we use the bound 0.135. The fourth period, n = 44,... ,55, has maximum 0.126 which is getting close to the limiting maximum 0.125.
322
EXHIBIT 12.4 Asymptotic order of the D-efficiency loss. The scaled efficiency loss AQ() = (n2/121)(l - <ô-eff(Tn/n)) is bounded by c = 0.135, where rn is an efficient apportionment of the ô-optimal design T* for 0 in a polynomial fit model of degree 10 on [-1; Ij.
12.13. MINIMAL SUPPORT AND FINITE SAMPLE SIZE OPTIMALFTY With little effort, we can establish a first discrete optimality property of the efficient design apportionment E(TQ,II), in a subclass Tn of designs for sample size n. Let Tn be the set of such designs for sample size n that have a smallest possible support size for investigating the full parameter vector 0,
In this class, the best way to allocate the k support points is the one realized by the efficient design apportionment (T^,/I), according to the following. Claim. In the d th-degree polynomial fit model, the designs in the efficient design apportionment (TQ,H) are the only (fo-optimal designs for 6 in Tw. Proof. To prove this, let rn 6 Tn be a competing design for sample size , with k = d + 1 support points 7t and with frequencies n,-. We denote the associated model matrix by X = (/(/o),. -. ,/(/)) ' The optimality criterion happens to separate the contribution of the support points from that of the weights,
12.13. MINIMAL SUPPORT AND FINITE SAMPLE SIZE OPTIMALITY
323
Degree
Support points of r*+2
<fo-eff(T,*+2) 0.9585 0.9637 0.9718 0.9752 0.9795
<fo-eff(Td+2) 0.9572 0.9620 0.9662 0.9691 0.9720
4 5 6 7 8
1, 1, 1, 1, 1,
0.6629, 0.7722, 0.8328, 0.8734, 0.9006,
0.1154 0.3541, 0.4920, 0.6053, 0.6837,
0 0.1239 0.2739, 0 0.3919, 0.1083
EXHIBIT 12.5 Nonoptimality of the efficient design apportionment. In a d th-degree polynomial fit model over [1; 1], the designs r*+2 for sample size n = d+2 have larger ^-efficiency for d than the designs rd+2 m the efficient design apportionment E(rft,d + 2).
In order to estimate the first factor, we introduce the design T T which assigns uniform weight l//c to the support points f, of rn. As before, we let X be the matrix that belongs to the support points r, of TQ . The unique <o-optimality of TQ entails the determinant inequality
with equality if and only if the two support sets match, {f tf}. The second factor only depends on the weights. If there are frequencies nt other than m or m + 1, then some pair ; and nk, say, satisfies We transfer an observation from / to k to define the competitor
Then the optimality criterion is increased because Qiiijnk < njnk+fij~nk-l = rijnk. In summary, every design rn e E(^,n) and any competing design
Tn G f fulfill
with equality if and only if ? e E(r^n). Thus the proof is complete. From Theorem 10.7, designs with more than k support points are inadmissible in T. But there is nothing in the development of Chapter 10 to imply that this will succeed for the discrete design set Tn. Indeed, there are designs for sample size n with more than k support points that are superior to the efficient design apportionment (T^,/I), if only slightly so. Examples are given in Exhibit 12.5. This appears to be an exception rather than the rule. In fact, the exceptional cases only start with degree d > 4 and, for a fixed degree d, are restrained to finitely many sample sizes n < nd. In all other cases, n > nd, it
324
is the efficient design apportionment E(TQ , n) which yields the <fo-optimal designs for 6 in TM. To deduce this result and to calculate nd (in Section 12.16), we start with a lemma which permits a nonconvex set of competing moment matrices M. Based on the subgradient inequality, it provides a sufficient criterion for a finite subset C M to be a complete class relative to the optimality criterion <f>, in the sense that every moment matrix A M is matched by a matrix M e which under <j> is at least as good as A.
12.14. A SUFFICIENT CONDITION FOR COMPLETENESS
Lemma. Assume that the optimality criterion <f> is an information function on NND(&). Let M. C M(3) be an arbitrary subset of moment matrices which includes a finite subset of positive definite matrices. If for every moment matrix A e M there exists a matrix M e , such that some nonnegative definite k x k matrix N solves the polarity equation 0(M)<^(N) = trace MN 1 and satisfies trace AN < 1, then at least one of the moment matrices in 8 is <-optimal for 6 in M. If, in addition, </> is strictly isotonic on PD() and the polar function < is differentiable on PD(), then every < -optimal moment matrix for 0 in M is a member of . Proof. Let the moment matrices A .M and M be such that the equation <f>(M)<f>(N) = trace MN 1 has a nonnegative definite solution N satisfying trace AN < 1. Then <f>(M)N is a subgradient of < at M, by Theorem 7.9. Because of <(M) - (M,4>(M)N} = 0, the subgradient inequality simplifies
where the second inequality exploits the assumption trace AN < 1. From the finiteness of , we get supAM 4>(A) = maxWf <(M), thus establishing the first part of the assertion. If, in addition, $ is strictly isotonic on PD(/c), then the matrix N is positive definite, by Lemma 7.5. Now let A e M be <f>-optimal for 6 in M. Then equality holds throughout (1), entailing <f>(A)<f>(N) = trace AN = 1. Part (c) of Theorem 7.9 yields that <t>(N)A is a subgradient of the polar function 0 at N, as is (f>(N)M. Differentiability of $ leaves the gradient as the only possible subgradient, Hence we obtain A = M e . For the determinant criterion <fo, we apply the lemma to the moment matrices of the designs in the efficient apportionment (,) that comes with a 0o-optimal design for 0 in H. The Kiefer-Wolfowitz Theorem 9.4 provides the necessary and sufficient normality inequality,
12.15. A SUFFICIENT CONDITION FOR FINITE SAMPLE SIZE D-OPTIMALITY
325
Let us assume that the optimal design has a smallest possible support size, k. Then the weights of are uniform, I/A;, and the support points x\,..., xk e X of assemble to form the nonsingular k x k matrix X1 = (x\,... ,xk). With M() = X'X/k, the left hand side in (1) becomes kx'X'lX'~lx . Upon defining g(jc) = X'~lx, the normality inequality (1) turns into
The following theorem proposes a strengthened version of (2) by replacing the constant upper bound 1 by a variable upper bound. The tighter, and hence only sufficient, new bound is a convex combination of 1 and of the function max(</tgf(*) < 1, the weighting depending on the sample size n through the multiple m = [n/k\. The goal is to start out from optimality statements relative to the class H of all designs, and to deduce optimality properties relative to the discrete class En of designs for sample size n. 12.15. A SUFFICIENT CONDITION FOR FINITE SAMPLE SIZE D-OPTIMALITY Theorem. Let the design e H be (^-optimal for 6 in H, and have k support points j t i , . . . , x^ e X. With the nonsingular k x k matrix X' (jcj,... ,xk), we define the function g(x) = X'~lx : Uk > Rk. If, for some m, we have
then, for all n > mk, the designs , for sample size n that constitute the efficient design apportionment E(,ri) are <fo-optimal for 6 in E. If, in addition, Yî<kS2i(x} < 1 for all * e #\ {*!,...,jty} then every (^-optimal design for 6 in EM is a member of E(,n). Proof. The proof is accomplished in three steps, showing that condition (1) entails the sufficient condition of Lemma 12.14, with M = M(s,n/n} and = {M(&/n): , e E(,n)}. I. First we identify a design in the efficient design apportionment (, n) with an /--element subset I of {!,...,}, through
To indicate this one-to-one correspondence, we denote the members of E(,ri) by &. Thus gi/n assigns to :c, the weight *v = (ra+l)/n for / e J, and
326
Wi m/n for / 0 J. Upon introducing the vector w(J) = (w\,..., wk)', the solution to the polarity equation for the moment matrix becomes NI = X~l^~l^X'~l/k, by Lemma 6.16. Therefore Lemma 12.14 demands verification of
We represent the moment matrices in regression vectors _ y i , . . . ,yn in X. Setting
with
we obtain
Thus the hypothesis of Lemma 12.14 turns into
n. In the second step, we prove that assumption (1) implies (2). For n = mk + r with remainder r e {0,..., k - 1}, we choose arbitrary regression vectors y\,...,yn in X. For each j = !,...,, there exists some index /; {!,...,*} such that g?(yy) = max/<*g?(yy). Let J be some r-element subset of {1,...,/:}. In any case, we have
The particular case that the since optimality of entails maximum is attained over J, ij J, leads to
12.15. A SUFFICIENT CONDITION FOR FINITE SAMPLE SIZE D-OPTIMALITY
327
This and assumption (1) yield an estimate that improves upon (3):
Introducing the counts c, = # { ; ' < n : i-} i}, we find that there are particular cases with improved estimates (4). Summation of (3) and (4) over ; < n leads to Now we form the minimum over the r-element subsets J,
is the sum of the r largest counts where among C i , . . . , ck. This sum attains the smallest value, r(m +1), if the sum of the k r smallest counts is largest, (k r)m. Insertion of yields (2). Hence (1) implies (2). Lemma 12.14 now states that the efficient design apportionment (, n) contains a design for sample size n which is <fo-optimal for 6 in Hn. Using the same arguments as in Section 12.12, we see that the determinant criterion <fo is constant over E(,n). Thus optimality of one member in E(g, n) entails optimality of all the others. III. Finally, let be any other design for sample size n which is <fooptimal for 0 in H/,. From the second part of Lemma 12.14, the moment matrix M(n/ri) coincides with one of the matrices With this, we find
for all then Now if, in addition to must be supported by jci,... ,xk. Denoting the weights by ^,(xi)/n = w/, we obtain the equality u = w from Therefore , is a member of E(,n), and the proof is complete.
328
12.16. FINITE SAMPLE SIZE D-OPTIMAL DESIGNS IN POLYNOMIAL FIT MODELS We now return to the assertion of Section 12.13 that the efficient design apportionment (T^,AI) yields <fo-optimal designs in the discrete class Tn, for polynomial fit models on the experimental domain T = [1;1], for large enough sample size n. We claim the following. Claim. For a d th-degree polynomial fit model there exists an integer nd such that for all n > nd, the designs for sample size n in the efficient design apportionment E(TQ,H) are the only <fo-optimal designs for 0 in Tn. Proof. From Section 9.5, we recall that the support points of the <fooptimal design TQ for 6 in T are /o>'i> i td, and that the matrix V = X'~l is the coefficient matrix of the Lagrange polynomials L, with nodes f/. In the notation of the previous section, we have k = d + 1 and L,-(f) = gl+1 (f(t)). We write n = m(d + 1) + r as in Section 12.12. Condition (1) of Theorem 12.15 requires, for all t e [-1; 1],
To isolate m we rearrange terms, From this, we see that the key object is the function
where #,(r) = (1 - L?(0) /P(t) are rational functions, for all i = 0,... ,d, with common denominator P(t) = 1 - )f=o Lj(t). The behavior of the polynomial P is studied in Section 9.5. It is of degree 2d, with a negative highest coefficient, and has d -1 local minima at t\,..., td_\. Thus at fo and td the first derivative is nonzero, while at t\,..., td_\ the first derivative vanishes but the second derivative is nonzero. At //, the singularity of Rj is removable and Rj is continuous, as follows by applying the 1'Hospital rule once at the endpoints i 0, d, and twice in the interior / 1,... ,d - 1. For j ^ /, the function Rj has a pole at f/. As a consequence R, being the minimum of RQ,RI, ... ,Rd, coincides around r, with Ri. Therefore R is continuous at fo, f i , . . . , td- Of course, R is continuous anywhere else. Thus it has a finite maximum over [-1;1],
EXERCISES
329
d Vd rid
1 2 3 1.5 1.87 3.25 2 3 12
4 5 6 8 7 3.93 5.12 5.88 6.99 7.79 15 30 35 48 63
9 10 8.87 9.69 80 99
EXHIBIT 12.6 Optimality of the efficient design apportionment. For sample size n > nd = (d + l)[/jt<f 1], the efficient design apportionment E(rfi,n) yields (fo-optimal designs for 6 in the discrete design set Tw, in a dth-degree polynomial fit model. The bounds nd are not best possible.
We define md to be the smallest integer w fulfilling fjid < m + 1, that is, d \^d l"l- Now Theorem 12.15 applies for n > nd, with nd = (d + l)md. Thus our claim is established.
m
The lower bounds nd that emerge during the proof are tabulated in Exhibit 12.6. These numbers are not best possible. For instance, for degree d = 3, we obtain n>\2 while it is known that for every n > 4 the designs in E(rQ,n) are optimal in T. Preservation of <fo-optimality under the efficient apportionment method cannot be expected to hold in general. Surprisingly enough it breaks the symmetry that carries much of the intuitive appeal of the design rfi. For example, in a third-degree polynomial fit model the <fo-optimal design TQ for 6 (0o, #i, #2, #3)' places uniform weights 1/4 on the four points -1, -l/\/5, l/v/5,1. For sample size five, the efficient apportionment (^,5) consists of the four permutations of the assignment 2,1,1,1. Each of these leads to the same <fo-efficiency 0.951 for 9. The apportionment 2,1,1,1 and its permutations violate our intuitive feeling that symmetry is a necessity, for optimality in polynomial fit models on the experimental domain [!;!]. However, intuition errs for finite sample sizes n. The (^-optimal symmetric design for sample size five assigns one observation to each of the five points -1, 0.511,0,0.511,1, but has for 6 an inferior ^-efficiency of 0.937. This provides yet further evidence that discrete design problems are very peculiar, in that often a reduction by symmetry fails to go through. On the other hand, our main problem of interest, of finding optimal designs for infinite sample size, may very well afford a reduction by invariance. This is the topic of the last three chapters.
EXERCISES 12.1 Show that the efficient design apportionment turns a design with weights 0.27744, 0.25178, 0.19951, 0.14610, 0.09225, 0.03292 into a de-
330
sign for sample size 36 with frequencies 10, 9, 7, 5, 3, 2 [Balinski and Young (1982), p. 96]. 12.2 Which of the weights of the <fo-optimal designs TQ in Exhibit 9.4 got rounded and how? 12.3 In a polynomial fit model of degree 8, the <f>_i -optimal design rf j for 6 in T assigns weights 0.045253, 0.098316, 0.121461, 0.151682 to points t i=- 0 and 0.166576 to 0. Show that the quota method is not sample size monotone from n = 1005 to n 1006. In a polynomial fit model of degree 10, the $_j -optimal design T for 6 in T assigns weights 0.037259, 0.078409, 0.089871, 0.106787, 0.122870 to points t / 0 and 0.129609 to 0. Compare the efficient apportionment for sample size n = 1000 with the numerical rounding. In a polynomial fit model of degree 10, the $_oo-optimal design ri^ for 0 in T assigns weights 0.036022, 0.075929, 0.087833, 0.106714, 0.126145 to points t ^ 0 and 0.134714 to 0. Use the efficient design apportionment to obtain a design for sample sizes n = 11,..., 1000. Compare the results with Exhibit 9.11 and Exhibit 12.3. Show that if / = [(n - ^)vv,] sum to n, then (,) is the singleton {(HI, . . . ,#)}, where wi, . . . , wf are the weights of H.
12.4
12.5
12.6
12.7 Show that the designs (,) satisfy lim,,-^ C/c(Af( n //i)) = CK(M()}, for every e E and for every full column rank matrix
K e Rkxs.
12.8
Show that the total variation distance between and n/n is bounded for all and by max
12.9 In a line fit model over [ 1;1], show that the globally optimal design for 6 among the designs for odd sample size n = 2m + l is r*(l) = m, r;(0) = 1. Compare the efficiency loss with that of the efficient apportionment of the globally optimal design r(l) = 5 in T, and with the uniform bound of Theorem 12.8.
C H A P T E R 13
Invariant Design Problems
Many design problems enjoy symmetry properties, in that they remain invariant under a group of linear transformations. The different levels of the problem actually have different groups associated with them, and the interrelation of these groups is developed. This leads to invariant design problems and invariant information functions. The ensuing optimality concept studies simultaneous optimally relative to all invariant information functions.
13.1. DESIGN PROBLEMS WITH SYMMETRY The general design problem, as set out in Section 5.15 and reviewed in Section 7.10, is an intrinsically multivariate problem. It is formulated in terms of k x k moment matrices, whence we may have to handle up to \k(k +1) real variables. Relying on an optimality criterion such as an information function is just one way to reduce the dimensionality of the problem. Another approach, with much intuitive appeal and at times very effective, is to exploit symmetries that may be inherent in a problem. The appropriate frame is to study invariance properties which come with groups of transformations that act "on the problem". However, we need to be more specific what "on the problem" means. The design problem has various levels, such as moment matrices, information matrices, or optimality criteria, and each level has its particular manifestation of symmetry when it comes to transformation groups and invariance properties. In the present chapter we single out these particulars, a task that is sometimes laborious but clarifies the technicalities of invariant design problems. We find it instructive to first work an example. Let us consider a parabola fit model, with experimental variables ti,...,te in the symmetric unit interval T = [1; 1]. The underlying design T lives on the experimental domain [1; 1], and assigns weight r(r/) (0;1) to the support point r,.
331
332
CHAPTER 13: INVARIANT DESIGN PROBLEMS
The transformation to be considered is the reflection t H- R(t) = -t. Whether this transformation is practically meaningful depends on the experiment under study. If the variable t stands for a development over time, from yesterday (t = 1) over today (/ = 0) to tomorrow (t = 1), then reflection means time reversal and would hardly be applicable. But if t is a technical variable, indicating deviation to either side (t 1) of standard operating conditions (t = 0), then t may describe the real experiment just as well as does t. If this is so then reflection in variance is a symmetry requirement that is practically meaningful. Then, the first step is to consider the reflected design TR, given by rR(t) = r(t) for all t e [!;!]. The designs T and rR share the same even moments pi and pi, while the odd moments of TR carry a reversed sign,
That is, the second-degree moment matrix A/2(r/?) is obtained from A/2(r) by a congruence transformation,
Hence the symmetrized design T |(T + rR) has moment matrix
This averaging operation evidently simplifies the moment matrices, by letting all odd moments vanish. Furthermore, let us consider an information function (f> on NND(3) which is invariant under the action of Q, that is, which fulfills <f>(QAQ) (A) for all A NND(3). Concavity and invariance of <f> imply
13.1. DESIGN PROBLEMS WITH SYMMETRY
333
Thus the transition from r to the symmetrized design f improves the criterion <f>, or at least maintains the same value. In a second step, we maximize the fourth moment as a function of the second moment. On [1;1], we have /i4 = /1 4 dr < /1 2 dr = ^2, with equality if and only if the only support points of T are 1 or 0. Hence we introduce the symmetric three-point design ra which, by definition, distributes mass a e [0;1] uniformly on the boundary {1},
A Loewner comparison now improves upon (1) for the choice OL ^LI-,
By monotonicity this carries over to an information function
No further improvement in the Loewner ordering is possible since the threepoint design ra is admissible, by Section 10.7. In a third step, we check the eigenvalues of the moment matrix M2(ra):
All of them are increasing for a e [0;2/5] (see Exhibit 13.1). Thus the eigenvalue vector A(2/5) is a componentwise improvement over any other eigenvalue vector A (a) with a < 2/5,
This eigenvalue monotonicity motivates a restriction to orthogonally invariant information functions </>, that is, those information functions that satisfy (QAQ') = <f>(A) for all A e NND(3) and Q e Orth(3). This and Loewner monotonicity for the diagonal matrices A A(a) yield a final improvement,
334
EXHIBIT 13.1 Eigenvalues of moment matrices of symmetric three-point designs. The eigenvalues A,(a) of the moment matrices of the symmetric three-point designs ra are increasing for a [0;2/5], for / = 1,2,3.
In summary, we have achieved a substantial reduction. For every orthogonally invariant information function <f> on NND(3) and for every design T e T, there exists a weight a e [2/5; 1] such that under </> the symmetric three-point design r is no worse than T,
Instead of seeking the optimum as r varies over T, we may therefore solve a problem of calculus, of maximizing the real function g(a) = <f> (A/2(Ta)), as the real variable a varies in the interval [2/5; 1]. In other words the symmetric three-point designs ra, with a [2/5; 1], form a complete class relative to any orthogonally invariant information function < on NND(3). The class is as small as possible since its members are uniquely optimal under the matrix means (f>p, from a = 2/5 for p = oo (Section 9.13), over a = 1/2 for p = -1 (Section 9.9), and a = 2/3 for p = 0 (Section 9.5), to a = 1 for p = 1 (Section 9.15). This type of reasoning generalizes to quadratic fit models with m > 1 factors (see Section 15.20). In Section 8.6, we put the same approach to work in a linear fit model over the unit square, to be extended to unit cubes in k > 2 dimensions in Section 14.10. Three steps come together to culminate in the final reduction (4). Eigenvalue monotonicity (3) is a technical postlude which helps to elucidate some, but not all models. The averaging operation from (1) and the Loewner comparison in (2) are the conceptual building blocks, and are the theme of this chapter. We begin our study with an inquiry into invariance properties relative to appropriate transformation groups.
13.2. INVARIANCE OF THE EXPERIMENTAL DOMAIN
335
13.2. INVARIANCE OF THE EXPERIMENTAL DOMAIN The starting point for invariant design problems is a group K of transformations (that is, bijections) of the experimental domain T. In other words, the set 72. consists of mappings R : T - T which are one-to-one (that is, injective) and onto (that is, surjective), such that any composition RiR^1 is a member of 71 for all R\,R2 11. In particular, the image of T under each transformation R is again equal to T,
The precise statement, that 72. is a transformation group on T, is often loosely paraphrased by (1), that the experimental domain T is invariant under each transformation R in 72.. In the parabola fit example of the previous section, the experimental domain is T = [-1; 1], and the group 72. considered there contains two transformations only, the identity t H- t and the reflection t H-> R(t) = -t. More generally, let us take a look at the m-way polynomial fit models introduced in Section 1.6. Then the experimental domain T is a subset of m -dimensional Euclidean space Rm. In the case where T is a Euclidean ball of radius r > 0, Tr = {t Um : \\t\\ < r}, the transformation group 71 may be the full orthogonal group Orth(ra) or any subgroup thereof. Indeed if R is an orthogonal m x m matrix, R'R = Im, then the linear transformation t H-> Rt is well known to preserve length, \\Rt\\2 t'R'Rt =^t't \\t\\2. Hence any Euclidean ball is mapped into itself. In other words, Euclidean balls are orthogonally invariant or, as we prefer to say, rotatable. Another case of interest emerges when T is an m-dimensional cube of sidelength r > 0, T = [Q;r]m, where ei denotes the ith Euclidean unit vector in Um. This particular domain is mapped into itself provided R is a permutation matrix. Hence the transformation group 72. could be any subgroup of the group of permutation matrices Perm(m). In other words, cubes are permutationally invariant. Yet other cases are easy to conceive. If T is a symmetric rectangle with half sidelengths r\,... ,rm > 0, T = [r\;ri] x x [rm;rm], then any subgroup 72. of the sign-change group Sign(m) keeps T invariant. Or else, if T is a symmetric cube of half sidelength r > 0, T [-r;r]m, then R might be a permutation matrix or a sign-change matrix to leave T invariant. Hence 72 could be any subgroup of the group that is generated jointly by the permutations Perm(m) and the sign-changes Sign(m). Exactly which transformations R of the experimental domain T make up the group 72. depends on which implications the transformation R has on the practical problem under study, and cannot be determined on an abstract level.
336
13.3. INDUCED MATRIX GROUP ON THE REGRESSION RANGE The most essential assumption is that the group Tl which acts on the experimental domain T induces, on the regression range X C R*, transformations that are linear. Linear mappings on Rk are given by k x k matrices, as pointed out in Section 1.12. Any such mapping is one-to-one and onto if and only if the associated matrix is nonsingular. The largest group of linear transformations on Rk is the general linear group GL(/c) which, by definition, comprises all nonsingular k x k matrices. Just as well we may therefore demand that H induces a matrix group Q, a subgroup of GL(/c). Effectively we think of Q as a construction so that any transformation R e H and the regression function / : T > Uk "commute", in the very specific sense that given R eK, there exists a matrix Q Q fulfilling
The function / is then said to be Tl-Q-equivariant. If the regression range X = f(T) contains k vectors f(t\),... , f ( t k ) that are linearly independent, then (1) uniquely determines Q through Q = (/(#'i), . ,/(***))(/Ci), Jfo))'1. However, we generally do not require the regression range X to span the full space R*. In order to circumvent any ambiguity for the rank deficient case, the precise definition of equivariance requires the existence of a group homomorphism h from K into GL(fc) so that (1) holds for the matrix Q = h(R) in the image group Q = h(K). Then / is termed equivariant under h. Of course, the identity transformation in K always generates the identity matrix Ik in Q. In Section 13.1, the reflection R(t) = t leads to the 3 x 3 matrix Q which reverses the sign of the linear component t in the power vector f(t] (l,t,t2)'. Generally, determination of the matrix group Q is a specific task in any given setting. Many results on optimal experimental designs do not explicitly refer to the experimental domain T and the regression function /. When we only rely on the regression range X C IR*, property (1) calls for a matrix group Q C GL(fc) with transformations Q fulfilling
The same is stipulated by the set inclusion Q(X) C X. Since the inclusion holds for every matrix Q in the group Q it also applies to Q~l, that is, X C Q(X). So yet another way of expressing (2) is to demand
In other words, the group Q C GL(fc) is such that all of its transformations Q leave the regression range X invariant. Invariance property (3) very
13.4. CONGRUENCE TRANSFORMATIONS OF MOMENT MATRICES
337
much resembles property (1) of Section 13.2. However, here we admit linear transformations, x H-> Qx, only. 13.4. CONGRUENCE TRANSFORMATIONS OF MOMENT MATRICES The linear action x H-> Qx on the regression range X induces a congruence action on the moment matrix of a design on X,
The congruence transformation A H- QAQ' is linear on Sym(fc), as is the action x H-> Qx on IR*. The sole distinction is that the way Q enters into Qx looks simpler than with QAQ', if at all. As a consequence of the in variance of X under Q and (1), a transformed moment matrix again lies in A/(E), the set of all moment matrices. That is, for every transformation Q 6 Q, we have
As above, this is equivalent to the set M(H) being invariant under the congruence transformation given by each matrix Q e Q. In the setting of the General Equivalence Theorem we start out from an arbitrary compact and convex set M C NND(fc). The analogue of (2) then amounts to the requirement that M. is invariant under each transformation QeQ,
This is the invariance property to be repeatedly met in the sequel. The two previous sections demonstrate that it emerges quite naturally from the assumptions that underlie the general design problem. Invariance property (3) entails that the transformation group Q C GL(k) usually is going to be compact. Namely, if the set of competing moment matrices M contains the identity matrix, then (3) gives QQ' e M, and boundedness of M implies boundedness of Q,
If the set M contains a positive definite matrix N other than the identity matrix, then the same argument applies to the norm ||<2||^ = traceQNQ' < supMeM trace M < oo. Closedness of Q is added as a more technical supplement.
338
For a parameter system of interest K'O, with a k x s coefficient matrix K of full column rank s, the performance of a design is evaluated through the information matrix mapping CK of Section 3.13. In an invariant design problem, this mapping is taken to be equivariant. In Section 13.3 equivariance is required of the regression function /, and the arbitrariness of / precludes more special conclusions. However, as for the well-studied information matrix mapping C%, a necessary and sufficient condition for equivariance of CK is that the range of K is invariant under each transformation Q e Q.
13.5. CONGRUENCE TRANSFORMATIONS OF INFORMATION MATRICES Lemma. Let CK : NND(A:) > Sym(s) be the information matrix mapping corresponding to a k x 5 coefficient matrix K that has full column rank s. Assume Q to be a subgroup of the general linear group GL(k). a. (Equivariance) There exists a group homomorphism H : Q - GL(s) such that CK is equivariant under //,
if and only if the range of K is invariant under each transformation QtQ,
b. (Uniqueness) Suppose CK is equivariant under the group homomorphism H : Q -> GL(s). Then H(Q) or -H(Q) is the unique nonsingular s xs matrix H that satisfies QK = KH, for all Q e Q. c. (Orthogonal transformations) Suppose CK is equivariant under the group homomorphism H : Q > GL(s). If the matrix K fulfills K'K = Is and Q e Q is an orthogonal k x k matrix, then H(Q) = K'QK is an orthogonal s x s matrix. Proof. For the direct part of (a), we assume (1) for a given homomorphism //. Choosing A = QKK'Q' and using CK(KK') = Is, we get KCK(QKK'Q')K' = KH(Q)CK(KK')H(Q)'K' = KH(Q)H(Q)'K'.
From the range formula of Theorem 3.24 and the discussion of square roots
13.5. CONGRUENCE TRANSFORMATIONS OF INFORMATION MATRICES
339
in Section 1.14, nonsingularity of H(Q) leads to
From range QK n range K range K, we infer that the range of QK includes that of K. Because of nonsingularity of Q both ranges have the same dimension, s. Hence they are equal, Q(rangeK) = ranged. For the converse of (a), we assume (2) and construct a suitable homomorphism H. From Lemma 1.17, the range equality (2) entails QK = KLQK for every left inverse L of K. Any other left inverse L of K then satisfies
Therefore the definition H(Q) LQK does not depend on the choice of L K~. The s x s matrix H(Q) is nonsingular, because of rank//((2) > rank KLQK = rank (2 AT = s where the last equality follows from (2). For any two transformations Q\, Q2 Q, we obtain
This proves that the mapping Q i- H(Q) is a group homomorphism from Q into GL(s). It remains to show the equivariance of CK under H. By definition of H, we have KH(Q) = QK, and Q~1K = KH(Q)~\ For positive definite kxk matrices A equivariance is now deduced by straightforward calculation,
By part (c) of Theorem 3.13, regularization extends this to any nonnegative
340
definite k x k matrix A, singular or not,
This establishes the equivalence of (1) and (2). Part (b) claims uniqueness (up to the sign) of the homomorphism Q i-> LQK. To verify uniqueness, we start out from an arbitrary group homomorphism H : Q > GL(s) which fulfills (1), and fix a transformation Q e Q. The matrix QKH(Q}'1 then has the same range as has QK. Property (2) implies a representation QKH(Q}~1 = KS, for some s x s matrix 5. For any matrix C e NND(s), we have CK(KSCS'K') = SCS'. On the other hand, we can write KSCS'K' = QAQ' with A = KH(Q)~1C H(QYl 'K'. Hence (1) and CK(A) = H(Q)'lCH(QYl' lead to CK(QAQ') = H(Q)CK(A)H(Q)' = C. Thus the matrix S satisfies SCS' = C for all C e NND(s). This forces S = Is or S = -/,. Indeed, insertion of C = Is shows that 5 is an orthogonal matrix. Denoting the /th row of 5 by z,, insertion of C = z/z/ gives e,e/ = z/z/. That is, we have z, = e,e, with 8i e {1}, whence S = &$ is a sign-change matrix. Insertion of C = lsls' entails e\ = = es = 1 or e\ = es 1. This yields QKH(Q)'1 = K or QKH(QYl = -K, that is, QK = KH(Q) or QK = -KH(Q). The proof of uniqueness is complete. Part (c) follows from part (b). If AT' is a left inverse of K, then orthogonality of Q implies orthogonality of H(Q), through H(Q)'H(Q) = K'Q'KK'QK = K'Q'QK = IS. The invariance property of the range of K also appears in hypothesis testing. Because the expected yield is x'0, the action x H-+ Qx on the regression vectors induces the transformed expected yield x'Q'0, whence the group action on the parameter vectors is 6 H- Q'6. Therefore the linear hypothesis K'8 = 0 is invariant under the transformation Q, {0 e Rk : K'6 = 0} = {6 e Rk : K'Q'S = 0}, if and only if the nullspaces of K' and K'Q' are equal. By Lemma 1.13, this means that K and QK have the same range. Since the homomorphism H is (up to the signs) unique if it exists, all emphasis is on the transformations H(Q) that it determines. In order to force uniqueness of the signs, we adhere to the choice H(Q) = LQK, and define the image group H = {LQK : Q Q} C GL(s), relying on the fact that the choice of the left inverse L of K does not matter provided the range invariance property (2) holds true. Accordingly, the information matrix mapping CK is then said to be Q-'H-equivariant. A simple example is provided by the parabola fit model of Section 13.1
13.5. CONGRUENCE TRANSFORMATIONS OF INFORMATION MATRICES
341
where the group Q has order 2:
Let K'd designate a subset of the three individual parameters OQ, 6\, 62- There are two distinct classes of coefficient matrices K, depending on whether K'B includes the coefficient B\ of the linear term, or not. In the first case, K is one of the 3 x s matrices
with s = 3,2,2,1, respectively. Here H is of order 2 as is Q, containing the identity matrix as well as, respectively,
In the second case, K is one of the matrices
whence K'O misses out on the linear term BI. Here H degenerates to the trivial group which only consists of the identity transformation. In either case, the information matrix mapping CK is equivariant. The example also illustrates the implications of our convention to eliminate the sign ambiguity, by adhering to the choice H = LQK. If the linear coefficient is of interest, K' = (0,1,0), then our convention induces the subgroup H = {!,-!} of GL(1) = R\{0}, of order 2. However, Hl = 1 and H2 -1 induce the same congruence transformation on Sym(l) = R, namely the identity. Evidently, the trivial homomorphism Q i- 1 serves the same purpose. The same feature occurs in higher dimensions, s > 1. For instance, let Q C GL(3) be the group that reflects the jc-axis and the v-axis in threedimensional space. This group is of order 4, generated by
342
and also containing the identity and the product Q\ Q2. Let us consider the full parameter vector 0, that is, K = Ij. According to our convention, we compute H\ = Q\ and H2 = Q2, and hence work with the group H.\ = Q. The sign patterns H\ = Q\ and H2 = Q2, or H\ = Q\ and H2 = -Q2, or HI = Q\ and HI = Q2 define alternative groups 7i2, HI, H*. The four groups HI, for i = 1,2,3,4, are distinct (as are the associated homomorphism), but the congruence transformations that they induce are identical. Hence the identity Ci3(A) = A is Q-Hi-invariant, for all i = 1,2,3,4. Our convention singles out the group H\ as the canonical choice. Thus, for all practical purposes, we concentrate on s x s matrices H, rather than on a homomorphism H. We must check whether, for any given transformation Q Q, there is a nonsingular s xs matrix H satisfying QK KH. This secures equivariance of the information matrix mapping CK,
The matrices H in question form the subgroup H of GL(s),
where the specific choice of the left inverse L of K does not matter. Since H is a homomorphic image of the group Q, the order of H does not exceed the order of Q. Moreover, if the coefficient matrix K satisfies K'K = Is and Q is a subgroup of orthogonal k x k matrices, then H is a subgroup of orthogonal s x s matrices. Of course, for the full parameter vector 0, the transformations in Q and H coincide, Q = H, just as there is no need to distinguish between moment matrices and information matrices, A = CIk(A). 13.6. INVARIANT DESIGN PROBLEMS We are now in a position to precisely specify when a design problem is called invariant. While C# is the information matrix mapping for a full column rank kx s coefficient matrix K, no assumption needs to be placed on the set M C NND(&) of competing moment matrices. DEFINITION. We say that the design problem for K'6 in M is Q-invariant when Q is a subgroup of the general linear group GL(k) and all transformations Q Q fulfill
13.7. INVARIANCE OF MATRIX MEANS
343
From the range in variance (2), there exists a nonsingular s x s matrix H satisfying QK = KH, for any given transformation Q e Q. In Section 13.5, we have seen that these matrices H compose the set
which is a subgroup of GL(s) such that the information matrix mapping CK is Q-7i-equivariant. We call H in (3) the equivariance group that is induced by the Q-invariance of the design problem for K'O in M. There are two distinct ways to exploit invariance when it comes to discuss optimality in invariant design problems. One approach builds on the Kiefer ordering, an order relation that captures increasing information in terms of information matrices and of moment matrices, to be introduced in Section 14.2. The other approach studies simultaneous optimality relative to a wide class of information functions delimited by invariance. DEFINITION. An information function (f> on NND(s) is called H-invariant when H is a subgroup of the general linear group GL(s) and all transformations H H fulfill
The set of all ^-invariant information functions on NND($) is denoted by $>(H). Before showing that the two resulting optimality concepts coincide, in Section 14.6, we study the basics of the two approaches. We start out by developing some feeling for invariant information functions. The most prominent criteria are the matrix means <f>p. There is a surprising discontinuity in their behavior, depending on whether the parameter p vanishes or not. 13.7. INVARIANCE OF MATRIX MEANS Lemma. Let H be a subgroup of GL(s). For p e [-00; 0) U (0;oo], the matrix mean $p is ^-invariant if and only if *H is a subgroup of orthogonal matrices, H C Orth(s). Proof. For the direct part, we first treat p = -co. For a fixed transformation H e H, invariance applies to C = Is and yields </>_oo(HH 1 ) = <_oo(/j), that is, \min(HH') = 1. Invariance also applies to C (H'H)~l and gives *->(/,) = <f>-oo((H'H)~l), that is, 1 = \min((H'Hrl). Together we get A min (//'H) = 1 = A max (//'//). Hence H is orthogonal, H'H = Is. An analoguous argument holds for p = oo.
344
Secondly, we treat p e (-oo;0) U (0;oo). With C = Is, invariance yields <f>p(HH') = 4>P(IS), that is, trace(///f 'Y = s. The choice C = (H'H)~l gives lrace(H'H)-p = s. Thus for D = (H'H)P, we get
Hence D = Is and H is orthogonal. In the converse, we assume H to be a subgroup of orthogonal matrices. But <f>p depends on C only through its eigenvalues (see Section 6.7). Sinc the eigenvalues of C and of HCH' are identical, invariance follows. Invariance of the determinant criterion fa holds relative to the group of unimodular transformations,
Unimodular transformations preserve Lebesgue volume, while orthogonal transformations preserve the Euclidean scalar product. The group of unimodular transformations is unbounded and hence of little interest in design problems, except for actually characterizing the determinant criterion <fo. 13.8. INVARIANCE OF THE D-CRTTERION Lemma. Let H be a subgroup of GL(s), and let </> be an information function on NND(^). a. (Invariance) The determinant criterion <fo is Ti-invariant if and only if H is a subgroup of unimodular matrices, H C Unim(.s). b. (Uniqueness) The information function <j> is Unim(s)-invariant if and only if <f> is positively proportional to the determinant criterion <foProof. For the direct part in (a), invariance implies <fo(////') = 1 for every transformation H H. This is the same as |det//| = 1. Conversely, for H e Unim(s), we have invariance, fa(HCH') = ((det//) 2 detC) 1 / s = 0o(C). In order to establish part (b), we take an arbitrary information function <f> on NND(s). We fix a positive definite 5 x 5 matrix C = //A A //' where the matrix H is orthogonal and AA is the diagonal matrix with the eigenvalue vector A = ( A t , . . . , A^)' of C on the diagonal. Since unimodular invariance embraces orthogonal invariance, we initially get $(C) = <f>(&\)- Defining the numbers /*,- = A7 1/2 f[y< 5 AJ /(2s) for i = l , . . . , j we find that Hence the diagonal matrix AM with the vector n = (jti,..., ju,s)' on the diagonal is unimodular. From invariance and homogeneity, we finally conclude
13.9. INVARIANT SYMMETRIC MATRICES
345
Semicontinuity extends the identity < = <f>(Is)<f>o from the open cone PD(s) to its closure NND(j). The present result emphasizes the prominent role that the determinant criterion <o plays in the design of experiments. Another outstanding property is the self-polarity of <fo, mentioned in Section 6.14. An important tool for studying invariant information functions and invariant design problems are invariant symmetric matrices, and the subspaces that thev form. To this we turn next. 13.9. INVARIANT SYMMETRIC MATRICES For an arbitrary nonempty subset H of s x s matrices, we define a symmetric s x s matrix C to be H-invariant when
The set of all ^-invariant symmetric s xs matrices is denoted by Sym(s,H). Given a particular set H, we usually face the task of computing Sym(s, H). Here are three important examples,
In other words, the 7i-invariant matrices are the diagonal matrices if H is the sign-change group Sign (s). They are the completely symmetric matrices, that is, they have identical on-diagonal entries and identical off-diagonal entries, for the permutation group Perm(.s). They are multiples of the identity matrix under the full orthogonal group Orth(.s). For verification of these examples, we plainly evaluate C = HCH'. In the first case, let H be the diagonal matrix with ith entry -1 and all other entries 1, that is, H reverses the sign of the ith coordinate. Then the offdiagonal elements of HCH' in the i th row and the i th column are of opposite sign as in C, whence they vanish. It follows that C is diagonal, C Ag(C), with S(C) = (cn,... ,css)'. Conversely every diagonal matrix is sign-change
invariant
In the second case, we take any permutation matrix get
We
346
whence the entry c,-y is moved from the /th row and ;th column into the ir(i) th row and ir(j) th column. As TT varies, the on-diagonal entries become the same, as do the off-diagonal entries. Therefore permutational invariance of C implies complete symmetry. Conversely, every completely symmetric matrix is permutationally invariant. The third case profits from the previous two since Sign(s) and Perm(j) are subgroups of Orth(.y). Hence an orthogonally invariant matrix is diagonal, and completely symmetric. This leaves only multiples of the identity matrix. Conversely, every such multiple is orthogonally invariant. In each case, Sym(s,H) is a linear space. Also, invariance of C relative to the infinite set Orth(s) is determined through finitely many of the equations C = HCH', namely where H is a sign-change matrix or a permutation matrix. Both features carry over to greater generality. 13.10. SUBSPACES OF INVARIANT SYMMETRIC MATRICES Lemma. Let Sym(.s, H) C Sym(s) be the subset of H-invariant symmetric 5 X 5 matrices, under a subset H. of s x s matrices. Then we have: a. (Subspace) Sym(s,H) is a subspace of Sym(5). b. (Finite generator) There exists a finite subset HCH that determines the 7Y-invariant matrices, that is, Sym(s,H) Sym(5,'H). c. (Powers, inverses) If H consists of orthogonal matrices, H C Orth(s), then the subspace Sym(5, H) is closed under formation of any power Cp with p = 2,3,... and the Moore-Penrose inverse C+. Proof. For H e H, we define the linear operator TH(C) = C - HCH' from Sym(5) into Sym($). Then Syn^s,?^) = f]//eH nullspace TH proves the subspace property in part (a). Now we form the linear space (TH : H e H) that contains all linear combinations of the operators TH with H G H. Since Sym(.s) is of finite dimension, so is the space of all linear operators from Sym(s) into Sym(5). Therefore, its subspace C(TH : H e H) has a finite dimension d, say, and the generator {TH : H H} contains a basis {TH. : i <d}, where {//i,...,H d } is a subset of H. Given a matrix C e Sym(.y), the following lines are equivalent:
This proves part (b), with H - {// lt ..., Hd}. As for part (c), we first note that for C e Sym(s,H), we get HC2H' = HCHH'CH' = (HCH')2 = C2, using the orthogonality of H. Hence C2,
13.10. SUBSPACES OF INVARIANT SYMMETRIC MATRICES
347
C 3 ,... are also invariant and lie in Sym(s, H}. Now we consider an eigenvalue decomposition C = X^<r^'^" w^ r distinct nonzero eigenvalues A/. The r x r matrix A with entries a(y = \{ then has Vandermonde determinant and is invertible. Forfixedk < r, the Hence the coefficient vector projectors
all lie in Sym(s,n). This yields C+ = ,-<,. P/A,- e Sym(s,W)- It also embraces nonsingular inverses C"1 and fractional powers Cp where applicable. The larger the sets H, the smaller are the subspaces Sym(5,7i). In the three examples of the previous section, the groups Sign(.s), Perm(s), Orth(.s) have orders 2s < s\ < oo, while the invariant matrices form subspaces of dimensions 5 > 2 > 1. An orthonormal basis for the diagonal matrices Sym(s,Sign(.s)) is given by the rank 1 matrices e\e[,.. .,ese's, where et is the i th Euclidean unit vector in Rs. Hence for an arbitrary matrix C Sym(.s), its projection on the subspace Sym^Sign^)) is
where 5(C) = (cn,...,css)' is the vector of diagonal elements of C. An orthogonal basis for the completely symmetric matrices Sym(5, Perm(^)) is formed by the averaging matrix Js = lsls'/s and the centering matrix Ks = Is - Js from Section 3.20. Any matrix C 6 Sym(^) thus has projection
with and , The one-dimensional space Sym(s,Orth(.s)) is spanned by Is and C has projection
There is an alternative way to deal with the projection C of a matrix C onto a subspace of ^-invariant matrices, as the outcome of averaging HCH'
348
over the group H. It is at this point that the group structure of H gains importance. 13.11. THE BALANCING OPERATOR The set <J>(7Y) of Ti-invariant information functions on NND(s) has the same structure as the set 4> of all information functions on NND(.s). It is closed under the formation of nonnegative combinations, pointwise infima, and least upper bounds, as discussed in Section 5.11. The set <J>(ft) is not affected by the sign ambiguity which surfaces in the uniqueness part (b) of Lemma 13.8, since we evidently have (ft(HCH') = <f>((-H)C(-H)'). The larger the group n, the smaller is the set &(H). The largest subgroup for which the class 4>(?i) is nonempty are the unimodular transformations, H = Unim(s). By Lemma 13.8, the set 4>(Unim(s)) consists of the determinant criterion 0o and positive multiples thereof. Of course, the trivial group H = {!} is smallest and makes every information function invariant. Under an orthogonal subgroup, H C. Orth(s), all matrix means are invariant, <f>p e 3>(H} for all p [-00; 1], owing to Lemma 13.7 and Lemma 13.8. Furthermore, 4>(?i) then contains a sizeable subset 4>(7i) of 7i-invariant information functions that are linear. While the matrix means are prime criteria for practical applications, the linear invariant criteria help in understanding the implications of in variance. Theorem 13.12 shows that the subset 4>(7i) has the same descriptional power as has the full set <&(h). To this end let us first consider a finite group H C GL(s) of order WH. We define an averaging operation C - C for symmetric s x s matrices C by
If the group ?i
I is compact then the definition of C is
where the integral is taken with respect to the unique invariant probability measure on the group H. In any case the mapping : Sym(s) > Sym(.s) is linear, and we call it the balancing operator. _ The_balancing operator results in matrices C that are "H-invariant, C = HCH'" for all H H. Namely, since H is a group, the set {HG :GH} coincides with H and the average HCH' = GeW //GCG'//'/TO is the same as C from (1). Thus the balancing operator is idempotent, ( C j = C, and its image space is the subspace Sym(j, 7i) of W-invariant symmetric matri-
13.12. SIMULTANEOUS MATRIX IMPROVEMENT
349
ces studied in Section 13.10. In other words, the balancing operator projects the space Sym(s) onto the subspace Sym(5,/H). For a closed subgroup of orthogonal matrices, H C Orth(,s), we have //' = //-' e H and {//':// 6 H} = H. It follows that the projector is orthogonal,
for all C,D Sym(s). Given an orthqnormal basis V\,...,Vd of Sym(s,H), we may then calculate the projection C from
For an infinite compact group H this is a great improvement over the definition C = fn HCH' dH from (2), since the invariant probability measure dH is usually difficult to handle. The example of Section 13.10, for H = Orth(s), illustrates the method. The cases H = Sign(s) or H = Perm(,s) show that formula (3) is useful even if the group H is finite. The balancing operator serves two distinct and important purposes. First it leads to an improvement of a given information matrix C,
utilizing concavity and 'H-invariance of an information function <f> 6 $>(H). In search of optimality we may therefore restrict attention to the information matrices which are Ti-invariant, Cx(M.) n Sym(.s,7i). Secondly the balancing operator gives rise to a wide class of ^-invariant information functions that are linear,
The following theorem shows that a simultaneous comparison over the large criterion class $>(H) may be achieved by considering the smaller and more transparent subclass 4>(W). 13.12. SIMULTANEOUS MATRIX IMPROVEMENT Theorem. Assume that H C Orth(.s) is a closed subgroup of orthogonal matrices. Then for every %-invariant matrix C e Sym(s,H) and for every matrix D e Sym(5), the following four statements are equivalent:
350
a. (Simultaneous improvement) (f>(C) > $(>) for all "H-invariant information functions <J>. b. (Linear criteria) <f>(C) > $(>) for all 7i-invariant information functions <f> that are linear. c. (Invariant Loewner comparison) C > D. d. (Kiefer ordering) There exists a matrix E Sym(s) satisfying C > E e conv{HDH': H H}. Proof. Part (a) implies (b) since the latter comprises fewer criteria. Insertion of the specific functions from <b(H) = {C i-> z'Cz : z e Rs} in part (b) and invariance of C entail z'Cz = z'Cz > z'Dz for all z IR5. Hence (c) follows. Next we assume part (c). Since n is compact and the mapping H i- HDH' is continuous, the orbit {HDH': H H} is compact and so is its convex hull. Therefore the average D = Jn HDH' dH, as the limit of finite averages, lies in con\{HDH': H e H}. This is part (d), with the particular choice E = Z>. That (d) leads back to (a) is a consequence of the basic properties of 'H-invariant information functions. For since E can be written as a convex combination Yli<i aiHiDH/, say, we obtain the inequalities
The simplest case is the trivial group H = {Is}- Then part (a) refers to all information functions and part (b) to those that are linear, while parts (c) and (d) both reduce to the Loewner ordering, C > D. However, for nontrivial compact subgroups H C Orth(s), parts (c) and (d) concentrate on distinct aspects. Part (c) says that a simultaneous comparison relative to all "W-invariant information functions is equivalently achieved by restricting the Loewner ordering > to the W-invariant subspace Sym(.s, 7i). In contrast, part (d) augments the Loewner ordering > with another order relation, matrix majorization. The resulting new order relation, the Kiefer ordering, is the topic of the next chapter.
EXERCISES 13.1 In an m-way d th-degree polynomial fit model, let R(t) = At -f- b be a bijective affine transformation of T. Show that there exists a matrix Q 6 GL(k) which, for all designs r e T, satisfies Md(rR) = QMd(r)Q' [Heiligers (1988), p. 82].
EXERCISES
351
13.2 13.3
Show that a Q-invariant design problem for K'O has a Q-invariant feasibility cone, QA(K)Q' = A(K) for all Q e Q. Show that the information function <(C) = min^c,; is invariant under permutations and sign changes, but not under all orthogonal transformations [Kiefer (1960), p. 383]. For a subgroup H C GL(s), define the transposed subgroup H' = { / / ' : / / H] C GL(s). Show that an information function <j> is Hinvariant if and only if the polar function $ is H '-invariant. Find some subgroups with 'H = 'H'.
13.4
C H A P T E R 14
Kiefer Optimality
A powerful concept is the Kiefer ordering of moment matrices, combining an improvement in the Loewner ordering with increasing symmetry relative to the group involved. This is illustrated with two-way classification models, by establishing Kiefer optimality for the centered treatment contrasts, as well as for a maximal parameter system, of the uniform design, and of balanced incomplete block designs. As another example, uniform vertex designs in a first-degree model over the multidimensional cube are shown to be Kiefer optimal. 14.1. MATRIX MAJORIZATION Suppose we are given a matrix group H C GL(s). For any two matrices C, D Sym(5), we say that C is majorized by D, denoted by C x >, when C lies in the convex hull of the orbit of D under the congruence action of the group 7i,
This terminology is somewhat negligent of the group 7i, and that it acts on the underlying space Sym(s) by congruence. Both are essential and must be understood from the context. Other specifications reproduce the vector majorization of Section 6.9. There the underlying space is R* on which the permutation group Perm(A:) acts by left multiplication, x i-> Qx. In this setting majorization means, for any two vectors jc. y e IR*,
In other words, we have x = Sv, for a matrix 5 = ]CeePerm(*:) aQQ which is a convex combination of permutation matrices. Any such matrix 5 is doubly
352
14.1. MATRIX MAJORIZATION
353
stochastic, and vice versa. This is the content of the Birkhoff theorem (which we have circumvented in Section 6.9). Therefore vector majorization and matrix majorization are close relatives of each other, except for referring to distinct groups and different actions. Matrix majorization -< as in (1) defines a preordering on Sym(.s), in that it is a reflexive and transitive relation. Indeed, reflexivitv C x C is evident. Transitivity follows for if C3, then Hk stays in H because of the group property. For an orthogonal subgroup H C Orth(s), the preordering -< on Sym(s) is antisymmetric modulo H,
This follows from the strict convexity and the orthogonal invariance of the matrix norm \\C\\ = (trace C2)1/2. If we have C = V a,//,>/// -< D, then we get
Hence C -< D -< C entails equality in (2), and strict convexity of the norm forces C = //,)/// for some /. In any case, only subgroups of the group of unimodular transformations are of interest, H C Unim(.s), by Lemma 13.8. Then antisymmetry modulo H prevails provided we restrict the preordering -< to the open cone PD(s). This follows from replacing the matrix norm || || by the determinant criterion <fo and reversing the inequality in (2), since <fo is strictly concave on PD(s) besides being unimodularly invariant. For any two matrices C,D NND(s), matrix majorization C -< D relative to a group H implies an improvement in terms of every 7i-invariant information function </>,
Unfortunately the terminology that C is majorized by D is not indicative of C being an improvement over >. For the purposes of the design of experiments, it would be more telling to call C more balanced than D and to use the reversed symbol >-, but we refrain from this new terminology. The suggestive orientation comes to bear as soon as we combine matrix majorization with the Loewner ordering, to generate the Kiefer ordering.
354
CHAPTER 14: KIEFER OPTIMALITY
14.2. THE KIEFER ORDERING OF SYMMETRIC MATRICES All of the earlier chapters stress the importance of the Loewner ordering > of symmetric matrices, for the design of experiments. The previous section has shown that matrix majorization -< provides another meaningful comparison, in the presence of a group H C GL(s) acting by congruence on Sym(s). The two order relations capture complementary properties, at least for orthogonal subgroups H C Orth(s). In this case C > D and C -< D imply C = D, for then the matrix C - D is nonnegative definite with trace C -D trace ^ a////!)/// - D = 0. The Loewner ordering C > D fails to pick up the improvement that is captured by matrix majorization, C -< D. Part (d) of Theorem 13.12 suggests a way of merging the Loewner ordering and matrix majorization into a single relation. Given two matrices C, D e Sym(s), we call C more informative than D and write C ~^> D when C is better in the Loewner ordering than some intermediate matrix E which is majorized by /)>
We call the relation > the Kiefer ordering on Sym(s) relative to the group H. The notation > is reminiscent of the fact that two stages C > E -< D enter into definition (1), the Loewner improvement >, and the improved balancedness as expressed by the majorization relation -< (see Exhibit 14.1). Exhibit 14.1 utilizes the isometry
from the cone NND(2) to the ice-cream cone from Section 2.5. The group H consists of permutations and sign-changes,
Then the matrix
14.2. THE KIEFER ORDERING OF SYMMETRIC MATRICES
355
EXHIBIT 14.1 The Kiefer ordering. The Kiefer ordering is a two-stage ordering, C D <$=> C > E -< D for some E, combining a Loewner improvement C > E with matrix majorization E conv{HDH' : H e H}, relative to a group H acting by congruence on the matrix space Sym(.s).
travels through the orbit
356
These matrices together with D generate the convex hull, a quadrilateral, which contains the points E that are majorized by D. Any matrix C > E then improves upon D in the Kiefer ordering, C > D. The Kiefer ordering on Sym(,s) is reflexive, C > C. It is also transitive,
thpn WP If we have and may choose to obtain that is, is a preordering. The Kiefer ordering is antisymmetric modulo H, under the same provisos that are required for matrix majorization. First we consider an orthogonal subgroup n C Orth(s). Then C> D and D C entail
Equality holds since the final sum has the same trace as C. Invoking strict convexity of the norm || || as in (2) of Section 14.1, we obtain C = HiDH/ for some /. Secondly we admit larger groups H C Unim(s), but restrict the antisymmetry property to the open cone PD(s). Here it suffices to appeal to the determinant criterion <fo, since for C > D monotonicity, concavity, and unimodular invariance of <fo imply
In the case of C > D C, we get equality throughout. Then strict concavity of <f>Q on PD(s) establishes antisymmetry modulo H, of the relation > on the cone PD(s). The latter argument involves a monotonicity property of </>o relative to the Kiefer ordering which actually extends to every ^-invariant information function. Also the information matrix mapping CK turns out to be monotonic if we equip the space Sym(A:) with its Kiefer ordering > relative to the underlying group Q C GL(fc),
for all A,B e Sym(fc). We leave the notation the same, even though something like A > B and C > D would better highlight the particulars of the
Q H
14.4.
KIEFER OPTIMALITY
357
Kiefer ordering on Sym(A:) relative to the group Q, and on Sym(s) relative to the group H. The following theorem details the implications between these order relations, and thus expands on the implication from part (d) to part (a) of Theorem 13.12. The underlying assumptions are formulated with a view towards invariant design problems as outlined in Section 13.6. 14.3. MONOTONIC MATRIX FUNCTIONS Theorem. Let the k x s matrix K be of rank s. Assume that the transformations in the matrix group Q C GL(fc) leave the range of K invariant, (2(range K) range K for all Q e Q, with equivariance group 7i C GL(s). Then the information matrix mapping CK is isotonic relative to the Kiefer orderings on Sym(/c) and on Sym(,s), as is every 7i-invariant information function <j> on NND(s):
for all .4, fi <ENND(s)and Proof. The mapping CK is isotonic relative to the Loewner ordering and implies matrix concave, by Theorem 3.13. Hence Equivariance yields say, by Lemma Together we get that is. Similarly we obtain In the presence of Ti-invariance of </> the last sum becomes For the trivial group, H = {/$}, the Kiefer ordering ; coincides with the Loewner ordering >, and Kiefer optimality is the same as Loewner optimality (compare Section 4.4). Otherwise the two orderings are distinct, with the Kiefer ordering having the potential of comparing more pairs C, D e Sym(^) than the Loewner ordering, C > D => C > D. Therefore there is a greater chance of finding a matrix C that is optimal relative to the Kiefer ordering . 14.4. KIEFER OPTIMALITY Let H be a subgroup of nonsingular s x s matrices. No assumption is placed on the set M C NND(fc) of competing moment matrices. DEFINITION. A moment matrix M e M is called Kiefer optimal for K'6 in M relative to the group 7i C GL(s) when the information matrix CK(M)
358
is %-invariant and satisfies
where > is the Kiefer ordering on Sym(s) relative to the group H. Given a subclass of designs, H C H, a design e H is called Kiefer optimal for K'B in H when its moment matrix A/() is Kiefer optimal for K'O inM(E). We can now be more specific about the two distinct ways to discuss optimality in invariant design problems, as announced earlier in Section 13.6. There we built on simultaneous optimality under all invariant information functions, now we refer to the Kiefer ordering. In all practical applications the equivariance group is a closed subgroup of orthogonal matrices, H C Orth(s). It is an immediate consequence of Theorem 13.12 that a moment matrix M e M is then Kiefer optimal for K'O in M if and only if, for all Hinvariant information functions <, the matrix M is 0-optimal for K'O in M. Therefore the two different avenues towards optimality in invariant design problems lead to the same goal, as anticipated in Section 13.6. An effective method of finding Kiefer optimal moment matrices rests on part (c) of Theorem 13.12, in achieving a smaller problem dimensionality by a transition from the space of symmetric matrices to the subspace of invariant symmetric matrices. However, for invariant design problems, we have meanwhile created more than a single invariance concept: W-invariance of information matrices in Sym(s), and Q-invariance of moment matrices in Sym(/0; we may also consider 7^-invariance of designs in T. It is a manifestation of the coherence of the approach that invariance is handed down from one level to the next, as we work through the design problem. 14.5. HERTTABILITY OF INVARIANCE The full ramifications of an invariant design problem have a group 7 acting on the experimental domain T (Section 13.2), a matrix group Q acting by left multiplication on the regression range X (Section 13.3) as well as acting by congruence on the set of moment matrices M (Section 13.4), and a matrix group H acting by congruence on the set of information matrices Cx(M). The assumptions are such that the regression function / is 7^-Q-equivariant (Section 13.2), and that the information matrix mapping C# is Q-Ti-equivariant (Section 13.5). On the design level, a transformation R 71 "rotates" a design r e T into the design rR defined by
14.5. HERITABILITY OF INVARIANCE
359
A design r e T is called K-invariant when it satisfies T = TR for all R e 7. We claim that every 72.-invariant design T has a Q-invariant moment matrix M(T),
Indeed, since Q is the homomorphic image of "R, that makes / equivariant, every matrix Q e Q originates from a transformation R e 7 such that <2/(r) = /(/?(0) for all r e T. We obtain
For an 7-invariant design T, the latter is equal to M(r), and this proves (1). The converse implication in (1) is generally false. For instance in the trigonometric fit model of Section 9.16, the experimental domain T = [0;27r) is invariant under rotations by a fixed angle r. That is, the group is 72. = [0; 2ir) and the action is addition, (r, r) H-> r +1. The sin-cos addition theorem gives
Hence we get f(r + t) = Qf(t)t where the (24 + 1) x (24 +1) matrix Q is block diagonal, with top left entry 1 followed by the 2 x 2 rotation matrices Sa, for a = l,...,d. In this setting, the equispaced support designs T" from Section 9.16 fail to be "fa-invariant, while their common moment matrix
is evidently Q-invariant. In fact, the only "/^-invariant probability measure is proportional to Lebesgue measure on the circle. Since in our terminology of Section 1.24 a design requires a finite support, we can make the stronger statement that no design is 7-invariant, in this example. On the moment matrix level, we claim that every Q-invariant matrix A e Sym(/c, Q) leads to an ^-invariant information matrix CK(A) 6 Sym(s,H),
Again we appeal to the fact that H is a homomorphic image of Q which makes CK equivariant, by Lemma 13.5. Hence if H e H stems from Q e Q, then equivariance yields HCK(A)H' CK(QAQ'). For a Q-invariant matrix A the latter is equal to CK(A), and this proves (2). The converse implication in (2) is generally false. For instance, in the parabola fit model of Section 13.5, the equivariance group becomes trivial if
360
the parameters K'O = (do, #2)' are of interest, H = {h}- Hence all information matrices for K'd are H-invariant. On the other hand, Q-invariance of the moment matrix requires the odd moments to vanish. Hence for a nonsymmetric design r on [1;1], the moment matrix M(r) is not Q-invariant, but the information matrix CK(M(r)] is ^-invariant, in this example. In summary, this establishes a persuasive heritability chain,
of which neither implication can generally be reversed. Heritability of invariance has repercussions on Kiefer optimality, as follows. The definition of Kiefer optimality refers to the subgroup H C GL(^) only. Usually H is the equivariance group that arises from the Q-invariance of the design problem for K'd in M. If we add our prevalent earlier assumption that the set M is compact and convex, then it suffices to seek the optimum in the subset of Q-invariant moment matrices. 14.6. KIEFER OPTIMALITY AND INVARIANT LOEWNER OPTIMALITY Theorem. Assume that Q C Orth(fc) is a closed subgroup of orthogonal matrices such that the design problem for K' 6 in M is Q-invariant, with equivariance group H C GL(s). Let the set M C NND(fc) of competing moment matrices be compact and convex. Then for every Q-invariant moment matrix M e M n Sym(A:, Q) the following statements are equivalent: a. (Kiefer optimally) M is Kiefer optimal for K'B in M. b. (Invariant Loewner optimally) M is Loewner optimal for K'0 in M n Sym(A:, Q). Proof. The balancing operator maps M into Ai, that is, A = JQQAQ'dQ M for all A e M, because of convexity and closedness of M. The image of M under is the set of invariant moment matrices, M = M n Sym(&, Q). _ First we assume (a). For a matrix A e .A/1 n Sym(A:, Q), Kiefer optimality of M in M yields C X (M) > CK(A), that is _ But Q-invariance of A entails ft-in variance of CK(A). Hence we get CK(M) > E = CK(A), that is, Loewner optimality of M for K'd in M. Now we assume (b). Given a matrix A M, we use the Caratheodory theorem to represent the average A e conv{QAQ' : Q e Q} as a finite convex combination, A YîatQiAQj. Concavity of C% gives CK(A) > ^ = , say, where = îiaiHiCK(A)Hi lies in the convex hull of the orbit of CK(A) under U. Altogether we get E -< CK(A), that is, Kiefer optimality of M for K'O in M.
14.7. OPTIMALITY UNDER INVARIANT INFORMATION FUNCTIONS
361
The theorem has a simple analogue for </>-optimality, assuming that $ is an H-invariant information function on NND(s). 14.7. OPTIMALITY UNDER INVARIANT INFORMATION FUNCTIONS Theorem. Under the assumptions of Theorem 14.6, the following statements are equivalent for every Q-invariant moment matrix Sym(A;, Q) and every "H-invariant information function <f> on NND(s): a. (All moment matrices) M is <-optimal for K'O in M. b. (Invariant moment matrices) M is $-optimal for K'O in Mr\Sym(k, Q). Proof. The direct part is immediate. The converse exploits the monofor all tonicity of <J> from Lemma 14.3, The point that comes to bear is that the set of invariant moment matrices, ~M M n Sym(A;, Q), lies in the space Sym(fc, Q) which usually is of much lower dimension than Sym(fc). The reduction in dimensionality is demonstratpH in Slprtirm T3 10
as opposed to the dimension k(k + l)/2 of Sym(fc) itself. The dimensionality reduction by invariance also has repercussions on the General Equivalence Theorem. In Chapter 7 we place the optimal design problem in the "large" Euclidean space Sym(fc), with inner product (A,B) = trace AB, and with dual space Sym(/c). For invariant design problems, the present theorem embeds the optimization problem in the "small" Euclidean space Sym(/c, Q) equipped with the restriction of the inner product {-,-}, and with dual space Sym(A:, Q') where Q' consists of the transposed matrices from Q. Therefore all the tools from duality theory (polar information functions, subgradients, normal vectors, and the like) carry a dual invariance structure of reduced dimensionality. The greater achievement, however, is the concept of Kiefer optimality. In terms of simultaneous optimality relative to classes of information functions, it mediates between all information functions and the determinant criterion, 4> D $(H) D {5<fo : 8 > 0}, reflecting the inclusions {/,} C U C Unim(s). More important, it gives rise to the Kiefer ordering > which focuses on the comparison of any two moment matrices, including nonoptimal ones, rather than being overwhelmed by a narrow desire to achieve optimality.
362
If a noninvariant moment matrix M e M \ M. is Kiefer optimal then so is its projection M e M, because M ; M is an improvement in the Kiefer ordering. On the other hand, the potential of noninvariant moment matrices and designs to be Kiefer optimal is a source of great economy. Invariant designs tend to evenly spread out the weights over many, if not all, support points. A noninvariant design that has an invariant information matrix offers the chance to reduce the support size while maintaining optimality. This is precisely what balanced incomplete block designs are about, see Section 14.9.
14.8. KIEFER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS In two-way classification models, the concept of Kiefer optimality unifies and extends the results on Loewner optimality from Section 4.8, and on rank deficient matrix mean optimafity from Section 8.19. The experimental domain T {!,...,} x (1,...,6} comprises all treatment-block combinations (/,_/') (see Section 1.5). The labeling of the treatments and the labeling of the blocks should not influence the design and its analysis. Hence we are aiming at invariance relative to the transformations (/,;') t- (p(i),<r(;)) where p is a permutation of the rows {!,...,} and or is a permutation of the columns {!,...,&}. In a more descriptive terminology, p represents a relabeling of treatments and a a relabeling of blocks. The regression function / maps ( i , j ) into the vector
With the permutation matrices j Section 6.8, we find
from
Therefore the regression function / is equivariant under the treatment-block relabeling group Q which is defined by
With the conventions of Section 1.27, a design on T is an a x b block design W, with row sum vector r = Wlb and column sum vector s = Wla. The
14.8. KIEFER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS
363
congruence action on the moment matrices
turns out to be
(Evidently there is also an action on the block designs W themselves, namely left multiplication by R and right multiplication by 5'.) In this setting we discuss the design problem (I) for the centered treatment contrasts in the class T(r) of designs with given row sum vector r, (II) for the same parameters in the full class T, and (III) for a maximal parameter system in T. Problem I is invariant under the block relabeling subgroup Q C Q, problems II and III are invariant under the group Q itself. I. The results on Loewner optimality in Section 4.8 pertain to the set T(r) of block designs with a fixed positive row sum vector r. The set M = {M(W): W e T(r)} of competing moment matrices is compact and convex, and invariant under each transformation in the block relabeling group
The parameter system of interest is the centered treatment contrasts AT'(0) = Kaa, with coefficient matrix K (*a) and with centering matrix Ka = Ia - Ja as defined in Section 3.20. The transformations in Q fulfill
whence the equivariance group becomes trivial, H = {Ia}. We may summarize as follows: The design problem for the centered treatment contrasts in the set T(r) of designs with fixed row sum vector r is invariant under relabeling of the blocks; the equivariance group is trivial, and the new concept of Kiefer optimality falls back on the old concept of Loewner optimality. To find an optimal block design we may start from any weight matrix W e T(r) and need only average it over Q,
364
We have used with Euclidean unit vectors d; of Rb. This also vields the average of the moment matrix, The improvement of the moment matrices entails an improvement of the contrast information matrices,
by Theorem 14.3. In fact, since the equivariance group is trivial, we experience the remarkable^ constellation that pure matrix majorization of the moment matrices, M(W) ~< M(W), translates into a pure Loewner comparison of the information matrices, CK (M(W)) > CK (M(W)). Equality holds if and only if W is of the form rs1, using the same argument as in Section 4.8. Hence we have rederived the result from that section: The product designs rs' with arbitrary column sum vector s are the only Loewner optimal designs for the centered contrasts of factor A in the set T(r) of designs with row sum vector equal to r; the optimal contrast information matrix is A, -rr'. The invariance approach would seem to reveal more of the structure of the present design problem, whereas in Section 4.8 less theory was available and we had to rely on more technicalities. II. In Section 8.19, we refer optimality to the full set T rather than to some subset. Now we not only rederive that result, on simultaneous optimality relative to all rank deficient matrix means <p', but strengthen them to Kiefer optimality. The set of competing moment matrices is maximal, M = M(T). It is invariant under each transformation in the treatment-block relabeling group Q introduced in the preamble to this section. For the centered treatment contrasts A^'(g) = Kaa, the transformations in Q fulfill
Hence here the equivariance group is the treatment relabeling group, H Perm(fl). We may summarize as follows: The design problem for the centered treatment contrasts in the set of all designs is invariant under relabeling of treatments and relabeling of blocks; the equivariance group is the permutation group Perm(fl), and Kiefer optimality is the same as simultaneous optimality relative to the permutationally invariant information functions on NND(a). This class of criteria contains the rank deficient matrix means that appeared in Section 8.19, as well as many other criteria.
14.8. KIEFER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS
365
Centering of an arbitrary block design W e T leads to the uniform design,
The contrast information matrix becomes CK(M(W)} Ka/a. A weight matrix W achieves the contrast information matrix Ka/a if and only if it is an equireplicated product design, W = 1 as'. This improves_upon the first result of Section 8.19: The equireplicated product designs las' with arbitrary column sum vector s are the unique Kiefer optimal designs for the centered treatment contrasts in the set T of all block designs; the optimal contrast information matrix is Ka/a. III. In just the same fashion, we treat the maximal parameter system
relative to the full moment matrix set M = M(T) and the full treatmentblock relabeling group Q. From
the equivariance group
is seen to be is isomorphic to the treatment-block relabeling group Q. This generalizes the second result of Section 8.19: The uniform design 1al b is the unique Kiefer optimal design for the maximal parameter system (1) in the set T of all block designs. These examples have in common that the design problems are invariant on each of the levels discussed in Section 14.5. The balancing operation may be carried out on the top level of moment matrices, or even for the weight matrices. This is not so when restrictions are placed on the support. Yet we profit from invariance by switching from the top-down approach to a bottom-up approach, backtracking from the anticipated "balanced solution" and exploiting invariance as long as is feasible.
366
14.9. BALANCED INCOMPLETE BLOCK DESIGNS In a two-way classification model the best block design is the uniform design lal b. It is optimal in the three situations of the previous section, and also in others. For it to be realizable, the sample size n must be a multiple of ab, that is, fairly large. For small sample size, n < ab, a block design for sample size n necessarily has an incomplete support, that is, some of the treatment-block combinations (/, y) are not observed. Other than that, it seems persuasive to salvage as many properties of the uniform design as is possible. This is achieved by a balanced incomplete block design. DEFINITION. An a x b weight matrix W is called a balanced incomplete block design when (a) W is an incomplete uniform design, that is, it has n < ab support points and assigns uniform weight l/n to each of them, (b) W is balanced for the centered treatment contrasts, in the sense that its contrast information matrixes completely symmetric and nonzero, and (c) W is equiblocksized, W'la = 1 bWith this definition a balanced incomplete block design W is in fact an a x b block design, W T, and it is realizable for sample size n = # supp W and multiples thereof. The limiting case n = ab would reproduce the uniform design 1al b, but is excluded by the incompleteness requirement (a). For a balanced incomplete block design W, the incidence matrix N = nW has entries 0 or 1. It may be interpreted as the indicator function of the support of W, or as the frequency matrix of a balanced incomplete block design for sample size n. The focus on support set and sample size comes to bear better by quoting the weight matrix in the form N/n, with n = /a'N7b. The baiancedness property (b) can be specified further. We have
since (c) yields column sum vector 5 = lb, while (a) implies nl} e {0,1} and From Section 13.10, complete symmetry means that CK(N/n) is a linear combination of the averaging matrix Ja and the centering matrix Ka. We get
since CK(N/n) has row sums 0 whence the coefficient of Ja vanishes, l^Cla =
14.9. BALANCED INCOMPLETE BLOCK DESIGNS
367
0. Therefore the balancedness requirement (b) is equivalently expressed through the formula CK(N/n) - ^f-Ka, with n > b. In part I, we list more properties that the parameters of a balanced incomplete block design necessarily satisfy. Then we establish (II) Kiefer optimality for the centered treatment contrasts, and (III) matrix mean optimality for a maximal parameter system. I. Whether a balanced incomplete block design exists is a combinatorial problem, of (a) arranging n ones and ab - n zeros into an a x b incidence matrix N such that N is (b) balanced and (c) equiblocksized. This implies a few necessary conditions on the parameters a, b, and n, as follows. Claim. If an a x b balanced incomplete block design for sample size n exists, then its incidence matrix N has constant column sums as well as constant row sums, and the treatment concurrence matrix NN' fulfills
with positive integers n/a, n/b, and
This necessitates a < b, and n > a + b - I . Proof. That N has constant column sums follows from the equiblocksize property (c). The balancedness property (b) yields constant row sums r, = \/a of N/n,
observing n > b. Thus (b) becomes
from which we calculate NN' in (1). The off-diagonal element A is the inner product of two rows of N, and hence counts how often two treatments concur in the same blocks. From n > b we get A ^ 0. Thus A is a positive integer, as are the common treatment replication number n/a and the common blocksize n/b of N. The incompleteness condition n < ab secures n/a > A,
368
whence NN' is positive definite. This implies a < b, and
The latter entails n > a + b - 1. The proof is complete. The optimality statements refer to the set T(N) of designs with support included in the support of N (and which are thus absolutely continuous relative to N/n),
This set is convex and also closed, since the support condition is an inclusion and not an equality. The design set T(Af) is not invariant under relabeling treatments or blocks, nor is the corresponding moment matrix set M (T(N)) invariant. Hence we can exploit invariance only on the lowest level, that of information matrices. II. For the centered treatment contrasts Kaa, a balanced incomplete block design is Kiefer optimal relative to the treatment relabeling group H = Perm(a). Claim. A balanced incomplete block design N/n is Kiefer optimal for the centered treatment contrasts in the set T(Af) of designs for which the support is included in the support of TV; their contrast information matrix is
Proof. To see this, let W e T(N) be a competing block design. Balancing the information matrix relative to Perm(,s) entails matrix majorization, C^(W) -< CK(W), with
where Sj are the column sums of W. Hence the Loewner comparison of CK(N/n) and C#(W) boils down to comparing traces. Because of the support inclusion, any competitor W G T(Af) satisfies w,; = riijWjj. The Cauchy inequality now yields
369
In terms of traces this means traceC K (W) trace CK(N/n). Altogether we obtain CK(N/n) > CK(W) x CK(W). Therefore N/n is Kiefer optimal for Kaa in T(N), and the proof is complete. In retrospect, the proof has two steps. The first is symmetrization using the balancing operator , without asking whether CK(W) is an information matrix originating from some design in T or not. The second is maximization of the trace, which enters because the trace determines the projection onto the invariant subspace formed by the completely symmetric matrices. The trace is not primarily used as the optimality criterion. Nevertheless the situation could be paraphrased by saying that N/n is (fo-optimal for Kaa in T(N). This is what we have alluded to in Section 6.5, that trace optimality has its place in the theory provided it is accompanied by some other strong property such as invariance. For fixed support size n, any balanced incomplete block design N/n, with a possibly different support than N/n, has the same contrast information matrix
Hence the optimal contrast information matrix is the same for the design sets T(W) and T(N). As a common superset we define
This set comprises those block designs for which each block contains at most n/b support points. For competing designs from this set, W T(n/b), the Cauchy inequality continues to yield
Thus optimality of any balanced incomplete block design N/n extends from the support restricted set T(N) to the larger, parameter dependent set T(n/b). This set may well fail to be convex. Another example of a nonconvex set over which optimality extends is the discrete set Tnb of those block designs for sample size n for which all b blocksizes are positive. This is established by a direct argument. It N e Tn fe assigns frequency n /y e {0,1,2,...} to treatment-block combination (/,;), then n /y > ,; yields the estimate
"i*? .-w
370
CHAPTER 14. KIEFER OPTIMALITY
EXHIBIT 14.2 Some 3 x 6 block designs for n 12 observations. The equireplicated product design N\/n is Kiefer optimal in T. The balanced incomplete block design N-i/n is Kiefer optimal in T(n/b), and in T12,6/i- The designs N^/n e 1(n/b) and N4 Tj2i6 perform just as well, as does N$.
with equality if and only if all frequencies /y are 0 or 1. Hence the incidence matrix N of a balanced incomplete block design is Kiefer optimal for Kaa in the discrete set TMfe. Thus we may summarize as follows: An a x b balanced incomplete block design for sample size n is Kiefer optimal relative to the treatment permutation group Perrn(a) for the centered treatment contrasts, in the set T(n/b) of those block designs for which each block contains at most n/b support points, as well as in the set TR^/n where TW)fc are the designs for sample size n with at least one observation in each of the b blocks. Exhibit 14.2 illustrates that the dominating role that balanced incomplete block designs play in the discrete theory is not supported by their optimality properties for the centered treatment contrasts. Design N\ says that it is better not to make observations in more than one block. This is plainly intelligible but somewhat besides the point. Two-way classification models are employed because it is thought that blocking cannot be avoided. Hence the class Tnj,
371
with all b blocksizes positive is of much greater practical interest. But even then there are designs such as N4 which fail to be balanced incomplete block designs but perform just as well. III. A balanced incomplete block design is distinguished by its optimality properties for an appropriate maximal parameter system. In Section 14.8 (III) we studied a maximal parameter set which treats the treatment effects a and the block effects j8 in an entirely symmetric fashion, E[Y,;] = (a. + j3.) + (a, a.)+ (07-0.). However, a balanced incomplete block design concentrates on treatment contrasts and handles the set of all parameters in an unsymmetric way,
In other words, expected yield decomposes according to E[F,-y] = (a, - a.) + (j3y-+a.). This system yields K = MG, where M is the moment matrix of a balanced incomplete block design N/n, while the specific generalized inverse G of M is given by
In particular, M is feasible for K'O since MGK ~ MGMG = MG = K shows that the range of M includes the range of K. The dispersion matrix D = K'GK becomes D = GMGMG - GMG = G, as M is readily verified to be a generalized inverse of G. Optimality then holds with respect to the rank deficient matrix means <p' of Section 8.18. Claim. A balanced incomplete block design N/n is the unique ^'-optimal design for the maximal parameter system (2) in the set T(N) of designs for which the support is included in the support of N, for every p e [-00; 1]. Proof. The proof does not use invariance directly since we do not have a group acting appropriately on the problem. Yet the techniques of Lemma 13.10 prove useful. In the present setting the normality inequality of Section 8.18 becomes
in case p (-00; 0). For p e (0; 1], we replace G~p by (G+y. The task is one of handling the powers G~p and (G+)p. To this end we introduce the four-dimensional matrix space C Sym(a+fe)
372
generated by the symmetric matrices
The space L contains the squares (Vt V})2, for all /,;' = 1,2,3,4. Hence is a quadratic subspace of symmetric matrices, that is, for any member C e C also the powers C2, C 3 ,... lie in . Arguing as in the proof of part (c) of Lemma 13.10, we find that also contains the Moore-Penrose inverse C+, as well as CP and (C+)p for C with C > 0. This applies to tG Vl - (b/n)V2 + tbV3 + (b2/n2)V4 6 C. For p (-00;0) therefore G~P is a linear combination of the form
for some ap, pp, yp, 8P e U. For p (0; 1] the same is true of (G+)p. Now we insert (4) into (3), and use w,-y =rt/yw/yfor W T(W). Some computation proves both sides in (3) to be equal. This establishes <p'-optimality for p 6 (-00; 0) u (0; 1], and Lemma 8.15 extends optimality to p = -oo,0. Uniqueness follows from Corollary 8.14, since in M(W)G = K, the left bottom block yields W = (b/ri)Nb.s while the right bottom block necessitates bA, = Ib. Together this yields W = N/n, thus completing the proof. Results of a similar nature hold for multiway classification models, using essentially the same tools. Instead we find it instructive to discuss an example of a different type.
14.10. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT CUBE A reduction by invariance was tacitly executed in Section 8.6 where we computed 0p-optimal designs for a linear fit over the unit square X = [0;!]*, with k = 2. We now tackle the general dimension k, as an invariant design problem. We start out from a A:-way first-degree model without a constant term,
also called a multiple linear regression model. The t regression vectors *, are assumed to lie in the unit cube X = [0; 1]*. From Theorem 8.5, we concentrate
14.10. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT CUBE
373
EXHIBIT 14.3 Uniform vertex designs. A /-vertex has j entries equal to 1 and k-j entries equal to 0, shown for k 3 and y = 0,1,2,3. The y'-vertex design y assigns uniform weight !/(*) to all y-vertices of [0; 1]*.
on the extreme points of the ElfVing set 71 conv(A' U (-/V)). We define x e [0; 1]* to be a j-vertex when / entries of x are 1 while the other k-j entries are 0, for / = 0 , . . . , k. The set X of extreme points of 7 comprises those /'-vertices with / > \\(k - 1)]. A further reduction exploits invariance. The regression range X is permutationally invariant, Qx e [0; 1]* for all x [0; 1]* and Q e Perm(A:). Also th design problem for the full parameter vector 6 in the set of all moment matrices M(H) is permutationally invariant, QM(g)Q' G Af (E) for all e H and Q e Perm(A:). The j-vertex design is defined to be the design which assigns uniform weight l/( k ) to the (*) /-vertices of the cube [0;!]*. The y'-vertex designs / for / = 0,1,2,3 in dimension k = 3 are shown in Exhibit 14.3. The /-vertex design , has a completely symmetric moment matrix, with
If k is even, then we have \\(k - 1)] = \\(k + 1)J. If k is odd, then we get M ([(*-i)/2i) = M (|_(/fc+i)/2j)- Hence we need to study the /-vertex designs with/ > l.5(fc + l)J only. It turns out that mixtures of neighboring /-vertex designs suffice. To this
374
end we introduce a continuous one-parameter design family,
with integer / running from [|(A: + l ) J t o A : - l . The support parameter s varies through the closed interval from \\(k + 1)J to k. If s / is an integer, then & is the /-vertex design. Otherwise the two integers / and / +1 closest to s specify the vertices supporting &, and the fractional part s j determines the weighting of and ;+1. As s grows, the weights shift towards the furthest point, the A>vertex. We call the designs in the class
the neighbor-vertex designs. For example, we have & n = 0.89& +0.11&, or We show that the neighbor-vertex designs (I) form an essentially complete class under the Kiefer ordering relative to the permutation group Perm(fc), (II) contain (^-optimal designs for 6 in H, and (III) have an excessive support size that is improved upon by other optimal designs such as balanced incomplete block designs. I. Our first result pertains to the Kiefer ordering of moment matrices. Claim. For every design 17 e E, there exists a neighbor-vertex design & e H so that A/(&) is more informative, M(s) > M(T?), in the Kiefer ordering relative to the permutation group Perm (A:). Proof. Let 17 e H be any competing design. From Theorem 8.5, there exists a design e H which has only /-vertices for its support points such that M() > M(-q). A transformation Q carries into Q(x) = ^(Q~lx) and (compare Section 14.5). The average designs 1 2 nave ]C<2Perm(*) ? /^' moment matrices
^.4-0.66 + 0.44-
Being invariant and supported by /-vertices, the design is a mixture of /vertex designs, ;^, with min, yy > 0 and ); y, 1. Its moment matrix is
375
EXHIBIT 14.4 Admissible eigenvalues. In a linear fit model over [0; l ] k , the eigenvalues of invariant moment matrices correspond to the points (a, ft) in a polytope with vertices on the curve x(t) = f 2 , y(t) t(\ - t). The points on the solid line are the admissible ones. Left: for k = 5; right: for k = 6.
The coefficients
and j
fulfill
where the convex hull is a polytope formed from the k + 1 points (a/,0y) which lie on the curve with coordinates x(t) = t2 and y(/) = t(\ t) for t [0; 1] (see Exhibit 14.4). The geometry of this curve visibly exhibits that we can enlarge (a,/3) in the componentwise ordering to a point on the solid boundary. That is, for some ; {\\(k + 1)J,..., k} and some d e [0; 1] we have
With s = j + 8, the neighbor-vertex design & then fulfills M(rj) x M(TJ). Thus we get Af(fe) M(TJ), and the proof is complete. We may reinterpret the result in terms of moment matrices. Let M be the set of moment matrices obtained from invariant designs that are supported by /-vertices, with / > \\(k + 1)J. Then M consists of the completely symmetric matrices that are analysed in the above proof:
That is, M is a two-dimensional polytope in matrix space. The Loewner
376
ordering of the matrices
in M coincides with the componentwise ordering of the coefficient pair (a, /3). In this ordering, the maximal elements are those which lie on the solid boundary, on the far side from the origin. This leads to the moment matrices M(s) of the neighbor-vertex designs &, with 5 It follows from Theorem 14.3 that the neighbor-vertex designs also perform better under every information function <f> on NND(&) that is permutationally invariant,
Therefore some member of the one-parameter family H of neighbor-vertex designs is $-optimal for 0 in H. II. For the matrix means <f>p with parameter p e [oo;l], the interval [-00; 1] is subdivided using two interlacing sequences and and b(j + 1) defined b y a
The (f>p-optimal support parameter turns out to be s*(p) = / in case p [#(/);/?(/)], and is otherwise given as follows. Claim. The unique <p-optimal design for & in H is the j-vertex design in case p e [a(j);b(j)]. In case p (b(j}\a(j + 1)), it is the neighbor-vertex design &4(p) with sk(p) given by
Proof. The proof identifies the optimal support points by maximizing the quadratic form Qp(x) = x'Mp~lx which appears as the left hand side in the
377
normality inequality of Theorem 7.20. The evaluation simplifies drastically since we know that the optimality candidate
is completely symmetric. Furthermore Qp(x) is convex in x e [0; 1]*, whence it suffices to maximize over the /-vertices x only. With this, the left hand side of the normality inequality depends on jc only through /,
say. For /-vertex designs and neighbor-vertex designs, we have j8/(A;-l) < a, while p - 1 < 0. Therefore the parabola hp opens downwards, and attains its maximum over the entire real line at some point jp e R. Let / = \jp\ be the integer with jp e [/';;' +1). The maximum of hp over integer arguments is attained either at /, or else at / + 1, or else at ;' and / +1 simultaneously. In case jp < / + 1/2 the integer maximum of hp occurs at j. The (^,-optimal support points are the j -vertices, and the j -vertex design , is <p-optimal. To see when this happens we notice that the double inequality hp(j\} < hp(j) > hp(j + 1) is equivalent to p [(/); >(/)] - In case jp > j + 1/2, the (f>p-optimal design is fy +1 . In case jp = j + 1/2, the integer maximum of hp occurs at / and at j + 1. A neighbor-vertex design & is ^-optimal, with s e [/;;' + 1]. This happens if and only if hp(j) = hp(j + 1), or equivalently, s = sk(p). The property sk(p) 6 (/;/ +1) translates into p e (b(j}\a(j +1)). The proof is complete. L_3 As a function of p, the support parameter sk is continuous, with constant value ; on (0(7); b(j)], and strictly increasing from ; to ;' +1 on (b(j); a(j-f 1)). The /-vertex design for / = {^(k + 1)J is </>p-optimal over an unbounded interval p e [-oo;cfc], with ck = b([^(k + 1)J). Some values of ck are as follows.
k
Ck
3 0.34
4 -0.46
5 0.30
6 -0.21
7 8 0.28 -0.13
9 0.26
10 -0.09
For odd k the \(k + l)-vertex design is (/^-optimal for all p [-00;0]. For even k the ^ k -design is ^-optimal for all p 6 [00;1/2]. For even /c, the
378
fa-optimal design ^(0) turns out to place uniform weight
on all of the \k-vertices and \k + 1-vertices. Thus for most parameters p G [oo;0], the ^-optimal design g^) is supported by those vertices which have half of their entries equal to 1 and the other half equal to 0. This is also the limiting design for large dimension k,
To see this, we introduce jk [sk(p)\ < k. We have sk(p) e \jk\jk + 1) or equivalently, p e [(/*);a(A + 1))- It suffices to show that jk/k tends to 1/2. From jk > [^(k + 1)J, we get liminfôo/jt/fc > 1/2. The assumption \imsupk_toojk/k = a e (\f2\\] leads to
if necessary along a subsequence (km)m>\, and contradicts p < 1. HI. While the ^-optimal moment matrices are unique, the designs are not. In fact, the /-vertex design with j = \\(k + 1)J has a support size (fc) that grows exponentially with k, and violates the quadratic bound \k(k + 1) from Section 8.3. We cannot reduce the number of candidate support points since equality in the normality inequality holds for all /-vertices. But we are not forced to use all of them, nor must we weigh them uniformly. More economical designs exist, and may be obtained from balanced incomplete block designs. The relation is easily recognized once we contemplate the model matrix X that belongs to a /-vertex design ;. Its I = (*) rows consist of the /-vertices x 6 [0;!]*, and X'X is completely symmetric. In other words, the transpose N X' is the incidence matrix of a k x I balanced incomplete block design for jt observations. It is readily checked that the identity
is valid for every b as long as N is a k x b balanced incomplete block design for sample size jb. Therefore we can replace the /-vertex design y by any design that places uniform weight on the columns of the incidence matrix of an arbitrary balanced incomplete block design for k treatments and blocksize /.
EXERCISES
379
The example conveys some impression of the power of exploiting invariance considerations in a design problem, and of the labor to do so. Another class of models that submit themselves to a reduction by invariance are polynomial fit models, under the generic heading of rotatability. EXERCISES 14.1 In a quadratic fit model over [1; 1], show that C for
14.2
Show that the support of TR is #(supp T), for all r e T and all transformations R of T.
14.3 Assume that the group 71 acting on T is finite, and induces the transformations Q Q C GL(fc) such that Qf(t) = f ( R ( t ) ) for all t e T. Show that a moment matrix M e M is Q-invariant if and only if there exists an fc-invariant design T e T with M = M/(T) [Gaffke (1987a), p. 946]. 14.4 14.5 An a x b block design for sample size n may be balanced, binary, and equireplicated, without being equiblocksized [Rao (1958), p. 294]. An a x b block design for sample size n may be balanced and binary, without being equireplicated nor equiblocksized [John (1964), p. 899; Tyagi (1979), p. 335]. An a x b block design for sample size n may be balanced and equiblocksized, without being equireplicated nor binary [John (1964), p. 898]. Consider an ra-way classification model with additive main effects for and no interaction, treatments i = 1,... ,a and blocking factors k = 1,... ,m with levels jk 1,... ,bk. Show that the moment matrix of a design r on the experimental domain T = {!,...,a] x {!,... ,b\] x x {!,.. .,bm} is
14.6
14.7
380
say, where Wk and Wk-k are the two-dimensional marginals of T between treatments and blocking factor k, and blocking factors k and k, respectively, with one-dimensional marginals r = Wklbk and sk = Wkklbk [Pukelsheim (1986), p. 340]. 14.8 14.9 (continued) Show that the centered treatment contrast information matrix is C(T) = Ar - WE~W. (continued) A design T is called a treatment-factor product design when Wk rs for all k < m. For such designs T, show that C(r) = Ar - rr'. (continued) A design T is called a factor-factor product design when W^ = sks-' for all A; 7^ ^ < m. For such designs T, show that C(r) =
14.10
C H A P T E R 15
Rotatability and Response Surface Designs
In m-way d th-degree polynomial fit models, invariance comes under the heading of rotatability. The object of study is the information surface, that is, the information that a design contains for the estimated response surface. Orthogonal invariance is then called rotatability. The implications on moment matrices and design construction are discussed, in first-degree and second-degree models. For practical purposes, near rotatability is often sufficient and leaves room to accommodate other, operational aspects of empirical model-building. 15.1. RESPONSE SURFACE METHODOLOGY Response surface methodology concentrates on the relationship between the experimental conditions t as input variable, and the expected response E P [Y] as output variable. The precise functional dependence of E/>[y] on t is generally unknown. Thus the issue becomes one of selecting an approximating model, in agreement with the response data that are obtained in an experiment or that are available otherwise. The model must be sufficiently complex to approximate the true, though unknown, relationship, while at the same time being simple enough to be well understood. To strike the right balance between model complexity and inferential simplicity is a delicate decision that usually evolves by repeatedly looping through a learning process:
CONJUCTURE Conjuture DESIGNA EXPERIMENT
ANALYSIS
There is plenty of empirical evidence that a valuable tool in this process is the classical linear model of Section 1.3. In the class of classical linear models, the task is one of finding a regression function / to approximate the expected response, Ep[F] = f(t)'0. For the
381
382
CHAPTER 15: ROTATABILITY AND RESPONSE SURFACE DESIGNS
most part, we study the repercussions of the choice of / on an experimental design T, in the class T of all designs on the experimental domain T. In Section 15.21 we comment on the ultimate challenge of model-building.
15.2. RESPONSE SURFACES
We assume the classical linear model of Section 1.3,
with regression function / : T > !R* on some experimental domain T. The model response surface is defined to be a function on T,
and depicts the dependence of the expected yield f(t) '0 on the experimental conditions t e T. The parameter vector 0 is unknown, and so is the model response surface. Therefore the object of study becomes the estimated response surface,
based on an experiment with n x k model matrix X and n x 1 observation vector Y. If f(t)'B is estimable, that is, if f(t) lies in the range of X', then it is estimated by f ( t ) ' 0 = f(t)'(X'X)-X'Y (see Section 3.5). The variance of the estimate is o-2f(t)'(X'X)~f(t) in case f(t) e range X', and otherwise it is taken to be oo. The statistical properties of the estimated response surface are determined by the moment matrix M = X'X/n, and are captured by the standardized information surface iM : T > (R which for t T is given by
and iM (t) 0 otherwise. In terms of the information matrices CK(M) of Section 3.2, we have iM(t) = C/(,)(Af). That is, 1^(1) represents the information that a design with moment matrix M contains for the model response surface /(r)'0, and emphasis is on the dependence on /. The domain T where the experimenter is interested in the behavior of IM need not actually coincide with the experimental domain T, but could be a subset or a superset of T. Equivalently, we may study the standardized variance surface VM = I/IMBut lack of identifiability of f ( t ) ' 0 entails vM(t) = oo, which makes the study of VM less convenient than that of iM Anyway, the notion of an information surface is more in line with our information oriented approach.
15.3. INFORMATION SURFACES AND MOMENT MATRICES
383
The following lemma states that the moment matrix M is uniquely determined by its information surface IM- In the earlier chapters, we have concentrated on designs G E over the regression range X. Since interest now shifts to designs T e T over the experimental domain T, we change the notation and denote the set of all moment matrices by
This is in line with the notation Md(r) for dth-degree polynomial fit models in Section 1.28. 15.3. INFORMATION SURFACES AND MOMENT MATRICES Lemma. Let M,A e A//(T) be two moment matrices. Then we have IM IA if and only if M = A. Proof. Only the direct part needs to be proved. The two information surfaces iM and iA vanish simultaneously or not. Hence the matrices M and A have the same range of dimension r, say. We choose a full rank decomposition A = KK', with K R*xr, and some left inverse L of K. Then M and K have the same range and fulfill KLM M and MM~K = K, by Lemma 1.17. The positive definite r x r matrix D = K'M'K has the same trace as AM~~. With a design T e T that has moment matrix A, equality of the information surfaces implies
This yields trace D = r. With a design that achieves M, we similarly see that the inverse D~l = LML' satisfies trace D"1 = trace MA" = trace MM~ = r. From
we get D-1 = Ir, and conclude that M = KLML'K' = KK' = A. In other words there is a unique correspondence between information surfaces iM and moment matrices M. While it is instructive to visualize an information surface iM, the technical discussion usually takes recourse to the corresponding moment matrix M. This becomes immediately apparent when studying invariant design problems as introduced in Section 13.6. To this end, let K be a group that acts on the experimental domain T in such a way that the regression function / is equivariant, f(R(t)) = Qnf(t) for all t T and R e ft, where QR is a member of an appropriate matrix group
384
Q C GL(&). We show that in variance of an information surface iM under ft is the same as invariance of the moment matrix M under Q. By definition, an information surface i\f is called H-invariant when
In the sequel, T is a subset of Rm and 71 is the group Orth(m) of orthogonal m x m matrices; we then call iM rotatable. The precise relation between ft-invariant information surfaces and Q-invariant moment matrices is the following. 15.4. ROTATABLE INFORMATION SURFACES AND INVARIANT MOMENT MATRICES Theorem. Assume that ft is a group acting on the experimental domain T, and that Q C GL(A;) is a homomorphic image of ft such that the regression function / : T R* is ft-Q-equivariant. Let M e A/y(T) be a moment matrix. Then the information surface IM is ft-invariant if and only if M is Q-invariant. Proof. Given a transformation R ft, let Q e Q be such that f ( R ~ l ( t ) ) = Q~lf(t} for all t T. We have f ( R ~ l ( t ) ) e range M if and only if /(r) range QMQ'. Since Q~l'M~Q~l is a generalized inverse of QMQ', we get
and M (/r!(0) = iQMQ-(t} for all / T. In the direct part, we have *A/(0 = i\f(R~l(t)) = iQMQ'(t), and Lemma 15.3 yields M QMQ'. Varying R through ft, we reach every Q in the image group Q. Hence M is Q-invariant. The converse follows similarly. Thus rotatable information surfaces come with moment matrices that due to their invariance properties enjoy a specific structure. Some examples of invariant subspaces of symmetric matrices are given in Section 13.9. Others now emerge when we discuss rotatable designs for polynomial fit models. 15.5. ROTATABILITY IN MULTIWAY POLYNOMIAL FIT MODELS The m-way dth-degree polynomial fit model, as introduced in Section 1.6, has as experimental conditions an m x 1 vector t = ( / i , . . . , tm)', with entry //
15.6. ROTATABILITY DETERMINING CLASSES OF TRANSFORMATIONS
385
representing the level of the i th out of m factors. The case of a single factor, m 1, is extensively discussed in Chapter 9. In the sequel, we assume
The regression function /(/) consists of the (d^") distinct monomials of degree 0,..., d in the variables t\,..., tm. For the rotatability discussion, we assume the experimental domain to be a Euclidean ball of radius r > 0, Tr = {t e Rm : \\t\\ < r}. Any such experimental domain Tr is invariant under the orthogonal group 72. = Orth(m), acting by left multiplication t H- Rt as mentioned in Section 13.2. We follow standard customs in preferring the more vivid notion of rotatability over the more systematic terminology of orthogonal invariance. From Section 14.5, the strongest statements emerge if invariance were to pertain to a design T itself. Rotatability of a measure r means that it distributes its mass uniformly over spheres. However, our definition of a design in Section 1.24 stipulates finiteness of the support. Therefore, in our terminology, if m > 2 then no design T on Tr is rotatable. Nevertheless, for a given degree d there are plenty of designs on Tr that have a rotatable information surface. These designs are often called rotatable d th-degree designs. We refrain from this irritating terminology since, as just pointed out, our notion of a design precludes it to be rotatable. Rotatability is a property pertaining to the d th-degree information surface, or moment matrix. In the sequel, we only discuss the special radius r ^lfm,
With this choice the vertices of the symmetrized unit cube, [-l;l]m, come to lie on the boundary sphere of T^. This is the appropriate generalization of the experimental domain [1;1] for the situation of a single factor. For a given degree d > 1, the program thus is the following. We compute a matrix group Qd C GL(/c) under which the regression function / : T/^ > IR* is equivariant, as discussed in Section 13.3. We then strive to single out a finite subset Qd C Qd which determines invariance of symmetric matrices. Of course, the associated set K C Orth(m) is also finite. Invariance under selected transformations in 7 then points to designs T which have a rotatable d th-degree moment matrix. 15.6. ROTATABILITY DETERMINING CLASSES OF TRANSFORMATIONS Part (b) of Lemma 13.10 secures the existence of a finite subset 7 C Orth(m) of orthogonal matrices such that for any moment matrix, invariance relative
386
to the set Qd which is induced by 7 implies invariance relative to the underlying group Qd. The issue is to find a set 72. which is small, and easy to handle. Our choice is based on the permutation group Perm(m). Beyond this it suffices to include the rotation Rv/4, defined by
which rotates the (t\, ^-plane by 45 and leaves the other coordinates fixed. This plane exists because of our general dimensionality assumption m >2. We show that the (ml + 1)-element set
is a rotatability determining class for both the first-degree model (Lemma 15.8) and for the second-degree model (Lemma 15.15). Invariance relative to 7 entails invariance under all finite products PQR with factors P,Q,/?,... in U. This covers the rotation by 45 of any other (fj,r,)-plane, / ^ j, since this transformation can be written in the form P'Rir^P with P e Perm(m). Furthermore every sign-change matrix can be written as a finite product of permutations and 45 rotations of the coordinate axes. Therefore we would not create any additional invariance conditions by adjoining the sign-change group Sign(m) to Tl. 15.7. FIRST-DEGREE ROTATABILITY In an m-way first-degree model, the regression function is
with 7^ the ball of radius \fm in IRm and k = 1 + m. The moment matrix of a design T e T is denoted by MI(T). For a rotation R e Orth(m), the identity
15.8. ROTATABLE FIRST-DEGREE SYMMETRIC MATRICES
387
suggests the definition of the (1 + m) x (1 + m) matrix group Q\,
With this, the regression function / becomes 7^-Qi-equivariant. The rotatability determining class K of Section 15.6 then induces the subset
The rotatable symmetric matrices have a simple pattern, as follows.
15.8. ROTATABLE FIRST-DEGREE SYMMETRIC MATRICES Lemma. For every matrix A Sym(l+w), the following three statements are equivalent: a. (Q\-invariance) A is Q\-invariant. b. (Qi-invariance) A is Q\-invariant. c. (Parametrization) For some a,/3 6 R we have
Proof. means
Partioning A to conform with Q Qi, invariance A = QAQ'
That (c) implies (a) is plainly verified by inserting a 0 and B = film. It is clear that (a) implies (b) since the latter involves fewer transformations. Now assume (b). First we average over permutations and obtain
388
for some 6,/3,-y 6 IR (see Section 13.10). Then, with vector s = R^lm = (0, \/2,1, 1)', invariance under the rotation Rv/^ yields
It follows that 5 = 0 and y = 0, whence (c) is established. Part (c) says that the subspace Sym(l + m, Qi) of invariant symmetric matrices has dimension 2, whatever the value of m. An orthogonal basis is given by
The projection of A e Sym(l + m) onto Sym(l + m, Qi) is A = (A, Vj> Vj + ({^,V 2 )/m)V 2 . We call a first-degree moment matrix rotatable when it is Q\ -invariant. The set of rotatable first-degree moment matrices,
is compact and convex. It turns out to be a line segment in the two-dimensional space Sym(m + 1, Qi). 15.9. ROTATABLE FIRST-DEGREE MOMENT MATRICES Theorem. Let M be a symmetric (1+w) x (1+m) matrix. Then M is a rotatable first-degree moment matrix on the experimental domain T^ if and only if for some /x2 e [0; 1], we have
The moment matrix in (1) is attained by a design r e T if and only if T has all moments of order 2 equal to ^
for all / < m, while the other moments up to order 2 vanish.
15.10. KIEFER OPTIMAL FIRST-DEGREE MOMENT MATRICES
389
Proof. For the direct part, let T be a design on ^/m] with a rotatable first-degree moment matrix M. Lemma 15.8 entails
Calculating the moments of T, we find
That /3 is the second moment under r common to the components r, is expressed through a change in notation, /3 = ^2- Hence M has form (1), that is, the moments of T fulfill (2). Clearly ju,2 > 0. As an upper bound, we obtain
For the converse, we notice that the rotatable matrix
is achieved by the one-point design in 0, while
is attained by the uniform distribution on the sphere of radius ^/m, or by the designs to be discussed in Section 15.11. Then every matrix on the line connecting A and B is also a moment matrix, and is rotatable. A design r with a rotatable first-degree moment matrix induces the rotatable first-degree information surface i r (t) = 1/(1 +t't/fjL2) for all / e R m , provided the common second moment /Lt2 of T is positive. If by choice of T we enlarge /-i/?, then the surface ir is uniformly raised. This type of improvement leads to Kiefer optimality relative to the group Qi, as discussed in Theorem 14.6. 15.10. KIEFER OPTIMAL FIRST-DEGREE MOMENT MATRICES Corollary. The unique Kiefer optimal moment matrix for 9 in A/i(T) is
390
with associated information surface
Proof. Among all Q\ -invariant moment matrices, M is Loewner optimal, by Theorem 15.9. Theorem 14.6 then yields Kiefer optimality of M in A/i(T). How do we find designs that have a Kiefer optimal moment matrix? Consider a design r which assigns uniform weight \/i to (. vectors f, e Um of length \/m, with model matrix
If the moment matrix of r is Kiefer optimal,
then X has orthogonal columns of squared lengths L Such designs are called orthogonal because of the orthogonality structure of X. A model matrix X with X'X = ilm is achieved by certain two-level factorial designs. 15.11. TWO-LEVEL FACTORIAL DESIGNS One way to construct a Kiefer optimal first-degree design on the experimental domain jf/^ is to vary each of the m factors on the two levels 1 only. Hence the vectors of experimental conditions are r, G {l}m, the 2m vertices of the symmetrized cube [I;!]"1 included in the rotatable experimental domain r^. The design that assigns uniform weight \/i to each of the t = 2m vertices of [-1; \}m is called the complete factorial design 2m. It has a model matrix X e R*x(1+w) satisfying X'X Um. For instance, in the two-way or three-way first-degree model, X is given by
15.12. REGULAR SIMPLEX DESIGNS
391
up to permutation of rows. The support size 2m of the complete factorial design quickly outgrows the quadratic bound \k(k +1) = \(m + l)(m + 2) of Corollary 8.3. The support size is somewhat less excessive for a 2m~p fractional factorial design which, by definition, comprises a 2~p fraction of the complete factorial design 2m in such a way that the associated model matrix X has orthogonal columns. For instance, the following 4 x 4 model matrix X belongs to a onehalf fraction of the complete factorial design 23,
It satisfies X'X 41 4 and hence is Kiefer optimal, with only 4 = j23 runs. There are also optimal designs with a minimum support size l+m = k, conveniently characterized by their geometric shape as a regular simplex. 15.12. REGULAR SIMPLEX DESIGNS Since in an m-way first-degree model, the mean parameter vector 6 = (0o, 0i, . . . , dm)' has l + m components, the smallest possible support size is 1 + m if a design is to be feasible for 6. Indeed, there exist Kiefer optimal designs with this minimal support size. To prove this for any number m > 2 of factors, we need to go beyond two-level designs and permit support points /, e T^ other than the vertices {l}m. For t = I + m runs the model matrix X is square. Hence X'X = (m + l)/m+t is the same as XX' (m + l)/ w+ i. Thus the vectors r, in the rows of X fulfill, for all i ^ ; < m + 1,
In other words, the convex hull of the vectors t\,..., tm+i in Rm is a polytope which has edges f, - tj of common squared length 2(m +1). Such a convex body is called a regular simplex. A design that assigns uniform weight l/(m +1) to the vertices f 1 ; . . . , tm+\ of a regular simplex in Rm is called a regular simplex design. For two factors, the three support points span an equilateral triangle in R2. For three factors, the four support points generate an equilateral tetrahedron in R3, and so on. The diameter of the simplex is such that the vertices r, belong to the boundary sphere of the ball T^ which figures as experimental domain. If a regular simplex design can be realized using the two levels 1 only, tf {l}m, then the model matrix X has entries 1 besides satisfying X 'X (m + \)lm+\. Such matrices X are known as Hadamard matrices. We may
392
summarize as follows, for an m-way first-degree model on the experimental domain 7^ = [t e Rm : \\t\\ < y^}. A regular simplex design and any rotation thereof is Kiefer optimal for 6 in T, it has smallest possible support size m +1 and always exists. A two-level regular simplex design exists if and only if there is a Hadamard matrix of order m + 1. The complete factorial design 2m is a two-level Kiefer optimal design for 8 in T with a large support size 2m. Any 2m~p fractional factorial design reduces the support size while maintaining optimality. These results rely on the moment conditions (2) in Theorem 15.9. For second-degree models, moments up to order 4 are needed. An efficient bookkeeping of higher order moments relies on Kronecker products and vectorizatioti of matrices. 15.13. KRONECKER PRODUCTS AND VECTORIZATION OPERATOR The Kronecker product of two matrices A Ukxm and B e Rx/1 is defined as the ki x mn block matrix
For two vectors s Rm and t e R", this simplifies to the mn x 1 block vector
The key property of the Kronecker product is how it conforms with matrix multiplication,
This is easily seen as follows. For Euclidean unit vectors et e Um and dj e IR", the identity (A <8> #)(e, <g> dj) = (Aei) <8> (Bdj) is verified from the definition. For arbitrary vectors s e Um and t R", bilinearity of the Kronecker product entails (1),
15.13. KRONECKER PRODUCTS AND VECTORIZATION OPERATOR
393
Formula (1) extends to matrices C e Rmxp and D R"x<?,
Again this is first verified for Euclidean unit matrices E^ etdj e Rmxp and E by referring to (1) three times,
Then an appeal to bilinearity justifies the extension to C ]
andD =
As a consequence of (2), a generalized inverse, the Moore-Penrose inverse, the inverse, and the transpose of a Kronecker product is equal to the Kronecker product of the generalized inverses, Moore-Penrose inverses, inverses, and transposes, respectively. The grand vector st assembles the same cross products s/f; that appear in the rank 1 matrix st'. For an easy transition between the two arrangements, we define
In other words, the matrix st' is converted into a column vector by a concatenation of rows, CM',... ,-W')> followed by a transposition. This is extended to matrices A = Z^/^7^'7 by linearity, vec A Yîja'j vec(e,d;') = Yî,jaij(e' dj). The construction results in a linear mapping called the vectorization operator on R mxn ,
which maps any rectangular m x n matrix A into a column vector vec A based on the lexicographic order of the subscripts, vec A = (a\ \,..., a\n, 02i, , 2w >
i m\ i i Qmn)
The matrix version of formula (3) is
provided dimensions match appropriately. This follows from vec(Ae/d;'C) = , and linearity. The
394
vectorization operator is also scalar product preserving. That is, for all A, B e (fr xw , we have
In second-degree moment matrices, the representation of moments of order 4 is based on the m2 x m2 matrices
where E/y = eê- are the Euclidean unit matrices in R w x m . Identity (5) is immediate from Im = Ysi EH and bilinearity of the Kronecker product. Identity (7) additionally uses (vec /,-)( vec Eyy)' = (/i)(e/<8>e/)' = /;-<8>/y. Identity (6) serves as the definition of the matrix /OT,m, called the vec-permutation matrix. The reason is that lmfn is understood best through its operation on vectorized matrices,
This follows from Im^m(\QcEki) = vec Eik, extending to (8) by linearity. This provides the technical basis to discuss rotatability in second-degree models. 15.14. SECOND-DEGREE ROTATABILITY In an m-way second-degree model, m > 2, we take the regression function to be
with T^ the ball of radius ^/m in Rm and k 1 + m + m2. The moment matrix of a design r e T is denoted by A/2(r). The m2 x 1 bottom portion r <8> t represents the mixed products for i ^ j twice, as r,-fy and as ifi. Thus the representation of second-degree terms in /(/)
15.14. SECOND-DEGREE ROTATABILITY
395
is redundant. Yet the powerful properties of the Kronecker product make / superior to any other form of parametrizing the second-degree model. To appreciate the issue, we refer to the vectorization operator, tt vec(ff'), and reiterate our point in terms of the rank 1 matrix tt'. This matrix is symmetric whence, of the m2 entries, only \m(m + 1) are functionally independent. Nevertheless the powerful rules of matrix algebra make the arrangement as an m x m matrix superior to any other form of handling the \m(m + 1) different components. The very same point is familiar from treating dispersion matrices, moment matrices, and information matrices as matrices, and not as arrays of a minimal number of functionally independent terms. Because of the redundancy in t <g> f, the function / satisfies the side conditions
for the |ra(ra-l) choices of distinct subscripts i and /'. Therefore any moment matrix M2(r) = J/(0/(0'^T nas nullity at least equal to \m(m - 1), or equivalently,
for all T e T. Rank deficiency of moment matrices poses no obstacle since our development does not presuppose nonsingularity. The introduction of the generalized information matrices in Section 3.21 was guided by a similar motivation. A rotation R e 7t Orth(m) leaves the experimental domain T^ invariant, and commutes with the regression function / according to
Therefore / is 7-Q2-equivariant relative to the (1 + m + m 2 ) x (1 + m + m 2 ) matrix group
The rotatability determining class ft of Section 15.6 induces the subset
396
Rotatable symmetric matrices now achieve a slightly more sophisticated pattern than those of Section 15.8. 15.15. ROTATABLE SECOND-DEGREE SYMMETRIC MATRICES Lemma. For every matrix A e Sym(l + m + m2), the following three statements are equivalent: a. (Q.2-invariance) A is Q2-invariant. b. (Qi-invariance) A is Q2-invariant. c. (Parametrization) For some a, /3, y, 5i, ^2, ^3 6 IR we have
where Proof. We partition A to conform with Q e Q2. Invariance then means
for all R <E Orth(m). That (c) implies (a) follows from R(pIm)R' - f3Im, and
The last line claims that the left hand side is the vec-permutation matrix of Section 15.13. It suffices to verify (8) of that section for all A e IR mxw :
15.15. ROTATABLE SECOND-DEGREE SYMMETRIC MATRICES
397
Hence F(8i, 82, 83) is invariant under RR, and A in part (c) is Q2-invariantClearly part (a) implies (b) since the latter comprises fewer transformations. It remains to establish the implication from part (b) to (c). We recall from Section 15.6 that invariance relative to Ti Perm(m) U {R^/*} implies invariance also relative to the sign-change group Sign(w). In a sequence of steps, we use sign-changes, permutations, and the 45 rotation to disclose the pattern of A as asserted in part (c). I. A feasible transformation is the reflection R -Im e Sign(m). With this, we obtain a -a and C = -C, whence a = 0 and C = 0. II. From B = RBR' for R e Perm(m) U {R^/*}, we infer B = plm for some /3 e 1R, as in the proof of Lemma 15.8. The vector b has m2 entries and hence may be written as b vec E for some square matrix E e Rmxm. Then b = (RR)b translates into E RER'. Permutational invariance of E implies complete symmetry even though E may not be symmetric (compare Section 13.9). Invariance under Rw/^ then necessitates E ylm. This yields b = y vec Im. III. Now we investigate invariance of the bottom right block, D (R <g> R)D(R'R'). It is convenient to display the entries of D according to d^^ (eiej)'D(ek<8)et). a. For sign-changes R we get with
and
for all /, j, k, t < m. Hence d^ki vanishes provided four of the subscripts ij,k, are distinct, or three are distinct, or two are distinct with multiplicities 1 and 3. This leaves two identical subscripts of multiplicity 2 each, or four identical subscripts,
with 3m(m 1) + m = 3m2 2m coefficients to be investigated. b. A permutation matrix Ra = Yîea(i)e,' Perm(m) yields d^^ = d(,(i)a(j),<T(k)<r(i)' Whatever the value of m, this reduces the number of coefficients to four, 81,82,8^,84 e IR, say, with
398
for all / / ;' < m. Hence D attains the form
This is part (c) provided we show that the last term vanishes. c. This is achieved by invoking the 45 rotation Rv/^ from Section 15.6, giving
Thus we obtain
and the theorem is proved.
For second-degree rotatability, the subspace Sym(l +m+m 2 , Qi) of invariant symmetric matrices has dimension 6, whatever the value of m. We call a second-degree moment matrix A/2(r) rotatable when it is Q.2invariant. The set of rotatable second-degree moment matrices,
is parametrized by the moment of order 2, 1^2(7) = f(e-t)2dr, and the moment of order 2-2, M22(T) = /(e//) 2 (r'e y ) 2 rfr for i ^ /, as follows. 15.16. ROTATABLE SECOND-DEGREE MOMENT MATRICES Theorem. Let M be a symmetric (l + m + m 2 ) x ( l + m + m 2 ) matrix. Then M is a rotatable second-degree moment matrix on the experimental domain Tj^ if and only if for some
we have
where
15.16. ROTATABLE SECOND-DEGREE MOMENT MATRICES
399
The moment matrix in (1) is attained by a design r T if and only if r has all moments of order 2 equal to 112, all moments of order 2-2 equal to /i,22, and all moments of order 4 equal to 3/u,22>
for all / / j < m, while the other moments up to order 4 vanish. Proof. For the direct part, let r be a design on T^ - (t e Rm : \\t\\ < ^/m} with a rotatable second-degree moment matrix M. Then M is of the form given in Lemma 15.15. Calculating the moments of T, we find a = 1 and
Hence M has form (1), that is, the moments of T fulfill (2). The bounds on /u-2 are copied from Theorem 15.9. For 7x22, the upper bound
follows fromfixing; = m and summing over / / m,
The lower bound,
is obtained from the variance of t't = (vec Im)'(t <g> t) under r,
400
For the converse, we need to construct a design r which for given parameters 1*2 and fjL22 has M of (1) for its moment matrix. We utilize the uniform distribution rr on the sphere of radius r, {t e Um : ||/|| = r}; the central composite designs in Section 15.18 have the same moments up to order 4 and could also be used. Clearly the measure rr is rotatable, with moments
that is, iL22(Tr) = r4l(m(m + 2)). Now let the numbers
be given. We define
The measure (1 - a)rQ + aTr then places mass a on the sphere of radius r and puts the remaining mass 1 - a into the point 0. It attains the given moments,
Thus there exists a design of which the matrix M in (1) is the moment matrix. Rotatable second-degree information surfaces take an explicit form. 15.17. ROTATABLE SECOND-DEGREE INFORMATION SURFACES Corollary. If the moments /x2 and /A22 satisfy
15.17. ROTATABLE SECOND-DEGREE INFORMATION SURFACES
401
then the moment matrix M in (1) of Theorem 15.16 has maximal rank \(m + l)(m + 2), and induces the rotatable information surface, which for t e R1" is given by
Proof. The rotatable moment matrix M in (1) of Theorem 15.16 has positive eigenvalues /A2 and 2î22> with associated projectors
where G Hence pt2 has multiplicity m, while 2ju,22 has multiplicity trace Gm = \m(m + 1) 1. This accounts for all but two degrees of freedom, leaving M - \ijP\ - 2ju,22P2 STS' with
The nonvanishing eigenvalues of STS' are the same as those of the 2 x 2 matrix S'ST. The latter has determinant
say. By our moment assumption, d is positive. Hence the rank of M is maximal. The Moore-Penrose inverse then is
From part (c) of Lemma 13.10, this matrix is necessarily invariant. Indeed, a parametric representation as in Lemma 15.15 holds, with
Finally, straightforward evaluation of f(t)'M+f(t)
yields /A/(0-
402
15.18. CENTRAL COMPOSITE DESIGNS In the proof of Theorem 15.16, we established the existence of a design with prescribed moments /x2 and ^22 by taking recourse to the uniform distribution rr on the sphere of radius r. For lack of a finite support, this is not a design in the terminology of Section 1.24. Designs with the same lower order moments as the uniform distribution are those that place equal weight on the vertices of a regular polyhedron. In the first-degree model of Section 15.12, moments up to order 2 matter and call for regular simplices. For a second-degree model, moments up to order 4 must be matched. With m = 3 factors, at least \(m + l)(m + 2) = 10 support points are needed to achieve a maximal rank. The twelve vertices of an icosahedron can be used, or the twenty vertices of a dodecahedron. The class of central composite designs serves the same purpose, as well as being quite versatile in many other respects. These designs are mixtures of three building blocks: cubes, stars, and center points. The cube portion rc is a 2m-p fractional factorial design. If it is replicated nc times, then ncrc is a design for sample size 2m~pnc. The star portion rs takes one observation at each of the vectors rei for / < m, for some star radius r > 0. With ns replications, nsrs is a design for sample size 2mns. The center point portion TO is the onepoint design in 0, it is replicated o times. The design ncrc + nsrs + rtoT0 is then a central composite design for sample size n = 2m~pnc + 2mns + 0, with star radius r > 0. The only nonvanishing moments of the standardized design r = (ncTc + nsTs + n0To)/n T are
Rotatability imposes the condition m(r) = 3/i22(T)> that is, r4 = 2m pnc/ns. Specifically, we choose nc = m2, ns = 2m~p, and 0 = 0 to obtain a central composite design with no center points, for sample size n = 2m~pm(m + 2). The rotatability condition r4 = 2m~pnc/ns forces r = ^/m. Hence the star points .^fmei lie on the sphere of radius ^/m, as do the cube points (1}W. The resulting design is
Up to order 4, it has the same moments as has the uniform distribution T^,
The corresponding design with cube and star points scaled to lie on the sphere of radius r > 0 is denoted by fr'. Its moments up to order 4 match
15.19. SECOND-DEGREE COMPLETE CLASSES OF DESIGNS
403
those of the uniform distribution rr. In retrospect, we can achieve any given moments /u,2 and /u<22 m Theorem 15.16 by the central composite design arr + (1 a)TO, with
This design has a finite support size, at most equal to 2m p + 2m +1, and is a legitimate member of the design set T. Of the two remaining parameters //,2 and 1*22, we now eliminate ju,22 by a Loewner improvement, and calculate a threshold for /x2 by studying the eigenvalues of M. This leads to admissibility results and complete classes in second-degree models. 15.19. SECOND-DEGREE COMPLETE CLASSES OF DESIGNS Theorem. For a e [0;1], let r be the central composite design which places mass a on the cube-plus-star design Ty^ from the previous section while putting weight 1 - a into 0. a. (Kiefer completeness) For every design T e T, there is some a e [0; 1] such that the central composite design ra improves upon T in the Kiefer ordering, M2(ra) > M2(r), relative to the group Q2 of Section 15.14. b. (Qi-invariant (f>) Let (f> be a Q2-invariant information function on NND(1 + m + m2). Then for some a e [0;1], the central composite design ra is <f> -optimal for 0 in T. c. (Orthogonally invariant <f>) Let </> be an orthogonally invariant information function on NND(1 + w + ra2). Then for some a e [2/(m + 4); 1], the central composite design ra is < -optimal for B in T. Proof. In part (a), we use for Sym(l + m + m 2 , Q2), the orthogonal basis
where F For r e T, we calculate ^t 2 (r), and (M2(r), V4)/{V 4 , V4) = ^2i(^}- Hence the projection of M 2 (r) on Sym(l + m + m2, Q2) is determined by the moments of T,
404
Since this is a rotatable second-degree moment matrix, the coefficients fulfill /n22(T) < (ml(m + 2))/i2(T), by Theorem 15.16. Because of nonnegative definiteness of V4, this permits the estimate
With a = 1*2(7), the latter coincides with the moment matrix of the central composite design ra = (1 - CX)TQ + ar^, as introduced in the preceding section. In summary we have, with a = /x2(r),
Therefore ra is an improvement over r in the Kiefer ordering, M 2 (r a ) > M2(r). For part (b), we use the monotonicity of Q2 -invariant information functions tf> from Theorem 14.3 to obtain max T T<(M 2 (T)) < maxa[o;i] <f>(M2(ra)). In part (c), orthogonal invariance implies that <j> depends on M2(Ta) only through the eigenvalues,
The respective multiplicities are 1, \m(m + 1) 1, m, and 1. All eigenvalues are increasing for a e [0;2/(w + 4)]. Hence for a < 2/(m + 4) the eigenvalue vector admits a componentwise improvement, A(a) < A(2/(m + 4)). Thus we get for all a e [0;2/(/n+4)]. Hence there exists a 0-optimal design ra with a [2/(w + 4);l|. The theorem generalizes the special case of a parabola fit model with a single factor of Section 13.1, to an arbitrary number of factors m > 2. For the trace criterion <fo, again the largest value of the parameter is optimal, a =-1, at the expense of giving away feasibility for 6, rank Af 2 (ri) = \(m + l)(m + 2)-l. However, if m > 2, then for the rank deficient smallest-eigenvalue criterion (f)^ it is not the smallest value a = 2/(m + 4) which is optimal, but a = (m - 1)/(2m - 1). This is a consequence of the ordering of the eigenvalues, in that the smallest positive eigenvalue is A 3 (a) for a [0; (m - l)/(2m - 1)],
15.20. MEASURES OF ROTATABILITY
405
EXHIBIT 15.1 Eigenvalues of moment matrices of central composite designs. For m = 4 factors, the positive eigenvalues A, (a) of the moment matrices of the central composite designs ra (l-a)T 0 + a72 are increasing on [0;l/4], fori = 1,2,3,4. The minimum of the eigenvalues is maximized at a = 3/7.
and A 4 (a) for a G [(m - l)/(2m - 1); 1] (see Exhibit 15.1). As m tends to oo we get 0 <- 2/(m + 4) < (m - I)/(2m - 1) - 1/2. The bottom line is that rotatability generates a complete class of designs with a single parameter a no matter how many factors m are being investigated.
15.20. MEASURES OF ROTATABILITY
In Section 5.15, we made a point that design optimality opens the way to the pratically more relevant class of efficient designs, substituting optimality by near-optimality. In the same vein, we may replace rotatability by nearrotatability. In fact^ given an arbitrary moment matrix M, the question is how it relates to M, the orthogonal projection onto the subspace Sym(l +m + m2, Q2) of rotatable matrices. However, an orthogonal projection is nothing but a least-squares fit for regressing M on the matrices V0, V2> and V4 which span Sym(l + m + m 2 , Q2). Hence relative to the squared matrix norm \\A\\2 = trace A 'A the following statements are equivalent: M is second-degree rotatable,
406
Deviations from rotatability can thus be measured by the distance between M and M, \\M M\\2 > 0. However, it is difficult to assess the numerical value of this measure in order to decide which numbers jire too big to be acceptable. Instead we recommend the /?2-type measure \\M V0\\2/\\M VQ\\2, which lies between 0 and 1 besides being a familiar diagnostic number in linear model inference. 15.21. EMPIRICAL MODEL-BUILDING Empirical model-building deals with the issue of finding a design that is able to resolve the fine structure, in a model sequence of growing sophistication. The option to pick from a class of reasonable, efficient, near-rotatable designs a good one rather than the best is of great value when it comes to decide which of various models to fit. Optimality in a single model aids the experimenter to assess the performance of a design. If more than one model is contemplated, then the results of Chapter 11 on discrimination designs become relevant, even though they are computationally expensive. Whatever tools are being considered, they should not construe the insight that a detailed understanding of the practical problem is needed. It may lead the experimenter to a design choice which, although founded on pragmatic grounds, is of high efficiency. Generally, the theory of optimal experimental designs provides but one guide for a rational design selection, albeit an important one. EXERCISES 15.1 Show that the central composite design ra with a = m(m + 3)/ ((m+l)(m+2)) is ^'-optimal for 6 in T, where <0' is the rank deficient determinant criterion of Section 8.18.
15.2 Show that for four factors, m = 4, the central composite designs ra of Section 15.19 have rank deficient matrix mean information with p 7^ 0, oo. 15.3 15.4 (continued) Show that $1^-, $_'j- </>0'-optimal designs for 6 in T are obtained with a = 3/7, 0.748, 14/15, respectively. Let the information functions 0 on NND(5i52) and $, on NND(.s,) for i = 1,2 satisfy ^(Q <g> C2) = <Ai(Q)<fc(C2). Show that if M, e MI is <fo-optimal for /f/0/, for i = 1,2, then MI <8> MI is <-optimal for (K\K2)'(8i&2)inM = conv{AiA2 : Al M\, A2 M2] [Hoel (1965), p. 1099; Krafft (1978), p. 286; Pukelsheim (1983a), p. 196].
EXERCISES
407
15.5
(continued) Show that the matrix means on NND(,siS2), and on for all NNDfo) and NND(s 2) satisfy Ci 6 NND(si) and C2 NND(s2). Deduce formulas for trace and for det C\ C2. (continued) The two-way second-degree unsaturated regression function /(?!, r2) = (1, ?i, f 2 , r^ 2 )' is the Kronecker product of fi(tt) = (1, /,)' for i = 1,2. Find optimal designs by separately considering the one-way first-degree models for i = 1,2.
15.6
Comments and References
The pertinent literature is discussed chapter by chapter and further developments are mentioned.
1. EXPERIMENTAL DESIGNS IN LINEAR MODELS The material in Chapter 1 is standard. Our linear model is often called a linear regression model in the wide sense, whereas a linear regression model in the narrow sense is our multiple line fit model. Multiway classification models and polynomial fit models are discussed in such textbooks as Searle (1971), and Box and Draper (1987). The study of monotonic matrix functions is initiated by Loewner (1934), and is developed in detail by Donoghue (1974). The terminology of a Loewner ordering follows Marshall and Olkin (1979). The Gauss-Markov Theorem is the key result of the theory of estimation in linear models. In terms of estimation theory, the minimum representation in the GaussMarkov Theorem 1.19 is interpreted as a covariance adjustment by Rao (1967). Optimal design theory calls for another, equally important application of the Gauss-Markov Theorem when, in Section 3.2, the information matrix for a parameter subsystem is defined. In order that the Gauss-Markov Theorem provides a basis for the definition of information matrices, it is imperative to present it without any assumptions on ranks and ranges, as does Theorem 1.19. This is closely related to general linear models and generalized inverses of matrices, see the monograph by Rao and Mitra (1971). Of the various classes of generalized inverses, we make do with the simplest one, that of Section 1.16. Our derivation of the Gauss-Markov Theorem emphasizes matrix algebra rather than the method of least squares. That the Gauss-Markov Theorem and the method of least squares are synonymous is discussed in many statistical textbooks. Krafft (1983) derives the Gauss-Markov Theorem by means
408
COMMENTS AND REFERENCES
409
of a duality approach, exhibiting an appropriate dual problem. The history of the method of least squares, and hence of the Gauss-Markov Theorem, is intriguing, see Plackett (1949, 1972), Farebrother (1985), and Stigler (1986). A brief review follows. Legendre (1806) was the first to publish the Methode des Moindres Quarres, proposed the name and recommended it because of its simplicity. Gauss, according to a 1812 letter to Laplace [see Gauss (Werke), Band X l, p. 373], had known the method already in 1795, and had found a maximum likelihood type justification in 1798. However, he did not publish his findings until 1809, in Section 179 of his work Theoria Motus Corporum Coelestium, [see Gauss (Werke), Band VII, p. 245]. A minimum variance argument that comes closest to what today we call the Gauss-Markov Theorem appeared in 1823, as Section 21 of the paper Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Pars Prior, [see Gauss (Werke), Band IV, p. 24]. Markov (1912) contributed to the dissemination of the method, by including it as the seventh chapter in his textbook Wahrscheinlichkeitsrechnung; see also Section 4.1 of Sheynin (1989). The standardization of a design for finite sample size n to a discrete probability distribution n/n anticipates the generalization to infinite sample size. From a practical point of view, a design for finite sample size n is a list (*i,... ,*) of regression vectors jc, which determine experimental run i. The transition to designs for infinite sample size goes back to Elfving (1952), see also Kiefer (1959, p. 281). Ever since, there has been an attempt to discriminate between designs for finite sample size and designs for infinite sample size by appropriate terminology: discrete versus continuous designs; exact versus approximate designs; concrete designs versus design measures. We believe that the inherent distinction is best brought to bear by speaking of designs for finite sample size, and designs for infinite sample size. In doing so, we adopt the philosophy of Pazman (1986), with a slightly different wording. In the same vein we find it useful to distinguish between moment matrices of a design and information matrices for a parameter subsystem. Many authors simply speak in both cases of information matrices. Kempthorne (1980) makes a point of distinguishing between a design matrix and a model matrix; Box and Hunter (1957) use the terms design matrix and matrix of independent variables. We reserve the term precision matrix for the inverse of a dispersion matrix, because it is our understanding that precision ought to be maximized while variability ought to be minimized. In contrast, the precision matrix of Box and Hunter (1957, p. 199), is the inverse moment matrix n(X'X)~l. In Lemma 1.26, we use the fact that in Euclidean space, the convex hull of a compact set is compact; this result can be found in most books on convex analysis, see for instance Rockafellar (1970, p. 158). An important implication of Lemma 1.26 is that no more moment matrices beyond the set M(H) evolve if the notion of a design is extended to mean an arbitrary probability distribution P on the Borel sigmaalgebra B of the compact regression range X. Namely, every such distribution P is the limit of a sequence (m)m>\
410
of probability measures with finite support, see Korollar 30.5 in Bauer (1990). The limit is in the sense of vague convergence and entails convergence of the moment matrices,
In our terminology, &, are designs in the set H, whence the moment matrices M(gm) lie in M(H). Since the set M(H) is closed, it contains the limit matrix, fxxx' dP e Af(E). Hence the moment matrix of P is attained also by some design e s , M ( P ) = M(). 2. OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS In the literature, designs that are optimal for c'd are mostly called c-optimal. The main result of the chapter, Theorem 2.14, is due to Elfving (1952). Studden (1971) generalizes the result to average-variance optimal designs for K '0, with applications to polynomial extrapolation and linear spline fitting. Other important elaborations of Elfving's result are given in Chernoff (1972), and Fellman (1974). Our presentation is from Pukelsheim (1981) and is set up in order to ease the passage to multidimensional parameter subsystems. That the cone of nonnegative definite matrices NND(fc) is the same as the ice-cream cone, as established in Section 2.5, holds true for dimensionality k = 2 only. A general classification of self-dual cones is given by Bellissard, lochum and Lima (1978). The separating hyperplane theorem is a standard result from convex analysis; see for instance Rockafellar (1970, p. 95), Bazaraa and Shetty (1979, p. 45), and Witting (1985, p. 71). The Equivalence Theorem 2.16 on scalar optimality of moment matrices is due to Pukelsheim (1980) where it is obtained as a corollary from the General Equivalence Theorem. The necessity part of the proof puts early emphasis on a type of argument that is needed again in Section 7.7. Following work of Kiefer and Wolfowitz (1959), Hoel and Levine (1964) reduce scalar optimality to Chebyshev approximation problems; see also Kiefer and Wolfowitz (1965), and Karlin and Studden (1966b). For a dth-degree polynomial fit model on [1;1], Studden (1968) determines optimal designs for individual coefficients Oj. We include these results in Section 9.12 and Section 9.14. The geometry that underlies the Elfving Theorem is exploited for more sophisticated criteria than scalar optimality by Dette (1991b). 3. INFORMATION MATRICES Throughout the book, we determine the s x l parameter system of interest K'6 through the choice of the k x s coefficient matrix K, not through the
411
underlying model parametrization that determines the k x 1 parameter vector 6. For instance, the two-way classification model is always parametrized as E[Y,j] = a, + j8y, with parameters a, and fy forming the components of the vector 6. The grand mean 0., or the contrasts (! a.,..., aa - a.)', or some other parameter subsystem are then extracted by a proper choice of the coefficient matrix K. There is no need to reparametrize the model! The central role of information matrices for the design of experiments is evident from work as early as Chernoff (1953). The definition of the information matrix mapping as a Loewner minimum of linear functions is due to Gaffke (1987a), see also Gaffke (1985a, p. 378), Fedorov and Khabarov (1986, p. 185), and Pukelsheim (1990). It is made possible through the GaussMarkov Theorem in the form of Theorem 1.21. The close relation with the Gauss-Markov Theorem is anticipated by Pukelsheim and Styan (1983). Feasibility cones, as domains of optimization for the design problem, were first singled out by Pukelsheim (1980,1981). That the range inclusion condition which appears in the definition of feasibility cones is essential for estimability, testability, and identifiability is folklore of statistical theory, compare Bunke and Bunke (1974). Alternative characterizations based on rank are available in Alalouf and Styan (1979). Fisher information matrices are discussed in many textbooks on mathematical statistics. The dispersion formula of Section 3.10 also appears in the differential geometrical analysis of more general models, as in Barndorff-Nielssen and Jupp (1988). The role of the moment matrix in general parametric modeling is emphasized by Pazman (1990). Parameter orthogonality is investigated in greater generality by Cox and Reid (1987). Lemma 3.12 is due to Albert (1969). For a history of Schur complements and their diverse uses in statistics, see Ouellette (1981), Styan (1985), and Carlson (1986). The derivation of the properties of information matrices CK(A) or of generalized information matrices AK is based on Anderson (1971), and also appears in Krein (1947, p. 492). Anderson (1971) calls AK a shorted operator, for the reason that
To see this, we recall that by Lemma 3.14, the matrix AK satisfies AK < A and range AK C range K. Now let B e NND(fc) be another matrix with B < A and range B C range K. From B < A, we get range B C range A, whence range B C range AK by the proof of Theorem 3.15. From Lemma 3.22 and Lemma 1.17, we obtain B = AKA~B. Symmetry of B and the assumption B < A yield B - AKA~B - AKA~BA~AK < AKA~AA~AK = AK, since A~AA~ is a generalized inverse of A and hence of AK, again by Lemma 3.22. Thus AK is a maximum relative to the Loewner ordering, among all nonnegative definite k x k matrices B < A of which the range is included in the range of K. Anderson and Trapp (1975, p. 65) establish the minimum property of
412
shorted operators which we have chosen as the definition. It emphasizes the fundamental role that the Gauss-Markov Theorem plays in linear model theory. Mitra and Puri (1979) and Golier (1986) develop a wealth of alternative representations of shorted operators, based on various types of generalized inverse matrices and generalized Schur complements. Alternative ways to establish the functional properties of the information matrix mapping are offered by Silvey (1980, p. 69), Pukelsheim and Styan (1983), and Hedayat and Majumdar (1985). The method of regularization has more general applications in statistics, see Cox (1988). The line fit example illustrating the discontinuity behavior of the information matrix mapping is adapted from Pazman (1986, p. 67). The C-matrix in a two-way classification model is a well-understood object, but its origin is uncertain. Reference to a C-matrix is made implicitly by Bose (1948, p. (12)), and explicitly by Chakrabarti (1963). Anyway, the name is convenient in that the C-matrix is the coefficient matrix of the reduced system of normal equations, as well as the contrast information matrix. Christof (1987) applies iterated information matrices to simple block designs in two-way classification models. Theorem 3.19 is from Gaffke and Pukelsheim (1988). For the occurrence of the C-matrix in models with correlated observations see, for instance, Kunert and Martin (1987), and Kunert (1991). 4. LOEWNER OPTIMALITY Many authors speak of uniform optimality rather than Loewner optimality. We believe that the latter makes the reference to the Loewner ordering more visible. The Loewner ordering is the same as the uniform ordering of Pazman (1986, p. 48), in view of Lemma 2 of Ste.pniak, Wang and Wu (1984). It fits in with other desirable notions of information oriented orderings of experiments, see Kiefer (1959, Theorem 3.1), Hansen and Torgersen (1974), and Ste.pniak (1989). The first result on Loewner optimal designs appears in Kurotschka (1971), in the setting of the two-way classification models of Section 4.8; see also Gaffke and Krafft (1977), Kurotschka (1978), Giovagnoli and Wynn (1981, 1985a), and Pukelsheim (1983a,c). The first part of Lemma 4.2 is due to LaMotte (1977). Section 4.5 to Section 4.7 follow Pukelsheim (1980). The nonexistence of Loewner optimal designs discussed in Corollary 4.7 is implicit in the paper of Wald (1943, p. 136). The derivation of the General Equivalence Theorem for scalar optimality from Section 4.9 onwards is new. 5. REAL OPTIMALITY CRITERIA The first systematic attempt to cover more general optimality criteria than the classical ones is put forward by Kiefer (1974a). From the broad class
413
of criteria discussed in that paper, Pukelsheim (1980) singles out information functions as an appropriate subclass for a general duality theory. Closely related classes of functions are investigated by Rockafellar (1967), and McFadden (1978). Other classes may be of interest; for instance, Cheng (1978a) distinguishes between what he calls type I criteria and type II criteria. Nalimov (1974) and Hedayat (1981) provide an overview of various common and some not so common optimality criteria. Shewry and Wynn (1987) propose a criterion based on entropy. Related design aspects in dynamic systems are reviewed by Titterington (1980b). It is worth emphasizing that the defining propertiesmonotonicity, concavity, and homogeneityof information functions are motivated, not by technical convenience, but from statistical aspects; see also Pukelsheim (1987a). The analysis of information functions parallels to a considerable extend the general discussion of norms, compare Rockafellar (1970, p. 131), and Pukelsheim (1983b). The Holder inequality dates back to Holder (1889) and Rogers (1888), as mentioned in Hardy, Littlewood and Polya (1934, p. 25). Beckenbach and Bellman (1965, p. 28), call it the Minkowski-Mahler inequality, and emphasize the method of quasilinearization in defining polar functions. Our formulation of the general design problem in Section 5.15 is anticipated in essence by Elfving (1959). The Existence Lemma 5.16 is a combination of Theorem 1 in Pukelsheim (1980), and Corollary 5.1 in Muller-Funk, Pukelsheim and Witting (1985). 6. MATRIX MEANS The concepts of determinant optimality, average-variance optimality, and smallest-eigenvalue optimality are classical, to which we add trace optimality. Determinant optimality and smallest-eigenvalue optimality were first proposed by Wald (1943). Krafft (1978) discusses the relation of the determinant criterion with the Gauss curvature of the power function of the F-test and with the concentration ellipsoid, see also Nordstrom (1991). Gaffke (1981, p. 894) proves that the determinant is the unique criterion which induces an ordering that is invariant under nonsingular reparametrization. That trace optimality acquires its importance only in combination with some other properties has already been pointed out by Kiefer (1960, p. 385). Our notion of trace optimality is distinct from the T-optimality concept of Atkinson and Fedorov (1975a,b) who use their criterion in order to test which of several models is the true one. Vector means are discussed, for example, by Beckenbach and Bellman (1965). An in-depth study of majorization is given by Hardy, Littlewood and Polya (1934), and Marshall and Olkin (1979). Two alternative proofs of the Birkhoff (1946) theorem and refinements are given in Marshall and Olkin (1979, pp. 34-38). Our method of proof in Section 6.8, of projecting onto diagonal matrices through averaging with respect to the sign-change group,
414
is adapted from Andersson and Perlman (1988). The proof of the Holder inequality follows Beckenbach and Bellman (1965, p. 70), Gaffke and Krafft (1979a), and Magnus (1987). Matrix norms of the form <(C) = (A(C)), with a symmetric gauge function, are studied by von Neumann (1937). Marshall and Olkin (1969), based on a result of Schatten (1950, p. 85), show that these norms are monotone. For a general exposition of matrix norms see Horn and Johnson (1985). According to Wussing and Arnold (1975, p. 230), THospital communicated his rule in 1696 in the textbook Analyse des Infiniment Petits. Under the seal of secrecy, 1'Hospital had bought much of the contents of the book from his private tutor, Johann Bernoulli. Insistence in formulating a convex minimization problemonly because this is the preferred problem type in optimization theoryis detrimental to the general design problem, compare Hoang and Seeger (1991). Theoretically, there is a deep and perfect analogy between convexity and concavity, see Part VII of Rockafellar (1970). Practically, the optimal design problem is an instance testifying to the difference of the two concepts, in that maximization of a concave information function describes the problem much more comprehensively than does minimization of a convex risk function. Logarithmic concavity is an accidental byproduct of the theory, with no intrinsic value. We know of no instance where the logarithm is of any necessity in solving a design problem. Rather, use of the logarithm signals that the problem under study is not yet fully understood. Lindley (1956) and Stone (1959) access the design problem through entropy which, of course, involves the logarithm. But before long, when it comes to the optimization problem, the logarithm disappears and the determinant criterion takes over.
7. THE GENERAL EQUIVALENCE THEOREM

In the literature there is only a sporadic discussion of the existence of optimal designs, as settled by Theorem 7.13. That the existence issue may be crucial is evidenced by Theorem 1 of Whittle (1973, p. 125), whose proof as it stands is incomplete. Existence of determinant optimal designs has never been doubted; see Kiefer (1961, p. 306) and Atwood (1973, p. 343). For scalar optimality, existence follows from the geometric approach of Chapter 2 which is due to ElfVing (1952), see also Chernoff (1972, p. 12). For the average-variance criterion, existence of an optimal design is established by Fellman (1974, Theorem 4.1.3). Pdzman (1980, Propositions 2 and 4) solves the problem for the sequence of matrix means <_i, </>_2, The term sub gradient originates with convex functions g since there any subgradient defines an affine function that bounds g from below. For a concave function g, a subgradient yields an affine function that bounds g from above. Hence the term swpergradient would be more appropriate, in the concave case.
415
Our formulation of the General Equivalence Theorem 7.14 appears to come as close as possible to the classical results of Kiefer and Wolfowitz (1960), and Kiefer (1974a). The original proof of Pukelsheim (1980) is based on Fenchel duality, Pukelsheim and Titterington (1983) outline the approach based on subgradients. The computation of the set of all subgradients is carried out by Gaffke (1985a,b). It is worthwhile contemplating the key role of the subgradient inequality. It demands the existence of a generalized inverse G of M such that
The prevailing property is linearity in the competing moment matrices A e M. It is due to this linearity that, for the full set M = M(E) = conv{*jc' : x G X ] , we need to verify (1) only for the generating rank 1 matrices A = xx', with x E X. There are alternatives to the subgradient inequality that lack this linearity property. For instance, a derivation based on directional derivatives leads to the requirement
where AM = minQ6R*x*.QAf=M QAQ' is the generalized information matrix from Section 3.21. The advantage of (2) is that the product K'M~AMM~K is invariant to the choice of a generalized inverse of M. This follows from Theorem 3.24 and Lemma 1.17, see also the preamble in the proof of Theorem 4.6. The disadvantage of (2) is that the dependence of AM on A is in general nonlinear, and for M = A/(E) it does not suffice to verify (2) for A = xx', with jc G X, only. That (1) implies (2) is a direct consequence of AM < A and the monotonicity properties from Section 1.11. The gist of the equivalence comes to bear in the converse implication. To see that (2) implies (1), we use tLt alternative representation
following from AM < min GAf - MG'AGM < MG'AGM = AM. The first inequality holds because there are more matrices Q satisfying QM = M than those with the form Q = MG' with G G M~. The second inequality holds for any particular member G G M~. With arbitrary matrix G G M~, residual projector R = Ik - MG, and generalized inverse H G (RAR')~, we pick the particular version G = G - R'HRAG to obtain MG'M = M and MG'AGM = A - AR'HRA. The latter coincides with AM, see (2) in
416
Section 3.23. Now the left hand side in (2) turns into
As a whole, (2) becomes max^g^ min GM - IraceK'G'AGKCDC < 1. Appealing to one of the standard minimax theorems such as Corollary 37.3.2 in Rockafellar (1970), we finally obtain
This is the same as (1). The proof of the equivalence of (1) and (2) is complete. A closely related question is which generalized inverses G of M are such that they appear in the General Equivalence Theorem. It is not always the Moore-Penrose inverse Af + , as asserted by Fedorov and Malyutov (1972), nor is it an arbitrary generalized inverse, as claimed by Bandemer (1977, Section 5.6.3). Counterexamples are given by Pukelsheim (1981). Silvey (1978, p. 557) proposes a construction of permissible generalized inverses based on subspaces that are complementary to the range of M, see also Pukelsheim (1980, p. 348), and Pukelsheim and Titterington (1983, p. 1064). However, the choice of the complementary subspace depends on the optimal solution of the dual problem. Hence the savings in computational (and theoretical) complexity seem to be small. There are also various versions of the General Equivalence Theorem which emphasize the geometry in the space L2(), see Kiefer and Wolfowitz (1960, p. 364), Kiefer (1962), Karlin and Studden (1966a, Theorem 6.2), and Pukelsheim (1980, p. 354). In our terminology, an Equivalence Theorem for a general information function </> seeks to exhibit necessary and sufficient conditions for </>-optimality which are easy to verify. The original intent, of reconciling two independent criteria, no longer prevails. We use the qualifying attribute General Equivalence Theorems to indicate such results that allow for arbitrary convex compact sets M. of competing moment matrices, rather than insisting on the largest possible set M(E). If the set of competing moment matrices is maximal, M = M(H), then for the full parameter vector 6 the variables of the dual problem lend themselves to an appealing interpretation. In fact, for every matrix N > 0, we have that
In other words, N induces a cylinder that includes the regression range X, as described in Section 2.10. This dual analysis dates back to Elfving (1952,
417
p. 260); see also Wynn (1972, p. 174), Silvey and Titterington (1973, p. 25), and Sibson (1974, p. 684). The dual problem then calls for minimizing the "size" of the cylinder N as measured by the polar function $. For the determinant criterion, the topic is known among convex geometers and is associated with the name of Loewner. The ellipsoid of smallest volume that includes a compact set of points is called the Loewner ellipsoid, see Busemann (1955, p. 414), Danzer, Laugwitz and Lenz (1957), Danzer, Grunbaum and Klee (1963, p. 139), Krafft (1981, p. 101), and Gruber (1988); Loewner (1939) investigates closely related problems. Silverman and Titterington (1980) develop an exact terminating algorithm for finding the ellipsoid of smallest area covering a plane set of regression vectors. The results in Section 7.23 and Section 7.24 on the relation between smallest-eigenvalue optimality and scalar optimality are from Pukelsheim and Studden (1993). 8. OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS The upper bound (2) on the support size in Theorem 8.2 is the Caratheodory theorem which says that for a bounded set SCR", every point in its convex hull can be represented as a convex combination of n + 1 points in 5. For a proof see, for instance, Rockafellar (1970, p. 155), and Bazaraa and Shetty (1979, p. 37). Bound (5) of Theorem 8.2 originates with Fellman (1974, p. 62) for scalar optimality, and generalizes earlier results of Elfving (1952, p. 260) and Chernoff (1953, p. 590). It is shown to extend to the general design problem by Pukelsheim (1980, p. 351) and Chaloner (1984). Theoretically, in a Baire category sense, most optimal designs are supported by at most \k(k + 1) many points, as established by Gruber (1988, p. 58). Practically, designs with more support points may be preferable because of their greater balancedness; see Section 14.9. The usefulness of Lemma 8.4 has been apparent to early writers such as Kiefer (1959, p. 290). Theorem 8.5 appears in Ehrenfeld (1956, p. 62), in a study of complete classes and admissibility. The example in Section 8.6 is a particular case of a line fit model over the k-dimensional cube for k = 2, discussed for general dimension k in Section 14.10. A similarly complete discussion of a parabola fit model is given by Preitschopf and Pukelsheim (1987). Gaffke (1987a) calculates the </>p-optimal designs for the parameters of the two highest coefficients in a polynomial fit model of arbitrary degree. Studden (1989) rederives these results using canonical moments. Theorem 8.7 on linearly independent support vectors is from Pukelsheim and Torsney (1991), and has forerunners in Studden (1971, Theorem 3.1), Torsney (1981), and Kitsos, Titterington and Torsney (1988, Section 6.1). The fix point equation for the optimal weights (A * A)w = l f in Theorem 8.11 is from Pukelsheim (1980, p. 353). The resulting bound l/s for determi-
418
nant optimality first appears in Atwood (1973, Theorem 4). The auxiliary Lemma 8.10 on Hadamard products dates back to Schur (1911, p. 14); see also Styan (1973). The proof of Theorem 8.7 employs an idea of Sibson and Kenny (1975), to expand the quadratic form until the matrix M(g) appears in the middle. The material in Section 8.11 to Section 8.16 is taken from Pukelsheim (1980). Section 8.19 follows Pukelsheim (1983a). Corollary 8.16 has a history of its own, see Kiefer (1961, Theorem 2), Karlin and Studden (1966a, Theorem 6.1), Atwood (1969, Theorem 3.2), Silvey and Titterington (1973, p. 25), and Sibson (1974, p. 685). 9. D-, A-, E-, T-OPTIMALITY Farrell, Kiefer and Walbran (1967, p. 113) introduce the name global criterion for the G-criterion. The first reference to work on globally optimal designs is Smith (1918). The Equivalence Theorem 9.4 establishes the equivalence of determinant optimality and global optimality, whence its name. This is a famous result due to Kiefer and Wolfowitz (1960), announced as a footnote in Kiefer and Wolfowitz (1959, p. 292). Earlier Guest (1958) had found the globally optimal designs for polynomial fit models, and Hoel (1958) the determinant optimal designs. That the two, apparently distinct criteria lead to the same class of optimal designs came as a surprise to the people working in the field. As Kiefer (1974b) informally reports:
In fact the startling coincidence is that these two people have the same first two initials (P.G.) and you can compute the odds of that!!
That the determinant optimal support points are the local extrema of the Legendre polynomials follows from a result of calculus of Schur (1918), see Szego (1939, Section VI.6.7), and Karlin and Studden (1966b, p. 330). In verifying optimality by solving the dual problem, we follow Fejer (1932). The equivalence of the two problems is also stressed in Schoenberg (1959, p. 289) who, on p. 284, has a formula for the optimal determinant value which yields the recursion formula for u</+i($o) in Section 9.5. Bandemer and Nather (1980, p. 299), list the determinant optimal designs for polynomial fit models up to degree d = 6. That monograph also contains a wealth of tabulated designs for other criteria and for other models. Karlin and Studden (1966a) allow for different variance functions; the determinant optimal support points are then associated with other classical polynomials; see also, for instance, Fedorov (1972, p. 88), Humak (1977, p. 457) Krafft (1978, p. 282), Ermakov (1983), and Pazman (1986, p. 176). St.John and Draper (1975) provide a review of early work on determinant optimality and a bibliography. Applications to multivariate problems are studied by Krafft and Schaefer (1992). Bischoff (1992, 1993) gives conditions on the dispersion structure so
419
that the determinant optimal design in the homoscedastic model remains optimal in the presence of correlated observations. Kiefer and Studden (1976) study in greater detail the consequences of the fact that for increasing degree d, the determinant optimal designs converge to a limiting arcsin distribution. Dette and Studden (1992) establish the limiting arcsin distribution using canonical moments. Arcsin support designs are also emphasized by Fedorov (1972, p. 91). Arcsin support points, under the heading of Chebyshev points, first appeared while treating scalar optimality problems as a Chebyshev approximation scheme; see Kiefer and Wolfowitz (1959, 1965), Hoel and Levine (1964), and Studden (1968). As the degree d tends to oo, limiting distributions other than the arcsin distribution do occur. Studden (1978) shows that the optimal designs for the quantile component 0\dq\, with q G (0;1), have limiting Lebesgue density
Kiefer and Studden (1976) provide the limiting Lebesgue density of the optimal designs for f (to)'6 with |/o| > 1 (extrapolation design),
The classical Kiefer and Wolfowitz Theorem 9.4 investigates determinant optimality of a design in the set H of all designs, or equivalently, of a moment matrix M in the set M(E) of all moment matrices. Our theory is more general, by admitting any subset M C A/(E) of competing moment matrices, as long as M is convex and compact. For instance, if the regression range is a Cartesian product, X = X\ x X2, and r is a given design on the marginal set X\, then the set of moment matrices originating from designs with first marginal distribution equal to r,
is convex and compact, and Theorem 9.4 applies. For this situation, determinant optimality is characterized for the full parameter system by Cook and Thibodeau (1980), and for parameter subsystems by Nachtsheim (1989). The concept of average-variance optimality is the prime alternative to determinant optimality. Fedorov (1972, Section 2.9), discusses linear optimality criteria, in the sense of Section 9.8. Studden (1977) arrives at the criterion by integrating the variance surface x'M~lx, and calls it I-optimality. The average-variance criterion also arises as a natural choice from the Bayes point of view as proposed, for instance, by Chaloner (1984). An example where the
420
experimenters favor weighted average-variance optimality over determinant optimality is Conlisk and Watts (1979, p. 37). Ehrenfeld (1955) introduces the smallest-eigenvalue criterion. The criterion becomes differentiable at a matrix M > 0 for which the smallest eigenvalue has multiplicity 1; see Kiefer (1974a, Section 4E). The results in Section 9.13 on smallest-eigenvalue optimality are drawn from Pukelsheim and Studden (1993), and are obtained independently by Heiligers (1991c). Dette and Studden (1993) present a detailed inquiry into the geometric aspects of smallest-eigenvalue optimality, interwoven with other results from classical analysis. If the full parameter 6 is of interest, K = Ik, then the eigenvalue property in part I of the proof in Section 9.13 is
In terms of the polynomials P(t) = a'f(t)/\\a\\, standardized to have coefficient vector a/\\a\\ of Euclidean norm 1, we obtain a least squares property of the standardized Chebyshev polynomial Td(t) = c'f(t)/\\c\\ relative to the smallest-eigenvalue optimal design TC for 6,
This complements the usual least squares property of the Chebyshev polynomials which pertains to the arcsin distribution, see Rivlin (1990, p. 42). The eigenvalue property (1) can also be written as 1 > ||fl||2/||c||2 for all vectors a e Ud+l satisfying a'Ma < 1. That is, for every vector a e IRd+1, we have
A weaker statement, with max,=0,i,...,<*('/(*/)) < 1 in place of the integral, is a corollary to results of Erdos (1947, pp. 1175-1176). Either way, we may deduce that the Elfving set 7 for a polynomial fit over [1;1] has in-ball radius \\c\\~1. Indeed, the supporting hyperplanes to n are given by the vectors 0 ^ a e Ud+l such that max,e[_1;]] \a'f(t)\ = 1. As mentioned in the previous paragraph, this entails ||a||2 < ||c||2. The hyperplane {v e Rd+l : a'v = 1} has distance l/||a|| to the origin. Therefore the supporting hyperplane closest to the origin is given by the Chebyshev coefficient vector c, and has distance r = I/\\c\\. Our approach includes, in Section 9.12 and Section 9.14, an alternate derivation of the optimal designs for the individual components 0j originally due to Studden (1968), see also Murty (1971). At the same time, we rederive
421
and extend the classical extremum property of Chebyshev polynomials. With CKK'C(M) = mma&Kd^.a'KK/c=ia'Ma as in Section 3.2, we have
Hence among all polynomials P ( t ) = a'f(t) that satisfy a'KK'c = 1, the supnorm (I/11| = max,e[_1;1] |P(01 has minimum value ||Ar'c||"2, and this minimum is attained only by the standardized Chebyshev polynomial Td/\\K'c\\2. Section 9.14 provides the generalization that stems from the coefficients 6d_i_2jAmong all polynomials P(t) = a'f(t) that satisfy a'KK'c = 1, the sup-norm ||P|| has minimum value ||^'cj|~2, and this minimum is attained only by the standardized Chebyshev polynomial Td_i/\\K'c\\2. For the highest index d, this result is due to Chebyshev (1859), for the other indices to Markoff (1916), see Natanson (1955, pp. 36, 50) and Rivlin (1990, pp. 67, 112). Our formulation allows for combinations of the individual components, as given by KK 'c andKK'c. Trace optimality is discussed implicitly in Silvey and Titterington (1974, p. 301), Kiefer (1975, p. 338), and Titterington (1975, 1980a). Trace optimality often appears in conjunction with Kiefer optimality, for reasons discussed in Section 14.9 (II). Optimal designs for the trigonometric fit model are given by Hoel (1965, p. 1100) and Karlin and Studden (1966b, p. 347); see also Fedorov (1972, Section 2.4) and Krafft (1978, Section 19(c)). The example in Section 9.17 of a design that remains optimal under variation of the model is from Hill (1978a). The Legendre polynomials in Exhibit 9.1 and the Chebyshev polynomials in Exhibit 9.8 are taken from Abramowitz and Stegun (1970, pp. 798, 795). The numerical results in Exhibit 9.2 to Exhibit 9.11 were obtained with the Fortran program PolyPlan of Preitschopf (1989).
10. ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES The notion of design admissibility is introduced by Ehrenfeld (1956) in the context of complete class theorems. Theorem 10.2 is due to Elfving (1959, p. 71); see also Karlin and Studden (1966a, p. 809). Pilz (1979) develops a decision theoretic framework. Lemma 10.3 is from Gaffke (1982, p. 9). Design admissibility in polynomial fit models is resolved rather early by de la Garza (1954) and Kiefer (1959, p. 291). Our development in Section 10.4
422
to Section 10.7 follows the presentations in Gaffke (1982, pp. 90-92) and Heiligers (1988, pp. 84-86). The result from convex analysis that is pictured in Exhibit 10.1 is Theorem 6.8 in Rockafellar (1970). Extensions to r-systems and polynomial spline regression are put forward by Karlin and Studden (1966a, 1966b), and Studden and VanArman (1969). Wierich (1986) obtains complete class theorems for product designs in models with continuous and discrete factors. The way admissibility relates to trace optimality is outlined by Elfving (1959), Karlin and Studden (1966a), Gaffke (1987a), and Heiligers (1991a). That it is preferable to first study its relations to smallest-eigenvalue optimality is pointed out in Gaffke and Pukelsheim (1988); see also Pukelsheim (1980, p. 359). Elfving (1959) calls admissibility of moment matrices total admissibility, in contrast to partial admissibility of information matrices. He makes a point that the latter is not only a property of the support points but also involves the design weights. Admissibility of special C-matrices is tackled by Christof and Pukelsheim (1985), and Baksalary and Pukelsheim (1985). Constantine, Lim and Studden (1987) discuss admissibility of designs for finite sample sizes in polynomial fit models.
11. BAYES DESIGNS AND DISCRIMINATION DESIGNS There is a rich and diverse literature on incorporating prior knowledge into a design problem. Our development emphasizes that the General Equivalence Theorem still provides the basis for most of the results. Bayes designs have found the broadest coverage. The underlying distributional assumptions are discussed in detail by Sinha (1970), Lindley and Smith (1972), and Guttman (1971); see also the overviews of Herzberg and Cox (1969), Atkinson (1982), Steinberg and Hunter (1984), and Bandemer, Na'ther and Pilz (1986). The monograph of Pilz (1991) provides a comprehensive presentation of the subject and gives many references. Another detailed review of the literature is included in the authoritative exposition of Chaloner (1984). El-Krunz and Studden (1991) analyse Bayes scalar optimal designs by geometric arguments akin to the Elfving Theorem 2.14. Mixtures of Bayes models are studied by DasGupta and Studden (1991). Chaloner and Larntz (1989) apply the Bayes approach to logistic regression experiments. That Bayes designs and designs with protected experimental runs lead to the same optimization problem has already been pointed out by CoveyCrump and Silvey (1970). Theorem 11.8 on designs with bounded weights generalizes the result of Wynn (1977, p. 474). Wynn (1982, p. 494) applies the approach to finite population sampling, Welch (1982) uses it as a basis to derive a branch-and-bound algorithm.
423
The setting in Section 11.10, of designing the experiment simultaneously for several models, comes under various headings such as model robust design, designs for model discrimination, or multipurpose designs. The issue arises naturally from applications as in Hunter and Reiner (1965), or Cook and Nachtsheim (1982). Lauter (1974,1976) formulates and solves the problem as an optimization problem, see also Humak (1977, Kapitel 8), and Bunke and Bunke (1986). Atkinson and Cox (1974) use the approach to guard against misspecification of the degree of a polynomial fit model. Atkinson and Fedorov (1975a,b) treat the testing problem of whether one of two or more models is the correct one, with a special view of nonlinear models and sequential experimental designs. Fedorov and Malyutov (1972) emphasize the aspects of model discrimination. Hill (1978b) reviews procedures and methods for model discrimination designs. Our presentation follows Pukelsheim and Rosenberger (1993). Most of the literature deals with geometric means of determinant criteria. In this case canonical moments provide another powerful tool for the analysis. For mixture designs, deep results are presented by Studden (1980), Lau and Studden (1985), Lim and Studden (1988), and Dette (1990,1993a,b). Dette (1991a) shows that in a polynomial fit model every admissible symmetric design on [1;1] becomes optimal relative to a weighted geometric mean of determinant criteria. Example I of Section 11.18 is due to Dette (1990, p. 1791). Other examples of mixture determinant optimal designs for weighted polynomial fit models are given by Dette (1992a,b). Lemma 11.11 is taken from Gutmair (1990) who calculates polars and subgradients for mixtures of information functions. A related calculus is in use in electric network theory, as in Anderson and Trapp (1976). The weighting and scaling issue which we address in Section 11.17 is immanent in all of the work on the subject. It can be further elucidated by considering weighted vector means ^>p(A) (),-<m WjAf)1/*', with arbitrary weights Wi > 0, rather than restricting attention to the uniform weighting wi = 1/m as we do. Interest in designs with guaranteed efficiencies was sparked off by the seminal paper of Stigler (1971). Atkinson (1972) broadens the setting by testing the adequacy of an extended model which reduces to the given model for particular parameter values. For the determinant criteria, again the canonical moment technique proves to be very powerful, leading to the detailed results of Studden (1982) and Lau (1988). DasGupta, Mukhopadhyay and Studden (1992) call designs with guaranteed efficiencies compromise designs, and study classical and Bayes settings in heteroscedastic linear models. Our Theorem 11.20 is a consequence of the Kuhn-Tucker theorem; see, for instance, Rockafellar (1970, p. 283). Example I in Section 11.22 is taken from Studden (1982). Lee (1987, 1988) considers the same problem when the criteria are differentiable, using directional derivatives. Similarly results from constrained optimization theory are used by Vila (1991) to investigate determinant optimality of designs for finite sample size n.
424
12. EFFICIENT DESIGNS FOR FINITE SAMPLE SIZES A proper treatment of the discrete optimization problems that come with the set "En of designs for sample size n requires a combinatorial theory, as in the monographs of Raghavarao (1971), Raktoe, Hedayat and Federer (1981), Constantine (1987), John (1987), and Shah and Sinha (1989). The usual numerical rounding of the quoatas nn>i has little chance of preserving the side condition that the rounded numbers sum to n. For instance, the probability for rounded percentages to add to 100% is ^/blirt as the support size i becomes large; see Mosteller. Youtz and Zahn (1967), and Diaconis and Freedman (1979). A systematic apportionment method provides an efficient alternative to convert an optimal design E into a design , for sample size n, see Pukelsheim and Rieder (1992). The discretization problem for experimental designs has much in common with the apportionment methods for electorial bodies, as studied in the lively treatise of Balinski and Young (1982). The monotonicity properties that have shaped the political debate have their counterparts in the design of experiments. Of these, sample size monotonicity is the simplest. Exhibit 12.1 shows the data of the historical Alabama paradox, see Balinski and Young (1982, p. 39). Those authors prove that the monotonicity requirements automatically lead to divisor methods, that is, multiplier methods in our terminologyOther than in the political sciences, the design of experiments provides numerical criteria to compare various apportionment methods, resulting in the efficient apportionment method of Section 12.6. Lemma 12.5 is in the spirit of Fedorov (1972, Chapter 3.1). Theorem 12.7 is from Balinski and Young (1982, p. 105). Exhibit 12.2 takes up an example of Bandemer and Na'ther (1980, p. 267). The efficient design apportionment is called the method of John Quincy Adams by Balinski and Young (1982, p. 28), and is also known as the method of smallest divisors. For us, the latter translates into the method of largest multipliers since it uses multipliers v as large as possible such that the frequencies \vwi\ sum to n. Kiefer (1971, p. 116), advocates the goal to minimize the total variation distance max/<^ \ntjn - w/|. This is achieved uniquely by the method of Hamilton, see Balinski and Young (1982, pp. 17, 104). The asymptotic order of the efficiency loss due to rounding has already been mentioned by Kiefer (1959, p. 281) and Kiefer (1960, p. 383). Fellman (1980) and Rieder (1990) address the differentiability assumptions of Theorem 12.10. The subgradient efficiency bound (3) in Section 12.11 is from Pukelsheim and Titterington (1983, p. 1067). It generalizes an earlier result of Atwood (1969, p. 1596), and is related to the approach of Gribik and Kortanek (1977). From Section 12.12 on, the exposition closely follows the work of Gaffke (1987b), building on Gaffke and Krafft (1982). It greatly improves upon the original paper by Salaevskil (1966) who conjectured that in polynomial fit
425
models rounding always leads to determinant optimal designs for sample size n. Gaffke (1987b) reviews the successes and failures in settling this conjecture, and proposes a necessary condition for discrete optimality which disproves the conjecture for degree d > 4 and small sample sizes n. The explicit counterexamples in Exhibit 12.5 are taken from Constantine, Lim and Studden (1987, p. 25). Those authors also prove that the Salaevskii conjecture holds true for degree d = 3. This provides some indication that the sample sizes nd of Exhibit 12.6 are very much on the safe side, since there we find 3 = 12 rather than rij = 4. Further examples are discussed by Gaffke (1987b). A numerical study is reported in Cook and Nachtsheim (1980). The role of symmetry is an ongoing challenge in the design of experiments. The example at the end of Section 12.16 is taken from Kiefer (1959, p. 281), who refers back to it in Kiefer (1971, p. 117).
13. INVARIANT DESIGN PROBLEMS Many classical designs were first proposed because of their intuitive appeal, on the grounds of symmetry properties such as some sort of "balancedness", as pointed out by Elfving (1959) and Kiefer (1981). The mathematical expression of symmetry is invariance, relative to a group of transformations which suitably acts on the problem. Invariance considerations have permeated the design literature from the very beginning; see Kiefer (1958), and Kiefer and Wolfowitz (1959, p. 279). The material in Section 13.1 to Section 13.4 is basic. Nevertheless there is an occasional irritation as if the matrix transformations were to act on the moment matrices by similarity, A -> QAQ~l, rather than by congruence, A i-> QAQ'. Lemma 13.5, on the homomorphisms under which the information matrix mapping CK is equivariant, is new. Sinha (1982) provides a similar discussion of invariance, restricted to the settings of block designs. Lemma 13.7 on invariance of the matrix means extends a technique of Lemma 7.4 in Draper, Gaffke and Pukelsheim (1991); Lemma 13.10 compiles the results of Section 2 of that paper. Part (c) of Lemma 13.10 says that Sym(s, H) forms a quadratic subspace of symmetric matrices as introduced by Seely (1971). This property is also central to multivariate statistical analysis, compare Andersson (1975) and Jensen (1988). For simultaneous optimality under permutationally invariant criteria in the context of block designs, Kiefer (1975, p. 336) coined the notion of universal optimality. This was followed by a series of papers elaborating on the interplay of averaging over the transformation group (Section 13.11), simultaneous optimality relative to sets of invariant optimality criteria (Section 13.12), and matrix majorization (Section 14.1), see Giovagnoli and Wynn (1981,1985a,b), Bondar (1983), and Giovagnoli, Pukelsheim and Wynn (1987).
426
14. KIEFER OPTIMALITY Universal optimality generalizes to the Kiefer ordering as introduced in Section 14.2. The semantic distinction between uniform optimality and universal optimality is weak, which is why we prefer to speak of Loewner optimality and Kiefer optimality, and of the Loewner ordering and the Kiefer ordering. It is the same as the upper weak majorization ordering of Giovagnoli and Wynn (1981, 1985b), and the information increasing ordering of Pukelsheim (1987a,b,c). Shah and Sinha (1989) provide a fairly complete overview of results on universally optimal block designs. In the discrete theory, feasibility of a block design for the centered treatment contrasts comes under the heading of connectedness of the incidence matrix N, see Krafft (1978, p. 195) and Heiligers (1991b). Eccleston and Hedayat (1974) distinguish various notions of connectedness and relate them to optimality properties of block designs. Our notion of balancedness is often called variance-balancedness, see, for example, Kageyama and Tsuji (1980). Various other meanings of balancedness are discussed by Caliriski (1977). An a x b balanced incomplete block design for sample size n is known in expert jargon as a BIBD(r,A:, A), with treatment replication number r = n/a, blocksize k nib, and treatment concurrence number A = r(k l)/(a 1). Our terminology is more verbose for the reason that it treats block designs as just one instance of the general design problem. Our Section 14.8 follows Giovagnoli and Wynn (1981, p. 414) and Pukelsheim (1983a, 1987a). Section 14.9 on balanced incomplete block designs is mostly standard, compare Raghavarao (1971, Section 4.3) and Krafft (1978, Section 21). Rasch and Herrendorfer (1982) and Nigam, Puri and Gupta (1988) present an exposition with a view towards applications. Beth, Jungnickel and Lenz (1985) investigate for which parameter settings block designs exist and how they are constructed. The inequality a < b, calling for at least as many blocks as there are treatments, is due to Fisher (1940, p. 54); see also Bose (1949). A particular parameter system are the contrasts between a treatment and a control, see the review by Hedayat, Jacroux and Majumdar (1988). Feasibility and optimality of multiway block designs are treated in Raghavarao and Federer (1975), Cheng (1978b, 1981), Gaffke and Krafft (1979b), Eccleston and Kiefer (1981), Pukelsheim (1983c, 1986), and Pukelsheim and Titterington (1986, 1987). In Section 14.10, the discussion of designs for linear regression on the cube follows Cheng (1987) and Pukelsheim (1989), but see also Kiefer (1960, p. 402). An application of the Kiefer ordering to experiments with mixtures is given by Mikaeili (1988). 15. ROTATABILITY AND RESPONSE SURFACE DESIGNS Response surface methodology originates with Box and Wilson (1951), Box and Hunter (1957), and Box and Draper (1959). The monograph of Box and
427
Draper (1987) is the core text on the subject, and includes a large bibliography. Other expositions of the subject are Khuri and Cornell (1987), and Myers (1971); see also Myers, Khuri and Carter (1989). The definition of an information surface iM makes sense for an arbitrary nonnegative definite matrix M, but the conclusion of Lemma 15.3 that 'M i-A implies M = A then ceases to hold true, see Draper, Gaffke and Pukelsheim (1991, p. 154; 1993). This is in contrast to the information matrix mapping CK for which it is quite legitimate to extend the domain of definition from the set of moment matrices M(E) to the entire cone NND(/c). First-degree rotatability and regular simplex designs are studied by Box (1952). Two-level fractional factorial designs are investigated in detail by Box and Hunter (1961). Their usefulness was recognized already in the seminal paper of Plackett and Burman (1946). Those authors also point out that there are close interrelations with balanced incomplete block designs. The use of Hadamard matrices for the design of weighing experiments is worked out by Hotelling (1944); the same result is given much earlier in the textbook by Helmert (1872, pp. 48-49) who attributes it to Gauss. See also the review paper by Hedayat and Wallis (1978). The origin of the Kronecker product dates back beyond Kronecker to Zehfuss (1858), according to Henderson, Pukelsheim and Searle (1983). In modern algebra, the Kronecker product is an instance of a tensor product, as is the outer product (s,t) H- st', see Greub (1967). Henderson and Searle (1981) review the use of the vectorization operator in statistics and elsewhere. These tools are introduced to second-degree models by Draper, Gaffke and Pukelsheim (1991). Central composite designs are proposed in Box and Wilson (1951, p. 16), and Box and Hunter (1957, p. 224). Rotatable second-degree moment matrices also arise with the simplex-sum designs of Box and Behnken (1960) whose construction is based on the regular simplex designs for first-degree models. Optimality properties of rotatable second-degree designs are derived by Kiefer (1960, p. 398), and Galil and Kiefer (1977). Other ways to adjust the single parameter that remains in the complete class of Theorem 15.18 are dictated by more practical considerations, see Box (1982), Box and Draper (1987, p. 486), and Myers, Vining, Giovannitti-Jensen and Myers (1992). Draper and Pukelsheim (1990) study measures of rotatability which can be used to indicate a deviation from rotatability, following work of Draper and Guttman (1988), and Khuri (1988).
Biographies
1. CHARLES LOEWNER 1893-1968 Karl Lowner was bom on May 29, 1893, near Prague, into a large Jewish family. He wrote his dissertation under Georg Pick at the Charles University in Prague. After some years as an Assistent at the German Technical University in Prague, Privatdozent at the Friedrich-Wilhelm-Universitat in Berlin, and auflerordentlicher Professor at Cologne University, he returned to the Charles University where he was promoted to an ordentlicher Professor. The German occupation of Czechoslovakia in 1939 caused him to emigrate to the United States and to change his name to Charles Loewner. Loewner taught at Louisville University, Brown University, and Syracuse University prior to his appointment in 1951 as Professor of Mathematics at Stanford University, where he remained beyond his retirement in 1963 until his death on January 8,1968. Loewner's success as a teacher was outstanding. Even during the last year of his life he directed more doctoral dissertations than any other department member. Volume 14 (1965) of the Journal d'Analyse Mathematique is dedicated to him, and Stefan Bergman and Gabor Szego. Lipman Bers tells of the man and scientist in the Introduction of the Collected Papers Loewner (CP). Loewner's work covers wide areas of complex analysis and differential geometry. The research on conformal mappings and their iterations led Loewner to the general study of semi-groups of transformations. In this vein, he axiomatized and characterized monotone matrix functions. There is a large body of Loewner's work which will not be found in his formal publications. One example is what is now called the Loewner ellipsoid, the ellipsoid of smallest volume circumscribing a compact set in Euclidean space.
428
BIOGRAPHIES
429
Top: Loewner. Middle: Elfving. Bottom: Kiefer.
430
BIOGRAPHIES
2. GUSTAV ELFVING 190&-1984 Erik Gustav Elfving was born on June 25,1908, in Helsinki. His father was a Professor of Botany at the University of Helsinki. Elfving learnt his calculus of probability from J.W. Lindeberg, but wrote his doctoral thesis (1934) under Rolf Nevanlinna. As a mathematician member of a 1935 Danish cartographic expedition to Western Greenland, when incessant rain forced the group to stay in their tents for three solid days, Elfving began to think about least squares problems and thereafter turned to statistics and probability theory. He started his academic career 1932 as a lecturer at the Akademi in Turku, and six years later became a docent in the Mathematics Department at the University of Helsinki. In 1948 Elfving was appointed to the chair that became vacant after Lars Ahlfors moved to Harvard University. He retired from this position in 1975, and died on March 25, 1984. An appreciation of the life and work of Gustav Elfving is given by Makelainen (1990). Elfving's publications are amongst others on continuous time Markov chains, Markovian two person nonzero sum games, decision theory, and counting processes. The 1952 paper on Optimum allocation in linear regression theory marks the beginning of the optimality theory of experimental design. Elfving's work in this area is reviewed by Fellman (1991). During his retirement, Elfving wrote a history of mathematics in Finland 1828-1918, including Lindelof, Mellin, and Lindeberg, see also Elfving (1985). The photograph shows Elfving giving a Bayes talk at his seminar in Helsinki right after his 1966 visit to Stanford University. 3. JACK KIEFER 1924-1981 Jack Carl Kiefer was born on January 25, 1924, in Cincinnati, Ohio. He attended the Massachusetts Institute of Technology and, interrupted by military service in World War II, earned a master's degree in electrical engineering and economics. In 1948 he enrolled in the Department of Mathematical Statistics at Columbia University and wrote a doctoral thesis (1952) on decision theory under Jacob Wolfowitz. In 1951, Wolfowitz left Columbia and joined the Department of Mathematics at Cornell University. Kiefer went along as an instructor in the same department, and eventually became a full professor in 1959. After 28 years at Cornell, which showed Kiefer as a prolific researcher in almost all parts of modern statistics, he took early retirement in 1979 only to join the Statistics Department of the University of California at Berkeley. He died in Berkeley on August 10, 1981. Kiefer was one of the leading statisticians of the century. Circling around the decision theoretic foundations laid down by the work of Abraham Wald at Columbia, Kiefer's more than 100 publications span an amazing range of interests: stochastic approximations, sequential procedures, nonparametric statistics, multivariate analysis, inventory models, stochastic processes, design
BIOGRAPHIES
431
of experiments. More than 45 papers are on optimal design, highlights being the 1959 discussion paper before the Royal Statistical Society (met by some discussants with rejection and ridicule), the 1974a review paper on the approximate design theory in the Annals of Statistics, and his last paper (1984, with H.P. Wynn) on optimal designs in the presence of correlated errors. A set of commentaries in the Collected Papers Kiefer (CP) aids in focusing on the contributions of the scientist and elucidating the warm personality of the man Jack Kiefer; see also Bechhofer (1982).
Bibliography
Numbers in square brackets refer to the page on which the reference is quoted.
Abramowitz, M. and Stegun, LA. (1970). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York. |42i] Alalouf, I.S. and Styan, G.P.H. (1979). "Characterizations of estimability in the general linear model." Annals of Statistics 7, 194-200. [4iij Albert, A. (1969). "Conditions for positive and nonnegative definiteness in terms of pseudoinverses." SIAM Journal on Applied Mathematics 17, 434-440. [4ii] Anderson, W.N., Jr. (1971). "Shorted operators." SIAM Journal on Applied Mathematics 20, 520-525. [4ii] Anderson, W.N., Jr. and Trapp, G.E. (1975). "Shorted operators II." SIAM Journal on Applied Mathematics 28, 60-71. HH] Anderson, W.N., Jr. and Trapp, G.E. (1976). "A class of monotone operator functions related to electrical network theory." Linear Algebra and Its Applications 15, 53-67. H23] Andersson, S. (1975). "Invariant normal models." Annals of Statistics 3, 132-154. [425] Andersson, S.A. and Perlman, M.D. (1988). "Group-invariant analogues of Hadamard's inequality." Linear Algebra and Its Applications 110, 91-116. [414] Atkinson, A.G (1972). "Planning experiments to detect inadequate regression Biometrika 59, 275-293. models." [423]
Atkinson, A.C. (1982). "Developments in the design of experiments." International Statistical Review 50, 161-177. [422] Atkinson, A.C. and Cox, D.R. (1974). "Planning experiments for discriminating between models." Journal of the Royal Statistical Society Series B 36, 321-334. "Discussion on the paper by Dr Atkinson and Professor Cox." Ibidem, 335-348. [423] Atkinson, A.C. snd Fedorov, V.V. (1975a). "The design of experiments for discriminating between two rival models." Biometrika 62, 57-70. [413.423) Atkinson, A.C. and Fedorov, V.V. (1975b). "Optimal design: Experiments for discriminating between several models." Biometrika 62, 289-303. (413,423) Atwood, C.L. (1969). "Optimal and efficient designs of experiments." Annals of Mathematical Statistics 40, 1570-1602. [6o.4is.424]
432
BIBLIOGRAPHY
433
Atwood, C.L. (1973). "Sequences converging to D-optimal designs of experiments." Annals of Statistics 1, 342-352. (4u, 4is] Baksalary, J.K. and Pukelsheim, F. (1985). "A note on the matrix ordering of special C-matrices." Linear Algebra and Its Applications 70, 263-267. [422) Balinski, M.L. and Young, H.P. (1982). Fair Representation. Meeting the Ideal of One Man, One Vote. Yale University Press, New Haven, CT. (330,424] Bandemer, H. (Ed.) (1977). Theorie and Anwendung der optimalen Versuchsplanung I. Handbuch zur Theorie. Akademie-Verlag, Berlin. 160,416] Bandemer, H. and Nather, W. (1980). Theorie und Anwendung der optimalen Versuchsplanung II. Handbuch zur Anwendung. Akademie-Verlag, Berlin. HIS. 424] Bandemer, H., Nather, W. and Pilz, J. (1986). "Once more: Optimal experimental design for regression models." Statistics 18, 171-198. "Discussion." Ibidem, 199-217. (422] Barndorff-Nielsen, O.E. and Jupp, P.E. (1988). "Differential geometry, profile likelihood, Lsufficiency and composite transformation models." Annals of Statistics 16, 1009-1043. |4ii] Bauer, H. (1990). Mass- und Integrationstheorie. De Gruyter, Berlin. [410] Bazaraa, M.S. and Shetty, C.M. (1979). Nonlinear Programming. Theory and Algorithms. Wiley, New York. HIO. 417] Bechhofer, R. (1982). "Jack Carl Kiefer 1924-1981." American Statistician 36, 356-357. Beckenbach, E.F. and Bellman, R. (1965). Inequalities. Springer, Berlin. [431] [413,4141
Bellissard, J., lochum, B. and Lima, R. (1978). "Homogeneous and facially homogeneous selfdual cones." Linear Algebra and Its Applications 19, 1-16. [410] Beth, T, Jungnickel, D. and Lenz, H. (1985). Design Theory. Bibliographisches Institut, Mannheim. (426] Birkhoff, G. (1946). "Tres observaciones sobre el algebra lineal." Universidad Nacional de Tucumdn, Facultad de Ciencias Exactas, Puras y Aplicadas, Revista, Serie A, Matemdticas y Fisica Teorica 5, 147-151. [413] Bischoff, W. (1992). "On exact D-optimal designs for regression models with correlated observations." Annals of the Institute of Statistical Mathematics 44, 229-238. [418] Bischoff, W. (1993). "On D-optimal designs for linear models under correlated observations with an application to a linear model with multiple response." Journal of Statistical Planning and Inference 37, 69-80. HIK] Bondar, J.V. (1983). "Universal optimality of experimental designs: definitions and a criterion." Canadian Journal of Statistics 11, 325-331. [425] Bose, R.C. (1948). "The design of experiments." In Proceedings of the Thirty-Fourth Indian Science Congress, Delhi 1947. Indian Science Congress Association, Calcutta, (l)-(25). |4i2] Bose, R.C. (1949). "A note on Fisher's inequality for balanced incomplete block designs." Annals of Mathematical Statistics 20, 619-620. [426] Box, G.E.P. (1952). "Multi-factor designs of first order." Biometrika 39, 49-57. 1427] Box, G.E.P. (1982). "Choice of response surface design and alphabetic optimality." Utilitas Mathematica21B, 11-55. 1427] Box, G.E.P. (CW). The Collected Works of George E.P. Box (Eds. G.C. Tiao, C.W.J. Granger, I. Guttman, B.H. Margolin, R.D. Snee, S.M. Stigler). Wadsworth, Belmont, CA 1985.
Box, G.E.P. and Behnken, D.W. (1960). "Simplex-sum designs: A class of second order rotatable designs derivable from those of first order." Annals of Mathematical Statistics 31, 838-864.
[427]
Box,
G.E.P. and Draper, N.R. (1959). "A basis for the selection of a response surface design." Journal of the American Statistical Association 54, 622-654. [426]
434
BIBLIOGRAPHY
Box, G.E.P. and Draper, N.R. (1987), Empirical Model-Building and Response Surfaces. Wiley, New York. [4os, 426,427] Box, G.E.P. and Hunter, J.S. (1957). "Multi-factor experimental designs for exploring response surfaces." Annals of Mathematical Statistics 28, 195-241. [409,426,42?] Box, G.E.P. and Hunter, J.S. (1961). "The 2k~p fractional factorial designs. Part I." Technometrics 3, 311-351. "Part II." Ibidem, 449-458. \/ta\ Box, G.E.P. and Wilson, K.B. (1951). "On the experimental attainment of optimum conditions." Journal of the Royal Statistical Society Series B 13, 1-38. "Discussion on paper by Mr. Box and Dr. Wilson." Ibidem, 38-45. [426,42?] Bunke, H. and Bunke, O. (1974). "Identifiability and estimability." Mathematische Operationsforschung und Statistik 5, 223-233. HII] Bunke, H. and Bunke, O. (Eds.) (1986). Statistical Inference in Linear Models. Statistical Methods of Model Building, Volume 1. Wiley, Chichester. [423] Busemann, H. (1955). The Geometry of Geodesies. Academic Press, New York. \4iT\ Calinski, T. (1977). "On the notion of balance in block designs." In Recent Developments in Statistics. Ptrtceedings of the European Meeting of Statisticians, Grenoble 1976 (Eds. J.R. Barra, F. Brodeau, G. Romier, B. van Cutsem). North-Holland, Amsterdam, 365-374. |426] Carlson, D. (1986). "What are Schur complements, anyway?" Linear Algebra and Its Applications 74, 257-275. pu] Chakrabarti, M.C. (1963). "On the C-matrix in design of experiments." Journal of the Indian Statistical Association 1, 8-23. [412] Chaloner, K. (1984). "Optimal Bayesian experimental design for linear models." Annals of Statistics 12, 283-300. "Correction." Ibidem 13, 836. 1302,417,419,422] Chaloner, K. and Larntz, K. (1989). "Optimal Bayesian design applied to logistic regression experiments." Journal of Statistical Planning and Inference 21, 191-208. [4221 Chebyshev, PL. [Tchebychef, PL.] (1859). "Sur les questions de minima qui se rattachent a la representation approximative des fonctions." Memoires de VAcademic Imperiale des Sciences de St.-Petersbourg. Sixieme Serie. Sciences Mathemathiques et Physiques 7, 199-291. (421] Chebyshev, P.JU [Tchebychef, PL.] ((Euvres). (Euvres de PL. Tchebychef. Reprint, Chelsea, New York 1961. Cheng, C.-S. (1978a). "Optimality of certain asymmetrical experimental designs." Annals of Statistics 6, 1239-1261. [4i3] Cheng, C.-S. (1978b). "Optimal designs for the elimination of multi-way heterogeneity." Annals of Statistics 6, 1262-1212. [426] Cheng, C.-S. (1981). "Optimality and construction of pseudo-Youden designs." Annals of Statistics 9, 201-205. [426] Cheng, C.-S. (1987). "An application of the Kiefer-Wolfowitz equivalence theorem to a problem in Hadamard transform optics." Annals of Statistics 15, 1593-1603. [426j Chernoff, H. (1953). "Locally optimal designs for estimating parameters." Annals of Mathematical Statistics 24, 586-602. [411, ui\ Chernoff, H. (1972). Sequential Analysis and Optimal Design. Society for Industrial and Applied Mathematics, Philadelphia, PA. (410,414] Christof, K. (1987), Optimale Blockpldne turn Vergleich von Kontroll- und Testbehandlungen. Dissertation, Universitat Augsburg, 99 pages. [412] Christof, K. and Pukelsheim, F. (1985). "Approximate design theory for a simple block design with random block effects." In Linear Statistical Inference. Proceedings of the International Conference, Poznan 1984 (Eds. T. Calirtski, W. Klonecki). Lecture Notes in Statistics 35, Springer, Berlin, 20-28. [422]
BIBLIOGRAPHY
435
Conlisk, J. and Watts, H. (1979). "A model for optimizing experimental designs for estimating response surfaces." Journal of Econometrics 11, 27-42. (4201 Constantine, G.M. (1987). Combinatorial Theory and Statistical Design. Wiley, New York. [424] Constantine, K.B., Lim, Y.B. and Studden, W.J. (1987). "Admissible and optimal exact designs for polynomial regression." Journal of Statistical Planning and Inference 16, 15-32. 1422,425) Cook, R.D. and Nachtsheim, C.J. (1980). "A comparison of algorithms for constructing exact D-optimal designs." Technometrics 22, 315-324. |425i Cook, R.D. and Nachtsheim, C.J. (1982). "Model robust, linear-optimal designs." Technometrics 24, 49-54. (423] Cook, R.D. and Thibodeau, L.A. (1980). "Marginally restricted D-optimal designs." Journal of the American Statistical Association 75, 366-371. [4i9] Covey-Crump, P.A.K. and Silvey, S.D. (1970). "Optimal regression designs with previous observations." Biometrika 57, 551-566. (422] Cox, D.D. (1988). "Approximation of method of regularization estimators." Annals of Statistics 16, 694-712. [4i2] Cox, D.R. and Reid, N. (1987). "Parameter orthogonality and approximate conditional inference." Journal of the Royal Statistical Society Series B 49, 1-18. |4ii] Danzer, L., Griinbaum, B. and Klee, V. (1963). "Helly's theorem and its relatives." In Convexity. Proceedings of Symposia in Pure Mathematics 7 (Ed. V. Klee). American Mathematical Society, Providence, RI, 101-180. |4n] Danzer, L., Laugwitz, D. and Lenz, H. (1957). "Uber das LOWNERsche Ellipsoid und sein Analogon unter den einem Eikorper einbeschriebenen Ellipsoiden." Archiv der Mathematik 7, 214-219. (4.71 DasGupta, A. and Studden, W.J. (1991). "Robust Bayesian experimental designs in normal linear models." Annals of Statistics 19, 1244-1256. [4221 DasGupta, A., Mukhopadhyay, S. and Studden, W.J. (1992). "Compromise designs in heteroscedastic linear models." Journal of Statistical Planning and Inference 32, 363-384. [423] de la Garza, A. (1954). "Spacing of information in polynomial regression." Annals of Mathematical Statistics 25, 123-130. [4211 Dette, H. (1990). "A generalization of D- and D\ -optimal designs in polynomial regression." Annals of Statistics 18, 1784-1804. 1423] Dette, H. (1991a). "A note on robust designs for polynomial regression." Journal of Statistical Planning and Inference 28, 223-232. [423] Dette, H. (1991b). Geometric Characterizations of Model Robust Designs. Habilitationsschrift, Georg-August-Universitat Gottingen, 111 pages. [4io] Dette, H. (1992a). "Experimental designs for a class of weighted polynomial regression models." Computational Statistics and Data Analysis 14, 359-373. (423] Dette, H. (1992b). "Optimal designs for a class of polynomials of odd or even degree." Annals of Statistics 20, 238-259. (423) Dette, H. (1993a). "Elfving's theorem for D-optimality." Annals of Statistics 21, 753-766. 1423] Dette, H. (1993b). "A mixture of the D- and D(-optimally criterion in polynomial regression." Journal of Statistical Planning and Inference 35, 233-249. 1423] Dette, H. and Studden, W.J. (1992). "On a new characterization of the classical orthogonal polynomials." Journal of Approximation Theory 71, 3-17. (419] Dette, H. and Studden, W.J. (1993). "Geometry of E-optimality." Annals of Statistics 21, 416433. |420]
436
BIBLIOGRAPHY
Diaconis, P. and Freedman, D. (1979). "On rounding percentages." Journal of the American Statistical Association 74, 359-364. [424] Donoghue, W.F., Jr. (1974). Monotone Matrix Functions and Analytic Continuation. Springer, Berlin. [408] Draper, N.R. and Guttman, I. (1988). "An index of rotatability." Technometrics 30, 105-111. |42?| Draper, N.R. and Pukelsheim, F. (1990), "Another look at rotatability." Technometrics 32, 195202.
|427]
Draper, N.R., Gaffke, N. and Pukelsheim, F. (1991). "First and second order rotatability of experimental designs, moment matrices, and information surfaces." Metrika 38, 129-161.
|425, 4271
Draper, N.R., Gaffke, N. and Pukelsheim, F. (1993). "Rotatability of variance surfaces and moment matrices." Journal of Statistical Planning and Inference 36, 347-356. (427] Eccleston, J.A. and Hedayat, A. (1974). "On the theory of connected designs: Characterization and optimality." Annals of Statistics 2, 1238-1255. |4zei Eccleston, J.A. and Kiefer, J. (1981). "Relationships of optimality for individual factors of a design." Journal of Statistical Planning and Inference 5, 213-219. 1426] Ehrenfeld, S, (1955). "On the efficiency of experimental designs." Annals of Mathematical Statistics 26, 247-255. H20] Ehrenfeld, S. (1956). "Complete class theorems in experimental designs." In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA 1954 and 1955, Volume 1 (Ed. J. Neyman). University of California, Berkeley, CA, 57-67. 1417.421) El-Krunz, S.M. and Studden, W.J. (1991). "Bayesian optimal designs for linear regression models." Annals of Statistics 19, 2183-2208. (422) Elfving, G. (1952). "Optimum allocation in linear regression theory." Annals of Mathematical Statistics 23, 255-262. (409,4io. 4i4,4i6,417.430] Elfving, G. (1959). "Design of linear experiments." In Probability and Statistics. The Harold Cramer Volume (Ed. U. Grenander). Almquist and Wiksell, Stockholm, 58-74. (413,421,422.425) Elfving, G. (1985). "Finnish Mathematical Statistics in the past." In Proceedings of the First International Tampere Seminar on Linear Statistical Models and their Applications, Tampere 1983 (Eds. T. Pukkiia, S. Puntanen). University of Tampere, Tampere, 3-8. (4301 Erdos, P. (1947). "Some remarks on polynomials." Bulletin of the American Mathematical Society 53, 1169-1176. 1420] Ermakov, S.M. (Ed.) (1983). Mathematical Theory of Experimental Planning. Nauka, Moscow (in Russian). [41 g| Farebrother, R.W. (1985). "The statistical estimation of the standard linear model, 1756-1853." In Proceedings of the First International Tampere Seminar on Linear Statistical Models and their Applications, Tampere 1983 (Eds. T. Pukkiia, S. Puntanen). University of Tampere, Tampere, 77-99. poo] Farrell, R.H., Kiefer, J. and Walbran, A. (1967). "Optimum multivariate designs." In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA 1965 and 1966, Volume 1 (Eds. L.M. Le Cam, J. Neyman). University of California, Berkeley, CA, 113-138. |4i8] Fedorov, V.V. (1972). Theory of Optimal Experiments. Academic Press, New York. (4is. 419,421.424] Fedorov, V. and Khabarov, V. (1986). "Duality of optimal designs for model discrimination and parameter estimation." Biometrika 73, 183-190. |4ii| Fedorov, V.V. and Malyutov, M.B. (1972). "Optimal designs in regression problems." Mathematische Operationsforschung und Statistik 3, 281-308. H16.423]
BIBLIOGRAPHY
437
Fejer, L. (1932). "Bestimmung derjenigen Abszissen eines Intervalles, fur welche die Quadratsumme der Grundfunktionen der Lagrangeschen Interpolation im Intervalle ein moglichst kleines Maximum besitzt." Annali delta R. Scuola Normale Superiore di Pisa Serie II, Scienze Fisiche e Matematiche 1, 263-276. |4i8] Fellman, J. (1974). "On the allocation of linear observations." Societas Scientiarum Fennica, Commentationes Physico-Mathematicae 44, 27-78. [410.414,41?] Fellman, J. (1980). "On the behavior of the optimality criterion in the neighborhood of the optimal point." Working Paper 49, Swedish School of Economics and Business Administration, Helsinki, 15 pages. [424] Fellman, J. (1991). "Gustav Elfving and the emergence of the optimal design theory." Working Paper 218, Swedish School of Economics and Business Administration, Helsinki, 7 pages.
[430]
Fisher, R.A. (1940). "An examination of the different possible solutions of a problem in incomplete blocks." Annals of Eugenics 10, 52-75. \nk\ Gaffke, N. (1981). "Some classes of optimality criteria and optimal designs for complete two-way layouts." Annals of Statistics 9, 893-898. [413] Gaffke, N. (1982). Optimalitatskriterien und optimale Versuchspldne fur lineare Regressionsmodelle. Habilitationsschrift, Rheinisch-Westfalische Technische Hochschule Aachen, 127 pages.
(421,422]
Gaffke, N. (1985a). "Directional derivatives of optimality criteria at singular matrices in convex design theory." Statistics 16, 373-388. [411,415] Gaffke, N. (1985b). "Singular information matrices, directional derivatives, and subgradients in optimal design theory." In Linear Statistical Inference. Proceedings of the International Conference on Linear Inference, Poznan 1984 (Eds. T. Calinski, W. Klonecki). Lecture Notes in Statistics 35, Springer, Berlin, 61-77. [415] Gaffke, N. (1987a). "Further characterizations of design optimality and admissibility for partial parameter estimation in linear regression." Annals of Statistics 15, 942-957. [379,411,417,422] Gaffke, N. (1987b). "On D-optimality of exact linear regression designs with minimum support." Journal of Statistical Planning and Inference 15, 189-204. (424,425) Gaffke, N. and Krafft, O. (1977). "Optimum properties of Latin square designs and a matrix inequality." Mathematische Operationsforschung und Statistik Series Statistics 8, 345-350.
[209, 412[
Gaffke, N. and Krafft, O. (1979a). "Matrix inequalities in the Lowner ordering." In Modern Applied Mathematics: Optimization and Operations Research (Ed. B. Korte). North-Holland, Amsterdam, 595-622. [4i4] Gaffke, N. and Krafft, O. (1979b). "Optimum designs in complete two-way layouts." Journal of Statistical Planning and Inference 3, 119-126. [426] Gaffke, N. and Krafft, O. (1982). "Exact D-optimum designs for quadratic regression." Journal of the Royal Statistical Society Series B 44, 394-397. (424] Gaffke, N. and Pukelsheim, F. (1988). "Admissibility and optimality of experimental designs." In Model-Oriented Data Analysis. Proceedings of an International Institute for Applied Systems Analysis Workshop on Data Analysis, Eisenach 1987 (Eds. V Fedorov, H. Lauter). Lecture Notes in Economics and Mathematical Systems 297, Springer, Berlin, 37-43. [412,422] Galil, Z. and Kiefer, J. (1977). "Comparison of rotatable designs for regression on balls, I (quadratic)." Journal of Statistical Planning and Inference 1, 27-40. [427] Gauss, C.F. (Werke). Werke (Ed. Konigliche Gesellschaft der Wissenschaften zu Gottingen). Band IV (Gottingen 1873), Band VII (Leipzig 1906), Band X 1 (Leipzig 1917). [409] Giovagnoli, A. and Wynn, H.P. (1981). "Optimum continuous block designs." Proceedings of the Royal Society London Series A 377, 405-416. [412,425,426]
438
BIBLIOGRAPHY
Giovagnoli, A. and Wynn, H.P. (1985a). "Schur-optimal continuous block designs for treatments with a control." In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, Volume 2 (Eds. L.M. Le Cam, R.A. Olshen). Wadsworth, Belmont, CA, 651-666.
[412, 425)
Giovagnoli, A. and Wynn, H.P. (1985b). "G-majorization with applications to matrix orderings." Linear Algebra and Its Applications 67, 111-135. [425,426] Giovagnoli, A., Pukelsheim, F. and Wynn, H.P. (1987). "Group invariant orderings and experimental designs." Journal of Statistical Planning and Inference 17, 159-171. [425] Goller, H. (1986). "Shorted operators and rank decomposition matrices." Linear Algebra and Its Applications 81, 207-236. [412] Greub, W.H. (1967). Multilinear Algebra. Springer, Berlin. \m\ Gribik, PR. and Kortanek, K.O. (1977). "Equivalence theorems and cutting plane algorithms for a class of experimental design problems." SIAM Journal on Applied Mathematics 32, 232-259. I424J Gruber, P.M. (1988). "Minimal ellipsoids and their duals." Rendiconti del Circolo Matematico di Palermo Serie H 37, 35-64. |4n] Guest, P.G. (1958). "The spacing of observations in polynomial regression." Annals of Mathematical Statistics 29, 294-299. [245,418] Gutmair, S. (1990). Mischungen von Informationsfunktionen: Optimalitatstheorie undAnwendungen in der klassischen and Bayes'schen Versuchsplanung. Dissertation, Universitat Augsburg, 82 pages. [157,423) Guttman, I. (1971). "A remark on the optimal regression designs with previous observations of Covey-Crump and Silvey." Biometrika 58, 683-685. (4221 Hansen, O.H. and Torgersen, E.N. (1974). "Comparison of linear normal experiments." Annals of Statistics 2, 367-373. [412] Hardy, G.H., Littlewood, I.E. and P61ya, G. (1934). Inequalities. Cambridge University Press, Cambridge, UK. [413] Hedayat, A. (1981). "Study of optimality criteria in design of experiments." In Statistics and Related Topics. Proceedings of the Symposium, Ottawa 1980 (Eds. M. Csorgo, D.A. Dawson, J.N.K. Rao, A.K.Md.E. Saleh). North-Holland, Amsterdam, 39-56. [413] Hedayat, A.S. and Majumdar, D. (1985). "Combining experiments under Gauss-Markov models." Journal of the American Statistical Association 80, 698-703. [412] Hedayat, A. and Wallis, W.D. (1978). "Hadamard matrices and their application." Annals of Statistics 6, 1184-1238. [427] Hedayat, A.S., Jacroux, M. and Majumdar, D. (1988). "Optimal designs for comparing test treatments with a control." Statistical Science 3, 462-476. "Discussion." Ibidem, 477-491. [426] Heiligers, B. (1.988). Zulassige Versuchspldne in linearen Regressionsmodellen. Dissertation, Rheinisch-Westfalische Technische Hochschule Aachen, 194 pages. [350,422] Heiligers, B. (1991a). "Admissibility of experimental designs in linear regression with constant term." Journal of Statistical Planning and Inference 28, 107-123. 1422] Heiligers, B. (1991b). "A note on connectedness of block designs." Metrika 38, 377-381. [426] Heiligers, B. (1991c). E-optimal Polynomial Regression Designs. Habilitationsschrift, RheinischWestfalische Technische Hochschule Aachen, 88 pages. 1%, 246,420) Helmert, F.R. (1872). Die Ausgleichungsrechnung nach der Methode der Kleinsten Quadrate, mit Anwendungen auf die Geoddsie und die Theorie der Messinstrumente. Teubner, Leipzig. [427] Henderson, H.V. and Searle, S.R. (1981). "The vec-permutation matrix, the vec operator and Kronecker products: A review." Linear and Multilinear Algebra 9, 271-288. [427]
BIBLIOGRAPHY
439
Henderson, H.V., Pukelsheim, F. and Searle, S.R. (1983). "On the history of the Kronecker product." Linear and Multilinear Algebra 14, 113-120. [42?] Herzberg, A.M. and Cox, D.R. (1969). "Recent work on the design of experiments: A bibliography and a review." Journal of the Royal Statistical Society Series A 132, 29-61. (4221 Hill, P.D.H. (1978a). "A note on the equivalence of D-optimal design measures for three rival linear models." Biometrika 65, 666-667. [421] Hill, P.D.H. (1978b). "A review of experimental design procedures for regression model discrimination." Technometrics 20, 15-21. [423) Hoang, T. and Seeger, A. (1991). "On conjugate functions, subgradients, and directional derivatives of a class of optimality criteria in experimental design." Statistics 22, 349-368. |4U] Hoel, P.O. (1958). "Efficiency problems in polynomial estimation." Annals of Mathematical Statistics 29, 1134-1145. [4i8| Hoel, P.O. (1965). "Minimax designs in two dimensional regression." Annals of Mathematical Statistics 36, 1097-1106. (406,421) Hoel, P.O. and Levine, A. (1964). "Optimal spacing and weighting in polynomial prediction." Annals of Mathematical Statistics 35, 1553-1560. [209,410,4i9| Holder, O. (1889). "Ueber einen Mittelwerthssatz." Nachrichten von der Koniglichen Gesellschaft der Wissenschaften und der Georg-Augusts-Universitat zu Gottingen 2, 38-47. [4U] Horn, R.A. and Johnson, C.R. (1985). Matrix Analysis. Cambridge University Press, Cambridge,
UK. 133,414)
Hotelling, H. (1944). "Some improvements in weighing and other experimental techniques." Annals of Mathematical Statistics 15, 297-306. [42?] Humak, K.M.S. (1977). Statistische Methoden der Modellbildung, Band I. Statistische Inferenz fur lineare Parameter. Akademie-Verlag, Berlin. HIS. 4231 Hunter, W.G. and Reiner, A.M. (1965). "Designs for discriminating between two rival models." Technometrics 7, 307-323. H23) Jensen, S.T. (1988). "Covariance hypotheses which are linear in both the covariance and the inverse covariance." Annals of Statistics 16, 302-322. 1425] John, J.A. (1987). Cyclic Designs. Chapman and Hall, London. 197,424] John, P.W.M. (1964). "Balanced designs with unequal numbers of replicates." Annals of Mathematical Statistics 35, 897-899. [379] Kageyama, S. and Tsuji, T. (1980). "Characterization of equireplicated variance-balanced block designs." Annals of the Institute of Statistical Mathematics 32, 263-273. [426] Karlin, S. and Studden, W.J. (1966a). "Optimal experimental designs." Annals of Mathematical Statistics 37, 783-815. [4i6,4i8,421,422) Karlin, S. and Studden, W.J. (1966b). Tchebycheff Systems: With Applications in Analysis and Statistics. Interscience, New York. [4io, 4is, 421,422] Kempthorne, O. (1980). "The term design matrix." American Statistician 34, 249.
95-104.
[409]
[427]
Khuri, A.I. (1988). "A measure of rotatability for response-surface designs." Technometrics 30, Khuri, A.I. and Cornell, J.A. (1987). Response Surfaces. Designs and Analyses. Dekker, New York. (427] Kiefer, J.C. (1958). "On the nonrandomized optimality and randomized nonoptimaltiy of symmetrical designs." Annals of Mathematical Statistics 29, 675-699. \ns\ Kiefer, J.C. (1959). "Optimum experimental designs." Journal of the Royal Statistical Society Series B 21, 272-304. "Discussion on Dr Kiefer's paper." Ibidem, 304-319.
[409, 412. 417, 421, 424, 425, 431]
440
BIBLIOGRAPHY
Kiefer, J.C. (1960). "Optimum experimental designs V, with applications to systematic and rotatable designs." In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA 1960, Volume 1 (Ed. J. Neyman). University of California, Berkeley, CA, 381-405. psi. 413,424,426.427] Kiefer, J.C. (1961). "Optimum designs in regression problems, II." Annals of Mathematical Statistics 32, 298-325. piMiq Kiefer, J.C. (1962). "An extremum result." Canadian Journal of Mathematics 14, 597-601. [416] Kiefer, J.C. (1971). "The role of symmetry and approximation in exact design optimality." In Statistical Decision Theory and Related Topics. Proceedings of a Symposium, Purdue University 1970 (Eds. S.S. Gupta, J. Yackel). Academic Press, New York, 109-118. [424.425] Kiefer, J.C. (1974a). "General equivalence theory for optimum designs (approximate theory)." Annals of Statistics 2, 849-879. [4:2.4is, 420.43ij Kiefer, J.C. (1974b). "Lectures on design theory." Mimeograph Series No. 397, Department of Statistics, Purdue University, 52 pages. pis) Kiefer, J.C. (1975). "Construction and optimality of generalized Youden designs." In A Survey of Statistical Design and Linear Models (Ed. J.N. Srivastava). North-Holland, Amsterdam,
333-353. (421,425]
Kiefer, J.C. (1981). "The interplay of optimality and combinatorics in experimental design." Canadian Journal of Statistics 9, 1-10. 1425] Kiefer, J.C. (CP). Collected Papers (Eds. L.D. Brown, I. Olkin, J. Sacks, H.P. Wynn). Springer, New York 1985. [431] Kiefer, J.C. and Studden, W.J. (1976). "Optimal designs for large degree polynomial regression." Annals of Statistics 4, 1113-1123. Hi9] Kiefer, J.C. and Wolfowitz, J. (1959). "Optimum designs in regression problems." Annals of Mathematical Statistics 30, 271-294. p46,410,4is. 419,425) Kiefer, J.C. and Wolfowitz, J. (1960). "The equivalence of two extremum problems." Canadian Journal of Mathematics 12, 363-366. [4i5.4i6.4i8] Kiefer, J.C. and Wolfowitz, J. (1965). "On a theorem of Hoel and Levine on extrapolation designs." Annals of Mathematical Statistics 36, 1627-1655. [267.410, 9] Kiefer, J.C. and Wyflfi, H.P. (1984). "Optimum and minimax exact treatment designs for onedimensional autoregressive error processes." Annals of Statistics 12, 431-450. [431] Kitsos, C.P., Titterington, D.M. and Torsney, B. (1988). "An optimal design problem in rhythmometry." Biometrics 44, 657-671. [417] Krafft, O. (1978). Lineare statistische Modelle und optimale Versuchsplane. Vandenhoeck und Ruprecht, Gottingen. [406,413.418,421,426] Krafft, O. (1981). "Dual optimization problems in stochastics." Jahresbericht der Deutschen Mathematiker-Vereinigung 83, 97-105. HI?) Krafft, O. (1983)* "A matrix optimization problem." Linear Algebra and Its Applications 51, 137-142. [408] Krafft, O. (1990). "Some matrix representations occurring in linear two-factor models." In Probability, Statistics and Design of Experiments. Proceedings of the R.C. Base Memorial Conference, Delhi 1988 (Ed. R.R. Bahadur). Wiley Eastern, New Delhi, 461-470. [113] Krafft, O. and Schaefer, M. (1992). "D-optimal designs for a multivariate regression model." Journal of Multivariate Analysis 42, 130-140. pig] Krein, M. (1947); "The theory of self-adjoint extensions of semibounded Hermitian transformations and its applications. I." Matematicheskii Sbornik 20(62), 431-495 (in Russian). |4iij Kunert, J. (1991). "Cross-over designs for two treatments and correlated errors." Biometrika 78, 315-324. [4i2]
BIBLIOGRAPHY
441
Kunert, J. and Martin, R.J. (1987). "On the optimality of finite Williams II(a) designs." Annals of Statistics 15, 1604-1628. [412] Kurotschka, V. (1971). "Optimale Versuchsplane bei zweifach klassifizierten Beobachtungsmodellen." Metrika 17, 215-232. [412] Kurotschka, V. (1978). "Optimal design of complex experiments with qualitative factors of influence." Communications in Statistics, Theory and Methods A7, 1363-1378. [412] LaMotte, L.R. (1977). "A canonical form for the general linear model." Annals of Statistics 5, 787-789. [412] Lau, T.-S. (1988). "D-optimal designs on the unit g-ball." Journal of Statistical Planning and Inference 19, 299-315. [423] Lau, T.-S. and Studden, W.J. (1985). "Optimal designs for trigonometric and polynomial regression using canonical moments." Annals of Statistics 13, 383-394. [423] Lauter, E. (1974). "Experimental design in a class of models." Mathematische Operationsforschung und Statistik 5, 379-398. [423] Lauter, E. (1976). "Optimal multipurpose designs for regression models." Mathematische Operationsforschung und Statistik 7, 51-68. [w, 423] Lee, C.M.-S. (1987). "Constrained optimal designs for regression models." Communications in Statistics, Theory and Methods 16, 765-783. [423] Lee, C.M.-S. (1988). "Constrained optimal designs." Journal of Statistical Planning and Inference 18, 377-389. [423] Legendre, A.M. (1806). Nouvelles Methodes pour la Determination des Orbites des Cometes. Courcier, Paris. English translation of the appendix "Sur la methode des moindres quarres" in: D.E. Smith, A Source Book in Mathematics, Volume 2. Dover, New York 1959, 576-579.
[409]
Lim, Y.B. and Studden, WJ. (1988). "Efficient Ds-optimal design for multivariate polynomial regression on the <?-cube." Annals of Statistics 16, 1225-1240. [423] Lindley, D.V. (1956). "On a measure of the information provided by an experiment." Annals of Mathematical Statistics 27, 986-1005. [4i4] Lindley, D.V. and Smith, A.F.M. (1972). "Bayes estimates for the linear model." Journal of the Royal Statistical Society Series B 34, 1-18. "Discussion on the paper by Professor Lindley and Dr Smith." Ibidem, 18-41. [422] Loewner, C. [Lowner, K.) (1934). "Uber monotone Matrixfunktionen." Mathematische Zeitschrift 38, 177-216. [408] Loewner, C. [Lowner, K.] (1939). "Grundziige einer Inhaltslehre im Hilbertschen Raume." Annals of Mathematics 40, 816-833. [4i?] Loewner, C. [Lowner, K.] (CP). Collected Papers (Ed. L. Bers). Birkhauser, Basel 1988.
p l p
[428]
Magnus, J.R. (1987). "A representation theorem for (trA ) / ." Linear Algebra and Its Applications 95, 127-134. [4i4] Makelainen, T. (1990). "Gustav Elfving 1908-1984." Address presented to the International Workshop on Linear Models, Experimental Design, and Related Matrix Theory, Tampere
1990. 1430)
Markoff, W. (1916). "Uber Polynome, die in einem gegebenen Intervalle moglichst wenig von Null abweichen." Vorwort von Serge Bernstein in Charkow. Mathematische Annalen 77, 213258. [42i] Markov, A.A. (1912). Wahrscheinlichkeitsrechnung. Teubner, Leipzig. [409] Marshall, A.W. and Olkin, I. (1969). "Norms and inequalities for condition numbers, II." Linear Algebra and Its Applications 2, 167-172. 1134,414)
442
BIBLIOGRAPHY
Marshall, A.W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. Academic Press, New York, [4os,4i3] Mathias, R. (1990). "Concavity of monotone matrix functions of finite order." Linear and Multilinear Algebra 27, 129-138. Ii56] McFadden, D. (1978). "Cost, revenue, and profit functions." In Production Economics: A Dual Approach to Theory and Applications, Volume I (Eds. M. Fuss, D. McFadden). NorthHolland, Amsterdam, 3-109. [413] Mikaeili, F. (1988). "Allocation of measurements in experiments with mixtures." Keio Science and Technology Reports 41, 25-37. 1426) Milliken, G.A. and Akdeniz, F. (1977). "A theorem on the difference of the generalized inverses of two nonnegative matrices." Communications in Statistics, Theory and Methods A6, 73-79.
[209]
Mitra, S.K. and Puri, M.L. (1979). "Shorted operators and generalized inverses of matrices." Linear Algebra and Its Applications 25, 45-56. 1412] Mosteller, F, Youtz, C. and Zahn, D. (1967). "The distribution of sums of rounded percentages." Demography 4, 850-858. [424] Miiller-Funk, U., Pukelsheim, F. and Witting, H. (1985). "On the duality between locally optimal tests and optimal experimental designs." Linear Algebra and Its Applications 67, 19-34. (4i3j Murty, V.N. (1971), "Optimal designs with a Tchebycheffian spline regression function." Annals of Mathematical Statistics 42, 643-649. (420) Myers, R.H. (1971). Response Surface Methodology. Allyn and Bacon, Boston, MA. [42?] Myers, R.H., Khuri, A.I. and Carter, W.H., Jr. (1989). "Response surface methodology: 19661988." Technometrics 31, 137-157. [42?] Myers, R.H., Vining, G.G., Giovannitti-Jensen, A. and Myers, S.L. (1992). "Variance dispersion properties of second-order response surface designs." Journal of Quality Technology 24, I'll. |427] Nachtsheim, C.J. (1989). "On the design of experiments in the presence of fixed covariates." Journal of Statistical Planning and Inference 22, 203-212. [419] Nalimov, V.V. (1974). "Systematization and codification of the experimental designsThe survey of the works of Soviet statisticians." In Progress in Statistics. European Meeting of Statisticians, Budapest 1972, Volume 2 (Eds. J. Gani, K. Sarkadi, I. Vincze). Colloquia Mathematica Societatis Janos Bolyai 9, North-Holland, Amsterdam, 565-581. [4n] Natanson, I.P. (1955). Konstruktive Funktionentheorie. Akademie-Verlag, Berlin. [421] Nigam, A.K., Puri, P.D. and Gupta, V.K. (1988). Characterizations and Analysis of Block Designs. Wiley Eastern, New Delhi. [426] Nordstrom, K. (1991). "The concentration ellipsoid of a random vector revisited." Econometric Theory 7, 397-403. HB] Ouellette, D.V. (1981). "Schur complements and statistics." Linear Algebra and Its Applications 36, 187-295. Hm Pazman, A. (1980). "Singular experimental design (standard and Hilbert-space approaches)." Mathematischs Operationsforschung und Statistik Series Statistics 11, 137-149. HH] Pazman, A. (1986). Foundations of Optimum Experimental Design. Reidel, Dordrecht. [409.412,418] Pazman, A. (1990). "Small-Sample distributional properties of nonlinear regression estimators (a geometric approach)." Statistics 21, 323-346. "Discussion." Ibidem, 346-367. |4ii] Pilz, J. (1979). "Optimalitatskriterien, Zulassigkeit und Vollstandigkeit im Planungsproblem fur eine bayessche Schatzung im linearen Regressionsmodell." Freiberger Forschungshefte D117, 67-94. [42i]
BIBLIOGRAPHY
443
Pilz, J (1991). Bayesian Estimation and Experimental Design in Linear Regression Models. Wiley, New York. 1122] Plackett, R.L. (1949). "A historical note on the method of least squares." Biometrika 36, 458460.
[409]
Plackett, R.L. (1972). "Studies in the history of probability and statistics XXIX: The discovery of the method of least squares." Biometrika 59, 239-251. MOI Plackett, R.L. and Burman, J.P. (1946). "The design of optimum multifactorial experiments." Biometrika 33, 305-325. [427] Preece, D.A. (1982). "Balance and designs: Another terminological tangle." Utilitas Mathematica 21C, 85-186. 1426] Preitschopf, F. (1989). Bestimmung optimaler Versuchspldne in der polynomialen Regression. Dissertation, Universitat Augsburg, 152 pages. [421) Preitschopf, F. and Pukelsheim, F. (1987). "Optimal designs for quadratic regression." Journal of Statistical Planning and Inference 16, 213-218. [41?] Pukelsheim, F. (1980). "On linear regression designs which maximize information." Journal of Statistical Planning and Inference 4, 339-364. [157. 4i(MU. 415-418,422] Pukelsheim, F. (1981). "On c-optimal design measures." Mathematische Operationsforschung und Statistik Series Statistics 12, 13-20. ]i86, 410,4ii,4i6j Pukelsheim, F. (1983a). "On optimality properties of simple block designs in the approximate design theory." Journal of Statistical Planning and Inference 8, 193-208. [406,412,418,426] Pukelsheim, F. (1983b). "On information functions and their polars." Journal of Optimization Theory and Applications 41, 533-546. [413] Pukelsheim, F. (1983c). "Optimal designs for linear regression." In Recent Trends in Statistics. Proceedings of the Anglo-German Statistical Meeting, Dortmund 1982 (Ed. S. Heiler). Allgemeines Statistisches Archiv, Sonderheft 21, Vandenhoeck und Ruprecht, Gottingen, 32-39.
[412, 426]
Pukelsheim, F. (1986). "Approximate theory of multiway block designs." Canadian Journal of Statistics 14, 339-346. [aso. 426] Pukelsheim, F. (1987a). "Information increasing orderings in experimental design theory." International Statistical Review 55, 203-219. 1413.426] Pukelsheim, F. (1987b). "Ordering experimental designs." In Proceedings of the First World Congress of the Bernoulli Society, Tashkent 1986, Volume 2 (Eds. Yu.A. Prohorov, V.V. Sazonov). VNU Science Press, Utrecht, 157-165. (426) Pukelsheim, F. (1987c). "Majorization orderings for linear regression designs." In Proceedings of the Second International Tampere Conference in Statistics, Tampere 1987 (Eds. T. Pukkila, S. Puntanen). Department of Mathematical Sciences, Tampere, 261-274. [426] Pukelsheim, F. (1989). "Complete class results for linear regression designs over the multidimensional cube." In Contributions to Probability and Statistics. Essays in Honor of Ingram Olkin (Eds. L.J. Gleser, M.D. Perlman, S.J. Press, A.R. Sampson). Springer, New York, 349356.
[426]
Pukelsheim, F. (1990). "Information matrices in experimental design theory." In Probability, Statistics and Design of Experiments. Proceedings of the R.C. Bose Memorial Conference, Delhi 1988 (Ed. R.R. Bahadur). Wiley Eastern, New Delhi, 607-618. pm Pukelsheim, F. and Rieder, S. (1992). "Efficient rounding of approximate designs." Biometrika 79, 763-770. p*) Pukelsheim, F. and Rosenberger, J.L. (1993). "Experimental Designs for Model Discrimination." Journal of the American Statistical Association 88, 642-649. [423] Pukelsheim, F. and Studden, W.J. (1993). "-optimal designs for polynomial regression." Annals of Statistics 21, 402-415. (417.4201
444
BIBLIOGRAPHY
Pukelsheim, F. and Styan, G.P.H. (1983). "Convexity and monotonicity properties of dispersion matrices of estimators in linear models." Scandinavian Journal of Statistics 10, 145-149.
(411.412)
Pukelsheim, F. and Titterington, D.M. (1983). "General differential and Lagrangian theory for optimal experimental design." Annals of Statistics 11, 1060-1068. HIS, 416.424] Pukelsheim, F. and Titterington, D.M. (1986). "Improving multi-way block designs at the cost of nuisance parameters." Statistics & Probability Letters 4, 261-264. [426] Pukelsheim, F. and Titterington, D.M. (1987). "On the construction of approximate multi-factor designs from given marginals using the Iterative Proportional Fitting Procedure." Metrika 34, 201-210. [426] Pukelsheim, F. and Torsney, B. (1991). "Optimal weights for experimental designs on linearly independent support points." Annals of Statistics 19, 1614-1625. [*n\ Raghavarao, D. (1971). Constructions and Combinatorial Problems in Design of Experiments. Wiley, New York. [424.426] Raghavarao, D. and Federer, W.T. (1975). "On connectedness in two-way elimination of heterogeneity designs." Annals of Statistics 3, 730-735. (426) Raktoe, B.L., Hedayat, A. and Federer, W.T. (1981). Factorial Designs. Wiley, New York. [424] Rao, C.R. (1967). "Least squares theory using an estimated dispersion matrix and its application to measurement of signals." In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics <md Probability, Berkeley, CA 1965 and 1966, Volume 1 (Eds. L.M. Le Cam, J. Neyman). University of California, Berkeley, CA, 355-372. 1%, 408) Rao, C.R. and Mitra, S.K. (1971). Generalized Inverse of Matrices and Its Applications. Wiley, New York. [408] Rao, V.R. (1958). "A note on balanced designs." Annals of Mathematical Statistics 29, 290-294.
[379]
Rasch, D. and Herrendorfer, G. (1982). Statistische Versuchsplanung. VEB Deutscher Verlag der Wissenschaften, Berlin. [426] Rieder, S. (1990). Versuchspldne mil vorgegebenen Trdgerpunkten. Diplomarbeit, Universitat Augsburg, 73 pages. [424] Rivlin, T.J. (1990). Chebyshev Polynomials. From Approximation Theory to Algebra and Number Theory, Second Edition. Wiley-Interscience, New York. [420,421] Rockafellar, R.T. (1967). "Monotone processes of convex and concave type." Memoirs of the American Mathematical Society 77, 1-74. HBJ Rockafellar, R.T. (1970). Convex Analysis. Princeton University Press, Princeton, NJ.
[409, 410, 413. 414, 416. 417, 422, 423]
Rogers, L.J. (1888). "An extension of a certain theorem in inequalities." Messenger of Mathematics 17, 145-150. [4i3j Salaevskii, O.V. (1966). "The problem of the distribution of observations in polynomial regression." Proceedings of the Steklov Institute of Mathematics 79, 147-166. (424) Schatten, R. (1950). A Theory of Cross-Spaces. Princeton University Press, Princeton, NJ. [4i4] Schoenberg, I.J. (1959). "On the maxima of certain Hankel determinants and the zeros of the classical orthogonal polynomials." Indagationes Mathematicae 21, 282-290. [4is] Schur, I. (1911). "Bemerkungen zur Theorie der beschrankten Bilinearformen mit unendlich vielen Veranderlichen." Journal fiir die reine und angewandte Mathematik 140, 1-28. [4i8] Schur, I. (1918). "Ober die Verteilung der Wurzeln bei gewissen algebraischen Gleichungen mit ganzzahligen Koeffizienten." Mathematische Zeitschrift 1, 377-402. pis]
BIBLIOGRAPHY
445
Schur, I. (GA). Issai Schur Gesammelte Abhandlungen (Eds. A. Brauer, H. Rolnbach). Springer, Berlin 1973. Searle, S.R. (1971). Linear Models. Wiley, New York. poo) Seely, J.F. (1971). "Quadratic subspaces and completeness." Annals of Mathematical Statistics 42, 710-721. 1425) Shah, K.R.and Sinha, B[ikas].K. (1989). Theory of Optimal Designs. Lecture Notes in Statistics 54, Springer, New York. (424,426] Shewry, M.C. and Wynn, H.P. (1987). "Maximum entropy sampling." Journal of Applied Statistics 14, 165-170. HIS, Sheynin, O.B. (1989). "A.A. Markov's work on probability." Archive for History of Exact Sciences 39, 337-377. [409] Sibson, R. (1974). "D^-optimality and duality." In Progress in Statistics. European Meeting of Statisticians, Budapest 1972, Volume 2 (Eds. J. Gani, K. Sarkadi, I. Vincze). Colloquia Mathematica Societatis Janos Bolyai 9, North-Holland, Amsterdam, 677-692. HI?, 4is] Sibson, R. and Kenny, A. (1975). "Coefficients in D-optimal experimental design." Journal of the Royal Statistical Society Series B 37, 288-292. HIS] Silverman, B.W. and Titterington, D.M. (1980). "Minimum covering ellipses." SIAM Journal in Scientific and Statistical Computing 1, 401-409. [417] Silvey, S.D. (1978). "Optimal design measures with singular information matrices." Biometrika 65, 553-559. ISMUI Silvey, S.D. (1980). Optimal Design. Chapman and Hall, London. HU] Silvey, S.D. and Titterington, D.M. (1973). "A geometric approach to optimal design theory." Biometrika 60, 21-32. (41?, 4ig| Silvey, S.D. and Titterington, D.M. (1974). "A Lagrangian approach to optimal design." Biometrika 61, 299-302. |42i] Sinha, B[ikas].K. (1982). "On complete classes of experiments for certain invariant problems of linear inference." Journal of Statistical Planning and Inference 7, 171-180. [425) Sinha, B[imal].K. (1970). "A Bayesian approach to optimum allocation in regression problems." Calcutta Statistical Association Bulletin 19, 45-52. (422) Smith, K. (1918). "On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations." Biometrika 12, 1-85. |4i8) St.John, R.C. and Draper, N.R. (1975). "D-optimality for regression designs: A review." Technometrics 17, 15-23. |4is) Steinberg, D.M and Hunter, W.G. (1984). "Experimental design: Review and comment." Technometrics 26, 71-97. [422] Ste.pniak, C. (1989). "Stochastic ordering and Schur-convex functions in comparison of linear experiments." Metrika 36, 291-298. [412] Ste,pniak, C., Wang, S.-G. and Wu, C.F.J. (1984). "Comparison of linear experiments with known covariances." Annals of Statistics 12, 358-365. (412] Stigler, S.M. (1971). "Optimal experimental design for polynomial regression." Journal of the American Statistical Association 66, 311-318. [423] Stigler, S.M. (1986). The History of Statistics. The Measurement of Uncertainty before 1900. Belknap Press, Cambridge, MA. (409] Stone, M. (1959). "Application of a measure of information to the design and comparison of regression experiments." Annals of Mathematical Statistics 30, 55-70. |4U]
446
BIBLIOGRAPHY
Studden, W.J. (1968). "Optimal designs on Tchebycheff points." Annals of Mathematical Statistics 59,1435-1447. [59, no. 9,420) Studden, W.J. (1971). "Elfving's Theorem and optimal designs for quadratic loss." Annals of Mathematical Statistics 42, 1613-1621. [4io,4i7] Studden, W.J. (1977). "Optimal designs for integrated variance in polynomial regression." In Statistical Decision Theory and Related Topics II. Proceedings of a Symposium, Purdue University 1976 (Eds. S.S. Gupta, D.S. Moore). Academic Press, New York, 411-420. pis] Studden, W.J. (1978). "Designs for large degree polynomial regression." Communications in Statistics, Theory and Methods A7, 1391-1397. [9] Studden, W.J. (1980). "Ds -optimal designs for polynomial regression using continued fractions." Annals of Statistics 8, 1132-1141. (423) Studden, W.J. (1982). "Some robust-type D-optimal designs in polynomial regression." Journal of the American Statistical Association 77, 916-921. (4231 Studden, W.J. (1989). "Note on some 4>p-optimal designs for polynomial regression." Annals of Statistics 17, 618-623. [4i7] Studden, W.J. and VanArman, D.J. (1969). "Admissible designs for polynomial spline regression." Annals of Mathematical Statistics 40, 1557-1569. [422] Styan, G.P.H. (1973). "Hadamard products and multivariate statistical analysis." Linear Algebra and Its Applications 6, 217-240. [4i8] Styan, G.P.H. (1985). "Schur complements and linear statistical models." In Proceedings of the First International Tampere Seminar on Linear Statistical Models and their Applications, Tampere 1983 (Eds. T. Pukkila, S. Puntanen). University of Tampere, Tampere, 37-75. [4ni Szego, G. (1939). Orthogonal Polynomials, Fourth Edition 1975. American Mathematical Society, Providence, RI. HIS] Titterington, D.M. (1975). "Optimal design: Some geometrical aspects of >-optimality." Biometrika 62, 313-320. [421] Titterington, D.M. (1980a). "Geometric approaches to design of experiment." Mathematische Operationsforschung und Statistik Series Statistics 11, 151-163. [4211 Titterington, D.M. (1980b). "Aspects of optimal design in dynamic systems." Technometrics 22, 287-299. [4i3] Torsney, B. (1981). Algorithms for a Constrained Optimization Problem with Applications in Statistics and Optimal Design. Dissertation, University of Glasgow, 336 pages. HI 7] Tyagi, B.N. (1979). "On a class of variance balanced block designs." Journal of Statistical Planning and Inference 3, 333-336. (379) Vila, J.P. (1991). "Local optimality of replications from a minimal D-optimal design in regression: A sufficient and a quasi-necessary condition." Journal of Statistical Planning and Inference 29, 261-277. H23] Voltaire [Arouet, P.M.] (CEuvres). (Euvres Completes de Voltaire. Tome 17, Dictionnaire Philosophique I (Paris 1878). Tome 21, Romans (Paris 1879). Reprint, Kraus, Nendeln, Liechtenstein 1967. von Neumann, J. (1937). "Some matrix-inequalities and metrization of matric-space." Mitteilungen des Forschungsinstituts fur Mathematik und Mechanik an der Kujbyschew-Universitat Tomsk 1, 286-300. [4i4] von Neumann, J. (CW). John von Neumann Collected Works (Ed. A.H. Taub). Pergamon Press, Oxford 1962. Wald, A. (1943). "On the efficient design of statistical investigations." Annals of Mathematical Statistics 14, 134-140. [412,4i3]
BIBLIOGRAPHY
447
Welch, W.J. (1982). "Branch-and-bound search for experimental designs based on D optimality and other criteria." Technometrics 24, 41-48. [4221 Whittle, P. (1973). "Some general points in the theory of optimal experimental design." Journal of the Royal Statistical Society Series B 35, 123-130. [59.4i4] Wierich, W. (1986). "On optimal designs and complete class theorems for experiments with continuous and discrete factors of influence." Journal of Statistical Planning and Inference 15, 19-27. [422] Witting, H. (1985). Mathematische Statistik I. Parametrische Verfahren beifestem Stichprobenumfang. Teubner, Stuttgart. [410] Wussing, H. and Arnold, W. (1975). Biographien bedeutender Mathematiker. Volk und Wissen, Berlin. Hi4| Wynn, H.P. (1972). "Results in the theory and construction of D-optimum experimental designs." Journal of the Royal Statistical Society Series B 34, 133-147. "Discussion of Dr Wynn's and of Dr Laycock's papers." Ibidem, 170-186. [417] Wynn, H.P. (1977). "Optimum designs for finite populations sampling." In Statistical Decision Theory and Related Topics II. Proceedings of a Symposium, Purdue University 1976 (Eds. S.S. Gupta, D.S. Moore). Academic Press, New York, 471-478. [422] Wynn, H.P. (1982). "Optimum submeasures with application to finite population sampling." In Statistical Decision Theory and Related Topics III. Proceedings of the Third Purdue Symposium, Purdue University 1981, Volume 2 (Eds. S.S. Gupta, J.O. Berger). Academic Press, New York, 485-495. [422j Zehfuss, G. (1858). "Uber eine gewisse Determinante." Zeitschrift fur Mathematik und Physik 3, 298-301. 1427] Zyskind, G. (1967). "On canonical forms, non-negative covariance matrices and best and simple least squares linear estimators in linear models." Annals of Mathematical Statistics 38, 10921109. [209)
Subject Index
A-criterion, see trace criterion absolute continuity, 248, 305, 368 admissibility, 57, 252, 262, 265, 403, 417 of a design, 247, 253, 421 of a moment matrix, 247, 256, 422 of an information matrix, 262, 264, 422 antisymmetry, 12, 145, 353, 356 antitonicity, 13, 89 arcsin distribution on [-1;1], 217, 246, 419 arcsin support design crd, 209, 217, 223, 230, 238, 241, 281, 419 arcsin support set T = {1, 1/2,0}, 281, 294, 300 arithmetic mean $,, 140, 292 average-variance criterion <^_ t , 135, 137, 140, 153, 197, 221, 223, 241, 413, 419 averaging matrix Ja lalg /a, 88, 347, 366 balanced incomplete block design W N/n, 138, 366, 378, 426 balancedness, 353, 417, 425 balancing operator A -* A, 193, 348, 369 Bayes design problem, 275, 278, 422 Bayes estimator, 270, 273 Bayes linear model, 269, 272 Bayes moment matrix Ma = (1 - a)A/o + aM, 271, 275 bijective (i.e. one-to-one and onto), 124, 335 Birkhoff theorem, 144, 413 block design W, 30, 94, 100, 362, 365 block matrix, 9, 75, 392
blocksize vector s, 31, 94, 426 bounded away from the origin, 120 C-matrix, see contrast information matrix canonical moments, 417, 419, 423 Caratheodory theorem, 188, 360, 417 Cauchy inequality, 56, 235 centered contrasts Kaa, 88, 93, 97, 105, 113, 206, 363, 366, 411 centering matrix Ka, 88, 347, 366 central composite design, 400, 402, 427 Chebyshev coefficient vector c, 227, 233, 237, 420 Chebyshev indices d,d~2,...,d- 2[d/2\, 233 Chebyshev points, see arcsin support design Chebyshev polynomial Td(t), 226, 238, 246, 420 classical linear model E[Y] = X0,D[Y] = o-2/n, 4, 16, 24, 36, 62, 72, 382 with normality assumption Y ~ coefficient matrix K, 19, 47, 61, 206, 410 rank deficient, 88, 205, 364, 371, 404 coefficient vector c, 36, 47, 54 column sum vector, see blocksize vector complete class, 324, 334, 374, 403, 417 completely symmetric matrix, 34, 345, 347 complexity, of computing an information matrix, 73, 76, 82 of model versus inferential simplicity, 381
^Wv 4> 67> 72
A bold face number refers to the page where the item is introduced.
448
SUBJECT INDEX of the design problem, 284 componentwise partial ordering >, 285, 333, 375, 404 concavity, 115, 119 concentration ellipsoid, 413 concurrence matrix of treatments NN', 367, 426 confidence ellipsoid, 96 congruence action (Q,A) H-> QAQ', 337, 425 conjugate numbers p + q = pq, 147, 261 connectedness, 426 contrast information matrix C, 94, 105, 262, 412, 422 convex hull convS, 29, 43, 253, 352 convex set, 44, 191, 252 covariance adjustment, 96, 408 cylinder, 44, 50, 57, 259, 417 D-criterion, see determinant criterion design e E, t T, 5, 26, 304, 390 for extrapolation, 410, 419 for model discrimination, 279 for sample size n, 25, 304 optimal for c'B in H, 50, 197 ^-optimal for K'B in H, 131, 187 standardized by sample size, 305 with a guaranteed efficiency, 296, 423 with bounded weights, 278, 422 with protected runs, 277, 422 design matrix, 29, 31, 409 design problem, 131, 152, 170, 275, 284, 331, 342, 413, 425 for a scalar parameter system c'B, 41, 63, 82, 108 for sample size n, 25, 304 design sequence, 409, 419 design set, T, on the experimental domain T, 27, 32, 213 T(r) C T, cross section, 105, 363 H, on the regression range X, 26, 32,
410
449
diagonality of a square matrix, 142 direct sum decomposition, 23, 52 directional derivative, 415, 423 discrepancy, 308 discrete optimization problem, 26, 320 dispersion formula, 73, 411 dispersion matrix, 25, 102, 205, 395 dispersion maximum d(A), 211 dispersion ordering and information ordering, 91 doubly stochastic matrix, 144 dual problem, 47, 171, 231 duality gap, 47, 184 duality theorem, 172, 415, 417 E-criterion, see smallest-eigenvalue criterion efficiency, 113, 132, 221, 223, 240, 292, 296 efficiency bound e^, 310, 312, 320, 424 optimal e{(n), 311 efficiency loss, 314, 317, 321 efficient design apportionment (,), 27, 221, 223, 237, 308, 312, 320, 328, 424 efficient rounding (fz|, 308 eigenvalue, 8, 24, 56, 140, 146, 182, 204, 333, 375, 405 eigenvalue decomposition, 8, 24, 375, 401 Elfving norm p, 47, 59, 121, 183 Elfving set U = conv(A' u (-*)), 43, 107, 134, 191, 231, 420 Elfving theorem, 50, 107, 182, 190, 198, 212, 231, 239, 259, 410, 422 equispaced support, 217, 223, 230, 238, 241, 281, 419 equivalence theorem [over M(H)],176 for the parameter vector 6, 177 of Kiefer-Wolfowitz, 212, 222, 313, 324, 418 under a matrix mean <p, 180 under a scalar criterion c'Q, 52 under the smallest-eigenvalue criterion
4>-oc, 181
H,,, for sample size n, 26, 304, 311 H/I/H C H, standardized by sample size AZ, 305, 311 determinant criterion <ft), 119, 136, 140, 153, 195, 213, 217, 285, 293, 320, 344, 353, 356, 413, 418, 423, 425 finite sample size optimality, 322, 325, 328 diagonal operator A, 8, 31, 94, 142, 146, 345, 413
equivariance, 336, 340, 387, 395, 425 equivariance group "H = {LQK : Q 6 Q}, 343, 357, 363 estimability, 19, 36, 41, 64, 72 estimated response surface / i-> f(t)'6, 382 estimation and testing problems, 4, 340 Euclidean scalar product, 107, 394 of column vectors x'y, 2, 72, 141, 344
450
Euclidean scalar product (Continued) of rectangular matrices traceA'B, 8, 125, 141 Euclidean space, of column vectors Rk, 2 of rectangular matrices /?"**, 8 of symmetric matrices Sym(^), 8 exchangeable dispersion structure, 34 existence theorem, 104, 109, 174 experimental conditions t T, 1 experimental domain T, 1, 335 experimental run, 3, 27, 278 extrapolation design, 410, 419 F-Test, 67, 89, 96, 413 factorial design, 390, 402 feasibility and formal optimality, 132, 138 feasibility cone A(c),A(K), 36, 63, 67, 82, 351, 411 and rank of information matrices, 81 Fenchel duality, 415 first-degree model, 6, 192 Fisher inequality a<b, 367, 426 Fisher information, 72, 411 formal optimality, 131, 174 frequency count, 25, 304 full rank reduction, 160, 188 G-criterion, see global criterion Gauft-Markov theorem, 13, 20, 34, 51, 62, 66, 89, 408 for the parameter vector 6, 22 under a range inclusion condition, 21 general equivalence theorem [over M\, 17, 52, 140, 175, 177, 282, 415 differentiability proof, 179 for a mixture of models or criteria, 286, 290 for Bayes designs, 276 for designs with bounded weights, 278 for guaranteed efficiency designs, 297 for the parameter vector 0, 176 under a matrix mean <$>p, 178 under a scalar criterion c'6, 111, 412 under the Loewner ordering >, 103 under the smallest-eigenvalue criterion <-oc, 180 general linear group GL(/c), 336 general linear model E[Y] ^ X0,D{Y] <r2V, 18, 72 generalized information matrix AK, 89, 94, 165, 411, 415 four formulas, 92
SUBJECT INDEX
generalized information matrix mapping A i-. AK, 92 generalized matrix inverse AGA = A, 16, 88, 408 set A~ = [G : AGA = A}, 16, 159 geometric mean 4>0, 140, 292 geometry, of a feasibility cone A(K), 40 of the closed cone of nonnegative definite matrices NND(), 10 of the moment set M2d-i(T). 252 of the penumbra P, 108 of the set M(H) of all moment matrices, 29 of the set of admissible moment matrices, 258 global criterion g, 211, 245, 313, 418 grand assumption MnPD(k) ^ 0, 98, 108, 160, 170 for mixture models, 286 group of unimodular matrices Unim(s), 344, 353, 361 Hadamard matrix, 391, 427 Hadamard product A *fl, 199, 418 harmonic mean 4>_ t , 140, 292 heritability of invariance, 358 homomorphism h : Q > GL(s), 338, 342 Holder inequality, 126, 147, 413 I-criterion, 419 ice-cream cone NND(2), 38, 354, 410 idempotency, 17, 127 identifiability, 72, 81, 89, 305 inadmissibility, 190, 248, 294 incidence matrix N, 366, 426 information for c'6, 63 information function, 4> : NND(s) - R, 119, 126, 134, 209, 413 i/f : NND(fcj x - - - x km) - R m , 285 * : R* - R, 284 functional operations, 124 matrix means <f>p with p (-00; 1], 140 reconstruction from unit level set {</> > 1}, 122 information increasing ordering, 426 information matrix CK(A) for K'd, 62, 86, 411 four formulas, 76 information matrix mapping C#, 76, 92, 129 discontinuity, 79, 82, 319
SUBJECT INDEX upper semicontinuity and regularization, 77, 99 information ordering and dispersion ordering, 91 information surface / >> Cf^,)(M), 382, 389, 427 injective (i.e. one-to-one), 122, 335 integer part [zj, 227, 306 invariance, heritability, 358 of a design problem, 331 of a matrix mean <f>p, 343 of a symmetric matrix, 345, 388, 398 of an information function </>, 343, 349 of the determinant criterion fa, 137, 344 of the experimental domain T, 335 of the regression range X, 336 of the set of competing matrices M, 337 under choice of generalized inverses GA~, 16 under orthogonal transformations, 335, 351, 359 under permutation matrices, 335, 351, 373 under reparametrization, 137 under rotations, 359 under sign-changes, 351 isotonicity, 12, 114, 119 iterated information matrix, CKH(A) = CH(cK(A)),m, 412 Kiefer optimality, 357, 360, 364, 368, 389, 421, 426 Kiefer ordering >, 354, 374 Kiefer-Wolfowitz theorem, 212, 214, 313, 324, 418 inappropriate frame for generalization, 222 Kronecker product s <8> t , 392, 427 Kuhn-Tucker coefficients, 298, 423
451
Legendre polynomial Pd(t), 214, 246, 418, 421 level set {<f> > 1} = {C > 0 :,0(C) > 1}, 77, 118, 120, 296 1'Hospital rule, 140, 328, 414 likelihood ratio, 198, 248, 310 limit of a design sequence, 409, 419 line fit model, 6, 32, 57, 83, 198, 246, 259 linear dispersion criterion, 222, 246, 277, 419 linear matrix equation, 16, 85, 201 linearly independent regression vectors, 42, 195, 213, 223, 417 Loewner comparison, 101, 251, 262, 350 Loewner ellipsoid, 417 Loewner optimality, 101, 206, 262, 357, 360, 363, 412, 426 nonexistence, 104 Loewner ordering >, 12, 19, 25, 62, 90, 107, 114, 190, 262, 269, 354, 375, 408, 412 logarithmic concavity, 155, 414 majorization ordering -<, of matrices, 352, 425 of vectors, 144, 352 matrix inverse, 13, 16, 159 matrix mean <f>p, 135, 140, 178, 200, 206, 246, 343, 413 as information function or norm, 151 gradient V</P(C), 179 polar function < = s(f>q, 149 simultaneous optimality, 203 weighted, 157 matrix mean optimality, 178, 196, 241, 260, 376 for a component subset ( f y , . . . , 6 S ) ' , 203 for a rank deficient subsystem, 88, 205, 364, 371, 404 matrix modulus |C|, 134, 141, 150, 156 matrix outer product, 190 matrix power, 140, 371 maximization of information, 63, 131 versus minimization of variance, 41, 155 mean squared-error matrix, 65, 269 measures of rotatability, 405, 427 minimum variance unbiased linear estimator, see optimal estimator mixture, of information functions, 285, 289 of models, 283, 288 model-building, 62, 72, 382, 406
L-criterion, see linear dispersion criterion Lagrange polynomials L0(t),...,Ld(t), for Chebyshev polynomial extrema, 227 for Legendre polynomial extrema, 215, 328 Lagrange multipliers, 297 left identity QA = A, 19 left inverse LK = Is, 22, 62 minimizing for A, CK(A) - LAL', 62
452
model discrimination, 279, 423 model matrix X, 3, 27, 409 model response surface / H- f ( t ) ' 9 , 382 model robust design, 423 model variance a2, 3, 69 moment matrix M (f), 26, 131, 232, 383, 395, 409 as information matrix for 0, 63 classical m.m. Md(r) for dth-degree model, 32, 213, 251 eigenvectors of an optimal m.m., 56 formal </>-optimality for K'6 in M, 131, 174 Loewner comparison, 101, 251 Loewner optimality for K'6 in M, 101 maximum range and rank, 99, 105 optimality for c'0 in A/(H), 41 reduced by reparametrization, 111 moment matrix set, A/(H), of all designs, 29, 102, 177, 252, 383 M(H n ), of designs for sample size n, 29 M, of competing moment matrices, 98, 102, 107, 177 A/y(T), induced by regression function /, 383 moment set /i2d-i(T), 251 monotonicity, 13, 357 Moore-Penrose inverse /4 + , 186, 204, 372, 401, 416 multiple linear regression over [0;!]*, 192, 372, 426 multiplicity of optimal moment matrices, 201, 207, 372, 378 multiplier method of apportionment, 307, 424 multipurpose design, 423 multiway classification model, 5, 372, 379, 426 mutual boundedness theorem, for a scalar criterion, 45, 231 for an information function, 171, 174, 261 negative part of a symmetric matrix C_, 141, 150 nonexistence examples, 104, 174, 412 nonnegative definite matrix A > 0, 9, 52, 141 closed cone NND(fc), 10 nonnegativity, 116 norm, 13, 54, 124, 126, 134, 151, 353, 356
SUBJECT INDEX
normal equations, 66, 412 normal vector to a convex set, 159, 258 normal-gamma distribution, 272 normality inequality, 160, 175, 192, 258 under a rank deficient matrix mean <', 205 notational conventions, 0,<r 2 ; Y,y;A,aij, 8 t,T; r,T,x,X; ,H, 27 nullspace and range, 13 numerical rounding, 306 one-point design, 55, 257 one-way classification model, 5, 30 optimal estimator, 24, 36, 41, 65 optimal variance, 41, 131, 233, 241 optimal weights, on arbitrary support points, 199 on linearly independent regression vectors, 195 optimality criterion <f>, 114, 256, 292, 412 order preservation or reversal, 13, 65, 91 orthodiagonal projector, see centering matrix orthogonal design, 390 orthogonal group Orth(m), 335, 342, 351, 384 orthogonal projector P = P2 P', 24, 71, 349 orthogonality, of two nonnegative definite matrices, 153 of two subspaces, 14, 154 parabola fit model, 6, 32, 138, 173, 184, 246, 266, 331, 340, 359, 417 parameter domain 0, 1, 19 parameter orthogonality, 76, 411 parameter system of interest K'O, 19, 35, 61 component subset (0j,..., Os)', 36, 73, 82, 203 maximal, 207, 371 rank deficient, 88, 205, 364, 371, 404 scalar c'0, 47, 137, 170, 410, 419, 420 parameter vector 0 6 0 , 1, 3 partial ordering, 12, 144, 265 penumbra P, 107, 134 permutation group Perm(A:), 143, 335, 351, 352, 362, 373 polar function <f>, 126, 149, 285, 351, 413
SUBJECT INDEX polarity equation, 132, 168, 175, 276, 324 for a matrix mean $p, 154 for a vector mean $p, 157 polarity theorem, 127 polynomial, 213, 249, 328 polynomial fit model, 6, 32, 251 average-variance optimal design, 223, 241 determinant optimal design, 213, 217, 243, 320 formal trace optimal design, 174 scalar optimal design, 229, 237 smallest-eigenvalue optimal design, 232, 316 positive definite matrix A > 0, 9 open cone PD(k), 9 positive homogeneity, 115, 119, 292 positive part of a symmetric matrix C+, 141, 150 positive vector A > 0, 139, 262 positivity, 100, 117 posterior precision increase )3i(y), 273 power vector /(/) = (l,t,...,td)', 213, 249, 336 precision matrix, 25, 65, 409 preordering, 144, 353, 356 criterion induced, 117, 136 prior distribution, 268, 272 prior moment matrix A/0, 271, 275 prior precision /3o, 272 prior sample size n0, 269 product design rs', 100, 105, 206, 364 projector P = P2, 17, 23, 52, 156 onto invariant symmetric matrices, 349 orthogonal P = P2 = P', 33, 71 proof arrangement for the theorems of Elfving, Gauss-Markov, Kiefer-Wolfowitz, 51, 212 proportional frequency design, see product design protected runs, 277, 422 pseudoquotas vwt, 307 quadratic subspace of symmetric matrices, 372, 425 quasi-linearization, 76, 129, 413 quota nwj, 306 quota method of apportionment, 306, 318, 424 range and nullspace, 13, 15 range inclusion condition, 16, 17, 21, 411 range summation lemma, 37
453
rank and nullity, 13 rank deficient matrix mean <', 88, 205, 364, 371, 404 rational weights w, = //, 312 rearrangement jq, 145 recession to infinity, 44, 110, 120 reduced normal equations C^(A)y T, 66, 412 reflection, 332, 341, 397 reflexivity, 12, 144, 353, 356 regression function / : T X, 2, 381 regression range X C Rk, 2 symmetrized r.r. X U (-X), 43, 191 regression space C(X) span(A') C Rk, 42,47 regression surface x i- x'O, 211, 222 regression vectors x 6 X, 2, 25 linearly independent, 195, 213, 223, 417 regular simplex design, 391, 402 regularization, 99, 119, 412 relative interior, 44 reparametrization, 88, 137, 160, 205 inappropriateness, 411 residual projector R = In - P, 23 residual sum of squares, 69 response surface, 382 response vector Y, 3 risk function, 265 rotatability, 335, 384, 394, 426 bad terminology for designs, 385 determining class, 386, 395 measures of r., 405, 427 of first-degree model, 386 of second-degree model, 394, 400 rounding function R(z), 307 row sum vector, see treatment replication vector sample size monotonicity, 306, 424 sampling weight a = n/(n0 + n), 272, 275 saturated models, 7 scalar criterion c'0, 133, 170, 182, 410, 420 in polynomial fit models, 229, 237 on linearly independent regression vectors, 197, 230 Schur complement AH - Ai2A22A2\, 75, 82, 92, 263, 274, 284, 411 Schur inequality S(C) -< A(C), 146 second-degree model, 280, 293, 299, 395, 403 separating hyperplane theorem, 128, 162, 410
454
shorted operator, 411 sign-change group Sign(s), 142, 335, 345, 351, 386, 413 simplex design, 391, 402 simultaneous optimality, under all invariant information functions, 349 under all matrix means 4>p,P G [-00; 1], 203 under some scalar criteria, 102 smallest-eigenvalue criterion <-oo, 119, 135, 153, 158, 180, 183, 232, 257, 316, 404, 413, 420 square root decomposition of a nonnegative definite matrix V UU', 15, 153 standardization, of a design for sample size n, 6,/n, 305 of an optimality criterion, <j(ls) = 1, 117 stochastic vector, 262 strict concavity, 116, 201, 353, strict isotonicity, 117 strict superadditivity, 116 subdifferential dtf>(M), 159, 167 subgradient, 158, 163, 287 subgradient inequality, 158, 164, 317, 414 subgradient theorem, 162, 268 superadditivity, 77, 115 support point x e supp , 26, 191 support set supp, 26, 265, 312 bound for the size, 188, 417 excessive size, 378, 391 minimal size, 195, 322, 391 supporting hyperplane theorem, 49, 107, 110 surjective (i.e. onto), 123, 335 symmetric design problem, 331 symmetric matrix C e Sym(s), 13, 345, 372 diagonality, 142 modulus |C|, 150, 156 positive and negative parts C+,C_, 141, 150 symmetric three-point design, 334
SUBJECT INDEX
T-criterion, see trace criterion, 119 Taylor theorem, 61, 215 testability, 67, 72 testing and estimation problems, 340 third-degree model, 7, 209, 280, 293, 299 total variation distance, 306, 424 trace criterion fa, 118, 135, 138, 140, 153, 173, 240, 258, 369, 404, 413, 422 geometric meaning, 258 trace operator, 7 transitivity, 12, 144, 353, 356 transposition ', 13, 185, 351 treatment concurrence matrix AW, 367, 426 treatment relabeling group, 362, 364, 368 treatment replication vector r, 31, 94, 264, 426 trigonometric fit model, 58, 138, 241, 249, 359 trivial group {Is}, 350, 357 two-point design, 83 two-sample problem, 4, 30 two-way classification model, 5, 30, 88, 93, 97, 138, 206, 249, 262, 362, 411 unbiasedness, 19, 36 uniform distribution on a sphere, 389, 400 uniform optimality, see Loewner optimality uniform weights w, = l/, 94, 100, 201, 213, 320 unit level set, 120 unity vector la, 30, 36, 88, 94, 140 universal optimality, see Kiefer optimality upper level set, 78, 118 upper semicontinuity, 77, 118, 412 Vandermonde matrix, 33, 253, 347 variance surface t *-> f(t)'M~f(t), 382 vec-permutation matrix lmfn, 394 vector mean <$>p, 121, 139, 146, 157, 284, 413, 423 vector of totals T, 66 vectorization operator vec(^r') = st, 393 vertex design ,, 373

Pukelsheim Optimal DoE

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pukelsheim Optimal DoE

Uploaded by

Copyright:

Available Formats

Optimal Design of Experiments

This page intentionally left blank

Optimal Design of Experiments

Society for Industrial and Applied Mathematics Philadelphia

1.23. 1.24. 1.25. 1.26. 1.27. 1.28.

4.10. 4.11. 4.12. 4.13.

6.12. 6.13. 6.14. 6.15. 6.16. 6.17.

9.12. 9.13. 9.14. 9.15. 9.16. 9.17.

Preface to the Classics Edition

Augsburg, Germany October 2005

This page intentionally left blank

12.1 12.2 12.3 12.4 12.5 12.6 13.1

14.3 14.4 15.1

5 Real Optimality Criteria

7 The General Equivalence Theorem

8 Optimal Moment Matrices and Optimal Designs

12 Efficient Designs for Finite Sample Sizes

9 D-, A-, E-, T-Optimality

13 Invariant Design Problems

10 Admissibility of Moment and Information Matrices

11 Bayes Designs and Discrimination Designs

15 Rotatability and Response Surface Designs

Outline of the Book

OUTLINE OF THE BOOK

OUTLINE OF THE BOOK

This page intentionally left blank

31 32 91 156 157 169 169 203 217 222

241 270 330 347 357 361 378 390

Exhibit 9.2 Ks NND(k)

This page intentionally left blank

OPTIMAL DESIGN OF EXPERIMENTS

This page intentionally left blank

Experimental Designs in Linear Models

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

of the simplest relationships is the deterministic linear model

A schematic arrangement of these quantities is presented in Exhibit 1.1.

1.3. CLASSICAL LINEAR MODELS WITH MOMENT ASSUMPTIONS

1.3. CLASSICAL LINEAR MODELS WITH MOMENT ASSUMPTIONS

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

1.5. TWO-WAY CLASSIFICATION MODELS

Assembling the components into n x 1 vectors, with n = n\ + n^, we get

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

1.7. EUCLIDEAN MATRIX SPACE

For instance, a two-way third-degree model is given by

mean parameters. An instance of a nonsaturated two-way second-degree model is

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

provided they are conformable,

1.8. NONNEGATIVE DEFINITE MATRICES

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

1.10. THE LOEWNER ORDERING OF SYMMETRIC MATRICES

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

partial ordering >, defined on Sym(k) by

1.12. RANGE AND NULLSPACE OF A MATRIX

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

denote the orthogonal complement of a subspace L of the linear space R".

Lemma. Let A be an n x k matrix. Then we have

A few transcriptions establish the result:

1.15. DISTRIBUTIONAL SUPPORT OF LINEAR MODELS

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

1.17. RANGE INCLUSION LEMMA

If range X C range V and V is a nonnegative definite n x n matrix, then the product

CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

1.18. GENERAL LINEAR MODELS