程序代写代做代考 Fortran Excel flex Java compiler Bioinformatics matlab data mining chain c++ AI algorithm information retrieval database scheme DNA Matrix Methods
Matrix Methods
in Data Mining
and Pattern
Recognition
fa04_eldenfm1.qxp 2/28/2007 3:24 PM Page 1
Fundamentals of Algorithms
Editor-in-Chief: Nicholas J. Higham, University of Manchester
The SIAM series on Fundamentals of Algorithms is a collection of short user-oriented books on state-
of-the-art numerical methods. Written by experts, the books provide readers with sufficient knowledge
to choose an appropriate method for an application and to understand the method’s strengths and
limitations. The books cover a range of topics drawn from numerical analysis and scientific computing.
The intended audiences are researchers and practitioners using the methods and upper level
undergraduates in mathematics, engineering, and computational science.
Books in this series not only provide the mathematical background for a method or class of methods
used in solving a specific problem but also explain how the method can be developed into an
algorithm and translated into software. The books describe the range of applicability of a method and
give guidance on troubleshooting solvers and interpreting results. The theory is presented at a level
accessible to the practitioner. MATLAB® software is the preferred language for codes presented since it
can be used across a wide variety of platforms and is an excellent environment for prototyping,
testing, and problem solving.
The series is intended to provide guides to numerical algorithms that are readily accessible, contain
practical advice not easily found elsewhere, and include understandable codes that implement the
algorithms.
Editorial Board
Series Volumes
Eldén, L., Matrix Methods in Data Mining and Pattern Recognition
Hansen, P. C., Nagy, J. G., and O’Leary, D. P., Deblurring Images: Matrices, Spectra, and Filtering
Davis, T. A., Direct Methods for Sparse Linear Systems
Kelley, C. T., Solving Nonlinear Equations with Newton’s Method
Peter Benner
Technische Universität Chemnitz
John R. Gilbert
University of California, Santa Barbara
Michael T. Heath
University of Illinois, Urbana-Champaign
C. T. Kelley
North Carolina State University
Cleve Moler
The MathWorks
James G. Nagy
Emory University
Dianne P. O’Leary
University of Maryland
Robert D. Russell
Simon Fraser University
Robert D. Skeel
Purdue University
Danny Sorensen
Rice University
Andrew J. Wathen
Oxford University
Henry Wolkowicz
University of Waterloo
fa04_eldenfm1.qxp 2/28/2007 3:24 PM Page 2
Lars Eldén
Linköping University
Linköping, Sweden
Matrix Methods
in Data Mining
and Pattern
Recognition
Society for Industrial and Applied Mathematics
Philadelphia
fa04_eldenfm1.qxp 2/28/2007 3:24 PM Page 3
Copyright © 2007 by the Society for Industrial and Applied Mathematics.
10 9 8 7 6 5 4 3 2 1
All rights reserved. Printed in the United States of America. No part of this book may be reproduced,
stored, or transmitted in any manner without the written permission of the publisher. For information,
write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center,
Philadelphia, PA 19104-2688.
Trademarked names may be used in this book without the inclusion of a trademark symbol. These
names are used in an editorial context only; no infringement of trademark is intended.
Google is a trademark of Google, Inc.
MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please
contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA, 508-647-7000,
Fax: 508-647-7101, info@mathworks.com, www.mathworks.com
Figures 6.2, 10.1, 10.7, 10.9, 10.11, 11.1, and 11.3 are from L. Eldén, Numerical linear algebra in
data mining, Acta Numer., 15:327–384, 2006. Reprinted with the permission of Cambridge University
Press.
Figures 14.1, 14.3, and 14.4 were constructed by the author from images appearing in P. N.
Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class
specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell., 19:711–720, 1997.
Library of Congress Cataloging-in-Publication Data
Eldén, Lars, 1944-
Matrix methods in data mining and pattern recognition / Lars Eldén.
p. cm. — (Fundamentals of algorithms ; 04)
Includes bibliographical references and index.
ISBN 978-0-898716-26-9 (pbk. : alk. paper)
1. Data mining. 2. Pattern recognition systems—Mathematical models. 3. Algebras,
Linear. I. Title.
QA76.9.D343E52 2007
05.74—dc20 2006041348
is a registered trademark.
fa04_eldenfm1.qxp 2/28/2007 3:24 PM Page 4
book
2007/2/2
page v
Contents
Preface ix
I Linear Algebra Concepts and Matrix Decompositions
1 Vectors and Matrices in Data Mining and Pattern Recognition 3
1.1 Data Mining and Pattern Recognition . . . . . . . . . . . . . . . 3
1.2 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Purpose of the Book . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Programming Environments . . . . . . . . . . . . . . . . . . . . 8
1.5 Floating Point Computations . . . . . . . . . . . . . . . . . . . . 8
1.6 Notation and Conventions . . . . . . . . . . . . . . . . . . . . . 11
2 Vectors and Matrices 13
2.1 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . 13
2.2 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . 15
2.3 Inner Product and Vector Norms . . . . . . . . . . . . . . . . . 17
2.4 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Linear Independence: Bases . . . . . . . . . . . . . . . . . . . . 20
2.6 The Rank of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Linear Systems and Least Squares 23
3.1 LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Symmetric, Positive Definite Matrices . . . . . . . . . . . . . . . 25
3.3 Perturbation Theory and Condition Number . . . . . . . . . . . 26
3.4 Rounding Errors in Gaussian Elimination . . . . . . . . . . . . . 27
3.5 Banded Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 The Least Squares Problem . . . . . . . . . . . . . . . . . . . . 31
4 Orthogonality 37
4.1 Orthogonal Vectors and Matrices . . . . . . . . . . . . . . . . . 38
4.2 Elementary Orthogonal Matrices . . . . . . . . . . . . . . . . . . 40
4.3 Number of Floating Point Operations . . . . . . . . . . . . . . . 45
4.4 Orthogonal Transformations in Floating Point Arithmetic . . . 46
v
book
2007/2/23
page vi
vi Contents
5 QR Decomposition 47
5.1 Orthogonal Transformation to Triangular Form . . . . . . . . . 47
5.2 Solving the Least Squares Problem . . . . . . . . . . . . . . . . 51
5.3 Computing or Not Computing Q . . . . . . . . . . . . . . . . . 52
5.4 Flop Count for QR Factorization . . . . . . . . . . . . . . . . . 53
5.5 Error in the Solution of the Least Squares Problem . . . . . . . 53
5.6 Updating the Solution of a Least Squares Problem . . . . . . . . 54
6 Singular Value Decomposition 57
6.1 The Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Fundamental Subspaces . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Matrix Approximation . . . . . . . . . . . . . . . . . . . . . . . 63
6.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . 66
6.5 Solving Least Squares Problems . . . . . . . . . . . . . . . . . . 66
6.6 Condition Number and Perturbation Theory for the Least Squares
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.7 Rank-Deficient and Underdetermined Systems . . . . . . . . . . 70
6.8 Computing the SVD . . . . . . . . . . . . . . . . . . . . . . . . 72
6.9 Complete Orthogonal Decomposition . . . . . . . . . . . . . . . 72
7 Reduced-Rank Least Squares Models 75
7.1 Truncated SVD: Principal Component Regression . . . . . . . . 77
7.2 A Krylov Subspace Method . . . . . . . . . . . . . . . . . . . . 80
8 Tensor Decomposition 91
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Basic Tensor Concepts . . . . . . . . . . . . . . . . . . . . . . . 92
8.3 A Tensor SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Approximating a Tensor by HOSVD . . . . . . . . . . . . . . . 96
9 Clustering and Nonnegative Matrix Factorization 101
9.1 The k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . 102
9.2 Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . 106
II Data Mining Applications
10 Classification of Handwritten Digits 113
10.1 Handwritten Digits and a Simple Algorithm . . . . . . . . . . . 113
10.2 Classification Using SVD Bases . . . . . . . . . . . . . . . . . . 115
10.3 Tangent Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 122
11 Text Mining 129
11.1 Preprocessing the Documents and Queries . . . . . . . . . . . . 130
11.2 The Vector Space Model . . . . . . . . . . . . . . . . . . . . . . 131
11.3 Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . 135
11.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
book
2007/2/23
page vii
Contents vii
11.5 Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . 141
11.6 LGK Bidiagonalization . . . . . . . . . . . . . . . . . . . . . . . 142
11.7 Average Performance . . . . . . . . . . . . . . . . . . . . . . . . 145
12 Page Ranking for a Web Search Engine 147
12.1 Pagerank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
12.2 Random Walk and Markov Chains . . . . . . . . . . . . . . . . . 150
12.3 The Power Method for Pagerank Computation . . . . . . . . . . 154
12.4 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
13 Automatic Key Word and Key Sentence Extraction 161
13.1 Saliency Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
13.2 Key Sentence Extraction from a Rank-k Approximation . . . . . 165
14 Face Recognition Using Tensor SVD 169
14.1 Tensor Representation . . . . . . . . . . . . . . . . . . . . . . . 169
14.2 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 172
14.3 Face Recognition with HOSVD Compression . . . . . . . . . . . 175
III Computing the Matrix Decompositions
15 Computing Eigenvalues and Singular Values 179
15.1 Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . . . 180
15.2 The Power Method and Inverse Iteration . . . . . . . . . . . . . 185
15.3 Similarity Reduction to Tridiagonal Form . . . . . . . . . . . . . 187
15.4 The QR Algorithm for a Symmetric Tridiagonal Matrix . . . . . 189
15.5 Computing the SVD . . . . . . . . . . . . . . . . . . . . . . . . 196
15.6 The Nonsymmetric Eigenvalue Problem . . . . . . . . . . . . . . 197
15.7 Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
15.8 The Arnoldi and Lanczos Methods . . . . . . . . . . . . . . . . 200
15.9 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Bibliography 209
Index 217
book
2007/2/23
page viii
book
2007/2/23
page ix
Preface
The first version of this book was a set of lecture notes for a graduate course
on data mining and applications in science and technology organized by the Swedish
National Graduate School in Scientific Computing (NGSSC). Since then the mate-
rial has been used and further developed for an undergraduate course on numerical
algorithms for data mining and IT at Linköping University. This is a second course
in scientific computing for computer science students.
The book is intended primarily for undergraduate students who have pre-
viously taken an introductory scientific computing/numerical analysis course. It
may also be useful for early graduate students in various data mining and pattern
recognition areas who need an introduction to linear algebra techniques.
The purpose of the book is to demonstrate that there are several very powerful
numerical linear algebra techniques for solving problems in different areas of data
mining and pattern recognition. To achieve this goal, it is necessary to present
material that goes beyond what is normally covered in a first course in scientific
computing (numerical analysis) at a Swedish university. On the other hand, since
the book is application oriented, it is not possible to give a comprehensive treatment
of the mathematical and numerical aspects of the linear algebra algorithms used.
The book has three parts. After a short introduction to a couple of areas of
data mining and pattern recognition, linear algebra concepts and matrix decom-
positions are presented. I hope that this is enough for the student to use matrix
decompositions in problem-solving environments such as MATLAB r©. Some math-
ematical proofs are given, but the emphasis is on the existence and properties of
the matrix decompositions rather than on how they are computed. In Part II, the
linear algebra techniques are applied to data mining problems. Naturally, the data
mining and pattern recognition repertoire is quite limited: I have chosen problem
areas that are well suited for linear algebra techniques. In order to use intelligently
the powerful software for computing matrix decompositions available in MATLAB,
etc., some understanding of the underlying algorithms is necessary. A very short
introduction to eigenvalue and singular value algorithms is given in Part III.
I have not had the ambition to write a book of recipes: “given a certain
problem, here is an algorithm for its solution.” That would be difficult, as the area
is far too diverse to give clear-cut and simple solutions. Instead, my intention has
been to give the student a set of tools that may be tried as they are but, more
likely, that will need to be modified to be useful for a particular application. Some
of the methods in the book are described using MATLAB scripts. They should not
ix
book
2007/2/23
page x
x Preface
be considered as serious algorithms but rather as pseudocodes given for illustration
purposes.
A collection of exercises and computer assignments are available at the book’s
Web page: www.siam.org/books/fa04.
The support from NGSSC for producing the original lecture notes is gratefully
acknowledged. The lecture notes have been used by a couple of colleagues. Thanks
are due to Gene Golub and Saara Hyvönen for helpful comments. Several of my own
students have helped me to improve the presentation by pointing out inconsistencies
and asking questions. I am indebted to Berkant Savas for letting me use results from
his master’s thesis in Chapter 10. Three anonymous referees read earlier versions of
the book and made suggestions for improvements. Finally, I would like to thank Nick
Higham, series editor at SIAM, for carefully reading the manuscript. His thoughtful
advice helped me improve the contents and the presentation considerably.
Lars Eldén
Linköping, October 2006
book
2007/2/23
page
Part I
Linear Algebra Concepts and
Matrix Decompositions
book
2007/2/23
page
book
2007/2/23
page 3
Chapter 1
Vectors and Matrices in
Data Mining and Pattern
Recognition
1.1 Data Mining and Pattern Recognition
In modern society, huge amounts of data are collected and stored in computers so
that useful information can later be extracted. Often it is not known at the time
of collection what data will later be requested, and therefore the database is not
designed to distill any particular information, but rather it is, to a large extent,
unstructured. The science of extracting useful information from large data sets is
usually referred to as “data mining,” sometimes with the addition of “knowledge
discovery.”
Pattern recognition is often considered to be a technique separate from data
mining, but its definition is related: “the act of taking in raw data and making
an action based on the ‘category’ of the pattern” [31]. In this book we will not
emphasize the differences between the concepts.
There are numerous application areas for data mining, ranging from e-business
[10, 69] to bioinformatics [6], from scientific applications such as the classification of
volcanos on Venus [21] to information retrieval [3] and Internet search engines [11].
Data mining is a truly interdisciplinary science, in which techniques from
computer science, statistics and data analysis, linear algebra, and optimization are
used, often in a rather eclectic manner. Due to the practical importance of the
applications, there are now numerous books and surveys in the area [24, 25, 31, 35,
45, 46, 47, 49, 108].
It is not an exaggeration to state that everyday life is filled with situations in
which we depend, often unknowingly, on advanced mathematical methods for data
mining. Methods such as linear algebra and data analysis are basic ingredients in
many data mining techniques. This book gives an introduction to the mathematical
and numerical methods and their use in data mining and pattern recognition.
3
book
2007/2/23
page 4
4 Chapter 1. Vectors and Matrices in Data Mining and Pattern Recognition
1.2 Vectors and Matrices
The following examples illustrate the use of vectors and matrices in data mining.
These examples present the main data mining areas discussed in the book, and they
will be described in more detail in Part II.
In many applications a matrix is just a rectangular array of data, and the
elements are scalar, real numbers:
A =
⎛
⎜⎜⎜⎝
a11 a12 · · · a1n
a21 a22 · · · a2n
…
…
…
am1 am2 · · · amn
⎞
⎟⎟⎟⎠ ∈ Rm×n.
To treat the data by mathematical methods, some mathematical structure must be
added. In the simplest case, the columns of the matrix are considered as vectors
in Rm.
Example 1.1. Term-document matrices are used in information retrieval. Con-
sider the following selection of five documents.1 Key words, which we call terms,
are marked in boldface.2
Document 1: The GoogleTM matrix P is a model of the Internet.
Document 2: Pij is nonzero if there is a link from Web page j to i.
Document 3: The Google matrix is used to rank all Web pages.
Document 4: The ranking is done by solving a matrix eigenvalue
problem.
Document 5: England dropped out of the top 10 in the FIFA
ranking.
If we count the frequency of terms in each document we get the following result:
Term Doc 1 Doc 2 Doc 3 Doc 4 Doc 5
eigenvalue 0 0 0 1 0
England 0 0 0 0 1
FIFA 0 0 0 0 1
Google 1 0 1 0 0
Internet 1 0 0 0 0
link 0 1 0 0 0
matrix 1 0 1 1 0
page 0 1 1 0 0
rank 0 0 1 1 1
Web 0 1 1 0 0
1In Document 5, FIFA is the Fédération Internationale de Football Association. This document
is clearly concerned with football (soccer). The document is a newspaper headline from 2005. After
the 2006 World Cup, England came back into the top 10.
2To avoid making the example too large, we have ignored some words that would normally be
considered as terms (key words). Note also that only the stem of the word is significant: “ranking”
is considered the same as “rank.”
book
2007/2/23
page 5
1.2. Vectors and Matrices 5
Thus each document is represented by a vector, or a point, in R10, and we can
organize all documents into a term-document matrix:
A =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0 0 0 1 0
0 0 0 0 1
0 0 0 0 1
1 0 1 0 0
1 0 0 0 0
0 1 0 0 0
1 0 1 1 0
0 1 1 0 0
0 0 1 1 1
0 1 1 0 0
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
.
Now assume that we want to find all documents that are relevant to the query
“ranking of Web pages.” This is represented by a query vector, constructed in a
way analogous to the term-document matrix:
q =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0
0
0
0
0
0
0
1
1
1
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
∈ R10.
Thus the query itself is considered as a document. The information retrieval task
can now be formulated as a mathematical problem: find the columns of A that are
close to the vector q. To solve this problem we must use some distance measure
in R10.
In the information retrieval application it is common that the dimension m is
large, of the order 106, say. Also, as most of the documents contain only a small
fraction of the terms, most of the elements in the matrix are equal to zero. Such a
matrix is called sparse.
Some methods for information retrieval use linear algebra techniques (e.g., sin-
gular value decomposition (SVD)) for data compression and retrieval enhancement.
Vector space methods for information retrieval are presented in Chapter 11.
Often it is useful to consider the matrix not just as an array of numbers, or
as a set of vectors, but also as a linear operator. Denote the columns of A
a·j =
⎛
⎜⎜⎜⎝
a1j
a2j
…
amj
⎞
⎟⎟⎟⎠ , j = 1, 2, . . . , n,
book
2007/2/23
page 6
6 Chapter 1. Vectors and Matrices in Data Mining and Pattern Recognition
and write
A =
(
a·1 a·2 · · · a·n
)
.
Then the linear transformation is defined
y = Ax =
(
a·1 a·2 . . . a·n
)
⎛
⎜⎜⎜⎝
x1
x2
…
xn
⎞
⎟⎟⎟⎠ =
n∑
j=1
xja·j .
Example 1.2. The classification of handwritten digits is a model problem in
pattern recognition. Here vectors are used to represent digits. The image of one digit
is a 16 × 16 matrix of numbers, representing gray scale. It can also be represented
as a vector in R256, by stacking the columns of the matrix. A set of n digits
(handwritten 3’s, say) can then be represented by a matrix A ∈ R256×n, and the
columns of A span a subspace of R256. We can compute an approximate basis of
this subspace using the SVD A = UΣV T . Three basis vectors of the “3-subspace”
are illustrated in Figure 1.1.
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 1.1. Handwritten digits from the U.S. Postal Service database [47],
and basis vectors for 3’s (bottom).
Let b be a vector representing an unknown digit. We now want to classify
(automatically, by computer) the unknown digit as one of the digits 0–9. Given a
set of approximate basis vectors for 3’s, u1, u2, . . . , uk, we can determine whether b
is a 3 by checking if there is a linear combination of the basis vectors,
∑k
j=1 xjuj ,
such that
b−
k∑
j=1
xjuj
book
2007/2/23
page 7
1.3. Purpose of the Book 7
is small. Thus, here we compute the coordinates of b in the basis {uj}kj=1.
In Chapter 10 we discuss methods for classification of handwritten digits.
The very idea of data mining is to extract useful information from large,
often unstructured, sets of data. Therefore it is necessary that the methods used
are efficient and often specially designed for large problems. In some data mining
applications huge matrices occur.
Example 1.3. The task of extracting information from all Web pages available
on the Internet is done by search engines. The core of the Google search engine is
a matrix computation, probably the largest that is performed routinely [71]. The
Google matrix P is of the order billions, i.e., close to the total number of Web pages
on the Internet. The matrix is constructed based on the link structure of the Web,
and element Pij is nonzero if there is a link from Web page j to i.
The following small link graph illustrates a set of Web pages with outlinks
and inlinks:
1
�� 2 � 3
4 5 6
� � �
�
���
�
�
A corresponding link graph matrix is constructed so that the columns and
rows represent Web pages and the nonzero elements in column j denote outlinks
from Web page j. Here the matrix becomes
P =
⎛
⎜⎜⎜⎜⎜⎜⎜⎝
0 1
3
0 0 0 0
1
3
0 0 0 0 0
0 1
3
0 0 1
3
1
2
1
3
0 0 0 1
3
0
1
3
1
3
0 0 0 1
2
0 0 1 0 1
3
0
⎞
⎟⎟⎟⎟⎟⎟⎟⎠
.
For a search engine to be useful, it must use a measure of quality of the Web pages.
The Google matrix is used to rank all the pages. The ranking is done by solving an
eigenvalue problem for P ; see Chapter 12.
1.3 Purpose of the Book
The present book is meant to be not primarily a textbook in numerical linear alge-
bra but rather an application-oriented introduction to some techniques in modern
book
2007/2/23
page 8
8 Chapter 1. Vectors and Matrices in Data Mining and Pattern Recognition
linear algebra, with the emphasis on data mining and pattern recognition. It de-
pends heavily on the availability of an easy-to-use programming environment that
implements the algorithms that we will present. Thus, instead of describing in detail
the algorithms, we will give enough mathematical theory and numerical background
information so that a reader can understand and use the powerful software that is
embedded in a package like MATLAB [68].
For a more comprehensive presentation of numerical and algorithmic aspects
of the matrix decompositions used in this book, see any of the recent textbooks
[29, 42, 50, 92, 93, 97]. The solution of linear systems and eigenvalue problems for
large and sparse systems is discussed at length in [4, 5]. For those who want to
study the detailed implementation of numerical linear algebra algorithms, software
in Fortran, C, and C++ is available for free via the Internet [1].
It will be assumed that the reader has studied introductory courses in linear
algebra and scientific computing (numerical analysis). Familiarity with the basics
of a matrix-oriented programming language like MATLAB should help one to follow
the presentation.
1.4 Programming Environments
In this book we use MATLAB [68] to demonstrate the concepts and the algorithms.
Our codes are not to be considered as software; instead they are intended to demon-
strate the basic principles, and we have emphasized simplicity rather than efficiency
and robustness. The codes should be used only for small experiments and never for
production computations.
Even if we are using MATLAB, we want to emphasize that any program-
ming environment that implements modern matrix computations can be used, e.g.,
Mathematica r© [112] or a statistics package.
1.5 Floating Point Computations
1.5.1 Flop Counts
The execution times of different algorithms can sometimes be compared by counting
the number of floating point operations, i.e., arithmetic operations with floating
point numbers. In this book we follow the standard procedure [42] and count
each operation separately, and we use the term flop for one operation. Thus the
statement y=y+a*x, where the variables are scalars, counts as two flops.
It is customary to count only the highest-order term(s). We emphasize that
flop counts are often very crude measures of efficiency and computing time and
can even be misleading under certain circumstances. On modern computers, which
invariably have memory hierarchies, the data access patterns are very important.
Thus there are situations in which the execution times of algorithms with the same
flop counts can vary by an order of magnitude.
book
2007/2/23
page 9
1.5. Floating Point Computations 9
1.5.2 Floating Point Rounding Errors
Error analysis of the algorithms will not be a major part of the book, but we will cite
a few results without proofs. We will assume that the computations are done under
the IEEE floating point standard [2] and, accordingly, that the following model is
valid.
A real number x, in general, cannot be represented exactly in a floating point
system. Let fl[x] be the floating point number representing x. Then
fl[x] = x(1 + �) (1.1)
for some �, satisfying |�| ≤ μ, where μ is the unit round-off of the floating point
system. From (1.1) we see that the relative error in the floating point representation
of any real number x satisfies ∣∣∣∣fl[x] − xx
∣∣∣∣ ≤ μ.
In IEEE double precision arithmetic (which is the standard floating point format
in MATLAB), the unit round-off satisfies μ ≈ 10−16. In IEEE single precision we
have μ ≈ 10−7.
Let fl[x � y] be the result of a floating point arithmetic operation, where �
denotes any of +, −, ∗, and /. Then, provided that x� y �= 0,∣∣∣∣x� y − fl[x� y]x� y
∣∣∣∣ ≤ μ (1.2)
or, equivalently,
fl[x� y] = (x� y)(1 + �) (1.3)
for some �, satisfying |�| ≤ μ, where μ is the unit round-off of the floating point
system.
When we estimate the error in the result of a computation in floating point
arithmetic as in (1.2) we can think of it as a forward error. Alternatively, we can
rewrite (1.3) as
fl[x� y] = (x + e) � (y + f)
for some numbers e and f that satisfy
|e| ≤ μ|x|, |f | ≤ μ|y|.
In other words, fl[x � y] is the exact result of the operation on slightly perturbed
data. This is an example of backward error analysis.
The smallest and largest positive real numbers that can be represented in IEEE
double precision are 10−308 and 10308, approximately (corresponding for negative
numbers). If a computation gives as a result a floating point number of magnitude
book
2007/2/23
page 10
10 Chapter 1. Vectors and Matrices in Data Mining and Pattern Recognition
v
w
Figure 1.2. Vectors in the GJK algorithm.
smaller than 10−308, then a floating point exception called underflow occurs. Sim-
ilarly, the computation of a floating point number of magnitude larger than 10308
results in overflow.
Example 1.4 (floating point computations in computer graphics). The de-
tection of a collision between two three-dimensional objects is a standard problem
in the application of graphics to computer games, animation, and simulation [101].
Earlier fixed point arithmetic was used for computer graphics, but such computa-
tions now are routinely done in floating point arithmetic. An important subproblem
in this area is the computation of the point on a convex body that is closest to the
origin. This problem can be solved by the Gilbert–Johnson–Keerthi (GJK) algo-
rithm, which is iterative. The algorithm uses the stopping criterion
S(v, w) = vT v − vTw ≤ �2
for the iterations, where the vectors are illustrated in Figure 1.2. As the solution is
approached the vectors are very close. In [101, pp. 142–145] there is a description
of the numerical difficulties that can occur when the computation of S(v, w) is done
in floating point arithmetic. Here we give a short explanation of the computation
in the case when v and w are scalar, s = v2 − vw, which exhibits exactly the same
problems as in the case of vectors.
Assume that the data are inexact (they are the results of previous computa-
tions; in any case they suffer from representation errors (1.1)),
v̄ = v(1 + �v), w̄ = w(1 + �w),
where �v and �w are relatively small, often of the order of magnitude of μ. From
(1.2) we see that each arithmetic operation incurs a relative error (1.3), so that
fl[v2 − vw] = (v2(1 + �v)2(1 + �1) − vw(1 + �v)(1 + �w)(1 + �2))(1 + �3)
= (v2 − vw) + v2(2�v + �1 + �3) − vw(�v + �w + �2 + �3) + O(μ2),
book
2007/2/23
page 11
1.6. Notation and Conventions 11
where we have assumed that |�i| ≤ μ. The relative error in the computed quantity
can be estimated by∣∣∣∣fl[v2 − vw] − (v2 − vw)(v2 − vw)
∣∣∣∣ ≤ v2(2|�v| + 2μ) + |vw|(|�v| + |�w| + 2μ) + O(μ2)|v2 − vw| .
We see that if v and w are large, and close, then the relative error may be large.
For instance, with v = 100 and w = 99.999 we get∣∣∣∣fl[v2 − vw] − (v2 − vw)(v2 − vw)
∣∣∣∣ ≤ 105((2|�v| + 2μ) + (|�v| + |�w| + 2μ) + O(μ2)).
If the computations are performed in IEEE single precision, which is common in
computer graphics applications, then the relative error in fl[v2−vw] may be so large
that the termination criterion is never satisfied, and the iteration will never stop. In
the GJK algorithm there are also other cases, besides that described above, when
floating point rounding errors can cause the termination criterion to be unreliable,
and special care must be taken; see [101].
The problem that occurs in the preceding example is called cancellation: when
we subtract two almost equal numbers with errors, the result has fewer significant
digits, and the relative error is larger. For more details on the IEEE standard and
rounding errors in floating point computations, see, e.g., [34, Chapter 2]. Extensive
rounding error analyses of linear algebra algorithms are given in [50].
1.6 Notation and Conventions
We will consider vectors and matrices with real components. Usually vectors will be
denoted by lowercase italic Roman letters and matrices by uppercase italic Roman
or Greek letters:
x ∈ Rn, A = (aij) ∈ Rm×n.
Tensors, i.e., arrays of real numbers with three or more indices, will be denoted by
a calligraphic font. For example,
S = (sijk) ∈ Rn1×n2×n3 .
We will use Rm to denote the vector space of dimension m over the real field and
R
m×n for the space of m× n matrices.
The notation
ei =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0
…
0
1
0
…
0
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
,
book
2007/2/23
page 12
12 Chapter 1. Vectors and Matrices in Data Mining and Pattern Recognition
where the 1 is in position i, is used for the “canonical” unit vectors. Often the
dimension is apparent from the context.
The identity matrix is denoted I. Sometimes we emphasize the dimension
and use Ik for the k × k identity matrix. The notation diag(d1, . . . , dn) denotes a
diagonal matrix. For instance, I = diag(1, 1, . . . , 1).
book
2007/2/23
page 13
Chapter 2
Vectors and Matrices
We will assume that the basic notions of linear algebra are known to the reader.
For completeness, some will be recapitulated here.
2.1 Matrix-Vector Multiplication
How basic operations in linear algebra are defined is important, since it influences
one’s mental images of the abstract notions. Sometimes one is led to thinking that
the operations should be done in a certain order, when instead the definition as
such imposes no ordering.3 Let A be an m × n matrix. Consider the definition of
matrix-vector multiplication:
y = Ax, yi =
n∑
j=1
aijxj , i = 1, . . . ,m. (2.1)
Symbolically one can illustrate the definition
⎛
⎜⎜⎝
×
×
×
×
⎞
⎟⎟⎠ =
⎛
⎜⎜⎝
← − − →
← − − →
← − − →
← − − →
⎞
⎟⎟⎠
⎛
⎜⎜⎝
↑
|
|
↓
⎞
⎟⎟⎠ . (2.2)
It is obvious that the computation of the different components of the vector y are
completely independent of each other and can be done in any order. However, the
definition may lead one to think that the matrix should be accessed rowwise, as
illustrated in (2.2) and in the following MATLAB code:
3It is important to be aware that on modern computers, which invariably have memory hierar-
chies, the order in which operations are performed is often critical for the performance. However,
we will not pursue this aspect here.
13
book
2007/2/23
page 14
14 Chapter 2. Vectors and Matrices
for i=1:m
y(i)=0;
for j=1:n
y(i)=y(i)+A(i,j)*x(j);
end
end
Alternatively, we can write the operation in the following way. Let a·j be a column
vector of A. Then we can write
y = Ax =
(
a·1 a·2 · · · a·n
)
⎛
⎜⎜⎜⎝
x1
x2
…
xn
⎞
⎟⎟⎟⎠ =
n∑
j=1
xja·j .
This can be illustrated symbolically:⎛
⎜⎜⎝
↑
|
|
↓
⎞
⎟⎟⎠ =
⎛
⎜⎜⎝
↑ ↑ ↑ ↑
| | | |
| | | |
↓ ↓ ↓ ↓
⎞
⎟⎟⎠
⎛
⎜⎜⎝
×
×
×
×
⎞
⎟⎟⎠ . (2.3)
Here the vectors are accessed columnwise. In MATLAB, this version can be written4
for i=1:m
y(i)=0;
end
for j=1:n
for i=1:m
y(i)=y(i)+A(i,j)*x(j);
end
end
or, equivalently, using the vector operations of MATLAB,
y(1:m)=0;
for j=1:n
y(1:m)=y(1:m)+A(1:m,j)*x(j);
end
Thus the two ways of performing the matrix-vector multiplication correspond to
changing the order of the loops in the code. This way of writing also emphasizes
the view of the column vectors of A as basis vectors and the components of x as
coordinates with respect to the basis.
4In the terminology of LAPACK [1] this is the SAXPY version of matrix-vector multiplication.
SAXPY is an acronym from the Basic Linear Algebra Subroutine (BLAS) library.
book
2007/2/23
page 15
2.2. Matrix-Matrix Multiplication 15
2.2 Matrix-Matrix Multiplication
Matrix multiplication can be done in several ways, each representing a different
access pattern for the matrices. Let A ∈ Rm×k and B ∈ Rk×n. The definition of
matrix multiplication is
R
m×n � C = AB = (cij),
cij =
k∑
s=1
aisbsj , i = 1, . . . ,m, j = 1, . . . , n. (2.4)
In a comparison to the definition of matrix-vector multiplication (2.1), we see that
in matrix multiplication each column vector in B is multiplied by A.
We can formulate (2.4) as a matrix multiplication code
for i=1:m
for j=1:n
for s=1:k
C(i,j)=C(i,j)+A(i,s)*B(s,j)
end
end
end
This is an inner product version of matrix multiplication, which is emphasized in
the following equivalent code:
for i=1:m
for j=1:n
C(i,j)=A(i,1:k)*B(1:k,j)
end
end
It is immediately seen that the the loop variables can be permuted in 3! = 6 different
ways, and we can write a generic matrix multiplication code:
for …
for …
for …
C(i,j)=C(i,j)+A(i,s)*B(s,j)
end
end
end
A column-oriented (or SAXPY) version is given in
for j=1:n
for s=1:k
C(1:m,j)=C(1:m,j)+A(1:m,s)*B(s,j)
end
end
book
2007/2/23
page 16
16 Chapter 2. Vectors and Matrices
The matrix A is accessed by columns and B by scalars. This access pattern can be
illustrated as
⎛
⎜⎜⎝
↑
|
|
↓
⎞
⎟⎟⎠ =
⎛
⎜⎜⎝
↑ ↑ ↑ ↑
| | | |
| | | |
↓ ↓ ↓ ↓
⎞
⎟⎟⎠
⎛
⎜⎜⎝
×
×
×
×
⎞
⎟⎟⎠
In another permutation we let the s-loop be the outermost:
for s=1:k
for j=1:n
C(1:m,j)=C(1:m,j)+A(1:m,s)*B(s,j)
end
end
This can be illustrated as follows. Let a·k denote the column vectors of A and let
bTk· denote the row vectors of B. Then matrix multiplication can be written as
C = AB =
(
a·1 a·2 . . . a·k
)
⎛
⎜⎜⎜⎝
bT1·
bT2·
…
bTk·
⎞
⎟⎟⎟⎠ =
k∑
s=1
a·sb
T
s·. (2.5)
This is the outer product form of matrix multiplication. Remember that the outer
product follows the standard definition of matrix multiplication: let x and y be
column vectors in Rm and Rn, respectively; then
xyT =
⎛
⎜⎜⎜⎝
x1
x2
…
xm
⎞
⎟⎟⎟⎠(y1 y2 · · · yn) =
⎛
⎜⎜⎜⎝
x1y1 x1y2 · · · x1yn
x2y1 x2y2 · · · x2yn
…
…
…
xmy1 xmy2 · · · xmyn
⎞
⎟⎟⎟⎠
=
(
y1x y2x · · · ynx
)
=
⎛
⎜⎜⎜⎝
x1y
T
x2y
T
…
xmy
T
⎞
⎟⎟⎟⎠ .
Writing the matrix C = AB in the outer product form (2.5) can be considered as
an expansion of C in terms of simple matrices a·sb
T
s·. We will later see that such
matrices have rank equal to 1.
book
2007/2/23
page 17
2.3. Inner Product and Vector Norms 17
2.3 Inner Product and Vector Norms
In this section we will discuss briefly how to measure the “size” of a vector. The
most common vector norms are
‖x ‖1 =
n∑
i=1
|xi|, 1-norm,
‖x ‖2 =
√√√√ n∑
i=1
x2i , Euclidean norm (2-norm),
‖x ‖∞ = max
1≤i≤n
|xi|, max-norm.
The Euclidean vector norm is the generalization of the standard Euclidean distance
in R3 to Rn. All three norms defined here are special cases of the p-norm:
‖x ‖p =
(
n∑
i=1
|xi|p
)1/p
.
Associated with the Euclidean vector norm is the inner product between two vectors
x and y in Rn, which is defined
(x, y) = xT y.
Generally, a vector norm is a mapping Rn → R with the properties
‖x ‖ ≥ 0 for all x,
‖x ‖ = 0 if and only if x = 0,
‖αx ‖ = |α| ‖x ‖, α ∈ R,
‖x + y ‖ ≤ ‖x ‖ + ‖ y ‖, the triangle inequality.
With norms we can introduce the concepts of continuity and error in approx-
imations of vectors. Let x̄ be an approximation of the vector x. The for any given
vector norm, we define the absolute error
‖ δx ‖ = ‖ x̄− x ‖
and the relative error (assuming that x �= 0)
‖ δx ‖
‖x ‖
=
‖ x̄− x ‖
‖x ‖
.
In a finite dimensional vector space all vector norms are equivalent in the sense that
for any two norms ‖ · ‖α and ‖ · ‖β there exist constants m and M such that
m‖x ‖α ≤ ‖x ‖β ≤ M‖x ‖α, (2.6)
book
2007/2/23
page 18
18 Chapter 2. Vectors and Matrices
where m and M do not depend on x. For example, with x ∈ Rn,
‖x ‖2 ≤ ‖x ‖1 ≤
√
n ‖x ‖2.
This equivalence implies that if a sequence of vectors (xi)
∞
i=1 converges to x
∗ in one
norm,
lim
i→∞
‖xi − x∗ ‖ = 0,
then it converges to the same limit in all norms.
In data mining applications it is common to use the cosine of the angle between
two vectors as a distance measure:
cos θ(x, y) =
xT y
‖x ‖2 ‖ y ‖2
.
With this measure two vectors are close if the cosine is close to one. Similarly, x
and y are orthogonal if the angle between them is π/2, i.e., xT y = 0.
2.4 Matrix Norms
For any vector norm we can define a corresponding operator norm. Let ‖ · ‖ be a
vector norm. The corresponding matrix norm is defined as
‖A ‖ = sup
x�=0
‖Ax ‖
‖x ‖
.
One can show that such a matrix norm satisfies (for α ∈ R)
‖A ‖ ≥ 0 for all A,
‖A ‖ = 0 if and only if A = 0,
‖αA ‖ = |α| ‖A ‖, α ∈ R,
‖A + B ‖ ≤ ‖A ‖ + ‖B ‖, the triangle inequality.
For a matrix norm defined as above the following fundamental inequalities hold.
Proposition 2.1. Let ‖ · ‖ denote a vector norm and the corresponding matrix
norm. Then
‖Ax ‖ ≤ ‖A ‖ ‖x ‖,
‖AB ‖ ≤ ‖A ‖ ‖B ‖.
Proof. From the definition we have
‖Ax ‖
‖x ‖
≤ ‖A ‖
book
2007/2/23
page 19
2.4. Matrix Norms 19
for all x �= 0, which gives the first inequality. The second is proved by using the
first twice for ‖ABx ‖.
One can show that the 2-norm satisfies
‖A ‖2 =
(
max
1≤i≤n
λi(A
TA)
)1/2
,
i.e., the square root of the largest eigenvalue of the matrix ATA. Thus it is a
comparatively heavy computation to obtain ‖A ‖2 for a given matrix (of medium
or large dimensions). It is considerably easier to compute the matrix infinity norm
(for A ∈ Rm×n),
‖A ‖∞ = max
1≤i≤m
n∑
j=1
|aij |,
and the matrix 1-norm
‖A ‖1 = max
1≤j≤n
m∑
i=1
|aij |.
In Section 6.1 we will see that the 2-norm of a matrix has an explicit expression in
terms of the singular values of A.
Let A ∈ Rm×n. In some cases we will treat the matrix not as a linear operator
but rather as a point in a space of dimension mn, i.e., Rmn. Then we can use the
Frobenius matrix norm, which is defined by
‖A‖F =
√√√√ m∑
i=1
n∑
j=1
a2ij . (2.7)
Sometimes it is practical to write the Frobenius norm in the equivalent form
‖A ‖2F = tr(A
TA), (2.8)
where the trace of a matrix B ∈ Rn×n is the sum of its diagonal elements,
tr(B) =
n∑
i=1
bii.
The Frobenius norm does not correspond to a vector norm, so it is not an operator
norm in that sense. This norm has the advantage that it is easier to compute than
the 2-norm. The Frobenius matrix norm is actually closely related to the Euclidean
vector norm in the sense that it is the Euclidean vector norm on the (linear space)
of matrices Rm×n, when the matrices are identified with elements in Rmn.
book
2007/2/23
page 20
20 Chapter 2. Vectors and Matrices
2.5 Linear Independence: Bases
Given a set of vectors (vj)
n
j=1 in R
m, m ≥ n, consider the set of linear combinations
span(v1, v2, . . . , vn) =
{
y | y =
n∑
j=1
αjvj
}
for arbitrary coefficients αj . The vectors (vj)
n
j=1 are called linearly independent
when ∑n
j=1 αjvj = 0 if and only if αj = 0 for j = 1, 2, . . . , n.
A set of m linearly independent vectors in Rm is called a basis in Rm: any vector
in Rm can be expressed as a linear combination of the basis vectors.
Proposition 2.2. Assume that the vectors (vj)
n
j=1 are linearly dependent. Then
some vk can be written as linear combinations of the rest, vk =
∑
j �=k βjvj.
Proof. There exist coefficients αj with some αk �= 0 such that
n∑
j=1
αjvj = 0.
Take an αk �= 0 and write
αkvk =
∑
j �=k
−αjvj ,
which is the same as
vk =
∑
j �=k
βjvj
with βj = −αj/αk.
If we have a set of linearly dependent vectors, then we can keep a linearly
independent subset and express the rest in terms of the linearly independent ones.
Thus we can consider the number of linearly independent vectors as a measure of the
information contents of the set and compress the set accordingly: take the linearly
independent vectors as representatives (basis vectors) for the set, and compute the
coordinates of the rest in terms of the basis. However, in real applications we
seldom have exactly linearly dependent vectors but rather almost linearly dependent
vectors. It turns out that for such a data reduction procedure to be practical and
numerically stable, we need the basis vectors to be not only linearly independent
but orthogonal. We will come back to this in Chapter 4.
book
2007/2/23
page 21
2.6. The Rank of a Matrix 21
2.6 The Rank of a Matrix
The rank of a matrix is defined as the maximum number of linearly independent
column vectors. It is a standard result in linear algebra that the number of linearly
independent column vectors is equal to the number of linearly independent row
vectors.
We will see later that any matrix can be represented as an expansion of rank-1
matrices.
Proposition 2.3. An outer product matrix xyT , where x and y are vectors in Rn,
has rank 1.
Proof.
xyT =
(
y1x y2x · · · ynx
)
=
⎛
⎜⎜⎜⎝
x1y
T
x2y
T
…
xny
T
⎞
⎟⎟⎟⎠ .
Thus, all the columns (rows) of xyT are linearly dependent.
A square matrix A ∈ Rn×n with rank n is called nonsingular and has an
inverse A−1 satisfying
AA−1 = A−1A = I.
If we multiply linearly independent vectors by a nonsingular matrix, then the
vectors remain linearly independent.
Proposition 2.4. Assume that the vectors v1, . . . , vp are linearly independent.
Then for any nonsingular matrix T , the vectors Tv1, . . . , T vp are linearly indepen-
dent.
Proof. Obviously
∑p
j=1 αjvj = 0 if and only if
∑p
j=1 αjTvj = 0 (since we can
multiply any of the equations by T or T−1). Therefore the statement follows.
book
2007/2/23
page 22
book
2007/2/23
page 23
Chapter 3
Linear Systems and Least
Squares
In this chapter we briefly review some facts about the solution of linear systems of
equations,
Ax = b, (3.1)
where A ∈ Rn×n is square and nonsingular. The linear system (3.1) can be solved
using Gaussian elimination with partial pivoting, which is equivalent to factorizing
the matrix as a product of triangular matrices.
We will also consider overdetermined linear systems, where the matrix A ∈
R
m×n is rectangular with m > n, and their solution using the least squares method.
As we are giving the results only as background, we mostly state them without
proofs. For thorough presentations of the theory of matrix decompositions for
solving linear systems of equations, see, e.g., [42, 92].
Before discussing matrix decompositions, we state the basic result concerning
conditions for the existence of a unique solution of (3.1).
Proposition 3.1. Let A ∈ Rn×n and assume that A is nonsingular. Then for any
right-hand-side b, the linear system Ax = b has a unique solution.
Proof. The result is an immediate consequence of the fact that the column vectors
of a nonsingular matrix are linearly independent.
3.1 LU Decomposition
Gaussian elimination can be conveniently described using Gauss transformations,
and these transformations are the key elements in the equivalence between Gaussian
elimination and LU decomposition. More details on Gauss transformations can be
found in any textbook in numerical linear algebra; see, e.g., [42, p. 94]. In Gaussian
elimination with partial pivoting, the reordering of the rows is accomplished by
23
book
2007/2/23
page 24
24 Chapter 3. Linear Systems and Least Squares
permutation matrices, which are identity matrices with the rows reordered; see,
e.g., [42, Section 3.4.1].
Consider an n × n matrix A. In the first step of Gaussian elimination with
partial pivoting, we reorder the rows of the matrix so that the element of largest
magnitude in the first column is moved to the (1, 1) position. This is equivalent to
multiplying A from the left by a permutation matrix P1. The elimination, i.e., the
zeroing of the elements in the first column below the diagonal, is then performed
by multiplying
A(1) := L−11 P1A, (3.2)
where L1 is a Gauss transformation
L1 =
(
1 0
m1 I
)
, m1 =
⎛
⎜⎜⎜⎝
m21
m31
…
mn1
⎞
⎟⎟⎟⎠ .
The result of the first step of Gaussian elimination with partial pivoting is
A(1) =
⎛
⎜⎜⎜⎝
a′11 a
′
12 . . . a
′
1n
0 a
(1)
22 . . . a
(1)
2n
…
0 a
(1)
n2 . . . a
(1)
nn
⎞
⎟⎟⎟⎠ .
The Gaussian elimination algorithm then proceeds by zeroing the elements of the
second column below the main diagonal (after moving the largest element to the
diagonal position), and so on.
From (3.2) we see that the first step of Gaussian elimination with partial
pivoting can be expressed as a matrix factorization. This is also true of the complete
procedure.
Theorem 3.2 (LU decomposition). Any nonsingular n × n matrix A can be
decomposed into
PA = LU,
where P is a permutation matrix, L is a lower triangular matrix with ones on the
main diagonal, and U is an upper triangular matrix.
Proof (sketch). The theorem can be proved by induction. From (3.2) we have
P1A = L1A
(1).
Define the (n− 1) × (n− 1) matrix
B =
⎛
⎜⎜⎝
a
(1)
22 . . . a
(1)
2n
…
a
(1)
n2 . . . a
(1)
nn
⎞
⎟⎟⎠ .
book
2007/2/23
page 25
3.2. Symmetric, Positive Definite Matrices 25
By an induction assumption, B can be decomposed into
PBB = LBUB,
and we then see that PA = LU , where
U =
(
a′11 a
T
2
0 UB
)
, L =
(
1 0
PBm1 LB
)
, P =
(
1 0
0 PB
)
P1,
and aT2 = (a
′
12 a
′
13 . . . a
′
1n).
It is easy to show that the amount of work for computing the LU decomposition
is 2n3/3 flops, approximately. In the kth step of Gaussian elimination, one operates
on an (n− k + 1) × (n− k + 1) submatrix, and for each element in that submatrix
one multiplication and one addition are performed. Thus the total number of flops
is
2
n−1∑
k=1
(n− k + 1)2 ≈
2n3
3
,
approximately.
3.2 Symmetric, Positive Definite Matrices
The LU decomposition of a symmetric, positive definite matrix A can always be
computed without pivoting. In addition, it is possible to take advantage of symme-
try so that the decomposition becomes symmetric, too, and requires half as much
work as in the general case.
Theorem 3.3 (LDLT decomposition). Any symmetric, positive definite matrix
A has a decomposition
A = LDLT ,
where L is lower triangular with ones on the main diagonal and D is a diagonal
matrix with positive diagonal elements.
Example 3.4. The positive definite matrix
A =
⎛
⎝8 4 24 6 0
2 0 3
⎞
⎠
has the LU decomposition
A = LU =
⎛
⎝ 1 0 00.5 1 0
0.25 −0.25 1
⎞
⎠
⎛
⎝8 4 20 4 −1
0 0 2.25
⎞
⎠
book
2007/2/23
page 26
26 Chapter 3. Linear Systems and Least Squares
and the LDLT decomposition
A = LDLT , D =
⎛
⎝8 0 00 4 0
0 0 2.25
⎞
⎠ .
The diagonal elements in D are positive, and therefore we can put
D1/2 =
⎛
⎜⎜⎜⎝
√
d1 √
d2
. . . √
dn
⎞
⎟⎟⎟⎠ ,
and then we get
A = LDLT = (LD1/2)(D1/2LT ) = UTU,
where U is an upper triangular matrix. This variant of the LDLT decomposition is
called the Cholesky decomposition.
Since A is symmetric, it is only necessary to store the main diagonal and the
elements above it, n(n + 1)/2 matrix elements in all. Exactly the same amount of
storage is needed for the LDLT and the Cholesky decompositions. It is also seen
that since only half as many elements as in the ordinary LU decomposition need to
be computed, the amount of work is also halved—approximately n3/3 flops. When
the LDLT decomposition is computed, it is not necessary to first compute the LU
decomposition, but the elements in L and D can be computed directly.
3.3 Perturbation Theory and Condition Number
The condition number of a nonsingular matrix A is defined as
κ(A) = ‖A‖ ‖A−1‖,
where ‖ · ‖ denotes any operator norm. If we use a particular matrix norm, e.g., the
2-norm, then we write
κ2(A) = ‖A‖2 ‖A−1‖2. (3.3)
The condition number is used to quantify how much the solution of a linear system
Ax = b can change, when the matrix and the right-hand side are perturbed by a
small amount.
Theorem 3.5. Assume that A is nonsingular and that
‖δA‖ ‖A−1‖ = r < 1. Then the matrix A + δA is nonsingular, and ‖(A + δA)−1‖ ≤ ‖A−1‖ 1 − r . book 2007/2/23 page 27 3.4. Rounding Errors in Gaussian Elimination 27 The solution of the perturbed system (A + δA)y = b + δb satisfies ‖y − x‖ ‖x‖ ≤ κ(A) 1 − r ( ‖δA‖ ‖A‖ + ‖δb‖ ‖b‖ ) . For a proof, see, for instance, [42, Theorem 2.7.2] or [50, Theorem 7.2] A matrix with a large condition number is said to be ill-conditioned. The- orem 3.5 shows that a linear system with an ill-conditioned matrix is sensitive to perturbations in the data (i.e., the matrix and the right-hand side). 3.4 Rounding Errors in Gaussian Elimination From Section 1.5.2, on rounding errors in floating point arithmetic, we know that any real number (representable in the floating point system) is represented with a relative error not exceeding the unit round-off μ. This fact can also be stated fl[x] = x(1 + �), |�| ≤ μ. When representing the elements of a matrix A and a vector b in the floating point system, there arise errors: fl[aij ] = aij(1 + �ij), |�ij | ≤ μ, and analogously for b. Therefore, we can write fl[A] = A + δA, fl[b] = b + δb, where ‖δA‖∞ ≤ μ‖A‖∞, ‖δb‖∞ ≤ μ‖b‖∞. If, for the moment, we assume that no further rounding errors arise during the solution of the system Ax = b, we see that x̂ satisfies (A + δA)x̂ = b + δb. This is an example of backward error analysis: the computed solution x̂ is the exact solution of a perturbed problem. book 2007/2/23 page 28 28 Chapter 3. Linear Systems and Least Squares Using perturbation theory, we can estimate the error in x̂. From Theorem 3.5 we get ‖x̂− x‖∞ ‖x‖∞ ≤ κ∞(A) 1 − r 2μ (provided that r = μκ∞(A) < 1). We can also analyze how rounding errors in Gaussian elimination affect the result. The following theorem holds. (For detailed error analyses of Gaussian elim- ination, see [50, Chapter 9] or [42, Chapters 3.3, 3.4].) Theorem 3.6. Assume that we use a floating point system with unit round-off μ. Let L̂ and R̂ be the triangular factors obtained from Gaussian elimination with partial pivoting, applied to the matrix A. Further, assume that x̂ is computed using forward and back substitution: L̂ŷ = Pb, R̂ x̂ = ŷ. Then x̂ is the exact solution of a system (A + δA)x̂ = b, where ‖δA‖∞ ≤ k(n)gnμ‖A‖∞, gn = max i,j,k ∣∣â(k)ij ∣∣ max i,j |aij | , k(n) is a third-degree polynomial in n, and â (k) ij are the elements computed in step k − 1 of the elimination procedure. We observe that gn depends on the growth of the matrix elements during the Gaussian elimination and not explicitly on the magnitude of the multipliers. gn can be computed, and in this way an a posteriori estimate of the rounding errors can be obtained. A priori (in advance), one can show that gn ≤ 2n−1, and matrices can be constructed where in fact the element growth is that serious (note that g31 = 2 30 ≈ 109). In practice, however, gn is seldom larger than 8 in Gaussian elimination with partial pivoting. It is important to note that there are classes of matrices for which there is no element growth during Gaussian elimination, i.e., gn = 1, even if no pivoting is done. This is true, e.g., if A is symmetric and positive definite. In almost all cases, the estimate in the theorem is much too pessimistic with regard to the third-degree polynomial k(n). In order to have equality, all rounding errors must be maximally large, and their accumulated effect must be maximally unfavorable. book 2007/2/2 page 29 3.5. Banded Matrices 29 We want to emphasize that the main objective of this type of a priori error analysis is not to give error estimates for the solution of linear systems but rather to expose potential instabilities of algorithms and provide a basis for comparing different algorithms. Thus, Theorem 3.6 demonstrates the main weakness of Gauss transformations as compared to the orthogonal transformations that we will intro- duce in Chapter 4: they can cause a large growth of the matrix elements, which, in turn, induces rounding errors. 3.5 Banded Matrices In many situations, e.g., boundary value problems for ordinary and partial differ- ential equations, matrices arise where a large proportion of the elements are equal to zero. If the nonzero elements are concentrated around the main diagonal, then the matrix is called a band matrix. More precisely, a matrix A is said to be a band matrix if there are natural numbers p and q such that aij = 0 if j − i > p or i− j > q.
Example 3.7. Let q = 2, p = 1. Let A be a band matrix of dimension 6:
A =
⎛
⎜⎜⎜⎜⎜⎜⎝
a11 a12 0 0 0 0
a21 a22 a23 0 0 0
a31 a32 a33 a34 0 0
0 a42 a43 a44 a45 0
0 0 a53 a54 a55 a56
0 0 0 a64 a65 a66
⎞
⎟⎟⎟⎟⎟⎟⎠ .
w = q + p + 1 is called the bandwidth of the matrix. From the example, we
see that w is the maximal number of nonzero elements in any row of A.
When storing a band matrix, we do not store the elements outside the band.
Likewise, when linear systems of equations are solved, one can take advantage of
the band structure to reduce the number of operations.
We first consider the case p = q = 1. Such a band matrix is called tridiagonal.
Let
A =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
α1 β1
γ2 α2 β2
γ3 α3 β3
. . .
. . .
. . .
γn−1 αn−1 βn−1
γn αn
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
.
The matrix can be stored in three vectors. In the solution of a tridiagonal system
Ax = b, it is easy to utilize the structure; we first assume that A is diagonally
dominant, so that no pivoting is needed.
book
2007/2/23
page 30
30 Chapter 3. Linear Systems and Least Squares
% LU Decomposition of a Tridiagonal Matrix.
for k=1:n-1
gamma(k+1)=gamma(k+1)/alpha(k);
alpha(k+1)=alpha(k+1)*beta(k);
end
% Forward Substitution for the Solution of Ly = b.
y(1)=b(1);
for k=2:n
y(k)=b(k)-gamma(k)*y(k-1);
end
% Back Substitution for the Solution of Ux = y.
x(n)=y(n)/alpha(n);
for k=n-1:-1:1
x(k)=(y(k)-beta(k)*x(k+1))/alpha(k);
end
The number of operations (multiplications and additions) is approximately 3n,
and the number of divisions is 2n.
In Gaussian elimination with partial pivoting, the band width of the upper
triangular matrix increases. If A has band width w = q + p + 1 (q diagonals under
the main diagonal and p over), then, with partial pivoting, the factor U will have
band width wU = p + q + 1. It is easy to see that no new nonzero elements will be
created in L.
The factors L and U in the LU decomposition of a band matrix A are band
matrices.
Example 3.8. Let
A =
⎛
⎜⎜⎜⎜⎝
4 2
2 5 2
2 5 2
2 5 2
2 5
⎞
⎟⎟⎟⎟⎠ .
A has the Cholesky decomposition A = UTU , where
U =
⎛
⎜⎜⎜⎜⎝
2 1
2 1
2 1
2 1
2
⎞
⎟⎟⎟⎟⎠ .
The inverse is
A−1 =
1
210
⎛
⎜⎜⎜⎜⎝
341 −170 84 −40 16
−170 340 −168 80 −32
84 −168 336 −160 64
−40 80 −160 320 −128
16 −32 64 −128 256
⎞
⎟⎟⎟⎟⎠ ,
which is dense.
book
2007/2/23
page 31
3.6. The Least Squares Problem 31
It turns out that the inverse of a band matrix is usually a dense matrix.
Therefore, in most cases the inverse of a band matrix should not be computed
explicitly.
3.6 The Least Squares Problem
In this section we will introduce the least squares method and the solution of the
linear least squares problem using the normal equations. Other methods for solving
the least squares problem, based on orthogonal transformations, will be presented
in Chapters 5 and 6. We will also give a perturbation result for the least squares
problem in Section 6.6. For an extensive treatment of modern numerical methods
for linear least squares problem, see [14].
Example 3.9. Assume that we want to determine the elasticity properties of a
spring by attaching different weights to it and measuring its length. From Hooke’s
law we know that the length l depends on the force F according to
e + κF = l,
where e and κ are constants to be determined.5 Assume that we have performed
an experiment and obtained the following data:
F 1 2 3 4 5
l 7.97 10.2 14.2 16.0 21.2
.
The data are illustrated in Figure 3.1. As the measurements are subject to error
we want to use all the data in order to minimize the influence of the errors. Thus
we are lead to a system with more data than unknowns, an overdetermined system,
e + κ1 = 7.97,
e + κ2 = 10.2,
e + κ3 = 14.2,
e + κ4 = 16.0,
e + κ5 = 21.2,
or, in matrix form, ⎛
⎜⎜⎜⎜⎝
1 1
1 2
1 3
1 4
1 5
⎞
⎟⎟⎟⎟⎠
(
e
κ
)
=
⎛
⎜⎜⎜⎜⎝
7.97
10.2
14.2
16.0
21.2
⎞
⎟⎟⎟⎟⎠ .
We will determine an approximation of the elasticity constant of the spring using
the least squares method.
5In Hooke’s law the spring constant is 1/κ.
book
2007/2/23
page 32
32 Chapter 3. Linear Systems and Least Squares
+
+
+
+
+
1 2 3 4 5
5
10
15
20
F
l
Figure 3.1. Measured data in spring experiment.
Let A ∈ Rm×n, m > n. The system
Ax = b
is called overdetermined : it has more equations than unknowns. In general such a
system has no solution. This can be seen geometrically by letting m = 3 and n = 2,
i.e., we consider two vectors a·1 and a·2 in R
3. We want to find a linear combination
of the vectors such that
x1a·1 + x2a·2 = b.
In Figure 3.2 we see that usually such a problem has no solution. The two vectors
span a plane, and if the right-hand side b is not in the plane, then there is no linear
combination of a·1 and a·2 such that x1a·1 + x2a·2 = b.
In this situation one obvious alternative to “solving the linear system” is to
make the vector r = b−x1a·1−x2a·2 = b−Ax as small as possible. b−Ax is called
the residual vector and is illustrated in Figure 3.2.
The solution of the problem depends on how we measure the length of the
residual vector. In the least squares method we use the standard Euclidean distance.
Thus we want to find a vector x ∈ Rn that solves the minimization problem
min
x
‖ b−Ax ‖2. (3.4)
As the unknown x occurs linearly in (3.4), this is also referred to as the linear least
squares problem.
In the example we know immediately from our knowledge of distances in R3
that the distance between the tip of the vector b and the plane is minimized if we
choose the linear combination of vectors in the plane in such a way that the residual
book
2007/2/23
page 33
3.6. The Least Squares Problem 33
b
r
a·1
a·2
Figure 3.2. The least squares problem, m = 3 and n = 2. The residual
vector b−Ax is dotted.
vector is orthogonal to the plane. Since the columns of the matrix A span the plane,
we see that we get the solution by making r orthogonal to the columns of A. This
geometric intuition is valid also in the general case:
rTa·j = 0, j = 1, 2, . . . , n.
(See the definition of orthogonality in Section 2.3.) Equivalently, we can write
rT
(
a·1 a·2 · · · a·n
)
= rTA = 0.
Then, using r = b−Ax, we get the normal equations (the name is now obvious)
ATAx = AT b
for determining the coefficients in x.
Theorem 3.10. If the column vectors of A are linearly independent, then the
normal equations
ATAx = AT b
are nonsingular and have a unique solution.
Proof. We first show that ATA is positive definite. Let x be an arbitrary nonzero
vector. Then, from the definition of linear independence, we have Ax �= 0. With
y = Ax, we then have
xTATAx = yTy =
n∑
i=1
y2i > 0,
book
2007/2/23
page 34
34 Chapter 3. Linear Systems and Least Squares
which is equivalent to ATA being positive definite. Therefore, ATA is nonsingular,
and the normal equations have a unique solution, which we denote x̂.
Then, we show that x̂ is the solution of the least squares problem, i.e., ‖r̂‖2 ≤
‖r‖2 for all r = b−Ax. We can write
r = b−Ax̂ + A(x̂− x) = r̂ + A(x̂− x)
and
‖r‖22 = r
T r = (r̂ + A(x̂− x))T (r̂ + A(x̂− x))
= r̂T r̂ + r̂TA(x̂− x) + (x̂− x)TAT r̂ + (x̂− x)TATA(x̂− x).
Since AT r̂ = 0, the two terms in the middle are equal to zero, and we get
‖r‖22 = r̂
T r̂ + (x̂− x)TATA(x̂− x) = ‖r̂‖22 + ‖A(x̂− x)‖
2
2 ≥ ‖r̂‖
2
2,
which was to be proved.
Example 3.11. We can now solve the example given at the beginning of the
chapter. We have
A =
⎛
⎜⎜⎝
1 1
1 2
1 3
1 41 5
⎞
⎟⎟⎠ , b =
⎛
⎜⎜⎝
7.97
10.2
14.2
16.021.2
⎞
⎟⎟⎠ .
Using MATLAB we then get
>> C=A’*A % Normal equations
C = 5 15
15 55
>> x=C(A’*b)
x = 4.2360
3.2260
Solving the linear least squares problems using the normal equations has two
significant drawbacks:
1. Forming ATA can lead to loss of information.
2. The condition number ATA is the square of that of A:
κ(ATA) = (κ(A))2.
We illustrate these points in a couple of examples.
book
2007/2/23
page 35
3.6. The Least Squares Problem 35
Example 3.12. Let � be small, and define the matrix
A =
⎛
⎝1 1� 0
0 �
⎞
⎠ .
It follows that
ATA =
(
1 + �2 1
1 1 + �2
)
.
If � is so small that the floating point representation of 1+ �2 satisfies fl[1+ �2] = 1,
then in floating point arithmetic the normal equations become singular. Thus vital
information that is present in A is lost in forming ATA.
The condition number of a rectangular matrix A is defined using the singular
value decomposition of A. We will state a result on the conditioning of the least
squares problem in Section 6.6.
Example 3.13. We compute the condition number of the matrix in Example 3.9
using MATLAB:
A = 1 1
1 2
1 3
1 4
1 5
cond(A) = 8.3657
cond(A’*A) = 69.9857
Then we assume that we have a linear model
l(x) = c0 + c1x
with data vector x = (101 102 103 104 105)T . This gives a data matrix with large
condition number:
A = 1 101
1 102
1 103
1 104
1 105
cond(A) = 7.5038e+03
cond(A’*A) = 5.6307e+07
book
2007/2/23
page 36
36 Chapter 3. Linear Systems and Least Squares
If instead we use the model
l(x) = b0 + b1(x− 103),
the corresponding normal equations become diagonal and much better conditioned
(demonstrate this).
It occurs quite often that one has a sequence of least squares problems with
the same matrix,
min
xi
‖Axi − bi‖2, i = 1, 2, . . . , p,
with solutions
xi = (A
TA)−1AT bi, i = 1, 2, . . . , p.
Defining X =
(
x1 x2 . . . xp
)
and X =
(
b1 b2 . . . bp
)
we can write this in
matrix form
min
X
‖AX −B‖F (3.5)
with the solution
X = (ATA)−1ATB.
This follows from the identity
‖AX −B‖2F =
p∑
i=1
‖Axi − bi‖22
and the fact that the p subproblems in (3.5) are independent.
book
2007/2/23
page 37
Chapter 4
Orthogonality
Even if the Gaussian elimination procedure for solving linear systems of equations
and normal equations is a standard algorithm with widespread use in numerous
applications, it is not sufficient in situations when one needs to separate the most
important information from less important information (“noise”). The typical linear
algebra formulation of “data quality” is to quantify the concept of “good and bad
basis vectors”; loosely speaking, good basis vectors are those that are “very linearly
independent,” i.e., close to orthogonal. In the same vein, vectors that are almost
linearly dependent are bad basis vectors. In this chapter we will introduce some
theory and algorithms for computations with orthogonal vectors. A more complete
quantification of the “quality” of a set of vectors is given in Chapter 6.
Example 4.1. In Example 3.13 we saw that an unsuitable choice of basis vectors
in the least squares problem led to ill-conditioned normal equations. Along similar
lines, define the two matrices
A =
⎛
⎝1 1.051 1
1 0.95
⎞
⎠ , B =
⎛
⎝1 1/
√
2
1 0
1 −1/
√
2
⎞
⎠ ,
whose columns are plotted in Figure 4.1. It can be shown that the column vectors
of the two matrices span the same plane in R3. From the figure it is clear that
the columns of B, which are orthogonal, determine the plane much better than the
columns of A, which are quite close.
From several points of view, it is advantageous to use orthogonal vectors as
basis vectors in a vector space. In this chapter we will list some important properties
of orthogonal sets of vectors and orthogonal matrices. We assume that the vectors
are in Rm with m ≥ n.
37
book
2007/2/23
page 38
38 Chapter 4. Orthogonality
0 0.2
0.4 0.6
0.8 1
1.2 1.4
0
0.2
0.4
0.6
0.8
1
−1
−0.5
0
0.5
1
Figure 4.1. Three vectors spanning a plane in R3.
4.1 Orthogonal Vectors and Matrices
We first recall that two nonzero vectors x and y are called orthogonal if xT y = 0
(i.e., cos θ(x, y) = 0).
Proposition 4.2. Let qj, j = 1, 2, . . . , n, be orthogonal, i.e., q
T
i qj = 0, i �= j. Then
they are linearly independent.
Proof. Assume they are linearly dependent. Then from Proposition 2.2 there exists
a qk such that
qk =
∑
j �=k
αjqj .
Multiplying this equation by qTk we get
qTk qk =
∑
j �=k
αjq
T
k qj = 0,
since the vectors are orthogonal. This is a contradiction.
Let the set of orthogonal vectors qj , j = 1, 2, . . . ,m, in R
m be normalized,
‖qj‖2 = 1.
Then they are called orthonormal, and they constitute an orthonormal basis in Rm.
A square matrix
Q =
(
q1 q2 · · · qm
)
∈ Rm×m
book
2007/2/23
page 39
4.1. Orthogonal Vectors and Matrices 39
whose columns are orthonormal is called an orthogonal matrix. Orthogonal matrices
satisfy a number of important properties that we list in a sequence of propositions.
Proposition 4.3. An orthogonal matrix Q satisfies QTQ = I.
Proof.
QTQ =
(
q1 q2 · · · qm
)T (
q1 q2 · · · qm
)
=
⎛
⎜⎜⎝
qT1
qT2
· · ·
qTm
⎞
⎟⎟⎠(q1 q2 · · · qm)
=
⎛
⎜⎜⎜⎝
qT1 q1 q
T
1 q2 · · · qT1 qm
qT2 q1 q
T
2 q2 · · · qT2 qm
…
…
…
qTmq1 q
T
mq2 · · · qTmqm
⎞
⎟⎟⎟⎠ =
⎛
⎜⎜⎜⎝
1 0 · · · 0
0 1 · · · 0
…
…
…
0 0 · · · 1
⎞
⎟⎟⎟⎠ ,
due to orthonormality.
The orthogonality of its columns implies that an orthogonal matrix has full
rank, and it is trivial to find the inverse.
Proposition 4.4. An orthogonal matrix Q ∈ Rm×m has rank m, and, since QTQ =
I, its inverse is equal to Q−1 = QT .
Proposition 4.5. The rows of an orthogonal matrix are orthogonal, i.e., QQT = I.
Proof. Let x be an arbitrary vector. We shall show that QQTx = x. Given x
there is a uniquely determined vector y, such that Qy = x, since Q−1 exists. Then
QQTx = QQTQy = Qy = x.
Since x is arbitrary, it follows that QQT = I.
Proposition 4.6. The product of two orthogonal matrices is orthogonal.
Proof. Let Q and P be orthogonal, and put X = PQ. Then
XTX = (PQ)TPQ = QTPTPQ = QTQ = I.
Any orthonormal basis of a subspace of Rm can be enlarged to an orthonormal
basis of the whole space. The next proposition shows this in matrix terms.
Proposition 4.7. Given a matrix Q1 ∈ Rm×k, with orthonormal columns, there
exists a matrix Q2 ∈ Rm×(m−k) such that Q = (Q1 Q2) is an orthogonal matrix.
This proposition is a standard result in linear algebra. We will later demon-
strate how Q can be computed.
book
2007/2/23
page 40
40 Chapter 4. Orthogonality
One of the most important properties of orthogonal matrices is that they
preserve the length of a vector.
Proposition 4.8. The Euclidean length of a vector is invariant under an orthogonal
transformation Q.
Proof. ‖Qx‖22 = (Qx)TQx = xTQTQx = xTx = ‖x‖22.
Also the corresponding matrix norm and the Frobenius norm are invariant
under orthogonal transformations.
Proposition 4.9. Let U ∈ Rm×m and V ∈ Rn×n be orthogonal. Then for any
A ∈ Rm×n,
‖UAV ‖2 = ‖A ‖2,
‖UAV ‖F = ‖A ‖F .
Proof. The first equality is easily proved using Proposition 4.8. The second is
proved using the alternative expression (2.8) for the Frobenius norm and the identity
tr(BC) = tr(CB).
4.2 Elementary Orthogonal Matrices
We will use elementary orthogonal matrices to reduce matrices to compact form.
For instance, we will transform a matrix A ∈ Rm×n, m ≥ n, to triangular form.
4.2.1 Plane Rotations
A 2 × 2 plane rotation matrix6
G =
(
c s
−s c
)
, c2 + s2 = 1,
is orthogonal. Multiplication of a vector x by G rotates the vector in a clockwise
direction by an angle θ, where c = cos θ. A plane rotation can be used to zero the
second element of a vector x by choosing c = x1/
√
x21 + x
2
2 and s = x2/
√
x21 + x
2
2:
1√
x21 + x
2
2
(
x1 x2
−x2 x1
)(
x1
x2
)
=
(√
x21 + x
2
2
0
)
.
By embedding a two-dimensional rotation in a larger unit matrix, one can manip-
ulate vectors and matrices of arbitrary dimension.
6In the numerical literature, plane rotations are often called Givens rotations, after Wallace
Givens, who used them for eigenvalue computations around 1960. However, they had been used
long before that by Jacobi, also for eigenvalue computations.
book
2007/2/23
page 41
4.2. Elementary Orthogonal Matrices 41
Example 4.10. We can choose c and s in
G =
⎛
⎜⎜⎝
1 0 0 0
0 c 0 s
0 0 1 0
0 −s 0 c
⎞
⎟⎟⎠
so that we zero element 4 in a vector x ∈ R4 by a rotation in plane (2, 4). Execution
of the MATLAB script
x=[1;2;3;4];
sq=sqrt(x(2)^2+x(4)^2);
c=x(2)/sq; s=x(4)/sq;
G=[1 0 0 0; 0 c 0 s; 0 0 1 0; 0 -s 0 c];
y=G*x
gives the result
y = 1.0000
4.4721
3.0000
0
Using a sequence of plane rotations, we can now transform an arbitrary vector
to a multiple of a unit vector. This can be done in several ways. We demonstrate
one in the following example.
Example 4.11. Given a vector x ∈ R4, we transform it to κe1. First, by a rotation
G3 in the plane (3, 4) we zero the last element:⎛
⎜⎜⎝
1 0 0 0
0 1 0 0
0 0 c1 s1
0 0 −s1 c1
⎞
⎟⎟⎠
⎛
⎜⎜⎝
×
×
×
×
⎞
⎟⎟⎠ =
⎛
⎜⎜⎝
×
×
∗
0
⎞
⎟⎟⎠ .
Then, by a rotation G2 in the plane (2, 3) we zero the element in position 3:⎛
⎜⎜⎝
1 0 0 0
0 c2 s2 0
0 −s2 c2 0
0 0 0 1
⎞
⎟⎟⎠
⎛
⎜⎜⎝
×
×
×
0
⎞
⎟⎟⎠ =
⎛
⎜⎜⎝
×
∗
0
0
⎞
⎟⎟⎠ .
Finally, the second element is annihilated by a rotation G1:⎛
⎜⎜⎝
c3 s3 0 0
−s3 c3 0 0
0 0 1 0
0 0 0 1
⎞
⎟⎟⎠
⎛
⎜⎜⎝
×
×
0
0
⎞
⎟⎟⎠ =
⎛
⎜⎜⎝
κ
0
0
0
⎞
⎟⎟⎠ .
book
2007/2/23
page 42
42 Chapter 4. Orthogonality
According to Proposition 4.8 the Euclidean length is preserved, and therefore we
know that κ = ‖x‖2.
We summarize the transformations. We have
κe1 = G1(G2(G3x)) = (G1G2G3)x.
Since the product of orthogonal matrices is orthogonal (Proposition 4.6) the matrix
P = G1G2G3 is orthogonal, and the overall result is Px = κe1.
Plane rotations are very flexible and can be used efficiently for problems with
a sparsity structure, e.g., band matrices. On the other hand, for dense matrices
they require more flops than Householder transformations; see Section 4.3.
Example 4.12. In the MATLAB example earlier in this section we explicitly
embedded the 2 × 2 in a matrix of larger dimension. This is a waste of operations,
since the computer execution of the code does not take into account the fact that
only two rows of the matrix are changed. Instead the whole matrix multiplication
is performed, which requires 2n3 flops in the case of matrices of dimension n. The
following two MATLAB functions illustrate how the rotation should be implemented
to save operations (and storage):
function [c,s]=rot(x,y);
% Construct a plane rotation that zeros the second
% component in the vector [x;y]’ (x and y are scalars)
sq=sqrt(x^2 + y^2);
c=x/sq; s=y/sq;
function X=approt(c,s,i,j,X);
% Apply a plane (plane) rotation in plane (i,j)
% to a matrix X
X([i,j],:)=[c s; -s c]*X([i,j],:);
The following script reduces the vector x to a multiple of the standard basis vec-
tor e1:
x=[1;2;3;4];
for i=3:-1:1
[c,s]=rot(x(i),x(i+1));
x=approt(c,s,i,i+1,x);
end
>> x = 5.4772
0
0
0
After the reduction the first component of x is equal to ‖x ‖2.
book
2007/2/23
page 43
4.2. Elementary Orthogonal Matrices 43
4.2.2 Householder Transformations
Let v �= 0 be an arbitrary vector, and put
P = I −
2
vT v
vvT ;
P is symmetric and orthogonal (verify this by a simple computation!). Such matrices
are called reflection matrices or Householder transformations. Let x and y be given
vectors of the same length, ‖x‖2 = ‖y‖2, and ask the question, “Can we determine
a Householder transformation P such that Px = y?”
The equation Px = y can be written
x−
2vTx
vT v
v = y,
which is of the form βv = x − y. Since v enters P in such a way that a factor β
cancels, we can choose β = 1. With v = x− y we get
vT v = xTx + yT y − 2xT y = 2(xTx− xT y),
since xTx = yT y. Further,
vTx = xTx− yTx =
1
2
vT v.
Therefore we have
Px = x−
2vTx
vT v
v = x− v = y,
as we wanted. In matrix computations we often want to zero elements in a vector
and we now choose y = κe1, where κ = ±‖x‖2, and eT1 =
(
1 0 · · · 0
)
. The
vector v should be taken equal to
v = x− κe1.
In order to avoid cancellation (i.e., the subtraction of two close floating point num-
bers), we choose sign(κ) = − sign(x1). Now that we have computed v, we can
simplify and write
P = I −
2
vT v
vvT = I − 2uuT , u =
1
‖v‖2
v.
Thus the Householder vector u has length 1. The computation of the Householder
vector can be implemented in the following MATLAB code:
book
2007/2/23
page 44
44 Chapter 4. Orthogonality
function u=househ(x)
% Compute the Householder vector u such that
% (I – 2 u * u’)x = k*e_1, where
% |k| is equal to the euclidean norm of x
% and e_1 is the first unit vector
n=length(x); % Number of components in x
kap=norm(x); v=zeros(n,1);
v(1)=x(1)+sign(x(1))*kap;
v(2:n)=x(2:n);
u=(1/norm(v))*v;
In most cases one should avoid forming the Householder matrix P explicitly,
since it can be represented much more compactly by the vector u. Multiplication
by P should be done according to Px = x − (2uTx)u, where the matrix-vector
multiplication requires 4n flops (instead of O(n2) if P were formed explicitly). The
matrix multiplication PX is done
PX = A− 2u(uTX). (4.1)
Multiplication by a Householder transformation is implemented in the following
code:
function Y=apphouse(u,X);
% Multiply the matrix X by a Householder matrix
% Y = (I – 2 * u * u’) * X
Y=X-2*u*(u’*X);
Example 4.13. The first three elements of the vector x =
(
1 2 3 4
)T
are
zeroed by the following sequence of MATLAB statements:
>> x=[1; 2; 3; 4];
>> u=househ(x);
>> y=apphouse(u,x)
y = -5.4772
0
0
0
As plane rotations can be embedded in unit matrices, in order to apply the
transformation in a structured way, we similarly can embed Householder trans-
formations. Assume, for instance that we have transformed the first column in a
matrix to a unit vector and that we then want to zero all the elements in the second
column below the main diagonal. Thus, in an example with a 5×4 matrix, we want
to compute the transformation
book
2007/2/23
page 45
4.3. Number of Floating Point Operations 45
P2A
(1) = P2
⎛
⎜⎜⎜⎜⎝
× × × ×
0 × × ×
0 × × ×
0 × × ×
0 × × ×
⎞
⎟⎟⎟⎟⎠ =
⎛
⎜⎜⎜⎜⎝
× × × ×
0 × × ×
0 0 × ×
0 0 × ×
0 0 × ×
⎞
⎟⎟⎟⎟⎠ =: A(2). (4.2)
Partition the second column of A(1) as follows:(
a
(1)
12
a
(1)
·2
)
,
where a
(1)
12 is a scalar. We know how to transform a
(1)
·2 to a unit vector; let P̂2 be a
Householder transformation that does this. Then the transformation (4.2) can be
implemented by embedding P̂2 in a unit matrix:
P2 =
(
1 0
0 P̂2
)
. (4.3)
It is obvious that P2 leaves the first row of A
(1) unchanged and computes the
transformation (4.2). Also, it is easy to see that the newly created zeros in the first
column are not destroyed.
Similar to the case of plane rotations, one should not explicitly embed a House-
holder transformation in an identity matrix of larger dimension. Instead one should
apply it to the rows (in the case of multiplication from the left) that are affected in
the transformation.
Example 4.14. The transformation in (4.2) is done by the following statements.
u=househ(A(2:m,2)); A(2:m,2:n)=apphouse(u,A(2:m,2:n));
>> A = -0.8992 -0.6708 -0.7788 -0.9400
-0.0000 0.3299 0.7400 0.3891
-0.0000 0.0000 -0.1422 -0.6159
-0.0000 -0.0000 0.7576 0.1632
-0.0000 -0.0000 0.3053 0.4680
4.3 Number of Floating Point Operations
We shall compare the number of flops to transform the first column of an m × n
matrix A to a multiple of a unit vector κe1 using plane and Householder transfor-
mations. Consider first plane rotations. Obviously, the computation of(
c s
−s c
)(
x
y
)
=
(
cx + sy
−sx + cy
)
requires four multiplications and two additions, i.e., six flops. Applying such a
transformation to an m× n matrix requires 6n flops. In order to zero all elements
book
2007/2/23
page 46
46 Chapter 4. Orthogonality
but one in the first column of the matrix, we apply m−1 rotations. Thus the overall
flop count is 6(m− 1)n ≈ 6mn.
If the corresponding operation is performed by a Householder transformation
as in (4.1), then only 4mn flops are needed. (Note that multiplication by 2 is not
a flop, as it is implemented by the compiler as a shift, which is much faster than a
flop; alternatively one can scale the vector u by
√
2.)
4.4 Orthogonal Transformations in Floating Point
Arithmetic
Orthogonal transformations are very stable in floating point arithmetic. For in-
stance, it can be shown [50, p. 367] that a computed Householder transformation
in floating point P̂ that approximates an exact P satisfies
‖P − P̂ ‖2 = O(μ),
where μ is the unit round-off of the floating point system. We also have the backward
error result
fl(P̂A) = P (A + E), ‖E ‖2 = O(μ‖A ‖2).
Thus the floating point result is equal to the product of the exact orthogonal matrix
and a data matrix that has been perturbed by a very small amount. Analogous
results hold for plane rotations.
book
2007/2/23
page 47
Chapter 5
QR Decomposition
One of the main themes of this book is decomposition of matrices to compact (e.g.,
triangular or diagonal) form by orthogonal transformations. We will now introduce
the first such decomposition, the QR decomposition, which is a factorization of a
matrix A in a product of an orthogonal matrix and a triangular matrix. This is
more ambitious than computing the LU decomposition, where the two factors are
both required only to be triangular.
5.1 Orthogonal Transformation to Triangular Form
By a sequence of Householder transformations7 we can transform any matrix A ∈
R
m×n, m ≥ n,
A −→ QTA =
(
R
0
)
, R ∈ Rn×n,
where R is upper triangular and Q ∈ Rm×m is orthogonal. The procedure can be
conveniently illustrated using a matrix of small dimension. Let A ∈ R5×4. In the
first step we zero the elements below the main diagonal in the first column,
H1A = H1
⎛
⎜⎜⎜⎜⎝
× × × ×
× × × ×
× × × ×
× × × ×
× × × ×
⎞
⎟⎟⎟⎟⎠ =
⎛
⎜⎜⎜⎜⎝
+ + + +
0 + + +
0 + + +
0 + + +
0 + + +
⎞
⎟⎟⎟⎟⎠ ,
where +’s denote elements that have changed in the transformation. The orthogonal
matrix H1 can be taken equal to a Householder transformation. In the second step
we use an embedded Householder transformation as in (4.3) to zero the elements
7In this chapter we use Householder transformations, but analogous algorithms can be formu-
lated in terms of plane rotations.
47
book
2007/2/23
page 48
48 Chapter 5. QR Decomposition
below the main diagonal in the second column:
H2
⎛
⎜⎜⎜⎜⎝
× × × ×
0 × × ×
0 × × ×
0 × × ×
0 × × ×
⎞
⎟⎟⎟⎟⎠ =
⎛
⎜⎜⎜⎜⎝
× × × ×
0 + + +
0 0 + +
0 0 + +
0 0 + +
⎞
⎟⎟⎟⎟⎠ .
Again, on the right-hand side, +’s denote elements that have been changed in
the transformation, and ×’s denote elements that are unchanged in the present
transformation.
In the third step we annihilate elements below the diagonal in the third col-
umn:
H3
⎛
⎜⎜⎜⎜⎝
× × × ×
0 × × ×
0 0 × ×
0 0 × ×
0 0 × ×
⎞
⎟⎟⎟⎟⎠ =
⎛
⎜⎜⎜⎜⎝
× × × ×
0 × × ×
0 0 + +
0 0 0 +
0 0 0 +
⎞
⎟⎟⎟⎟⎠ .
After the fourth step we have computed the upper triangular matrix R. The se-
quence of transformations is summarized
QTA =
(
R
0
)
, QT = H4H3H2H1.
Note that the matrices Hi have the following structure (here we assume that
A ∈ Rm×n):
H1 = I − 2u1uT1 , u1 ∈ R
m,
H2 =
(
1 0
0 P2
)
, P2 = I − 2u2uT2 , u2 ∈ R
m−1, (5.1)
H3 =
⎛
⎝1 0 00 1 0
0 0 P3
⎞
⎠ , P3 = I − 2u3uT3 , u3 ∈ Rm−2,
etc. Thus we embed the Householder transformations of successively smaller dimen-
sion in identity matrices, and the vectors ui become shorter in each step. It is easy
to see that the matrices Hi are also Householder transformations. For instance,
H3 = I − 2u(3)u(3)
T
, u(3) =
⎛
⎝ 00
u3
⎞
⎠ .
The transformation to triangular form is equivalent to a decomposition of the
matrix A.
book
2007/2/23
page 49
5.1. Orthogonal Transformation to Triangular Form 49
Theorem 5.1 (QR decomposition). Any matrix A ∈ Rm×n, m ≥ n, can be
transformed to upper triangular form by an orthogonal matrix. The transformation
is equivalent to a decomposition
A = Q
(
R
0
)
,
where Q ∈ Rm×m is orthogonal and R ∈ Rn×n is upper triangular. If the columns
of A are linearly independent, then R is nonsingular.
Proof. The constructive procedure outlined in the preceding example can easily be
adapted to the general case, under the provision that if the vector to be transformed
to a unit vector is a zero vector, then the orthogonal transformation is chosen equal
to the identity.
The linear independence of the columns of(
R
0
)
follows from Proposition 2.4. Since R is upper triangular, the linear independence
implies that its diagonal elements are nonzero. (If one column had a zero on the
diagonal, then it would be a linear combination of those to the left of it.) Thus, the
determinant of R is nonzero, which means that R is nonsingular.
We illustrate the QR decomposition symbolically in Figure 5.1.
A = Q 0
R
m× n m×m m× n
Figure 5.1. Symbolic illustration of the QR decomposition.
Quite often it is convenient to write the decomposition in an alternative way,
where the only part of Q that is kept corresponds to an orthogonalization of the
columns of A; see Figure 5.2.
This thin QR decomposition can be derived by partitioning Q = (Q1 Q2),
where Q1 ∈ Rm×n, and noting that in the multiplication the block Q2 is multiplied
by zero:
A = (Q1 Q2)
(
R
0
)
= Q1R. (5.2)
book
2007/2/23
page 50
50 Chapter 5. QR Decomposition
A = Q1
R
m× n m× n n× n
Figure 5.2. Thin QR decomposition A = Q1R.
It is seen from this equation that R(A) = R(Q1); thus we have now computed an
orthogonal basis of the range space R(A). Furthermore, if we write out column j
in (5.2),
aj = Q1rj =
j∑
i=1
rijqi,
we see that column j in R holds the coordinates of aj in the orthogonal basis.
Example 5.2. We give a simple numerical illustration of the computation of a QR
decomposition in MATLAB:
A = 1 1 1
1 2 4
1 3 9
1 4 16
>> [Q,R]=qr(A)
Q =-0.5000 0.6708 0.5000 0.2236
-0.5000 0.2236 -0.5000 -0.6708
-0.5000 -0.2236 -0.5000 0.6708
-0.5000 -0.6708 0.5000 -0.2236
R =-2.0000 -5.0000 -15.0000
0 -2.2361 -11.1803
0 0 2.0000
0 0 0
The thin QR decomposition is obtained by the command qr(A,0):
>> [Q,R]=qr(A,0)
Q =-0.5000 0.6708 0.5000
-0.5000 0.2236 -0.5000
-0.5000 -0.2236 -0.5000
-0.5000 -0.6708 0.5000
book
2007/2/2
page 51
5.2. Solving the Least Squares Problem 51
R =-2.0000 -5.0000 -15.0000
0 -2.2361 -11.1803
0 0 2.0000
5.2 Solving the Least Squares Problem
Using the QR decomposition, we can solve the least squares problem
min
x
‖ b−Ax ‖2, (5.3)
where A ∈ Rm×n, m ≥ n, without forming the normal equations. To do this we use
the fact that the Euclidean vector norm is invariant under orthogonal transforma-
tions (Proposition 4.8):
‖Qy‖2 = ‖y‖2.
Introducing the QR decomposition of A in the residual vector, we get
‖r‖22 = ‖b−Ax‖
2
2 =
∥∥∥∥b−Q
(
R
0
)
x
∥∥∥∥2
2
=
∥∥∥∥Q(QT b−
(
R
0
)
x)
∥∥∥∥2
2
=
∥∥∥∥QT b−
(
R
0
)
x
∥∥∥∥2
2
.
Then we partition Q = (Q1 Q2), where Q1 ∈ Rm×n, and denote
QT b =
(
b1
b2
)
:=
(
QT1 b
QT2 b
)
.
Now we can write
‖r‖22 =
∥∥∥∥
(
b1
b2
)
−
(
Rx
0
)∥∥∥∥2
2
= ‖b1 −Rx‖22 + ‖b2‖
2
2. (5.4)
Under the assumption that the columns of A are linearly independent, we can solve
Rx = b1
and minimize ‖r‖2 by making the first term in (5.4) equal to zero. We now have
proved the following theorem.
Theorem 5.3 (least squares solution by QR decomposition). Let the matrix
A ∈ Rm×n have full column rank and thin QR decomposition A = Q1R. Then the
least squares problem minx ‖Ax− b ‖2 has the unique solution
x = R−1QT1 b.
Example 5.4. As an example we solve the least squares problem from the begin-
ning of Section 3.6. The matrix and right-hand side are
book
2007/2/23
page 52
52 Chapter 5. QR Decomposition
A = 1 1 b = 7.9700
1 2 10.2000
1 3 14.2000
1 4 16.0000
1 5 21.2000
with thin QR decomposition and least squares solution
>> [Q1,R]=qr(A,0) % thin QR
Q1 = -0.4472 -0.6325
-0.4472 -0.3162
-0.4472 0.0000
-0.4472 0.3162
-0.4472 0.6325
R = -2.2361 -6.7082
0 3.1623
>> x=R(Q1’*b)
x = 4.2360
3.2260
Note that the MATLAB statement x=A gives the same result, using exactly the
same algorithm.
5.3 Computing or Not Computing Q
The orthogonal matrix Q can be computed at the same time as R by applying the
transformations to the identity matrix. Similarly, Q1 in the thin QR decomposition
can be computed by applying the transformations to the partial identity matrix(
In
0
)
.
However, in many situations we do not need Q explicitly. Instead, it may be suf-
ficient to apply the same sequence of Householder transformations. Due to the
structure of the embedded Householder matrices exhibited in (5.1), the vectors that
are used to construct the Householder transformations for reducing the matrix A to
upper triangular form can be stored below the main diagonal in A, in the positions
that were made equal to zero. An extra vector is then needed to store the elements
on the diagonal of R.
In the solution of the least squares problem (5.3) there is no need to compute
Q at all. By adjoining the right-hand side to the matrix, we compute
(
A b
)
→ QT
(
A b
)
=
(
QT1
QT2
)(
A b
)
=
(
R QT1 b
0 QT2 b
)
,
book
2007/2/23
page 53
5.4. Flop Count for QR Factorization 53
and the least squares solution is obtained by solving Rx = QT1 b. We also see that
min
x
‖Ax− b ‖2 = min
x
∥∥∥∥
(
Rx−QT1 b
QT2 b
)∥∥∥∥
2
= ‖QT2 b ‖2.
Thus the norm of the optimal residual is obtained as a by-product of the triangu-
larization procedure.
5.4 Flop Count for QR Factorization
As shown in Section 4.3, applying a Householder transformation to an m×n matrix
to zero the elements of the first column below the diagonal requires approximately
4mn flops. In the following transformation, only rows 2 to m and columns 2 to n
are changed (see (5.1)), and the dimension of the submatrix that is affected by the
transformation is reduced by one in each step. Therefore the number of flops for
computing R is approximately
4
n−1∑
k=0
(m− k)(n− k) ≈ 2mn2 −
2n3
3
.
Then the matrix Q is available in factored form, as a product of Householder trans-
formations. If we compute explicitly the full matrix Q, then in step k + 1 we need
4(m− k)m flops, which leads to a total of
4
n−1∑
k=0
(m− k)m ≈ 4mn
(
m−
n
2
)
.
It is possible to take advantage of structure in the accumulation of Q to reduce the
flop count somewhat [42, Section 5.1.6].
5.5 Error in the Solution of the Least Squares
Problem
As we stated in Section 4.4, Householder transformations and plane rotations have
excellent properties with respect to floating point rounding errors. Here we give a
theorem, the proof of which can be found in [50, Theorem 19.3].
Theorem 5.5. Assume that A ∈ Rm×n, m ≥ n, has full column rank and that the
least squares problem minx ‖Ax − b ‖2 is solved using QR factorization by House-
holder transformations. Then the computed solution x̂ is the exact least squares
solution of
min
x
‖ (A + ΔA)x̂− (b + δb) ‖2,
where
‖ΔA ‖F ≤ c1mnμ‖A ‖F + O(μ2), ‖ δb ‖2 ≤ c2mn‖ b ‖ + O(μ2),
and c1 and c2 are small constants.
book
2007/2/23
page 54
54 Chapter 5. QR Decomposition
It is seen that, in the sense of backward errors, the solution is as good as can
be hoped for (i.e., the method is backward stable). Using the perturbation theory
in Section 6.6 one can estimate the forward error in the computed solution. In
Section 3.6 we suggested that the normal equations method for solving the least
squares problem has less satisfactory properties in floating point arithmetic. It can
be shown that that method is not backward stable, unless the matrix A is well-
conditioned. The pros and cons of the two methods are nicely summarized in [50,
p. 399]. Here we give an example that, although somewhat extreme, demonstrates
that for certain least squares problems the solution given by the method of normal
equations can be much less accurate than that produced using a QR decomposition;
cf. Example 3.12.
Example 5.6. Let � = 10−7, and consider the matrix
A =
⎛
⎝1 1� 0
0 �
⎞
⎠ .
The condition number of A is of the order 107. The following MATLAB script
x=[1;1]; b=A*x;
xq=A; % QR decomposition
xn=(A’*A)(A’*b); % Normal equations
[xq xn]
gave the result
1.00000000000000 1.01123595505618
1.00000000000000 0.98876404494382
which shows that the normal equations method is suffering from the fact that the
condition number of the matrix ATA is the square of that of A.
5.6 Updating the Solution of a Least Squares
Problem
In some applications the rows of A and the corresponding elements of b are measured
in real time. Let us call one row and the element of b an observation. Every
time an observation arrives, a new least squares solution is to be computed. If
we were to recompute the solution from scratch, it would cost O(mn2) flops for
each new observation. This is too costly in most situations and is an unnecessarily
heavy computation, since the least squares solution can be computed by updating
the QR decomposition in O(n2) flops every time a new observation is available.
Furthermore, the updating algorithm does not require that we save the orthogonal
matrix Q!
Assume that we have reduced the matrix and the right-hand side(
A b
)
→ QT
(
A b
)
=
(
R QT1 b
0 QT2 b
)
, (5.5)
book
2007/2/23
page 55
5.6. Updating the Solution of a Least Squares Problem 55
from which the least squares solution is readily available. Assume that we have not
saved Q. Then let a new observation be denoted (aT β), where a ∈ Rn and β is a
scalar. We then want to find the solution of the augmented least squares problem
min
x
∥∥∥∥
(
A
aT
)
x−
(
b
β
)∥∥∥∥ . (5.6)
In terms of the new matrix, we can write the reduction (5.5) in the form(
A b
aT β
)
→
(
QT 0
0 1
)(
A b
aT β
)
=
⎛
⎝R QT1 b0 QT2 b
aT β
⎞
⎠ .
Therefore, we can find the solution of the augmented least squares problem (5.6) if
we reduce ⎛
⎝R QT1 b0 QT2 b
aT β
⎞
⎠
to triangular form by a sequence of orthogonal transformations. The vector QT2 b
will play no part in this reduction, and therefore we exclude it from the derivation.
We will now show how to perform the reduction to triangular form using a
sequence of plane rotations. The ideas of the algorithm will be illustrated using a
small example with n = 4. We start with
(
R b1
aT β
)
=
⎛
⎜⎜⎜⎜⎜⎜⎝
× × × × ×
× × × ×
× × ×
× ×
×
× × × × ×
⎞
⎟⎟⎟⎟⎟⎟⎠ .
By a rotation in the (1, n + 1) plane, we zero the first element of the bottom row
vector. The result is ⎛
⎜⎜⎜⎜⎜⎜⎝
+ + + + +
× × × ×
× × ×
× ×
×
0 + + + +
⎞
⎟⎟⎟⎟⎟⎟⎠ ,
where +’s denote elements that have been changed in the present transformation.
Then the second element of the bottom vector is zeroed by a rotation in (2, n+ 1):⎛
⎜⎜⎜⎜⎜⎜⎝
× × × × ×
+ + + +
× × ×
× ×
×
0 + + +
⎞
⎟⎟⎟⎟⎟⎟⎠ .
book
2007/2/23
page 56
56 Chapter 5. QR Decomposition
Note that the zero that was introduced in the previous step is not destroyed. After
three more analogous steps, the final result is achieved:
(
R̃ b̃1
0 β̃
)
=
⎛
⎜⎜⎜⎜⎝
× × × × ×
× × × ×
× × ×
× ×
×
⎞
⎟⎟⎟⎟⎠ .
A total of n rotations are needed to compute the reduction. The least squares
solution of (5.6) is now obtained by solving R̃x = b̃1.
book
2007/2/23
page 57
Chapter 6
Singular Value
Decomposition
Even if the QR decomposition is very useful for solving least squares problems
and has excellent stability properties, it has the drawback that it treats the rows
and columns of the matrix differently: it gives a basis only for the column space.
The singular value decomposition (SVD) deals with the rows and columns in a
symmetric fashion, and therefore it supplies more information about the matrix. It
also “orders” the information contained in the matrix so that, loosely speaking, the
“dominating part” becomes visible. This is the property that makes the SVD so
useful in data mining and many other areas.
6.1 The Decomposition
Theorem 6.1 (SVD). Any m× n matrix A, with m ≥ n, can be factorized
A = U
(
Σ
0
)
V T , (6.1)
where U ∈ Rm×m and V ∈ Rn×n are orthogonal, and Σ ∈ Rn×n is diagonal,
Σ = diag(σ1, σ2, . . . , σn),
σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
Proof. The assumption m ≥ n is no restriction: in the other case, just apply
the theorem to AT . We give a proof along the lines of that in [42]. Consider the
maximization problem
sup
‖ x ‖2=1
‖Ax ‖2.
Since we are seeking the supremum of a continuous function over a closed set, the
supremum is attained for some vector x. Put Ax = σ1y, where ‖ y ‖2 = 1 and
57
book
2007/2/23
page 58
58 Chapter 6. Singular Value Decomposition
σ1 = ‖A ‖2 (by definition). Using Proposition 4.7 we can construct orthogonal
matrices
Z1 = (y Z̄2) ∈ Rm×m, W1 = (x W̄2) ∈ Rn×n.
Then
ZT1 AW1 =
(
σ1 y
TAW̄2
0 Z̄T2 AW̄2
)
,
since yTAx = σ1, and Z
T
2 Ax = σ1Z̄
T
2 y = 0. Put
A1 = Z
T
1 AW1 =
(
σ1 w
T
0 B
)
.
Then
1
σ21 + w
Tw
∥∥∥∥A1
(
σ1
w
)∥∥∥∥2
2
=
1
σ21 + w
Tw
∥∥∥∥
(
σ21 + w
Tw
Bw
)∥∥∥∥2
2
≥ σ21 + w
Tw.
But ‖A1 ‖22 = ‖ZT1 AW1 ‖22 = σ21 ; therefore w = 0 must hold. Thus we have taken
one step toward a diagonalization of A. The proof is now completed by induction.
Assume that
B = Z2
(
Σ2
0
)
W2, Σ2 = diag(σ2, . . . , σn).
Then we have
A = Z1
(
σ1 0
0 B
)
WT1 = Z1
(
1 0
0 Z2
)⎛⎝σ1 00 Σ2
0 0
⎞
⎠(1 0
0 WT2
)
WT1 .
Thus, by defining
U = Z1
(
1 0
0 Z2
)
, Σ =
(
σ1 0
0 Σ2
)
, V = W1
(
1 0
0 W2
)
,
the theorem is proved.
The columns of U and V are called singular vectors and the diagonal elements
σi singular values.
We emphasize at this point that not only is this an important theoretical
result, but also there are very efficient and accurate algorithms for computing the
SVD; see Section 6.8.
The SVD appears in other scientific areas under different names. In statistics
and data analysis, the singular vectors are closely related to principal components
(see Section 6.4), and in image processing the SVD goes by the name Karhunen–
Loewe expansion.
We illustrate the SVD symbolically:
book
2007/2/23
page 59
6.1. The Decomposition 59
A = U 0
V T
0
m× n m×m m× n
With the partitioning U = (U1 U2), where U1 ∈ Rm×n, we get the thin SVD ,
A = U1ΣV
T ,
illustrated symbolically,
A = U
V T
0
0
m× n m× n n× n
If we write out the matrix equations
AV = U1Σ, A
TU1 = V Σ
column by column, we get the equivalent equations
Avi = σiui, A
Tui = σivi, i = 1, 2, . . . , n.
The SVD can also be written as an expansion of the matrix:
A =
n∑
i=1
σiuiv
T
i . (6.2)
This is usually called the outer product form, and it is derived by starting from the
thin version:
A = U1ΣV
T =
(
u1 u2 · · · un
)
⎛
⎜⎜⎜⎝
σ1
σ2
. . .
σTn
⎞
⎟⎟⎟⎠
⎛
⎜⎜⎜⎝
vT1
vT2
…
vTn
⎞
⎟⎟⎟⎠
=
(
u1 u2 · · · un
)
⎛
⎜⎜⎜⎝
σ1v
T
1
σ2v
T
2
…
σnv
T
n
⎞
⎟⎟⎟⎠ =
n∑
i=1
σiuiv
T
i .
book
2007/2/23
page 60
60 Chapter 6. Singular Value Decomposition
The outer product form of the SVD is illustrated as
A =
n∑
i=1
σiuiv
T
i = + + · · · .
Example 6.2. We compute the SVD of a matrix with full column rank:
A = 1 1
1 2
1 3
1 4
>> [U,S,V]=svd(A)
U = 0.2195 -0.8073 0.0236 0.5472
0.3833 -0.3912 -0.4393 -0.7120
0.5472 0.0249 0.8079 -0.2176
0.7110 0.4410 -0.3921 0.3824
S = 5.7794 0
0 0.7738
0 0
0 0
V = 0.3220 -0.9467
0.9467 0.3220
The thin version of the SVD is
>> [U,S,V]=svd(A,0)
U = 0.2195 -0.8073
0.3833 -0.3912
0.5472 0.0249
0.7110 0.4410
S = 5.7794 0
0 0.7738
V = 0.3220 -0.9467
0.9467 0.3220
The matrix 2-norm was defined in Section 2.4. From the proof of Theorem 6.1
we know already that ‖A ‖2 = σ1. This is such an important fact that it is worth
a separate proposition.
book
2007/2/23
page 61
6.2. Fundamental Subspaces 61
Proposition 6.3. The 2-norm of a matrix is given by
‖A ‖2 = σ1.
Proof. The following is an alternative proof. Without loss of generality, assume
that A ∈ Rm×n with m ≥ n, and let the SVD of A be A = UΣV T . The norm is
invariant under orthogonal transformations, and therefore
‖A ‖2 = ‖Σ ‖2.
The result now follows, since the 2-norm of a diagonal matrix is equal to the absolute
value of the largest diagonal element:
‖Σ ‖22 = sup
‖ y ‖2=1
‖Σy ‖22 = sup
‖ y ‖2=1
n∑
i=1
σ2i y
2
i ≤ σ
2
1
n∑
i=1
y2i = σ
2
1
with equality for y = e1.
6.2 Fundamental Subspaces
The SVD gives orthogonal bases of the four fundamental subspaces of a matrix.
The range of the matrix A is the linear subspace
R(A) = {y | y = Ax, for arbitrary x}.
Assume that A has rank r:
σ1 ≥ · · · ≥ σr > σr+1 = · · · = σn = 0.
Then, using the outer product form, we have
y = Ax =
r∑
i=1
σiuiv
T
i x =
r∑
i=1
(σiv
T
i x)ui =
r∑
i=1
αiui.
The null-space of the matrix A is the linear subspace
N (A) = {x | Ax = 0}.
Since Ax =
∑r
i=1 σiuiv
T
i x, we see that any vector z =
∑n
i=r+1 βivi is in the null-
space:
Az =
(
r∑
i=1
σiuiv
T
i
)(
n∑
i=r+1
βivi
)
= 0.
After a similar demonstration for AT we have the following theorem.
book
2007/2/23
page 62
62 Chapter 6. Singular Value Decomposition
Theorem 6.4 (fundamental subspaces).
1. The singular vectors u1, u2, . . . , ur are an orthonormal basis in R(A) and
rank(A) = dim(R(A)) = r.
2. The singular vectors vr+1, vr+2, . . . , vn are an orthonormal basis in N (A) and
dim(N (A)) = n− r.
3. The singular vectors v1, v2, . . . , vr are an orthonormal basis in R(AT ).
4. The singular vectors ur+1, ur+2, . . . , um are an orthonormal basis in N (AT ).
Example 6.5. We create a rank deficient matrix by constructing a third column
in the previous example as a linear combination of columns 1 and 2:
>> A(:,3)=A(:,1)+0.5*A(:,2)
A = 1.0000 1.0000 1.5000
1.0000 2.0000 2.0000
1.0000 3.0000 2.5000
1.0000 4.0000 3.0000
>> [U,S,V]=svd(A,0)
U = 0.2612 -0.7948 -0.5000
0.4032 -0.3708 0.8333
0.5451 0.0533 -0.1667
0.6871 0.4774 -0.1667
S = 7.3944 0 0
0 0.9072 0
0 0 0
V = 0.2565 -0.6998 0.6667
0.7372 0.5877 0.3333
0.6251 -0.4060 -0.6667
The third singular value is equal to zero and the matrix is rank deficient. Obviously,
the third column of V is a basis vector in N (A):
>> A*V(:,3)
ans =
1.0e-15 *
0
-0.2220
-0.2220
0
book
2007/2/23
page 63
6.3. Matrix Approximation 63
0 2 4 6 8 10 12 14 16 18 20
10
−1
10
0
10
1
singular values
index
Figure 6.1. Singular values of a matrix of rank 10 plus noise.
6.3 Matrix Approximation
Assume that A is a low-rank matrix plus noise: A = A0 + N , where the noise N is
small compared with A0. Then typically the singular values of A have the behavior
illustrated in Figure 6.1. In such a situation, if the noise is sufficiently small in
magnitude, the number of large singular values is often referred to as the numerical
rank of the matrix. If we know the correct rank of A0, or can estimate it, e.g., by
inspecting the singular values, then we can “remove the noise” by approximating A
by a matrix of the correct rank. The obvious way to do this is simply to truncate
the singular value expansion (6.2). Assume that the numerical rank is equal to k.
Then we approximate
A =
n∑
i=1
σiuiv
T
i ≈
k∑
i=1
σiuiv
T
i =: Ak.
The truncated SVD is very important, not only for removing noise but also for
compressing data (see Chapter 11) and for stabilizing the solution of problems that
are extremely ill-conditioned.
It turns out that the truncated SVD is the solution of approximation problems
where one wants to approximate a given matrix by one of lower rank. We will
consider low-rank approximation of a matrix A in two norms. First we give the
theorem for the matrix 2-norm.
Theorem 6.6. Assume that the matrix A ∈ Rm×n has rank r > k. The matrix
book
2007/2/23
page 64
64 Chapter 6. Singular Value Decomposition
approximation problem
min
rank(Z)=k
‖A− Z‖2
has the solution
Z = Ak := UkΣkV
T
k ,
where Uk = (u1, . . . , uk), Vk = (v1, . . . , vk), and Σk = diag(σ1, . . . , σk). The mini-
mum is
‖A−Ak‖2 = σk+1.
A proof of this theorem can be found, e.g., in [42, Section 2.5.5]. Next recall
the definition of the Frobenius matrix norm (2.7)
‖A‖F =
√∑
i,j
a2ij .
It turns out that the approximation result is the same for this case.
Theorem 6.7. Assume that the matrix A ∈ Rm×n has rank r > k. The Frobenius
norm matrix approximation problem
min
rank(Z)=k
‖A− Z‖F
has the solution
Z = Ak = UkΣkV
T
k ,
where Uk = (u1, . . . , uk), Vk = (v1, . . . , vk), and Σk = diag(σ1, . . . , σk). The mini-
mum is
‖A−Ak‖F =
(
p∑
i=k+1
σ2i
)1/2
,
where p = min(m,n).
For the proof of this theorem we need a lemma.
Lemma 6.8. Consider the mn-dimensional vector space Rm×n with inner product
〈A,B〉 = tr(ATB) =
m∑
i=1
n∑
j=1
aijbij (6.3)
and norm
‖A‖F = 〈A,A〉1/2.
book
2007/2/23
page 65
6.3. Matrix Approximation 65
Let A ∈ Rm×n with SVD A = UΣV T . Then the matrices
uiv
T
j , i = 1, 2, . . . ,m, j = 1, 2, . . . , n, (6.4)
are an orthonormal basis in Rm×n.
Proof. Using the identities 〈A,B〉 = tr(ATB) = tr(BAT ) we get
〈uivTj , ukv
T
l 〉 = tr(vju
T
i ukv
T
l ) = tr(v
T
l vj u
T
i uk) = (v
T
l vj) (u
T
i uk),
which shows that the matrices are orthonormal. Since there are mn such matrices,
they constitute a basis in Rm×n.
Proof (Theorem 6.7). This proof is based on that in [41]. Write the matrix
Z ∈ Rm×n in terms of the basis (6.4),
Z =
∑
i,j
ζijuiv
T
j ,
where the coefficients are to be chosen. For the purpose of this proof we denote the
elements of Σ by σij . Due to the orthonormality of the basis, we have
‖A− Z‖2F =
∑
i,j
(σij − ζij)2 =
∑
i
(σii − ζii)2 +
∑
i �=j
ζ2ij .
Obviously, we can choose the second term as equal to zero. We then have the
following expression for Z:
Z =
∑
i
ζiiuiv
T
i .
Since the rank of Z is equal to the number of terms in this sum, we see that the
constraint rank(Z) = k implies that we should have exactly k nonzero terms in the
sum. To minimize the objective function, we then choose
ζii = σii, i = 1, 2, . . . , k,
which gives the desired result.
The low-rank approximation of a matrix is illustrated as
A ≈ = UkΣkV Tk .
book
2007/2/23
page 66
66 Chapter 6. Singular Value Decomposition
6.4 Principal Component Analysis
The approximation properties of the SVD can be used to elucidate the equivalence
between the SVD and principal component analysis (PCA). Assume that X ∈
R
m×n is a data matrix, where each column is an observation of a real-valued random
vector with mean zero. The matrix is assumed to be centered, i.e., the mean of each
column is equal to zero. Let the SVD of X be X = UΣV T . The right singular
vectors vi are called principal components directions of X [47, p. 62]. The vector
z1 = Xv1 = σ1u1
has the largest sample variance among all normalized linear combinations of the
columns of X:
Var(z1) = Var(Xv1) =
σ21
m
.
Finding the vector of maximal variance is equivalent, using linear algebra terminol-
ogy, to maximizing the Rayleigh quotient:
σ21 = max
v �=0
vTXTXv
vT v
, v1 = arg max
v �=0
vTXTXv
vT v
.
The normalized variable u1 = (1/σ1)Xv1 is called the normalized first principal
component of X.
Having determined the vector of largest sample variance, we usually want to
go on and find the vector of second largest sample variance that is orthogonal to
the first. This is done by computing the vector of largest sample variance of the
deflated data matrix X − σ1u1vT1 . Continuing this process we can determine all the
principal components in order, i.e., we compute the singular vectors. In the general
step of the procedure, the subsequent principal component is defined as the vector
of maximal variance subject to the constraint that it is orthogonal to the previous
ones.
Example 6.9. PCA is illustrated in Figure 6.2. Five hundred data points from
a correlated normal distribution were generated and collected in a data matrix
X ∈ R3×500. The data points and the principal components are illustrated in the
top plot of the figure. We then deflated the data matrix, X1 := X − σ1u1vT1 ; the
data points corresponding to X1 are given in the bottom plot.
6.5 Solving Least Squares Problems
The least squares problem can be solved using the SVD. Assume that we have an
overdetermined system Ax ∼ b, where the matrix A has full column rank. Write
the SVD
A = (U1 U2)
(
Σ
0
)
V T ,
book
2007/2/23
page 67
6.5. Solving Least Squares Problems 67
−4
−2
0
2
4
−4−3−2
−101
234
−4
−3
−2
−1
0
1
2
3
4
−4
−2
0
2
4
−4−3−2
−101
234
−4
−3
−2
−1
0
1
2
3
4
Figure 6.2. Top: Cluster of points in R3 with (scaled) principal compo-
nents. Bottom: same data with the contributions along the first principal component
deflated.
where U1 ∈ Rm×n. Using the SVD and the fact that the norm is invariant under
orthogonal transformations, we have
‖r‖2 = ‖b−Ax‖2 =
∥∥∥∥b− U
(
Σ
0
)
V Tx
∥∥∥∥2 =
∥∥∥∥
(
b1
b2
)
−
(
Σ
0
)
y
∥∥∥∥2 ,
where bi = U
T
i b and y = V
Tx. Thus
‖r‖2 = ‖b1 − Σy‖2 + ‖b2‖2.
book
2007/2/23
page 68
68 Chapter 6. Singular Value Decomposition
We can now minimize ‖r‖2 by putting y = Σ−1b1. The least squares solution is
given by
x = V y = V Σ−1b1 = V Σ
−1UT1 b. (6.5)
Recall that Σ is diagonal,
Σ−1 = diag
(
1
σ1
,
1
σ2
, . . . ,
1
σn
)
,
so the solution can also be written
x =
n∑
i=1
uTi b
σi
vi.
The assumption that A has full column rank implies that all the singular values
are nonzero: σi > 0, i = 1, 2, . . . , n. We also see that in this case, the solution is
unique.
Theorem 6.10 (least squares solution by SVD). Let the matrix A ∈ Rm×n
have full column rank and thin SVD A = U1ΣV
T . Then the least squares problem
minx ‖Ax− b ‖2 has the unique solution
x = V Σ−1UT1 b =
n∑
i=1
uTi b
σi
vi.
Example 6.11. As an example, we solve the least squares problem given at the
beginning of Chapter 3.6. The matrix and right-hand side are
A = 1 1 b = 7.9700
1 2 10.2000
1 3 14.2000
1 4 16.0000
1 5 21.2000
>> [U1,S,V]=svd(A,0)
U1 =0.1600 -0.7579
0.2853 -0.4675
0.4106 -0.1772
0.5359 0.1131
0.6612 0.4035
S = 7.6912 0
0 0.9194
V = 0.2669 -0.9637
0.9637 0.2669
book
2007/2/23
page 69
6.6. Condition Number and Perturbation Theory for the Least Squares Problem 69
The two column vectors in A are linearly independent since the singular values are
both nonzero. The least squares problem is solved using (6.5):
>> x=V*(S(U1’*b))
x = 4.2360
3.2260
6.6 Condition Number and Perturbation Theory for
the Least Squares Problem
The condition number of a rectangular matrix is defined in terms of the SVD. Let
A have rank r, i.e., its singular values satisfy
σ1 ≥ · · · ≥ σr > σr+1 = · · · = σp = 0,
where p = min(m,n). Then the condition number is defined
κ(A) =
σ1
σr
.
Note that in the case of a square, nonsingular matrix, this reduces to the definition
(3.3).
The following perturbation theorem was proved by Wedin [106].
Theorem 6.12. Assume that the matrix A ∈ Rm×n, where m ≥ n has full column
rank, and let x be the solution of the least squares problem minx ‖Ax− b ‖2. Let δA
and δb be perturbations such that
η =
‖ δA ‖2
σn
= κ�A < 1, �A =
‖ δA ‖2
‖A ‖2
.
Then the perturbed matrix A+δA has full rank, and the perturbation of the solution
δx satisfies
‖ δx ‖2 ≤
κ
1 − η
(
�A‖x ‖2 +
‖ δb ‖2
‖A ‖2
+ �Aκ
‖ r ‖2
‖A ‖2
)
,
where r is the residual r = b−Ax.
There are at least two important observations to make here:
1. The number κ determines the condition of the least squares problem, and if
m = n, then the residual r is equal to zero and the inequality becomes a
perturbation result for a linear system of equations; cf. Theorem 3.5.
2. In the overdetermined case the residual is usually not equal to zero. Then the
conditioning depends on κ2. This dependence may be significant if the norm
of the residual is large.
book
2007/2/23
page 70
70 Chapter 6. Singular Value Decomposition
6.7 Rank-Deficient and Underdetermined Systems
Assume that A is rank-deficient, i.e., rank(A) = r < min(m,n). The least squares
problem can still be solved, but the solution is no longer unique. In this case we
write the SVD
A =
(
U1 U2
)(Σ1 0
0 0
)(
V T1
V T2
)
, (6.6)
where
U1 ∈ Rm×r, Σ1 ∈ Rr×r, V1 ∈ Rn×r, (6.7)
and the diagonal elements of Σ1 are all nonzero. The norm of the residual can now
be written
‖ r ‖22 = ‖Ax− b ‖
2
2 =
∥∥∥∥(U1 U2)
(
Σ1 0
0 0
)(
V T1
V T2
)
x− b
∥∥∥∥2
2
.
Putting
y = V Tx =
(
V T1 x
V T2 x
)
=
(
y1
y2
)
,
(
b1
b2
)
=
(
UT1 b
UT2 b
)
and using the invariance of the norm under orthogonal transformations, the residual
becomes
‖ r ‖22 =
∥∥∥∥
(
Σ1 0
0 0
)(
y1
y2
)
−
(
b1
b2
)∥∥∥∥2
2
= ‖Σ1y1 − b1 ‖22 + ‖ b2 ‖
2
2.
Thus, we can minimize the residual by choosing y1 = Σ
−1
1 b1. In fact,
y =
(
Σ−11 b1
y2
)
,
where y2 is arbitrary, solves the least squares problem. Therefore, the solution of the
least squares problem is not unique, and, since the columns of V2 span the null-space
of A, it is in this null-space, where the indeterminacy is. We can write
‖x ‖22 = ‖ y ‖
2
2 = ‖ y1 ‖
2
2 + ‖ y2 ‖
2
2,
and therefore we obtain the solution of minimum norm by choosing y2 = 0.
We summarize the derivation in a theorem.
Theorem 6.13 (minimum norm solution). Assume that the matrix A is rank
deficient with SVD (6.6), (6.7). Then the least squares problem minx ‖Ax − b ‖2
does not have a unique solution. However, the problem
min
x∈L
‖x ‖2, L = {x | ‖Ax− b ‖2 = min} ,
book
2007/2/23
page 71
6.7. Rank-Deficient and Underdetermined Systems 71
has the unique solution
x = V
(
Σ−11 0
0 0
)
UT b = V1Σ
−1
1 U
T
1 b.
The matrix
A† = V
(
Σ−11 0
0 0
)
UT
is called the pseudoinverse of A. It is defined for any nonzero matrix of arbitrary
dimensions.
The SVD can also be used to solve underdetermined linear systems, i.e., sys-
tems with more unknowns than equations. Let A ∈ Rm×n, with m < n, be given.
The SVD of A is
A = U
(
Σ 0
)(V T1
V T2
)
, V1 ∈ Rm×m. (6.8)
Obviously A has full row rank if and only Σ is nonsingular.
We state a theorem concerning the solution of a linear system
Ax = b (6.9)
for the case when A has full row rank.
Theorem 6.14 (solution of an underdetermined linear system). Let A ∈
R
m×n have full row rank with SVD (6.8). Then the linear system (6.9) always has
a solution, which, however, is nonunique. The problem
min
x∈K
‖x ‖2, K = {x |Ax = b} , (6.10)
has the unique solution
x = V1Σ
−1UT b. (6.11)
Proof. Using the SVD (6.8) we can write
Ax = U
(
Σ 0
)(V T1 x
V T2 x
)
=: U
(
Σ 0
)(y1
y2
)
= UΣy1.
Since Σ is nonsingular, we see that for any right-hand side, (6.11) is a solution of the
linear system. However, we can add an arbitrary solution component in the null-
space of A, y2 = V
T
2 x, and we still have a solution. The minimum norm solution,
i.e., the solution of (6.10), is given by (6.11).
The rank-deficient case may or may not have a solution depending on the
right-hand side, and that case can be easily treated as in Theorem 6.13.
book
2007/2/23
page 72
72 Chapter 6. Singular Value Decomposition
6.8 Computing the SVD
The SVD is computed in MATLAB by the statement [U,S,V]=svd(A). This state-
ment is an implementation of algorithms from LAPACK [1]. (The double precision
high-level driver algorithm for SVD is called DGESVD.) In the algorithm the matrix
is first reduced to bidiagonal form by a series of Householder transformations from
the left and right. Then the bidiagonal matrix is iteratively reduced to diagonal
form using a variant of the QR algorithm; see Chapter 15.
The SVD of a dense (full) matrix can be computed in O(mn2) flops. Depend-
ing on how much is computed, the constant is of the order 5–25.
The computation of a partial SVD of a large, sparse matrix is done in MAT-
LAB by the statement [U,S,V]=svds(A,k). This statement is based on Lanczos
methods from ARPACK. We give a brief description of Lanczos algorithms in
Chapter 15. For a more comprehensive treatment, see [4].
6.9 Complete Orthogonal Decomposition
In the case when the matrix is rank deficient, computing the SVD is the most
reliable method for determining the rank. However, it has the drawbacks that it is
comparatively expensive to compute, and it is expensive to update (when new rows
and/or columns are added). Both these issues may be critical, e.g., in a real-time
application. Therefore, methods have been developed that approximate the SVD,
so-called complete orthogonal decompositions, which in the noise-free case and in
exact arithmetic can be written
A = Q
(
T 0
0 0
)
ZT
for orthogonal Q and Z and triangular T ∈ Rr×r when A has rank r. Obviously,
the SVD is a special case of a complete orthogonal decomposition.
In this section we will assume that the matrix A ∈ Rm×n, m ≥ n, has exact
or numerical rank r. (Recall the definition of numerical rank on p. 63.)
6.9.1 QR with Column Pivoting
The first step toward obtaining a complete orthogonal decomposition is to perform
column pivoting in the computation of a QR decomposition [22]. Consider the
matrix before the algorithm has started: compute the 2-norm of each column, and
move the column with largest norm to the leftmost position. This is equivalent to
multiplying the matrix by a permutation matrix P1 from the right. In the first step
of the reduction to triangular form, a Householder transformation is applied that
annihilates elements in the first column:
A −→ AP1 −→ QT1 AP1 =
(
r11 r
T
1
0 B
)
.
Then in the next step, find the column with largest norm in B, permute the columns
so that the column with largest norm is moved to position 2 in A (this involves only
book
2007/2/23
page 73
6.9. Complete Orthogonal Decomposition 73
columns 2 to n, of course), and reduce the first column of B:
QT1 AP1 −→ Q
T
2 Q
T
1 AP1P2 =
⎛
⎝r11 r12 r̄T10 r22 r̄T2
0 0 C
⎞
⎠ .
It is seen that |r11| ≥ |r22|.
After n steps we have computed
AP = Q
(
R
0
)
, Q = Q1Q2 · · ·Qn, P = P1P2 · · ·Pn−1.
The product of permutation matrices is itself a permutation matrix.
Proposition 6.15. Assume that A has rank r. Then, in exact arithmetic, the QR
decomposition with column pivoting is given by
AP = Q
(
R11 R12
0 0
)
, R11 ∈ Rr×r,
and the diagonal elements of R11 are nonzero (R11 is nonsingular).
Proof. Obviously the diagonal elements occurring in the process are nonincreasing:
|r11| ≥ |r22| ≥ · · · .
Assume that rr+1,r+1 > 0. That would imply that the rank of R in the QR de-
composition is larger than r, which is a contradiction, since R and A must have the
same rank.
Example 6.16. The following MATLAB script performs QR decomposition with
column pivoting on a matrix that is constructed to be rank deficient:
[U,ru]=qr(randn(3)); [V,rv]=qr(randn(3));
D=diag([1 0.5 0]); A=U*D*V’;
[Q,R,P]=qr(A); % QR with column pivoting
R = -0.8540 0.4311 -0.0642
0 0.4961 -0.2910
0 0 0.0000
>> R(3,3) = 2.7756e-17
In many cases QR decomposition with pivoting gives reasonably accurate informa-
tion about the numerical rank of a matrix. We modify the matrix in the previous
script by adding noise:
book
2007/2/23
page 74
74 Chapter 6. Singular Value Decomposition
[U,ru]=qr(randn(3)); [V,rv]=qr(randn(3));
D=diag([1 0.5 0]); A=U*D*V’+1e-4*randn(3);
[Q,R,P]=qr(A);
>> R = 0.8172 -0.4698 -0.1018
0 -0.5758 0.1400
0 0 0.0001
The smallest diagonal element is of the same order of magnitude as the smallest
singular value:
>> svd(A) = 1.0000
0.4999
0.0001
It turns out, however, that one cannot rely completely on this procedure to
give correct information about possible rank deficiency. We give an example due to
Kahan; see [50, Section 8.3].
Example 6.17. Let c2 + s2 = 1. For n large enough, the triangular matrix
Tn(c) = diag(1, s, s
2, . . . , sn−1)
⎛
⎜⎜⎜⎜⎜⎝
1 −c −c · · · −c
1 −c · · · −c
. . .
…
1 −c
1
⎞
⎟⎟⎟⎟⎟⎠
is very ill-conditioned. For n = 200 and c = 0.2, we have
κ2(Tn(c)) =
σ1
σn
≈
12.7
5.7 · 10−18
.
Thus, in IEEE double precision, the matrix is singular. The columns of the trian-
gular matrix all have length 1. Therefore, because the elements in each row to the
right of the diagonal are equal, QR decomposition with column pivoting will not
introduce any column interchanges, and the upper triangular matrix R is equal to
Tn(c). However, the bottom diagonal element is equal to s
199 ≈ 0.0172, so for this
matrix QR with column pivoting does not give any information whatsoever about
the ill-conditioning.
book
2007/2/23
page 75
Chapter 7
Reduced-Rank Least
Squares Models
Consider a linear model
b = Ax + η, A ∈ Rm×n,
where η is random noise and A and b are given. If one chooses to determine x by
minimizing the Euclidean norm of the residual b − Ax, then one has a linear least
squares problem
min
x
‖Ax− b‖2. (7.1)
In some situations the actual solution x itself is not the primary object of interest,
but rather it is an auxiliary, intermediate variable. This is the case, e.g., in certain
classification methods, when the norm of the residual is the interesting quantity;
see Chapter 10.
Least squares prediction is another area in which the solution x is an interme-
diate quantity and is not interesting in itself (except that it should be robust and
reliable in a numerical sense). The columns of A consist of observations of explana-
tory variables, which are used to explain the variation of a dependent variable b. In
this context it is essential that the variation of b is well explained by an approxi-
mate solution x̂, in the sense that the relative residual ‖Ax̂ − b‖2/‖b‖2 should be
rather small. Given x̂ and a new row vector aTnew of observations of the explanatory
variables, one can predict the corresponding value of the dependent variable:
bpredicted = a
T
newx̂. (7.2)
Often in this context it is not necessary, or even desirable, to find the solution that
actually minimizes the residual in (7.1). For instance, in prediction it is common
that several of the explanatory variables are (almost) linearly dependent. Therefore
the matrix A is often very ill-conditioned, and the least squares solution is highly
influenced by measurement errors and floating point roundoff errors.
75
book
2007/2/23
page 76
76 Chapter 7. Reduced-Rank Least Squares Models
Example 7.1. The MATLAB script
A=[1 0
1 1
1 1];
B=[A A*[1;0.5]+1e-7*randn(3,1)];
b=B*[1;1;1]+1e-4*randn(3,1);
x=B
creates a matrix B, whose third column is almost a linear combination of the other
two. The matrix is quite ill-conditioned: the condition number is κ2(B) ≈ 5.969·107.
The script gives the least squares solution
x = -805.95
-402.47
807.95
This approximate solution explains the variation of the dependent variable very
well, as the residual is small:
resn = norm(B*x-b)/norm(b) = 1.9725e-14
However, because the components of the solution are large, there will be consider-
able cancellation (see Section 1.5) in the evaluation of (7.2), which leads to numerical
errors. Furthermore, in an application it may be very difficult to interpret such a
solution.
The large deviation of the least squares solution from the vector that was used
to construct the right-hand side is due to the fact that the numerical rank of the
matrix is two. (The singular values are 3.1705, 0.6691, and 8.4425 · 10−8.) In view
of this, it is reasonable to accept the approximate solution
xt = 0.7776
0.8891
1.2222
obtained using a truncated SVD
xtsvd =
2∑
i=1
uTi b
σi
vi
(cf. Section 7.1). This solution candidate has residual norm 1.2 · 10−5, which is
of the same order of magnitude as the perturbation of the right-hand side. Such
a solution vector may be much better for prediction, as the cancellation in the
evaluation of (7.2) is much smaller or is eliminated completely (depending on the
vector anew).
To reduce the ill-conditioning of the problem, and thus make the solution less
sensitive to perturbations of the data, one sometimes introduces an approximate
orthogonal basis of low dimension in Rn, where the solution x lives. Let the basis
book
2007/2/23
page 77
7.1. Truncated SVD: Principal Component Regression 77
vectors be
(
z1 z2 . . . zk
)
=: Zk, for some (small) value of k. Then to determine
the coordinates of the solution in terms of the approximate basis, we make the
ansatz x = Zky in the least squares problem and solve
min
y
‖AZky − b‖2. (7.3)
This is a least squares problem corresponding to a reduced-rank model. In the
following two sections we describe two methods for determining such a matrix of
basis vectors. The first is based on the SVD of the data matrix A. The second
method is a Krylov subspace method, in which the right-hand side influences the
choice of basis.
7.1 Truncated SVD: Principal Component
Regression
Assume that the data matrix has the SVD
A = UΣV T =
r∑
i=1
σiuiv
T
i ,
where r is the rank of A (note that we allow m ≥ n or m ≤ n). The minimum norm
solution (see Section 6.7) of the least squares problem (7.1) is
x =
r∑
i=1
uTi b
σi
vi. (7.4)
As the SVD “orders the variation” of the data matrix A starting with the dominating
direction, we see that the terms in the sum (7.4) are also organized in this way:
the first term is the solution component along the dominating direction of the
data matrix, the second term is the component along the second most dominating
direction, and so forth.8
Thus, if we prefer to use the ordering induced by the SVD, then we should
choose the matrix Zk in (7.3) equal to the first k right singular vectors of A (we
assume that k ≤ r):
Zk = Vk =
(
v1 v2 . . . vk
)
.
Using the fact that
V TVk =
(
Ik
0
)
,
8However, this does not mean that the terms in the sum are ordered by magnitude.
book
2007/2/23
page 78
78 Chapter 7. Reduced-Rank Least Squares Models
where Ik ∈ Rk×k, we get
‖AVky − b‖22 = ‖UΣV
TVky − b‖22 =
∥∥∥∥U
(
Σ
(
Ik
0
)
y − UT b
)∥∥∥∥2
2
=
∥∥∥∥∥∥∥
⎛
⎜⎝σ1 . . .
σk
⎞
⎟⎠
⎛
⎜⎝y1…
yk
⎞
⎟⎠−
⎛
⎜⎝u
T
1 b
…
uTk b
⎞
⎟⎠
∥∥∥∥∥∥∥
2
2
+
r∑
i=k+1
(uTi b)
2.
We see that the least squares problem miny ‖AVky − b‖2 has the solution
y =
⎛
⎜⎝u
T
1 b/σ1
…
uTk b/σk
⎞
⎟⎠ ,
which is equivalent to taking
xk :=
k∑
i=1
uTi b
σi
vi
as an approximate solution of (7.1). This is often referred to as the truncated SVD
solution.
Often one wants to find as low a value of k such that the reduction of the
residual is substantial enough. The procedure can be formulated as an algorithm,
which is sometimes referred to as principal component regression.
Principal component regression (truncated SVD)
1. Find the smallest value of k such that
∑r
i=k+1(u
T
i b)
2 < tol ‖b‖22. 2. Put xk := k∑ i=1 uTi b σi vi. The parameter tol is a predefined tolerance. Example 7.2. We use the matrix from Example 1.1. Let A = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 0 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ . book 2007/2/23 page 79 7.1. Truncated SVD: Principal Component Regression 79 0 1 2 3 4 5 0.4 0.5 0.6 0.7 0.8 0.9 1 Truncation index R e la tiv e r e si d u a l Figure 7.1. The relative norm of the residuals for the query vectors q1 (solid line) and q2 (dashed) as functions of the truncation index k. Recall that each column corresponds to a document (here a sentence). We want to see how well two query vectors q1 and q2 can be represented in terms of the first few terms of the singular value expansion (7.4) of the solution, i.e., we will solve the least squares problems min y ‖AVky − qi‖2, i = 1, 2, for different values of k. The two vectors are q1 = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ 0 0 0 0 0 0 0 1 1 1 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ , q2 = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ 0 1 1 0 0 0 0 0 0 0 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ , corresponding to the words rank, page, Web and England, FIFA, respectively. In Figure 7.1 we plot the relative norm of residuals, ‖Axk − b‖2/‖b‖2, for the two vectors as functions of k. From Example 1.1 we see that the main contents of the documents are related to the ranking of Web pages using the Google matrix, and this is reflected in the dominant singular vectors.9 Since q1 “contains Google 9See also Example 11.8 in Chapter 11. book 2007/2/23 page 80 80 Chapter 7. Reduced-Rank Least Squares Models terms,” it can be well represented in terms of the first few singular vectors. On the other hand, the q2 terms are related only to the “football document.” Therefore, it is to be expected that the residual for q1 decays faster than that of q2 as a function of k. The coordinates of q1 and q2 in terms of the first five left singular vectors are U’*[q1 q2] = 1.2132 0.1574 -0.5474 0.5215 0.7698 0.7698 -0.1817 0.7839 0.3981 -0.3352 The vector q1 has a substantial component in the first left singular vector u1, and therefore the residual is reduced substantially for k = 1. Since q2 has a small component in terms of u1, there is only a marginal reduction of the residual in the first step. If we want to reduce the relative residual to under 0.7 in this example, then we should choose k = 2 for q1 and k = 4 for q2. 7.2 A Krylov Subspace Method When we use the truncated SVD (principal component regression) for a reduced- rank model, the right-hand side does not influence the choice of basis vectors zi at all. The effect of this is apparent in Example 7.2, where the rate of decay of the residual is considerably slower for the vector q2 than for q1. In many situations one would like to have a fast decay of the residual as a function of the number of basis vectors for any right-hand side. Then it is necessary to let the right-hand side influence the choice of basis vectors. This is done in an algorithm called Lanczos–Golub–Kahan (LGK ) bidiagonalization, in the field of numerical linear algebra.10 A closely related method is known in chemometrics and other areas as partial least squares or projection to latent structures (PLS). It is an algorithm out of a large class of Krylov subspace methods, often used for the solution of sparse linear systems; see, e.g., [42, Chapters 9–10], [80] or, for eigenvalue–singular value computations, see Section 15.8. Krylov subspace methods are recursive, but in our derivation we will start with the reduction of a matrix to bidiagonal form using Householder transformations. The presentation in this section is largely influenced by [15]. 7.2.1 Bidiagonalization Using Householder Transformations The first step in the algorithm for computing the SVD of a dense matrix11 C ∈ R m×(n+1) is to reduce it to upper bidiagonal form by Householder transformations 10The algorithm is often called Lanczos bidiagonalization, but it was first described by Golub and Kahan in [41]. 11We choose these particular dimensions here because later in this chapter we will have C = (b A). book 2007/2/23 page 81 7.2. A Krylov Subspace Method 81 from the left and right. We assume that m > n. The result is
C = P
(
B̂
0
)
WT , (7.5)
where P and W are orthogonal and B̂ is upper bidiagonal. The decomposition
in itself is useful also for other purposes. For instance, it is often used for the
approximate solution of least squares problems, both dense and sparse.
We illustrate the Householder bidiagonalization procedure with a small ex-
ample, where C ∈ R6×5. First, all subdiagonal elements in the first column are
zeroed by a transformation PT1 from the left (the elements that are changed in the
transformation are denoted by ∗):
PT1 C = P
T
1
⎛
⎜⎜⎜⎜⎜⎜⎝
× × × × ×
× × × × ×
× × × × ×
× × × × ×
× × × × ×
× × × × ×
⎞
⎟⎟⎟⎟⎟⎟⎠ =
⎛
⎜⎜⎜⎜⎜⎜⎝
∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
⎞
⎟⎟⎟⎟⎟⎟⎠ .
Then, by a different Householder transformation W1 from the right, we zero ele-
ments in the first row, from position 3 to n. To achieve this we choose
R
5×5 � W1 =
(
1 0
0 Z1
)
,
where Z1 is a Householder transformation. Since this transformation does not
change the elements in the first column, the zeros that we just introduced in the
first column remain. The result of the first step is
PT1 CW1 =
⎛
⎜⎜⎜⎜⎜⎜⎝
× × × × ×
0 × × × ×
0 × × × ×
0 × × × ×
0 × × × ×
0 × × × ×
⎞
⎟⎟⎟⎟⎟⎟⎠W1 =
⎛
⎜⎜⎜⎜⎜⎜⎝
× ∗ 0 0 0
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗
⎞
⎟⎟⎟⎟⎟⎟⎠ =: C1.
We now continue in an analogous way and zero all elements below the diagonal in
the second column by a transformation from the left. The matrix P2 is constructed
so that it does not change the elements in the first row of C1, i.e., P2 has the
structure
R
6×6 � P2 =
(
1 0
0 P̃2
)
,
book
2007/2/23
page 82
82 Chapter 7. Reduced-Rank Least Squares Models
where P̃2 ∈ R5×5 is a Householder transformation. We get
PT2 C1 =
⎛
⎜⎜⎜⎜⎜⎜⎝
× × 0 0 0
0 ∗ ∗ ∗ ∗
0 0 ∗ ∗ ∗
0 0 ∗ ∗ ∗
0 0 ∗ ∗ ∗
0 0 ∗ ∗ ∗
⎞
⎟⎟⎟⎟⎟⎟⎠ .
Then, by a transformation from the right,
W2 =
(
I2 0
0 Z2
)
, I2 =
(
1 0
0 1
)
,
we annihilate elements in the second row without destroying the newly introduced
zeros:
PT1 C1W2 =
⎛
⎜⎜⎜⎜⎜⎜⎝
× × 0 0 0
0 × ∗ 0 0
0 0 ∗ ∗ ∗
0 0 ∗ ∗ ∗
0 0 ∗ ∗ ∗
0 0 ∗ ∗ ∗
⎞
⎟⎟⎟⎟⎟⎟⎠ =: C2.
We continue in an analogous manner and finally obtain
PTCW =
⎛
⎜⎜⎜⎜⎜⎜⎝
× ×
× ×
× ×
× ×
×
⎞
⎟⎟⎟⎟⎟⎟⎠ =
(
B̂
0
)
. (7.6)
In the general case,
P = P1P2 · · ·Pn ∈ Rm×m, W = W1W2 · · ·Wn−2 ∈ R(n+1)×(n+1)
are products of Householder transformations, and
B̂ =
⎛
⎜⎜⎜⎜⎜⎝
β1 α1
β2 α2
. . .
. . .
βn αn
βn+1
⎞
⎟⎟⎟⎟⎟⎠ ∈ R
(n+1)×(n+1)
is upper bidiagonal.
Due to the way the orthogonal matrices were constructed, they have a partic-
ular structure that will be used in the rest of this chapter.
book
2007/2/23
page 83
7.2. A Krylov Subspace Method 83
Proposition 7.3. Denote the columns of P in the bidiagonal decomposition (7.6)
by pi, i = 1, 2, . . . ,m. Then
p1 = β1c1, W =
(
1 0
0 Z
)
,
where c1 is the first column of C and Z ∈ Rn×n is orthogonal.
Proof. The first relation follows immediately from PT c1 = β1e1. The second
follows from the fact that all Wi have the structure
Wi =
(
Ii 0
0 Zi
)
,
where Ii ∈ Ri×i are identity matrices and Zi are orthogonal.
The reduction to bidiagonal form by Householder transformation requires
4mn2 − 4n3/3 flops. If m � n, then it is more efficient to first reduce A to upper
triangular form and then bidiagonalize the R factor.
Assume now that we want to solve the least squares problem minx ‖b−Ax‖2,
where A ∈ Rm×n. If we choose C =
(
b A
)
in the bidiagonalization procedure,
then we get an equivalent bidiagonal least squares problem. Using (7.6) and Propo-
sition 7.3 we obtain
PTCW = PT
(
b A
)(1 0
0 Z
)
=
(
PT b PTAZ
)
=
(
β1e1 B
0 0
)
, (7.7)
where
B =
⎛
⎜⎜⎜⎜⎜⎝
α1
β2 α2
. . .
. . .
βn αn
βn+1
⎞
⎟⎟⎟⎟⎟⎠ ∈ R
(n+1)×n.
Then, defining y = ZTx we can write the norm of the residual,
‖b−Ax‖2 =
∥∥∥∥(b A)
(
1
−x
)∥∥∥∥
2
=
∥∥∥∥PT (b A)
(
1 0
0 Z
)(
1
−y
)∥∥∥∥
2
=
∥∥∥∥(PT b PTAZ)
(
1
−y
)∥∥∥∥
2
= ‖β1e1 −By‖2. (7.8)
The bidiagonal least squares problem miny ‖β1e1 − By‖2 can be solved in O(n)
flops, if we reduce B to upper bidiagonal form using a sequence of plane rotations
(see below).
book
2007/2/23
page 84
84 Chapter 7. Reduced-Rank Least Squares Models
7.2.2 LGK Bidiagonalization
We will now give an alternative description of the bidiagonalization procedure of the
preceding section that allows us to compute the decomposition (7.7) in a recursive
manner. This is the LGK bidiagonalization. Part of the last equation of (7.7) can
be written
PTA =
(
BZT
0
)
, BZT ∈ R(n+1)×n,
which implies
AT
(
p1 p2 · · · pn+1
)
= ZBT =
(
z1 z2 . . . zn
)
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
α1 β2
α2 β3
. . .
. . .
βi
αi
. . .
. . .
αn βn+1
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
.
Equating column i (i ≥ 2) on both sides, we get
AT pi = βizi−1 + αizi,
which can be written
αizi = A
T pi − βizi−1. (7.9)
Similarly, by equating column i in
AZ = A
(
z1 z2 . . . zn
)
= PB =
(
p1 p2 . . . pn+1
)
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
α1
β2 α2
. . .
. . .
αi
βi+1
. . .
. . .
βn αn
βn+1
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
,
we get
Azi = αipi + βi+1pi+1,
which can be written
βi+1pi+1 = Azi − αipi. (7.10)
book
2007/2/23
page 85
7.2. A Krylov Subspace Method 85
Now, by compiling the starting equation β1p1 = b from Proposition 7.3, equations
(7.9) and (7.10), we have derived a recursion:
LGK Bidiagonalization
1. β1p1 = b, z0 = 0
2. for i = 1 : n
αizi = A
T pi − βizi−1,
βi+1pi+1 = Azi − αipi
3. end
The coefficients αi−1 and βi are determined so that ‖pi‖ = ‖zi‖ = 1.
The recursion breaks down if any αi or βi becomes equal to zero. It can be
shown (see, e.g., [15, Section 7.2]) that in the solution of least squares problems,
these occurrences are harmless in the sense that they correspond to well-defined
special cases.
The recursive bidiagonalization procedure gives, in exact arithmetic, the same
result as the Householder bidiagonalization of
(
b A
)
, and thus the generated vec-
tors (pi)
n
i=1 and (zi)
n
i=1 satisfy p
T
i pj = 0 and z
T
i zj = 0 if i �= j. However, in
floating point arithmetic, the vectors lose orthogonality as the recursion proceeds;
see Section 7.2.7.
7.2.3 Approximate Solution of a Least Squares Problem
Define the matrices Pk =
(
p1 p2 . . . pk
)
, Zk =
(
z1 z2 . . . zk
)
, and
Bk =
⎛
⎜⎜⎜⎜⎜⎝
α1
β2 α2
. . .
. . .
βk−1 αk−1
βk
⎞
⎟⎟⎟⎟⎟⎠ ∈ R
k×(k−1).
In the same way we could write the relations AZ = PB and ATP = ZBT as a
recursion, we can now write the first k steps of the recursion as a matrix equation
AZk = Pk+1Bk+1. (7.11)
Consider the least squares problem minx ‖Ax−b‖2. Note that the column vectors zi
are orthogonal vectors in Rn, where the solution x lives. Assume that we want to find
the best approximate solution in the subspace spanned by the vectors z1, z2, . . . , zk.
That is equivalent to solving the least squares problem
min
y
‖AZky − b‖2,
book
2007/2/23
page 86
86 Chapter 7. Reduced-Rank Least Squares Models
0 1 2 3 4 5
0.4
0.5
0.6
0.7
0.8
0.9
1
k
0 1 2 3 4 5
0.4
0.5
0.6
0.7
0.8
0.9
1
k
Figure 7.2. The relative norm of the residuals for the query vectors q1
(left) and q2 (right) as a function of subspace dimension k. The residual curves
for the truncated SVD solutions are solid and for the bidiagonalization solutions are
dash-dotted.
where y ∈ Rk. From (7.11) we see that this is the same as solving
min
y
‖Pk+1Bk+1y − b‖2,
which, using the orthogonality of P =
(
Pk+1 P⊥
)
, we can rewrite
‖Pk+1Bk+1y − b‖2 = ‖PT (Pk+1Bk+1y − b)‖2
=
∥∥∥∥
(
PTk+1
PT⊥
)
(Pk+1Bk+1y − b)
∥∥∥∥
2
=
∥∥∥∥
(
Bk+1y
0
)
−
(
β1e1
0
)∥∥∥∥
2
,
since b = β1p1. It follows that
min
y
‖AZky − b‖2 = min
y
‖Bk+1y − β1e1‖2, (7.12)
which, due to the bidiagonal structure, we can solve in O(n) flops; see below.
Example 7.4. Using bidiagonalization, we compute approximate solutions to the
same least squares problems as in Example 7.2. The relative norm of the residual,
‖AZky − b‖2/‖b‖2, is plotted as a function of k in Figure 7.2 for the truncated
SVD solution and the bidiagonalization procedure. It is seen that in both cases,
the bidiagonalization-based method give a faster decay of the residual than the
truncated SVD solutions. Thus in this example, the fact that we let the basis vectors
zi be influenced by the right-hand sides q1 and q2 leads to reduced rank models of
smaller dimensions. If we want to reduce the relative residual to below 0.7, then in
both cases we can choose k = 1 with the bidiagonalization method.
The least squares problem (7.12) with bidiagonal structure can be solved using
book
2007/2/23
page 87
7.2. A Krylov Subspace Method 87
a sequence of plane rotations. Consider the reduction of
(
Bk+1 βe1
)
=
⎛
⎜⎜⎜⎜⎜⎜⎜⎝
α1 β1
β2 α2 0
β3 α3 0
. . .
. . .
…
βk αk 0
βk+1 0
⎞
⎟⎟⎟⎟⎟⎟⎟⎠
to upper triangular form. We will now demonstrate that the norm of the residual
can be easily computed. In the first step we zero β2 by a rotation in the (1, 2) plane,
with cosine and sine c1 and s1. The result is⎛
⎜⎜⎜⎜⎜⎜⎜⎝
α̂1 + β1
0 α̂2 −β1s1
β3 α3 0
. . .
. . .
…
βk αk 0
βk+1 0
⎞
⎟⎟⎟⎟⎟⎟⎟⎠
,
where matrix elements that have changed are marked with a hat, and the new
nonzero element is marked with a +. In the next step, we zero β3 by a rotation
with cosine and sine c2 and s2:⎛
⎜⎜⎜⎜⎜⎜⎜⎝
α̂1 + β1
0 α̂2 + −β1s1
0 α̂3 β1s1s2
. . .
. . .
…
βk αk 0
βk+1 0
⎞
⎟⎟⎟⎟⎟⎟⎟⎠
.
The final result after k steps is⎛
⎜⎜⎜⎜⎜⎜⎜⎝
α̂1 + γ0
α̂2 + γ1
α̂3 + γ2
. . .
…
α̂k γk−1
γk
⎞
⎟⎟⎟⎟⎟⎟⎟⎠
=:
(
B̂k γ
0 γk
)
,
where γi = (−1)iβ1s1s2 · · · si and γ(k) =
(
γ0 γ1 · · · γk−1
)T
. If we define the
product of plane rotations to be the orthogonal matrix Qk+1 ∈ R(k+1)×(k+1), we
have the QR decomposition
Bk+1 = Qk+1
(
B̂k
0
)
(7.13)
book
2007/2/23
page 88
88 Chapter 7. Reduced-Rank Least Squares Models
and
(
γ(k)
γk
)
=
⎛
⎜⎜⎜⎝
γ0
γ1
…
γk
⎞
⎟⎟⎟⎠ = QTk+1
⎛
⎜⎜⎜⎝
β1
0
…
0
⎞
⎟⎟⎟⎠ . (7.14)
Using the QR decomposition we can write
‖Bk+1y − β1e1‖22 = ‖B̂ky − γ‖
2
2 + |γk|
2,
and the norm of the residual in the least squares problem is equal to |γk| =
|β1s1 · · · sk|. It follows that the norm of the residual can be computed recursively
as we generate the scalar coefficients αi and βi, and thus it is possible to monitor
the decay of the residual.
7.2.4 Matrix Approximation
The bidiagonalization procedure also gives a low-rank approximation of the matrix
A. Here it is slightly more convenient to consider the matrix AT for the derivation.
Assume that we want to use the columns of Zk as approximate basis vectors in R
n.
Then we can determine the coordinates of the columns of AT in terms of this basis
by solving the least squares problem
min
Sk∈Rm×k
‖AT − ZkSTk ‖F . (7.15)
Lemma 7.5. Given the matrix A ∈ Rm×n and the matrix Zk ∈ Rn×k with or-
thonormal columns, the least squares problem (7.15) has the solution
Sk = Pk+1Bk+1.
Proof. Since the columns of Zk are orthonormal, the least squares problem has
the solution
STk = Z
T
k A
T ,
which by (7.11) is the same as Sk = Pk+1Bk+1.
From the lemma we see that we have a least squares approximation AT ≈
Zk(Pk+1Bk+1)
T or, equivalently,
A ≈ Pk+1Bk+1ZTk .
However, this is not a “proper” rank-k approximation, since Pk+1 ∈ Rm×(k+1) and
Bk+1 ∈ R(k+1)×k. Now, with the QR decomposition (7.13) of Bk+1 we have
Pk+1Bk+1 = (Pk+1Qk+1)(Q
T
k+1Bk+1) = (Pk+1Qk+1)
(
B̂k
0
)
= WkB̂k,
book
2007/2/23
page 89
7.2. A Krylov Subspace Method 89
where Wk is defined to be the first k columns of Pk+1Qk+1. With Y
T
k = B̂kZ
T
k we
now have a proper rank-k approximation of A:
A ≈ Pk+1Bk+1ZTk = WkY
T
k , Wk ∈ R
m×k, Yk ∈ Rn×k. (7.16)
The low-rank approximation of A is illustrated as
A ≈ = WkY Tk .
As before, we can interpret the low-rank approximation as follows. The columns of
Wk are a basis in a subspace of R
m. The coordinates of column j of A in this basis
are given in column j of Y Tk .
7.2.5 Krylov Subspaces
In the LGK bidiagonalization, we create two sets of basis vectors—the pi and the
zi. It remains to demonstrate what subspaces they span. From the recursion we
see that z1 is a multiple of A
T b and that p2 is a linear combination of b and AA
T b.
By an easy induction proof one can show that
pk ∈ span{b, AAT b, (AAT )2b, . . . , (AAT )k−1b},
zk ∈ span{AT b, (ATA)AT b, . . . , (ATA)k−1AT b}
for k = 1, 2, . . . . Denote
Kk(C, b) = span{b, Cb, C2, . . . , Ck−1b}.
This a called a Krylov subspace. We have the following result.
Proposition 7.6. The columns of Pk are an orthonormal basis of Kk(AAT , b), and
the columns of Zk are an orthonormal basis of Kk(ATA,AT b).
7.2.6 Partial Least Squares
Partial least squares (PLS) [109, 111] is a recursive algorithm for computing approx-
imate least squares solutions and is often used in chemometrics. Different variants
of the algorithm exist, of which perhaps the most common is the so-called NIPALS
formulation.
book
2007/2/23
page 90
90 Chapter 7. Reduced-Rank Least Squares Models
The NIPALS PLS algorithm
1. A0 = A
2. for i=1,2, . . . ,k
(a) wi =
1
‖ATi−1b‖
ATi−1b
(b) ũi =
1
‖Ai−1wi‖ Ai−1wi
(c) ṽi = A
T
i−1ũi
(d) Ai = Ai−1 − ũiṽTi
This algorithm differs from LGK bidiagonalization in a few significant ways,
the most important being that the data matrix is deflated as soon as a new pair
of vectors (ũi, ṽi) has been computed. However, it turns out [32, 110] that the
PLS algorithm is mathematically equivalent to a variant of LGK bidiagonalization
that is started by choosing not p1 but instead α1z1 = A
T b. This implies that the
vectors (wi)
k
i=1 form an orthonormal basis in Kk(ATA,AT b), and (ũi)ki=1 form an
orthonormal basis in Kk(AAT , AAT b).
7.2.7 Computing the Bidiagonalization
The recursive versions of the bidiagonalization suffers from the weakness that the
generated vectors lose orthogonality. This can be remedied by reorthogonalizing
the vectors, using a Gram–Schmidt process. Householder bidiagonalization, on the
other hand, generates vectors that are as orthogonal as can be expected in floating
point arithmetic; cf. Section 4.4. Therefore, for dense matrices A of moderate
dimensions, one should use this variant.12
For large and sparse or otherwise structured matrices, it is usually necessary
to use the recursive variant. This is because the Householder algorithm modifies
the matrix by orthogonal transformations and thus destroys the structure. Note
that for such problems, the PLS algorithm has the same disadvantage because it
deflates the matrix (step (d) in the algorithm above).
A version of LGK bidiagonalization that avoids storing all the vectors pi and
zi has been developed [75].
12However, if there are missing entries in the matrix, which is often the case in certain applica-
tions, then the PLS algorithm can be modified to estimate those; see, e.g., [111].
book
2007/2/23
page 91
Chapter 8
Tensor Decomposition
8.1 Introduction
So far in this book we have considered linear algebra, where the main objects
are vectors and matrices. These can be thought of as one-dimensional and two-
dimensional arrays of data, respectively. For instance, in a term-document matrix,
each element is associated with one term and one document. In many applications,
data commonly are organized according to more than two categories. The corre-
sponding mathematical objects are usually referred to as tensors, and the area of
mathematics dealing with tensors is multilinear algebra. Here, for simplicity, we
restrict ourselves to tensors A = (aijk) ∈ Rl×m×n that are arrays of data with three
subscripts; such a tensor can be illustrated symbolically as
A =
Example 8.1. In the classification of handwritten digits, the training set is a
collection of images, manually classified into 10 classes. Each such class is a set of
digits of one kind, which can be considered as a tensor; see Figure 8.1. If each digit
is represented as a 16× 16 matrix of numbers representing gray scale, then a set of
n digits can be organized as a tensor A ∈ R16×16×n.
We will use the terminology of [60] and refer to a tensor A ∈ Rl×m×n as a
3-mode array,13 i.e., the different “dimensions” of the array are called modes. The
dimensions of a tensor A ∈ Rl×m×n are l, m, and n. In this terminology, a matrix
is a 2-mode array.
13In some literature, the terminology 3-way and, in the general case, n-way, is used.
91
book
2007/2/23
page 92
92 Chapter 8. Tensor Decomposition
16
16
digits
3
Figure 8.1. The image of one digit is a 16 × 16 matrix, and a collection
of digits is a tensor.
In this chapter we present a generalization of the matrix SVD to 3-mode
tensors, and then, in Chapter 14, we describe how it can be used for face recognition.
The further generalization to n-mode tensors is easy and can be found, e.g., in [60].
In fact, the face recognition application requires 5-mode arrays.
The use of tensors in data analysis applications was pioneered by researchers
in psychometrics and chemometrics in the 1960s; see, e.g., [91].
8.2 Basic Tensor Concepts
First define the inner product of two tensors:
〈A,B〉 =
∑
i,j,k
aijkbijk. (8.1)
The corresponding norm is
‖A‖F = 〈A,A〉1/2 =
(∑
i,j,k
a2ijk
)1/2
. (8.2)
If we specialize the definition to matrices (2-mode tensors), we see that this is
equivalent to the matrix Frobenius norm; see Section 2.4.
Next we define i-mode multiplication of a tensor by a matrix. The 1-mode
product of a tensor A ∈ Rl×m×n by a matrix U ∈ Rl0×l, denoted by A×1 U , is an
l0 ×m× n tensor in which the entries are given by
(A×1 U)(j, i2, i3) =
l∑
k=1
uj,k ak,i2,i3 . (8.3)
book
2007/2/23
page 93
8.2. Basic Tensor Concepts 93
For comparison, consider the matrix multiplication
A×1 U = UA, (UA)(i, j) =
l∑
k=1
ui,k ak,j . (8.4)
We recall from Section 2.2 that matrix multiplication is equivalent to multiplying
each column in A by the matrix U . Comparing (8.3) and (8.4) we see that the corre-
sponding property is true for tensor-matrix multiplication: in the 1-mode product,
all column vectors in the 3-mode array are multiplied by the matrix U .
Similarly, 2-mode multiplication of a tensor by a matrix V
(A×2 V )(i1, j, i3) =
l∑
k=1
vj,k ai1,k,i3
means that all row vectors of the tensor are multiplied by V . Note that 2-mode
multiplication of a matrix by V is equivalent to matrix multiplication by V T from
the right,
A×2 V = AV T ;
3-mode multiplication is analogous.
It is sometimes convenient to unfold a tensor into a matrix. The unfolding of
a tensor A along the three modes is defined (using (semi-)MATLAB notation; for
a general definition,14 see [60]) as
R
l×mn � unfold1(A) := A(1) :=
(
A(:, 1, 🙂 A(:, 2, 🙂 . . . A(:,m, 🙂
)
,
R
m×ln � unfold2(A) := A(2) :=
(
A(:, :, 1)T A(:, :, 2)T . . . A(:, :, n)T
)
,
R
n×lm � unfold3(A) := A(3) :=
(
A(1, :, :)T A(2, :, :)T . . . A(l, :, :)T
)
.
It is seen that the unfolding along mode i makes that mode the first mode of the
matrix A(i), and the other modes are handled cyclically. For instance, row i of A(j)
contains all the elements of A, which have the jth index equal to i. The following
is another way of putting it.
1. The column vectors of A are column vectors of A(1).
2. The row vectors of A are column vectors of A(2).
3. The 3-mode vectors of A are column vectors of A(3).
The 1-unfolding of A is equivalent to dividing the tensor into slices A(:, i, 🙂
(which are matrices) and arranging the slices in a long matrix A(1).
14For the matrix case, unfold1(A) = A, and unfold2(A) = A
T .
book
2007/2/23
page 94
94 Chapter 8. Tensor Decomposition
Example 8.2. Let B ∈ R3×3×3 be a tensor, defined in MATLAB as
B(:,:,1) = B(:,:,2) = B(:,:,3) =
1 2 3 11 12 13 21 22 23
4 5 6 14 15 16 24 25 26
7 8 9 17 18 19 27 28 29
Then unfolding along the third mode gives
>> B3 = unfold(B,3)
b3 = 1 2 3 4 5 6 7 8 9
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
The inverse of the unfolding operation is written
foldi(unfoldi(A)) = A.
For the folding operation to be well defined, information about the target tensor
must be supplied. In our somewhat informal presentation we suppress this.
Using the unfolding-folding operations, we can now formulate a matrix mul-
tiplication equivalent of i-mode tensor multiplication:
A×i U = foldi(U unfoldi(A)) = foldi(UA(i)). (8.5)
It follows immediately from the definition that i-mode and j-mode multiplication
commute if i �= j:
(A×i F ) ×j G = (A×j G) ×i F = A×i F ×j G.
Two i-mode multiplications satisfy the identity
(A×i F ) ×i G = A×i (GF ).
This is easily proved using (8.5):
(A×i F ) ×i G = (foldi(F (unfoldi(A)))) ×i G
= foldi(G(unfoldi(foldi(F (unfoldi(A))))))
= foldi(GF unfoldi(A)) = A×i (GF ).
8.3 A Tensor SVD
The matrix SVD can be generalized to tensors in different ways. We present one such
generalization that is analogous to an approximate principal component analysis. It
is often referred to as the higher order SVD (HOSVD)15 [60].
15HOSVD is related to the Tucker model in psychometrics and chemometrics [98, 99].
book
2007/2/23
page 95
8.3. A Tensor SVD 95
Theorem 8.3 (HOSVD). The tensor A ∈ Rl×m×n can be written as
A = S ×1 U (1) ×2 U (2) ×3 U (3), (8.6)
where U (1) ∈ Rl×l, U (2) ∈ Rm×m, and U (3) ∈ Rn×n are orthogonal matrices. S is
a tensor of the same dimensions as A; it has the property of all-orthogonality: any
two slices of S are orthogonal in the sense of the scalar product (8.1):
〈S(i, :, :),S(j, :, :)〉 = 〈S(:, i, :),S(:, j, :)〉 = 〈S(:, :, i),S(:, :, j)〉 = 0
for i �= j. The 1-mode singular values are defined by
σ
(1)
j = ‖S(i, :, :)‖F , j = 1, . . . , l,
and they are ordered
σ
(1)
1 ≥ σ
(1)
2 ≥ · · · ≥ σ
(1)
l . (8.7)
The singular values in other modes and their ordering are analogous.
Proof. We give only the recipe for computing the orthogonal factors and the tensor
S; for a full proof, see [60]. Compute the SVDs,
A(i) = U
(i)Σ(i)(V (i))T , i = 1, 2, 3, (8.8)
and put
S = A×1 (U (1))T ×2 (U (2))T ×3 (U (3))T .
It remains to show that the slices of S are orthogonal and that the i-mode singular
values are decreasingly ordered.
The all-orthogonal tensor S is usually referred to as the core tensor.
The HOSVD is visualized in Figure 8.2.
A
=
U (1) S U (2)
U (3)
Figure 8.2. Visualization of the HOSVD.
book
2007/2/23
page 96
96 Chapter 8. Tensor Decomposition
Equation (8.6) can also be written
Aijk =
l∑
p=1
m∑
q=1
n∑
s=1
u
(1)
ip u
(2)
jq u
(3)
ks Spqs,
which has the following interpretation: the element Spqs reflects the variation by
the combination of the singular vectors u
(1)
p , u
(2)
q , and u
(3)
s .
The computation of the HOSVD is straightforward and is implemented by the
following MATLAB code, although somewhat inefficiently:16
function [U1,U2,U3,S,s1,s2,s3]=svd3(A);
% Compute the HOSVD of a 3-way tensor A
[U1,s1,v]=svd(unfold(A,1));
[U2,s2,v]=svd(unfold(A,2));
[U3,s3,v]=svd(unfold(A,3));
S=tmul(tmul(tmul(A,U1’,1),U2’,2),U3’,3);
The function tmul(A,X,i) is assumed to multiply the tensor A by the matrix X in
mode i, A×i X.
Let V be orthogonal of the same dimension as Ui; then from the identities [60]
S ×i U (i) = S ×i (U (i)V V T ) = (S ×i V T )(×i U (i)V ),
it may appear that the HOSVD is not unique. However, the property that the
i-mode singular values are ordered is destroyed by such transformations. Thus, the
HOSVD is essentially unique; the exception is when there are equal singular values
along any mode. (This is the same type of nonuniqueness that occurs with the
matrix SVD.)
In some applications it happens that the dimension of one mode is larger
than the product of the dimensions of the other modes. Assume, for instance, that
A ∈ Rl×m×n with l > mn. Then it can be shown that the core tensor S satisfies
S(i, :, 🙂 = 0, i > mn,
and we can omit the zero part of the core and rewrite (8.6) as a thin HOSVD ,
A = Ŝ ×1 Û (1) ×2 U (2) ×3 U (3), (8.9)
where Ŝ ∈ Rmn×m×n and Û (1) ∈ Rl×mn.
8.4 Approximating a Tensor by HOSVD
A matrix can be written in terms of the SVD as a sum of rank-1 terms; see (6.2). An
analogous expansion of a tensor can be derived using the definition of tensor-matrix
16Exercise: In what sense is the computation inefficient?
book
2007/2/23
page 97
8.4. Approximating a Tensor by HOSVD 97
multiplication: a tensor A ∈ Rl×m×n can be expressed as a sum of matrices times
singular vectors:
A =
n∑
i=1
Ai ×3 u
(3)
i , Ai = S(:, :, i) ×1 U
(1) ×2 U (2), (8.10)
where u
(3)
i are column vectors in U
(3). The Ai are to be identified as both matrices
in Rm×n and tensors in Rm×n×1. The expansion (8.10) is illustrated as
A = A1
+
A2
+ · · · .
This expansion is analogous along the other modes.
It is easy to show that the Ai matrices are orthogonal in the sense of the scalar
product (8.1):
〈Ai, Aj〉 = tr[U (2)S(:, :, i)T (U (1))TU (1)S(:, :, j)(U (2))T ]
= tr[S(:, :, i)TS(:, :, j)] = 0.
(Here we have identified the slices S(:, :, i) with matrices and used the identity
tr(AB) = tr(BA).)
It is now seen that the expansion (8.10) can be interpreted as follows. Each
slice along the third mode of the tensor A can be written (exactly) in terms of the
orthogonal basis (Ai)
r3
i=1, where r3 is the number of positive 3-mode singular values
of A:
A(:, :, j) =
r3∑
i=1
z
(j)
i Ai, (8.11)
where z
(j)
i is the jth component of u
(3)
i . In addition, we have a simultaneous
orthogonal factorization of the Ai,
Ai = S(:, :, i) ×1 U (1) ×2 U (2),
which, due to the ordering (8.7) of all the j-mode singular values for different j,
has the property that the “mass” of each S(:, :, i) is concentrated at the upper left
corner.
We illustrate the HOSVD in the following example.
Example 8.4. Given 131 handwritten digits,17 where each digit is a 16×16 matrix,
we computed the HOSVD of the 16 × 16 × 131 tensor. In Figure 8.3 we plot the
17From a U.S. Postal Service database, downloaded from the Web page of [47].
book
2007/2/23
page 98
98 Chapter 8. Tensor Decomposition
0 20 40 60 80 100 120 140
10
−1
10
0
10
1
10
2
10
3
Figure 8.3. The singular values in the digit (third) mode.
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 8.4. The top row shows the three matrices A1, A2, and A3, and the
bottom row shows the three slices of the core tensor, S(:, :, 1), S(:, :, 2), and S(:, :, 3)
(absolute values of the components).
singular values along the third mode (different digits); it is seen that quite a large
percentage of the variation of the digits is accounted for by the first 20 singular
values (note the logarithmic scale). In fact,∑20
i=1(σ
(3)
i )
2∑131
i=1(σ
(3)
i )
2
≈ 0.91.
The first three matrices A1, A2, and A3 are illustrated in Figure 8.4. It is seen
that the first matrix looks like a mean value of different 3’s; that is the dominating
“direction” of the 131 digits, when considered as points in R256. The next two
images represent the dominating directions of variation from the “mean value”
among the different digits.
book
2007/2/23
page 99
8.4. Approximating a Tensor by HOSVD 99
In the bottom row of Figure 8.4, we plot the absolute values of the three slices
of the core tensor, S(:, :, 1), S(:, :, 2), and S(:, :, 3). It is seen that the mass of these
matrices is concentrated at the upper left corner.
If we truncate the expansion (8.10),
A =
k∑
i=1
Ai ×3 u
(3)
i , Ai = S(:, :, i) ×1 U
(1) ×2 U (2),
for some k, then we have an approximation of the tensor (here in the third mode)
in terms of an orthogonal basis. We saw in (8.11) that each 3-mode slice A(:, :, j)
of A can be written as a linear combination of the orthogonal basis matrices Aj . In
the classification of handwritten digits (cf. Chapter 10), one may want to compute
the coordinates of an unknown digit in terms of the orthogonal basis. This is easily
done due to the orthogonality of the basis.
Example 8.5. Let Z denote an unknown digit. For classification purposes we
want to compute the coordinates of Z in terms of the basis of 3’s from the previous
example. This is done by solving the least squares problem
min
z
∥∥∥∥∥Z −∑
j
zjAj
∥∥∥∥∥
F
,
where the norm is the matrix Frobenius norm. Put
G(z) =
1
2
∥∥∥∥∥Z −∑
j
zjAj
∥∥∥∥∥
2
F
=
1
2
〈
Z −
∑
j
zjAj , Z −
∑
j
zjAj
〉
.
Since the basis is orthogonal with respect to the scalar product,
〈Ai, Aj〉 = 0 for i �= j,
we can rewrite
G(z) =
1
2
〈Z,Z〉 −
∑
j
zj〈Z,Aj〉 +
1
2
∑
j
z2j 〈Aj , Aj〉.
To find the minimum, we compute the partial derivatives with respect to the zj and
put them equal to zero,
∂G
∂zj
= −〈Z,Aj〉 + zj〈Aj , Aj〉 = 0,
which gives the solution of the least squares problem as
zj =
〈Z,Aj〉
〈Aj , Aj〉
, j = 1, 2, . . . .
book
2007/2/23
page 100
100 Chapter 8. Tensor Decomposition
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 8.5. Compressed basis matrices of handwritten 3’s.
Because the mass of the core tensor is concentrated for small values of the
three indices, it is possible to perform a simultaneous data compression in all three
modes by the HOSVD. Here we assume that we compress to ki columns in mode
i. Let U
(i)
ki
= U (i)(:, 1 : ki) and Ŝ = S(1 : k1, 1 : k2, 1 : k3). Then consider the
approximation
A ≈ Â = Ŝ ×1 U
(i)
k1
×2 U
(2)
k2
×3 U
(3)
k3
.
We illustrate this as follows:
A
≈
U
(1)
k1
Ŝ
U
(2)
k2
U
(3)
k3
Example 8.6. We compressed the basis matrices Aj of 3’s from Example 8.4. In
Figure 8.5 we illustrate the compressed basis matrices
Âj = S(1 : 8, 1 : 8, j) ×1 U
(1)
8 ×2 U
(2)
8 .
See the corresponding full-basis matrices in Figure 8.4. Note that the new basis
matrices Âj are no longer orthogonal.
book
2007/2/23
page 101
Chapter 9
Clustering and
Nonnegative Matrix
Factorization
An important method for data compression and classification is to organize data
points in clusters. A cluster is a subset of the set of data points that are close
together in some distance measure. One can compute the mean value of each cluster
separately and use the means as representatives of the clusters. Equivalently, the
means can be used as basis vectors, and all the data points represented by their
coordinates with respect to this basis.
−4
−2
0
2
4
−4−3−2
−101
234
−4
−3
−2
−1
0
1
2
3
4
Figure 9.1. Two clusters in R3.
Example 9.1. In Figure 9.1 we illustrate a set of data points in R3, generated
from two correlated normal distributions. Assuming that we know that we have
two clusters, we can easily determine visually which points belong to which class.
A clustering algorithm takes the complete set of points and classifies them using
some distance measure.
101
book
2007/2/23
page 102
102 Chapter 9. Clustering and Nonnegative Matrix Factorization
There are several methods for computing a clustering. One of the most im-
portant is the k-means algorithm. We describe it in Section 9.1.
In data mining applications, the matrix is often nonnegative. If we compute a
low-rank approximation of the matrix using the SVD, then, due to the orthogonality
of the singular vectors, we are very likely to obtain factors with negative elements.
It may seem somewhat unnatural to approximate a nonnegative matrix by a low-
rank approximation with negative elements. Instead one often wants to compute a
low-rank approximation with nonnegative factors:
A ≈ WH, W,H ≥ 0. (9.1)
In many applications, a nonnegative factorization facilitates the interpretation of
the low-rank approximation in terms of the concepts of the application.
In Chapter 11 we will apply a clustering algorithm to a nonnegative matrix
and use the cluster centers as basis vectors, i.e., as columns in the matrix W in (9.1).
However, this does not guarantee that H also is nonnegative. Recently several algo-
rithms for computing such nonnegative matrix factorizations have been proposed,
and they have been used successfully in different applications. We describe such
algorithms in Section 9.2.
9.1 The k-Means Algorithm
We assume that we have n data points (aj)
n
j=1 ∈ Rm, which we organize as columns
in a matrix A ∈ Rm×n. Let Π = (πi)ki=1 denote a partitioning of the vectors
a1, a1, . . . , an into k clusters:
πj = {ν | aν belongs to cluster j}.
Let the mean, or the centroid, of the cluster be
mj =
1
nj
∑
ν∈πj
aν ,
where nj is the number of elements in πj . We will describe a k-means algorithm,
based on the Euclidean distance measure.
The tightness or coherence of cluster πj can be measured as the sum
qj =
∑
ν∈πj
‖aν −mj‖22.
The closer the vectors are to the centroid, the smaller the value of qj . The quality
of a clustering can be measured as the overall coherence:
Q(Π) =
k∑
j=1
qj =
k∑
j=1
∑
ν∈πj
‖aν −mj‖22.
In the k-means algorithm we seek a partitioning that has optimal coherence, in the
sense that it is the solution of the minimization problem
min
Π
Q(Π).
book
2007/2/23
page 103
9.1. The k-Means Algorithm 103
The basic idea of the algorithm is straightforward: given a provisional partitioning,
one computes the centroids. Then for each data point in a particular cluster, one
checks whether there is another centroid that is closer than the present cluster
centroid. If that is the case, then a redistribution is made.
The k-means algorithm
1. Start with an initial partitioning Π(0) and compute the corresponding centroid
vectors, (m
(0)
j )
k
j=1. Compute Q(Π
(0)). Put t = 1.
2. For each vector ai, find the closest centroid. If the closest vector is m
(t−1)
p ,
assign ai to π
(t)
p .
3. Compute the centroids (m
(t)
j )
k
j=1 of the new partitioning Π
(t).
4. if |Q(Π(t−1)) − Q(Π(t))| < tol, then stop; otherwise increment t by 1 and go to step 2. The initial partitioning is often chosen randomly. The algorithm usually has rather fast convergence, but one cannot guarantee that the algorithm finds the global minimum. Example 9.2. A standard example in clustering is taken from a breast cancer diagnosis study [66].18 The matrix A ∈ R9×683 contains data from breast cytology tests. Out of the 683 tests, 444 represent a diagnosis of benign and 239 a diagnosis of malignant. We iterated with k = 2 in the k-means algorithm until the relative difference in the function Q(Π) was less than 10−10. With a random initial parti- tioning the iteration converged in six steps (see Figure 9.2), where we give the values of the objective function. Note, however, that the convergence is not monotone: the objective function was smaller after step 3 than after step 6. It turns out that in many cases the algorithm gives only a local minimum. As the test data have been manually classified, it is known which patients had the benign and which the malignant cancer, and we can check the clustering given by the algorithm. The results are given in Table 9.1. Of the 239 patients with malignant cancer, the k-means algorithm classified 222 correctly but 17 incorrectly. In Chapter 11 and in the following example, we use clustering for information retrieval or text mining. The centroid vectors are used as basis vectors, and the documents are represented by their coordinates in terms of the basis vectors. 18See http://www.radwin.org/michael/projects/learning/about-breast-cancer-wisconsin.html. book 2007/2/23 page 104 104 Chapter 9. Clustering and Nonnegative Matrix Factorization 1 2 3 4 5 6 100 120 140 160 180 200 220 240 260 Iteration Q (Π ) Figure 9.2. The objective function in the k-means algorithm for the breast cancer data. Table 9.1. Classification of cancer data with the k-means algorithm. B stands for benign and M for malignant cancer. k-means M B M 222 17 B 9 435 Example 9.3. Consider the term-document matrix in Example 1.1, A = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 0 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ , and recall that the first four documents deal with Google and the ranking of Web pages, while the fifth is about football. With this knowledge, we can take the average of the first four column vectors as the centroid of that cluster and the fifth book 2007/2/23 page 105 9.1. The k-Means Algorithm 105 as the second centroid, i.e., we use the normalized basis vectors C = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ 0.1443 0 0 0.5774 0 0.5774 0.2561 0 0.1443 0 0.1443 0 0.4005 0 0.2561 0 0.2561 0.5774 0.2561 0 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ . The coordinates of the columns of A in terms of this approximate basis are computed by solving min D ‖A− CH‖F . Given the thin QR decomposition C = QR, this least squares problem has the solution H = R−1QTA with H = ( 1.7283 1.4168 2.8907 1.5440 0.0000 −0.2556 −0.2095 0.1499 0.3490 1.7321 ) . We see that the first two columns have negative coordinates in terms of the second basis vector. This is rather difficult to interpret in the term-document setting. For instance, it means that the first column a1 is approximated by a1 ≈ Ch1 = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ 0.2495 −0.1476 −0.1476 0.4427 0.2495 0.2495 0.6921 0.4427 0.2951 0.4427 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ . It is unclear what it may signify that this “approximate document” has negative entries for the words England and FIFA. Finally we note, for later reference, that the relative approximation error is rather high: ‖A− CH‖F ‖A‖F ≈ 0.596. (9.2) In the next section we will approximate the matrix in the preceding example, making sure that both the basis vectors and the coordinates are nonnegative. book 2007/2/23 page 106 106 Chapter 9. Clustering and Nonnegative Matrix Factorization 9.2 Nonnegative Matrix Factorization Given a data matrix A ∈ Rm×n, we want to compute a rank-k approximation that is constrained to have nonnegative factors. Thus, assuming that W ∈ Rm×k and H ∈ Rk×n, we want to solve min W≥0, H≥0 ‖A−WH‖F . (9.3) Considered as an optimization problem for W and H at the same time, this problem is nonlinear. However, if one of the unknown matrices were known, W , say, then the problem of computing H would be a standard, nonnegatively constrained, least squares problem with a matrix right-hand side. Therefore the most common way of solving (9.3) is to use an alternating least squares (ALS) procedure [73]: Alternating nonnegative least squares algorithm 1. Guess an initial value W (1). 2. for k = 1, 2, . . . until convergence (a) Solve minH≥0 ‖A−W (k)H‖F , giving H(k). (b) Solve minW≥0 ‖A−WH(k)‖F , giving W (k+1). However, the factorization WH is not unique: we can introduce any diagonal matrix D with positive diagonal elements and its inverse between the factors, WH = (WD)(D−1H). To avoid growth of one factor and decay of the other, we need to normalize one of them in every iteration. A common normalization is to scale the columns of W so that the largest element in each column becomes equal to 1. Let aj and hj be the columns of A and H. Writing out the columns one by one, we see that the matrix least squares problem minH≥0 ‖A −W (k)H‖F is equivalent to n independent vector least squares problems: min hj≥0 ‖aj −W (k)hj‖2, j = 1, 2, . . . , n. These can be solved by an active-set algorithm19 from [61, Chapter 23]. By trans- posing the matrices, the least squares problem for determining W can be reformu- lated as m independent vector least squares problems. Thus the core of the ALS algorithm can be written in pseudo-MATLAB: 19The algorithm is implemented in MATLAB as a function lsqnonneg. book 2007/2/23 page 107 9.2. Nonnegative Matrix Factorization 107 while (not converged) [W]=normalize(W); for i=1:n H(:,i)=lsqnonneg(W,A(:,i)); end for i=1:m w=lsqnonneg(H’,A(i,:)’); W(i,:)=w’; end end There are many variants of algorithms for nonnegative matrix factorization. The above algorithm has the drawback that the active set algorithm for nonnegative least squares is rather time-consuming. As a cheaper alternative, given the thin QR decomposition W = QR, one can take the unconstrained least squares solution, H = R−1QTA, and then set all negative elements in H equal to zero, and similarly in the other step of the algorithm. Improvements that accentuate sparsity are described in [13]. A multiplicative algorithm was given in [63]: while (not converged) W=W.*(W>=0);
H=H.*(W’*V)./((W’*W)*H+epsilon);
H=H.*(H>=0);
W=W.*(V*H’)./(W*(H*H’)+epsilon);
[W,H]=normalize(W,H);
end
(The variable epsilon should be given a small value and is used to avoid division
by zero.) The matrix operations with the operators .* and ./ are equivalent to the
componentwise statements
Hij := Hij
(WTA)ij
(WTWH)ij + �
, Wij := Wij
(AHT )ij
(WHHT )ij + �
.
The algorithm can be considered as a gradient descent method.
Since there are so many important applications of nonnegative matrix fac-
torizations, algorithm development is an active research area. For instance, the
problem of finding a termination criterion for the iterations does not seem to have
found a good solution. A survey of different algorithms is given in [13].
A nonnegative factorization A ≈ WH can be used for clustering: the data
vector aj is assigned to cluster i if hij is the largest element in column j of H
[20, 37].
book
2007/2/23
page 108
108 Chapter 9. Clustering and Nonnegative Matrix Factorization
Nonnegative matrix factorization is used in a large variety of applications:
document clustering and email surveillance [85, 8], music transcription [90], bioin-
formatics [20, 37], and spectral analysis [78], to mention a few.
9.2.1 Initialization
A problem with several of the algorithms for nonnegative matrix factorization is
that convergence to a global minimum is not guaranteed. It often happens that
convergence is slow and that a suboptimal approximation is reached. An efficient
procedure for computing a good initial approximation can be based on the SVD
of A [18]. We know that the first k singular triplets (σi, ui, vi)
k
i=1 give the best
rank-k approximation of A in the Frobenius norm. It is easy to see that if A is a
nonnegative matrix, then u1 and v1 are nonnegative (cf. Section 6.4). Therefore, if
A = UΣV T is the SVD of A, we can take the first singular vector u1 as the first
column in W (1) (and vT1 as the first row in an initial approximation H
(1), if that
is needed in the algorithm; we will treat only the approximation of W (1) in the
following).
The next best vector, u2, is very likely to have negative components, due to
orthogonality. But if we compute the matrix C(2) = u2v
T
2 and replace all negative
elements with zero, giving the nonnegative matrix C
(2)
+ , then we know that the first
singular vector of this matrix is nonnegative. Furthermore, we can hope that it is
a reasonably good approximation of u2, so we can take it as the second column
of W (1).
The procedure can be implemented by the following, somewhat simplified,
MATLAB script:
[U,S,V]=svds(A,k); % Compute only the k largest singular
% values and the corresponding vectors
W(:,1)=U(:,1);
for j=2:k
C=U(:,j)*V(:,j)’;
C=C.*(C>=0);
[u,s,v]=svds(C,1);
W(:,j)=u;
end
The MATLAB [U,S,V]=svds(A,k) computes only the k-largest singular val-
ues and the corresponding singular vectors using a Lanczos method; see Section
15.8.3. The standard SVD function svd(A) computes the full decomposition and is
usually considerably slower, especially when the matrix is large and sparse.
Example 9.4. We computed a rank-2 nonnegative factorization of the matrix
A in Example 9.3, using a random initialization and the SVD-based initialization.
With the random initialization, convergence was slower (see Figure 9.3), and after
10 iterations it had not converged. The relative approximation error of the algorithm
book
2007/2/23
page 109
9.2. Nonnegative Matrix Factorization 109
1 2 3 4 5 6 7 8 9 10
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
0.66
0.67
Iteration
E
rr
o
r
Figure 9.3. Relative approximation error in the nonnegative matrix fac-
torization as a function of the iteration number. The upper curve is with random
initialization and the lower with the SVD-based initialization.
with SVD initialization was 0.574 (cf. the error 0.596 with the k-means algorithm
(9.2)). In some runs, the algorithm with the random initialization converged to a
local, suboptimal minimum. The factorization with SVD initialization was
WH =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0.3450 0
0.1986 0
0.1986 0
0.6039 0.1838
0.2928 0
0 0.5854
1.0000 0.0141
0.0653 1.0000
0.8919 0.0604
0.0653 1.0000
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
(
0.7740 0 0.9687 0.9120 0.5251
0 1.0863 0.8214 0 0
)
.
It is now possible to interpret the decomposition. The first four documents are well
represented by the basis vectors, which have large components for Google-related
keywords. In contrast, the fifth document is represented by the first basis vector
only, but its coordinates are smaller than those of the first four Google-oriented
documents. In this way, the rank-2 approximation accentuates the Google-related
contents, while the “football-document” is de-emphasized. In Chapter 11 we will
see that other low-rank approximations, e.g., those based on SVD, have a similar
effect.
book
2007/2/23
page 110
110 Chapter 9. Clustering and Nonnegative Matrix Factorization
On the other hand, if we compute a rank-3 approximation, then we get
WH =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0.2516 0 0.1633
0 0 0.7942
0 0 0.7942
0.6924 0.1298 0
0.3786 0 0
0 0.5806 0
1.0000 0 0.0444
0.0589 1.0000 0.0007
0.4237 0.1809 1.0000
0.0589 1.0000 0.0007
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
⎛
⎝1.1023 0 1.0244 0.8045 00 1.0815 0.8314 0 0
0 0 0.1600 0.3422 1.1271
⎞
⎠ .
We see that now the third vector in W is essentially a “football” basis vector, while
the other two represent the Google-related documents.
book
2007/2/23
page
Part II
Data Mining Applications
book
2007/2/23
page
book
2007/2/23
page 113
Chapter 10
Classification of
Handwritten Digits
Classification by computer of handwritten digits is a standard problem in pattern
recognition. The typical application is automatic reading of zip codes on envelopes.
A comprehensive review of different algorithms is given in [62].
10.1 Handwritten Digits and a Simple Algorithm
In Figure 10.1 we illustrate handwritten digits that we will use in the examples in
this chapter.
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 10.1. Handwritten digits from the U.S. Postal Service database;
see, e.g., [47].
We will treat the digits in three different but equivalent formats:
1. As 16 × 16 gray scale images, as in Figure 10.1;
2. As functions of two variables, s = s(x, y), as in Figure 10.2; and
3. As vectors in R256.
In the classification of an unknown digit we need to compute the distance
to known digits. Different distance measures can be used, and perhaps the most
natural one to use is the Euclidean distance: stack the columns of the image in a
113
book
2007/2/23
page 114
114 Chapter 10. Classification of Handwritten Digits
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
0 2 4
6 8 10
12 14 160
5
10
15
20
−1
−0.5
0
0.5
1
Figure 10.2. A digit considered as a function of two variables.
vector and identify each digit as a vector in R256. Then define the distance function
(x, y) = ‖x− y ‖2.
An alternative distance function is the cosine between two vectors.
In a real application of handwritten digit classification, e.g., zip code reading,
there are hardware and real-time factors that must be taken into account. In this
chapter we describe an idealized setting. The problem is as follows:
Given a set of manually classified digits (the training set), classify a set
of unknown digits (the test set).
In the U.S. Postal Service database, the training set contains 7291 handwritten
digits. Here we will use a subset of 1707 digits, relatively equally distributed between
0 and 9. The test set has 2007 digits.
If we consider the training set digits as vectors or points, then it is reasonable
to assume that all digits of one kind form a cluster of points in a Euclidean 256-
dimensional vector space. Ideally the clusters are well separated (otherwise the task
of classifying unknown digits will be very difficult), and the separation between the
clusters depends on how well written the training digits are.
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 10.3. The means (centroids) of all digits in the training set.
In Figure 10.3 we illustrate the means (centroids) of the digits in the training
set. From the figure we get the impression that a majority of the digits are well
written. (If there were many badly written digits, the means would be very diffuse.)
book
2007/2/23
page 115
10.2. Classification Using SVD Bases 115
This indicates that the clusters are rather well separated. Therefore, it seems likely
that a simple classification algorithm that computes the distance from each unknown
digit to the means may be reasonably accurate.
A simple classification algorithm
Training: Given the manually classified training set, compute the means (cen-
troids) mi, i = 0, . . . , 9, of all the 10 classes.
Classification: For each digit in the test set, classify it as k if mk is the closest
mean.
It turns out that for our test set, the success rate of this algorithm is around
75%, which is not good enough. The reason for this relatively bad performance is
that the algorithm does not use any information about the variation within each
class of digits.
10.2 Classification Using SVD Bases
We will now describe a classification algorithm that is based on the modeling of the
variation within each digit class using orthogonal basis vectors computed using the
SVD. This can be seen as a least squares algorithm based on a reduced rank model ;
cf. Chapter 7.
If we consider the images as 16 × 16 matrices, then the data are multidimen-
sional; see Figure 10.4. Stacking all the columns of each image above each other
gives a matrix. Let A ∈ Rm×n, with m = 256, be the matrix consisting of all the
training digits of one kind, the 3’s, say. The columns of A span a linear subspace
of Rm. However, this subspace cannot be expected to have a large dimension, be-
cause if it did, then the subspaces of the different kinds of digits would intersect
(remember that we are considering subspaces of R256).
Now the idea is to “model” the variation within the set of training (and test)
digits of one kind using an orthogonal basis of the subspace. An orthogonal basis
can be computed using the SVD, and any matrix A is a sum of rank 1 matrices:
A =
m∑
i=1
σiuiv
T
i = + + · · · . (10.1)
Each column in A represents an image of a digit 3, and therefore the left singular
vectors ui are an orthogonal basis in the “image space of 3’s.” We will refer to the
left singular vectors as “singular images.” From (10.1) the jth column of A is equal
to
aj =
m∑
i=1
(σivij)ui,
book
2007/2/23
page 116
116 Chapter 10. Classification of Handwritten Digits
16
16
digits
3
A256
digits
Figure 10.4. The image of one digit is a matrix, and the set of images of
one kind form a tensor. In the lower part of the figure, each digit (of one kind) is
represented by a column in the matrix.
and we see that the coordinates of image j in A in terms of this basis are σivij .
From the matrix approximation properties of the SVD (Theorems 6.6 and 6.7), we
know that the first singular vector represents the “dominating” direction of the data
matrix. Therefore, if we fold the vectors ui back into images, we expect the first
singular vector to look like a 3, and the following singular images should represent
the dominating variations of the training set around the first singular image. In
Figure 10.5 we illustrate the singular values and the first three singular images for
the training set 3’s. In the middle graph we plot the coordinates of each of the
131 digits in terms of the first three singular vectors. We see that all the digits have
a large portion (between 0.05 and 0.1) of the first singular image, which, in fact,
looks very much like the mean of 3’s in Figure 10.3. We then see that there is a
rather large variation in the coordinates in terms of the second and third singular
images.
The SVD basis classification algorithm will be based on the following assump-
tions:
1. Each digit (in the training set and the test set) is well characterized by a
few of the first singular images of its own kind. The more precise meaning of
“few” should be investigated in experiments.
2. An expansion in terms of the first few singular images discriminates well
between the different classes of digits.
book
2007/2/23
page 117
10.2. Classification Using SVD Bases 117
0 20 40 60 80 100 120 140
10
−1
10
0
10
1
10
2
10
3
0 20 40 60 80 100 120 140
0
0.1
0.2
0 20 40 60 80 100 120 140
−0.5
0
0.5
0 20 40 60 80 100 120 140
−0.5
0
0.5
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 10.5. Singular values (top), coordinates of the 131 test digits in
terms of the first three right singular vectors vi (middle), and the first three singular
images (bottom).
3. If an unknown digit can be better approximated in one particular basis of
singular images, the basis of 3’s say, than in the bases of the other classes,
then it is likely that the unknown digit is a 3.
Thus we should compute how well an unknown digit can be represented in
the 10 different bases. This can be done by computing the residual vector in least
squares problems of the type
min
αi
∥∥∥∥∥z −
k∑
i=1
αiui
∥∥∥∥∥ ,
where z represents an unknown digit and ui represents the singular images. We can
book
2007/2/23
page 118
118 Chapter 10. Classification of Handwritten Digits
0 1 2 3 4 5 6 7 8 9
0.4
0.5
0.6
0.7
0.8
0.9
1
Basis
R
e
si
d
u
a
l
0 1 2 3 4 5 6 7 8 9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Basis
R
e
si
d
u
a
l
Figure 10.6. Relative residuals of all test 3’s (top) and 7’s (bottom) in
terms of all bases. Ten basis vectors were used for each class.
write this problem in the form
min
α
‖ z − Ukα ‖2,
where Uk =
(
u1 u2 · · · uk
)
. Since the columns of Uk are orthogonal, the solu-
tion of this problem is given by α = UTk z, and the norm of the residual vector of
the least squares problems is
‖ (I − UkUTk )z ‖2, (10.2)
i.e., the norm of the projection of the unknown digit onto the subspace orthogonal
to span(Uk).
To demonstrate that the assumptions above are reasonable, we illustrate in
Figure 10.6 the relative residual norm for all test 3’s and 7’s in terms of all 10 bases.
book
2007/2/23
page 119
10.2. Classification Using SVD Bases 119
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6 7 8 9 10
0.4
0.5
0.6
0.7
0.8
0.9
1
# basis vectors
R
e
si
d
u
a
l
Figure 10.7. Unknown digit (nice 3) and approximations using 1, 3, 5, 7,
and 9 terms in the 3-basis (top). Relative residual ‖ (I −UkUTk )z ‖2/‖ z ‖2 in least
squares problem (bottom).
In the two figures, there is one curve for each unknown digit, and naturally it is
not possible to see the individual curves. However, one can see that most of the
test 3’s and 7’s are best approximated in terms of their own basis. The graphs also
give information about which classification errors are more likely than others. (For
example, 3’s and 5’s are similar, whereas 3’s and 4’s are quite different; of course
this only confirms what we already know.)
It is also interesting to see how the residual depends on the number of terms
in the basis. In Figure 10.7 we illustrate the approximation of a nicely written 3 in
terms of the 3-basis with different numbers of basis images. In Figures 10.8 and 10.9
we show the approximation of an ugly 3 in the 3-basis and a nice 3 in the 5-basis.
From Figures 10.7 and 10.9 we see that the relative residual is considerably
smaller for the nice 3 in the 3-basis than in the 5-basis. We also see from Figure 10.8
that the ugly 3 is not well represented in terms of the 3-basis. Therefore, naturally,
if the digits are very badly drawn, then we cannot expect to get a clear classification
based on the SVD bases.
It is possible to devise several classification algorithms based on the model of
expanding in terms of SVD bases. Below we give a simple variant.
book
2007/2/23
page 120
120 Chapter 10. Classification of Handwritten Digits
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6 7 8 9 10
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
# basis vectors
R
e
si
d
u
a
l
Figure 10.8. Unknown digit (ugly 3) and approximations using 1, 3, 5,
7, and 9 terms in the 3-basis (top). Relative residual in least squares problem
(bottom).
An SVD basis classification algorithm
Training: For the training set of known digits, compute the SVD of each set of
digits of one kind.
Classification: For a given test digit, compute its relative residual in all 10 bases.
If one residual is significantly smaller than all the others, classify as that.
Otherwise give up.
The work in this algorithm can be summarized as follows:
Training: Compute SVDs of 10 matrices of dimension m2 × ni.
Each digit is an m×m digitized image.
ni: the number of training digits i.
Test: Compute 10 least squares residuals (10.2).
book
2007/2/23
page 121
10.2. Classification Using SVD Bases 121
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6 7 8 9 10
0.7
0.75
0.8
0.85
0.9
0.95
1
# basis vectors
R
e
si
d
u
a
l
Figure 10.9. Unknown digit (nice 3) and approximations using 1, 3, 5,
7, and 9 terms in the 5-basis (top). Relative residual in least squares problem
(bottom).
Table 10.1. Correct classifications as a function of the number of basis
images (for each class).
# basis images 1 2 4 6 8 10
correct (%) 80 86 90 90.5 92 93
Thus the test phase is quite fast, and this algorithm should be suitable for
real-time computations. The algorithm is related to the SIMCA method [89].
We next give some test results (from [82]) for the U.S. Postal Service database,
here with 7291 training digits and 2007 test digits [47]. In Table 10.1 we give
classification results as a function of the number of basis images for each class.
Even if there is a very significant improvement in performance compared to
the method in which one used only the centroid images, the results are not good
enough, as the best algorithms reach about 97% correct classifications. The training
and test contain some digits that are very difficult to classify; we give a few examples
in Figure 10.10. Such badly written digits are very difficult to handle automatically.
book
2007/2/23
page 122
122 Chapter 10. Classification of Handwritten Digits
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 10.10. Ugly digits in the U.S. Postal Service database.
Figure 10.11. A digit (left) and acceptable transformations (right).
Columnwise from left to right the digit has been (1) written with a thinner and
a thicker pen, (2) stretched diagonally, (3) compressed and elongated vertically, and
(4) rotated.
10.3 Tangent Distance
A good classification algorithm should be able to classify unknown digits that are
rather well written but still deviate considerably in Euclidean distance from the
ideal digit. There are some deviations that humans can easily handle and which are
quite common and acceptable. We illustrate a few such variations20 in Figure 10.11.
Such transformations constitute no difficulties for a human reader, and ideally they
should be very easy to deal with in automatic digit recognition. A distance measure,
tangent distance, that is invariant under small such transformations is described in
[86, 87].
16 × 16 images can be interpreted as points in R256. Let p be a fixed pattern
in an image. We shall first consider the case of only one allowed transformation,
translation of the pattern (digit) in the x-direction, say. This translation can be
thought of as moving the pattern along a curve in R256. Let the curve be param-
eterized by a real parameter α so that the curve is given by s(p, α) and in such a
way that s(p, 0) = p. In general, the curve is nonlinear and can be approximated
by the first two terms in the Taylor expansion,
s(p, α) = s(p, 0) +
ds
dα
(p, 0)α + O(α2) ≈ p + tpα,
where tp =
ds
dα
(p, 0) is a vector in R256. By varying α slightly around 0, we make
a small movement of the pattern along the tangent at the point p on the curve.
Assume that we have another pattern e that is approximated similarly:
s(e, α) ≈ e + teα.
20Note that the transformed digits have been written not manually but by using the techniques
described later in this section. The presentation in this section is based on the papers [86, 87, 82].
book
2007/2/23
page 123
10.3. Tangent Distance 123
�
�
p
e
s(e, α)
s(p, α)
tangent distance
distance between p and e
distance between s(p, α) and s(e, α)
Figure 10.12. The distance between the points p and e, and the tangent distance.
Since we consider small movements along the curves as allowed, such small move-
ments should not influence the distance function. Therefore, ideally we would like
to define our measure of closeness between p and e as the closest distance between
the two curves; see Figure 10.12 (cf. [87]).
However, since in general we cannot compute the distance between the curves,
we can use the first order approximations and compute the closest distance between
the two tangents in the points p and e. Thus we shall move the patterns indepen-
dently along their respective tangents, until we find the smallest distance. If we
measure this distance in the usual Euclidean norm, we solve the least squares prob-
lem,
min
αp,αe
∥∥ p + tpαp − e− teαe ∥∥2 = minαp,αe
∥∥∥∥ (p− e) − (−tp te)
(
αp
αe
)∥∥∥∥
2
.
Now consider the case when we are allowed to move the pattern p along l
different curves in R256, parameterized by α = (α1 · · · αl)T . This is equivalent to
moving the pattern on an l-dimensional surface (manifold) in R256. Assume that
we have two patterns, p and e, each of which is allowed to move on its surface of
allowed transformations. Ideally we would like to find the closest distance between
the surfaces, but instead, since that is impossible to compute, we now define a
distance measure, where we compute the distance between the two tangent planes
of the surface in the points p and e.
As before, the tangent plane is given by the first two terms in the Taylor
book
2007/2/23
page 124
124 Chapter 10. Classification of Handwritten Digits
expansion of the function s(p, α):
s(p, α) = s(p, 0) +
l∑
i
ds
dαi
(p, 0)αi + O(‖α ‖22) ≈ p + Tpα,
where Tp is the matrix
Tp =
(
ds
dα1
ds
dα2
· · ·
ds
dαl
)
,
and the derivatives are all evaluated in the point (p, 0).
Thus the tangent distance between the points p and e is defined as the smallest
possible residual in the least squares problem,
min
αp,αe
∥∥ p + Tpαp − e− Teαe ∥∥2 = minαp,αe
∥∥∥∥ (p− e) − (−Tp Te)
(
αp
αe
)∥∥∥∥
2
.
The least squares problem can be solved, e.g., using the SVD of A =
(
−Tp Te
)
.
Note that we are interested not in the solution itself but only in the norm of the
residual. Write the least squares problem in the form
min
α
‖ b−Aα ‖2, b = p− e, α =
(
αp
αe
)
.
If we use the QR decomposition21
A = Q
(
R
0
)
= (Q1 Q2)
(
R
0
)
= Q1R,
the norm of the residual is given by
min
α
‖ b−Aα ‖22 = min
α
∥∥∥∥
(
QT1 b−Rα
QT2 b
)∥∥∥∥2
= min
α
[
‖ (QT1 b−Rα) ‖
2
2 + ‖Q
T
2 b ‖
2
2
]
= ‖QT2 b ‖
2
2.
The case when the matrix A should happen to not have full column rank is easily
dealt with using the SVD; see Section 6.7. The probability is high that the columns
of the tangent matrix are almost linearly dependent when the two patterns are close.
The most important property of this distance function is that it is invariant
under movements of the patterns on the tangent planes. For instance, if we make
a small translation in the x-direction of a pattern, then with this measure, the
distance it has been moved is equal to zero.
10.3.1 Transformations
Here we consider the image pattern as a function of two variables, p = p(x, y), and
we demonstrate that the derivative of each transformation can be expressed as a
differentiation operator that is a linear combination of the derivatives px =
dp
dx
and
py =
dp
dy
.
21A has dimension 256 × 2l; since the number of transformations is usually less than 10, the
linear system is overdetermined.
book
2007/2/23
page 125
10.3. Tangent Distance 125
Figure 10.13. A pattern, its x-derivative, and x-translations of the pattern.
Translation. The simplest transformation is the one where the pattern is trans-
lated by αx in the x-direction, i.e.,
s(p, αx)(x, y) = p(x + αx, y).
Obviously, using the chain rule,
d
dαx
(s(p, αx)(x, y)) |αx=0 =
d
dαx
p(x + αx, y)|αx=0 = px(x, y).
In Figure 10.13 we give a pattern and its x-derivative. Then we demonstrate that
by adding a small multiple of the derivative, the pattern can be translated to the
left and to the right.
Analogously, for y-translation we get
d
dαy
(s(p, αy)(x, y)) |αy=0 = py(x, y).
Rotation. A rotation of the pattern by an angle αr is made by replacing the value
of p in the point (x, y) with the value in the point(
cosαr sinαr
− sinαr cosαr
)(
x
y
)
.
Thus we define the function
s(p, αr)(x, y) = p(x cosαr + y sinαr,−x sinαr + y cosαr),
and we get the derivative
d
dαr
(s(p, αr)(x, y)) = (−x sinαr + y cosαr)px + (−x cosαr − y sinαr)py.
Setting αr = 0, we have
d
dαr
(s(p, αr)(x, y)) |αr=0 = ypx − xpy,
where the derivatives are evaluated at (x, y).
An example of a rotation transformation is given in Figure 10.14.
book
2007/2/23
page 126
126 Chapter 10. Classification of Handwritten Digits
Figure 10.14. A pattern, its rotational derivative, and a rotation of the pattern.
Figure 10.15. A pattern, its scaling derivative, and an “up-scaling” of the
pattern.
Scaling. A scaling of the pattern is achieved by defining
s(p, αs)(x, y) = p((1 + αs)x, (1 + αs)y),
and we get the derivative
d
dαs
(s(p, αs)(x, y)) |αs=0 = xpx + ypy.
The scaling transformation is illustrated in Figure 10.15.
Parallel Hyperbolic Transformation. By defining
s(p, αp)(x, y) = p((1 + αp)x, (1 − αp)y),
we can stretch the pattern parallel to the axis. The derivative is
d
dαp
(s(p, αp)(x, y)) |αp=0 = xpx − ypy.
In Figure 10.16 we illustrate the parallel hyperbolic transformation.
Diagonal Hyperbolic Transformation. By defining
s(p, αh)(x, y) = p(x + αhy, y + αhx),
we can stretch the pattern along diagonals. The derivative is
d
dαh
(s(p, αh)(x, y)) |αh=0 = ypx + xpy.
In Figure 10.17 we illustrate the diagonal hyperbolic transformation.
book
2007/2/23
page 127
10.3. Tangent Distance 127
Figure 10.16. A pattern, its parallel hyperbolic derivative, and two
stretched patterns.
Figure 10.17. A pattern, its diagonal hyperbolic derivative, and two
stretched patterns.
Figure 10.18. A pattern, its thickening derivative, a thinner pattern, and
a thicker pattern.
Thickening. The pattern can be made thinner or thicker using similar techniques;
for details, see [87]. The “thickening” derivative is
(px)
2 + (py)
2.
Thickening and thinning are illustrated in Figure 10.18.
A tangent distance classification algorithm
Training: For each digit in the training set, compute its tangent matrix Tp.
Classification: For each test digit,
• compute its tangent matrix;
• compute the tangent distance to all training digits and classify the test
digit as the closest training digit.
book
2007/2/23
page 128
128 Chapter 10. Classification of Handwritten Digits
Although this algorithm is quite good in terms of classification performance
(96.9% correct classification for the U.S. Postal Service database [82]), it is very
expensive, since each test digit is compared to all the training digits. To be com-
petitive, it must be combined with some other algorithm that reduces the number
of tangent distance comparisons.
We end this chapter by remarking that it is necessary to preprocess the digits
in different ways in order to enhance the classification; see [62]. For instance,
performance is improved if the images are smoothed (convolved with a Gaussian
kernel) [87]. In [82] the derivatives px and py are computed numerically by finite
differences.
book
2007/2/23
page 129
Chapter 11
Text Mining
By text mining we mean methods for extracting useful information from large and
often unstructured collections of texts. Another, closely related term is information
retrieval. A typical application is searching databases of abstracts of scientific pa-
pers. For instance, in a medical application one may want to find all the abstracts
in the database that deal with a particular syndrome. So one puts together a search
phrase, a query, with keywords that are relevant to the syndrome. Then the re-
trieval system is used to match the query to the documents in the database and
presents to the user the documents that are relevant, ranked according to relevance.
Example 11.1. The following is a typical query for search in a collection of medical
abstracts:
9. the use of induced hypothermia in heart surgery, neurosurgery, head
injuries, and infectious diseases.
The query is taken from a test collection for information retrieval, called Medline.22
We will refer to this query as Q9.
Library catalogues are another example of text mining applications.
Example 11.2. To illustrate one issue in information retrieval, we performed a
search in the Linköping University library journal catalogue:
Search phrases Results
computer science engineering Nothing found
computing science engineering IEEE: Computing in
Science and Engineering
Naturally we would like the system to be insensitive to small errors on the
part of the user. Anyone can see that the IEEE journal is close to the query. From
22See, e.g., http://www.dcs.gla.ac.uk/idom/ir resources/test collections/.
129
book
2007/2/23
page 130
130 Chapter 11. Text Mining
this example we conclude that in many cases straightforward word matching is not
good enough.
A very well known area of text mining is Web search engines, where the search
phrase is usually very short, and often there are so many relevant documents that
it is out of the question to present them all to the user. In that application the
ranking of the search result is critical for the efficiency of the search engine. We
will come back to this problem in Chapter 12.
For an overview of information retrieval, see, e.g., [43]. In this chapter we
will describe briefly one of the most common methods for text mining, namely, the
vector space model [81]. Here we give a brief overview of the vector space model
and some variants: latent semantic indexing (LSI), which uses the SVD of the term-
document matrix, a clustering-based method, nonnegative matrix factorization, and
LGK bidiagonalization. For a more detailed account of the different techniques used
in connection with the vector space model, see [12].
11.1 Preprocessing the Documents and Queries
In this section we discuss the preprocessing that is done to the texts before the
vector space model of a particular collection of documents is set up.
In information retrieval, keywords that carry information about the contents
of a document are called terms. A basic step in information retrieval is to create a
list of all the terms in a document collection, a so-called index. For each term, a
list is stored of all the documents that contain that particular term. This is called
an inverted index.
But before the index is made, two preprocessing steps must be done: elimina-
tion of all stop words and stemming.
Stop words are words that one can find in virtually any document. Therefore,
the occurrence of such a word in a document does not distinguish this document
from other documents. The following is the beginning of one particular stop list:23
a, a’s, able, about, above, according, accordingly, across, actually, after,
afterwards, again, against, ain’t, all, allow, allows, almost, alone, along,
already, also, although, always, am, among, amongst, an, and, another,
any, anybody, anyhow, anyone, anything, anyway, anyways, anywhere,
apart, appear, appreciate, appropriate, are, aren’t, around, as, aside,
ask, . . . .
Stemming is the process of reducing each word that is conjugated or has a
suffix to its stem. Clearly, from the point of view of information retrieval, no
information is lost in the following reduction:
23ftp://ftp.cs.cornell.edu/pub/smart/english.stop.
book
2007/2/23
page 131
11.2. The Vector Space Model 131
computable
computation
computing
computed
computational
⎫⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎭
−→ comput
Public domain stemming algorithms are available on the Internet.24
Table 11.1. The beginning of the index for the Medline collection. The
Porter stemmer and the GTP parser [38] were used.
without stemming with stemming
action action
actions
activation activ
active
actively
activities
activity
acts
actual actual
actually
acuity acuiti
acute acut
ad ad
adaptation adapt
adaptations
adaptive
add add
added
addition addit
additional
Example 11.3. We parsed the 1063 documents (actually 30 queries and 1033
documents) in the Medline collection, with and without stemming, in both cases
removing stop words. For consistency it is necessary to perform the same stemming
to the stop list. In the first case, the number of terms was 5839 and in the second
was 4281. We show partial lists of terms in Table 11.1.
11.2 The Vector Space Model
The main idea in the vector space model is to create a term-document matrix,
where each document is represented by a column vector. The column has nonzero
24http://www.comp.lancs.ac.uk/computing/research/stemming/ and http://www.tartarus.org/
˜martin/PorterStemmer/.
book
2007/2/23
page 132
132 Chapter 11. Text Mining
entries in the positions that correspond to terms that can be found in the document.
Consequently, each row represents a term and has nonzero entries in those positions
that correspond to the documents where the term can be found; cf. the inverted
index in Section 11.1.
A simplified example of a term-document matrix is given in Chapter 1. There
we manually counted the frequency of the terms. For realistic problems one uses
a text parser to create the term-document matrix. Two public-domain parsers are
described in [38, 113]. Unless otherwise stated, we have used the one from [113] for
the larger examples in this chapter. Text parsers for information retrieval usually
include both a stemmer and an option to remove stop words. In addition there are
filters, e.g., for removing formatting code in the documents, e.g., HTML or XML.
It is common not only to count the occurrence of terms in documents but also
to apply a term weighting scheme, where the elements of A are weighted depending
on the characteristics of the document collection. Similarly, document weighting
is usually done. A number of schemes are described in [12, Section 3.2.1]. For
example, one can define the elements in A by
aij = fij log(n/ni), (11.1)
where fij is term frequency, the number of times term i appears in document j, and
ni is the number of documents that contain term i (inverse document frequency).
If a term occurs frequently in only a few documents, then both factors are large.
In this case the term discriminates well between different groups of documents, and
the log-factor in (11.1) gives it a large weight in the documents where it appears.
Normally, the term-document matrix is sparse: most of the matrix elements
are equal to zero. Then, of course, one avoids storing all the zeros and uses instead
a sparse matrix storage scheme; see Section 15.7.
Example 11.4. For the stemmed Medline collection, parsed using GTP [38], the
matrix (including 30 query columns) is 4163 × 1063 with 48263 nonzero elements,
i.e., approximately 1%. The first 500 rows and columns of the matrix are illustrated
in Figure 11.1.
11.2.1 Query Matching and Performance Modeling
Query matching is the process of finding the documents that are relevant to a
particular query q. This is often done using the cosine distance measure: a document
aj is deemed relevant if the angle between the query q and aj is small enough.
Equivalently, aj is retrieved if
cos(θ(q, aj)) =
qTaj
‖ q ‖2 ‖ aj ‖2
> tol,
where tol is a predefined tolerance. If the tolerance is lowered, then more documents
are returned, and it is likely that many of those are relevant to the query. But at
book
2007/2/23
page 133
11.2. The Vector Space Model 133
0 100 200 300 400 500
0
50
100
150
200
250
300
350
400
450
500
Figure 11.1. The first 500 rows and columns of the Medline matrix. Each
dot represents a nonzero element.
the same time there is a risk that when the tolerance is lowered, more and more
documents that are not relevant are also returned.
Example 11.5. We did query matching for query Q9 in the stemmed Medline col-
lection. With tol = 0.19 for the cosine measure, only document 409 was considered
relevant. When the tolerance was lowered to 0.17, documents 415 and 467 also were
retrieved.
We illustrate the different categories of documents in a query matching for two
values of the tolerance in Figure 11.2. The query matching produces a good result
when the intersection between the two sets of returned and relevant documents is
as large as possible and the number of returned irrelevant documents is small. For
a high value of the tolerance, the retrieved documents are likely to be relevant (the
small circle in Figure 11.2). When the cosine tolerance is lowered, the intersection
is increased, but at the same time, more irrelevant documents are returned.
In performance modeling for information retrieval we define the following mea-
sures:
Precision: P =
Dr
Dt
, (11.2)
where Dr is the number of relevant documents retrieved and Dt the total number
book
2007/2/23
page 134
134 Chapter 11. Text Mining
RELEVANT
DOCUMENTS
RETURNED
DOCUMENTS
ALL DOCUMENTS
Figure 11.2. Retrieved and relevant documents for two values of the tol-
erance. The dashed circle represents the retrieved documents for a high value of the
cosine tolerance.
of documents retrieved, and
Recall : R =
Dr
Nr
, (11.3)
where Nr is the total number of relevant documents in the database. With a large
value of tol for the cosine measure, we expect to have high precision but low recall.
For a small value of tol, we will have high recall but low precision.
In the evaluation of different methods and models for information retrieval,
usually a number of queries are used. For testing purposes, all documents have
been read by a human, and those that are relevant for a certain query are marked.
This makes it possible to draw diagrams of recall versus precision that illustrate the
performance of a certain method for information retrieval.
Example 11.6. We did query matching for query Q9 in the Medline collection
(stemmed) using the cosine measure. For a specific value of the tolerance, we
computed the corresponding recall and precision from (11.2) and (11.3). By varying
the tolerance from close to 1 down to zero, we obtained vectors of recall and precision
that gave information about the quality of the retrieval method for this query. In the
comparison of different methods it is illustrative to draw the recall versus precision
diagram as in Figure 11.3. Ideally a method has high recall at the same time as the
precision is high. Thus, the closer the curve is to the upper right corner, the higher
the retrieval quality.
In this example and the following examples, the matrix elements were com-
puted using term frequency and inverse document frequency weighting (11.1).
book
2007/2/23
page 135
11.3. Latent Semantic Indexing 135
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
Recall (%)
P
re
ci
si
o
n
(
%
)
Figure 11.3. Query matching for Q9 using the vector space method. Recall
versus precision.
11.3 Latent Semantic Indexing
Latent semantic indexing25 (LSI) [28, 9] “is based on the assumption that there is
some underlying latent semantic structure in the data . . . that is corrupted by the
wide variety of words used” [76] and that this semantic structure can be discovered
and enhanced by projecting the data (the term-document matrix and the queries)
onto a lower-dimensional space using the SVD.
Let A = UΣV T be the SVD of the term-document matrix and approximate
it by a matrix of rank k:
A ≈ = UkΣkV Tk =: UkHk.
The columns of Uk live in the document space and are an orthogonal basis that
we use to approximate the documents. Write Hk in terms of its column vectors,
Hk = (h1, h2, . . . , hn). From A ≈ UkHk we have aj ≈ Ukhj , which means that
column j of Hk holds the coordinates of document j in terms of the orthogonal
25Sometimes also called latent semantic analysis (LSA) [52].
book
2007/2/23
page 136
136 Chapter 11. Text Mining
basis. With this rank-k approximation the term-document matrix is represented
by Ak = UkHk and in query matching we compute q
TAk = q
TUkHk = (U
T
k q)
THk.
Thus, we compute the coordinates of the query in terms of the new document basis
and compute the cosines from
cos θj =
qTk hj
‖qk‖2 ‖hj‖2
, qk = U
T
k q. (11.4)
This means that the query matching is performed in a k-dimensional space.
Example 11.7. We did query matching for Q9 in the Medline collection, approx-
imating the matrix using the truncated SVD of rank 100.
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
Recall (%)
P
re
ci
si
o
n
(
%
)
Figure 11.4. Query matching for Q9. Recall versus precision for the full
vector space model (solid line) and the rank 100 approximation (dashed line and
diamonds).
The recall precision curve is given in Figure 11.4. It is seen that for this query,
LSI improves the retrieval performance. In Figure 11.5 we also demonstrate a fact
that is common to many term-document matrices: it is rather well-conditioned and
there is no gap in the sequence of singular values. Therefore, we cannot find a
suitable rank of the LSI approximation by inspecting the singular values; it must
be determined by retrieval experiments.
Another remarkable fact is that with k = 100 the approximation error in the
matrix approximation,
‖A−Ak ‖F
‖A ‖F
≈ 0.8,
is large, and we still get improved retrieval performance. In view of the large approx-
imation error in the truncated SVD approximation of the term-document matrix,
book
2007/2/23
page 137
11.3. Latent Semantic Indexing 137
0 20 40 60 80 100
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
S
in
g
u
la
r
va
lu
e
s
Figure 11.5. First 100 singular values of the Medline (stemmed) matrix.
The matrix columns are scaled to unit Euclidean length.
one may question whether the “optimal” singular vectors constitute the best basis
for representing the term-document matrix. On the other hand, since we get such
good results, perhaps a more natural conclusion may be that the Frobenius norm
is not a good measure of the information contents in the term-document matrix.
It is also interesting to see what are the most important “directions” in the
data. From Theorem 6.6 we know that the first few left singular vectors are the dom-
inant directions in the document space, and their largest components should indicate
what these directions are. The MATLAB statement find(abs(U(:,k))>0.13),
combined with look-up in the index of terms, gave the following results for k=1,2:
U(:,1) U(:,2)
cell case
growth cell
hormone children
patient defect
dna
growth
patient
ventricular
In Chapter 13 we will come back to the problem of extracting the keywords from
texts.
It should be said that LSI does not give significantly better results for all
queries in the Medline collection: there are some in which it gives results comparable
to the full vector model and some in which it gives worse performance. However, it
book
2007/2/23
page 138
138 Chapter 11. Text Mining
is often the average performance that matters.
A systematic study of different aspects of LSI was done in [52]. It was shown
that LSI improves retrieval performance for surprisingly small values of the reduced
rank k. At the same time, the relative matrix approximation errors are large. It
is probably not possible to prove any general results that explain in what way and
for which data LSI can improve retrieval performance. Instead we give an artificial
example (constructed using similar ideas as a corresponding example in [12]) that
gives a partial explanation.
Example 11.8. Consider the term-document matrix from Example 1.1 and the
query “ranking of Web pages.” Obviously, Documents 1–4 are relevant with
respect to the query, while Document 5 is totally irrelevant. However, we obtain
cosines for the query and the original data as(
0 0.6667 0.7746 0.3333 0.3333
)
,
which indicates that Document 5 (the football document) is as relevant to the query
as Document 4. Further, since none of the words of the query occurs in Document 1,
this document is orthogonal to the query.
We then compute the SVD of the term-document matrix and use a rank-2
approximation. After projection to the two-dimensional subspace, the cosines, com-
puted according to (11.4), are(
0.7857 0.8332 0.9670 0.4873 0.1819
)
.
It turns out that Document 1, which was deemed totally irrelevant for the query
in the original representation, is now highly relevant. In addition, the cosines for
the relevant Documents 2–4 have been reinforced. At the same time, the cosine for
Document 5 has been significantly reduced. Thus, in this artificial example, the
dimension reduction enhanced the retrieval performance.
In Figure 11.6 we plot the five documents and the query in the coordinate
system of the first two left singular vectors. Obviously, in this representation, the
first document is closer to the query than Document 5. The first two left singular
vectors are
u1 =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0.1425
0.0787
0.0787
0.3924
0.1297
0.1020
0.5348
0.3647
0.4838
0.3647
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
, u2 =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0.2430
0.2607
0.2607
−0.0274
0.0740
−0.3735
0.2156
−0.4749
0.4023
−0.4749
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
,
and the singular values are Σ = diag(2.8546, 1.8823, 1.7321, 1.2603, 0.8483). The
first four columns in A are strongly coupled via the words Google, matrix, etc., and
book
2007/2/23
page 139
11.4. Clustering 139
0.5 1 1.5 2 2.5
−1.5
−1
−0.5
0
0.5
1
1
2
3
4
5
q
u
1
u
2
Figure 11.6. The five documents and the query projected to the coordinate
system of the first two left singular vectors.
those words are the dominating contents of the document collection (cf. the singular
values). This shows in the composition of u1. So even if none of the words in the
query are matched by Document 1, that document is so strongly correlated to the
dominating direction that it becomes relevant in the reduced representation.
11.4 Clustering
In the case of document collections, it is natural to assume that there are groups
of documents with similar contents. If we think of the documents as points in Rm,
we may be able to visualize the groups as clusters. Representing each cluster by
its mean value, the centroid,26 we can compress the data in terms of the centroids.
Thus clustering, using the k-means algorithm, for instance, is another method for
low-rank approximation of the term-document matrix. The application of clustering
to information retrieval is described in [30, 76, 77].
In analogy to LSI, the matrix Ck ∈ Rm×k of (normalized but not orthogonal)
centroids can be used as an approximate basis in the “document space.” For query
matching we then need to determine the coordinates of all the documents in this
basis. This can be made by solving the matrix least squares problem,
min
Ĝk
‖A− CkĜk ‖F .
However, it is more convenient first to orthogonalize the columns of C, i.e., compute
26Closely related to the concept vector [30].
book
2007/2/23
page 140
140 Chapter 11. Text Mining
its thin QR decomposition,
Ck = PkR, Pk ∈ Rm×k, R ∈ Rk×k,
and solve
min
Gk
‖A− PkGk ‖F . (11.5)
Writing each column of A− PkGk separately, we see that this matrix least squares
problem is equivalent to n independent standard least squares problems
min
gj
‖ aj − Pkgj ‖2, j = 1, . . . , n,
where gj is column j in Gk. Since Pk has orthonormal columns, we get gj = P
T
k aj ,
and the solution of (11.5) becomes
Gk = P
T
k A.
For matching of a query q we compute the product
qTA ≈ qTPkGk = (PTk q)
TGk = q
T
k Gk,
where qk = P
T
k q. Thus, the cosines in the low-dimensional approximation are
qTk gj
‖ qk ‖2 ‖ gj ‖2
.
Example 11.9. We did query matching for Q9 of the Medline collection. Before
computing the clustering we normalized the columns to equal Euclidean length. We
approximated the matrix using the orthonormalized centroids from a clustering into
50 clusters. The recall-precision diagram is given in Figure 11.7. We see that for
high values of recall, the centroid method is as good as the LSI method with double
the rank; see Figure 11.4.
For rank 50 the approximation error in the centroid method,
‖A− PkGk ‖F /‖A ‖F ≈ 0.9,
is even higher than for LSI of rank 100.
The improved performance can be explained in a similar way as for LSI. Being
the “average document” of a cluster, the centroid captures the main links between
the dominant documents in the cluster. By expressing all documents in terms of
the centroids, the dominant links are emphasized.
When we tested all 30 queries in the Medline collection, we found that the cen-
troid method with rank equal to 50 has a similar performance as LSI with rank 100:
there are some queries where the full vector space model is considerably better, but
there are also some for which the centroid method is much better.
book
2007/2/23
page 141
11.5. Nonnegative Matrix Factorization 141
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
Recall (%)
P
re
ci
si
o
n
(
%
)
Figure 11.7. Query matching for Q9. Recall versus precision for the full
vector space model (solid line) and the rank 50 centroid approximation (solid line
and circles).
11.5 Nonnegative Matrix Factorization
Assume that we have computed an approximate nonnegative matrix factorization
of the term-document matrix,
A ≈ WH, W ≥ 0, H ≥ 0,
where W ∈ Rm×k and H ∈ Rk×n. Column j of H holds the coordinates of document
j in the approximate, nonorthogonal basis consisting of the columns of W . We want
to first determine the representation of the query vector q in the same basis by
solving the least squares problem minq̂ ‖q−Wq̂‖2. Then, in this basis, we compute
the angles between the query and all the document vectors. Given the thin QR
decomposition of W ,
W = QR, P ∈ Rm×k, R ∈ Rk×k,
the query in the reduced basis is
q̂ = R−1QT q,
and the cosine for document j is
q̂Thj
‖q̂‖2 ‖‖hj‖2
.
book
2007/2/23
page 142
142 Chapter 11. Text Mining
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
Recall (%)
P
re
ci
si
o
n
(
%
)
Figure 11.8. Query matching for Q9. Recall versus precision for the full
vector space model (solid line) and the rank 50 nonnegative matrix approximation
(dashed line and ×’s).
Example 11.10. We computed a rank-50 approximation of the Medline term-
document matrix using 100 iterations of the multiplicative algorithm described in
Section 9.2. The relative approximation error ‖A − WH‖F /‖A‖F was approxi-
mately 0.89. The recall-precision curve for query Q9 is given in Figure 11.8.
11.6 LGK Bidiagonalization
So far in this chapter we have described three methods for improving the vector
space method for information retrieval by representing the term-document matrix
A by a low-rank approximation, based on SVD, clustering, and nonnegative ma-
trix factorization. These three methods have a common weakness: it is costly to
update the low-rank approximation when new documents are added or deleted. In
Chapter 7 we described a method for computing a low-rank approximation of A
in connection with a least squares problem, using the right-hand side as a start-
ing vector in an LGK bidiagonalization. Here we will apply this methodology to
the text mining problem in such a way that a new low-rank approximation will be
computed for each query. Therefore there is no extra computational cost when the
document collection is changed. On the other hand, the amount of work for each
query matching becomes higher. This section is inspired by [16].
Given a query vector q we apply the recursive LGK bidiagonalization algo-
rithm (or PLS).
book
2007/2/23
page 143
11.6. LGK Bidiagonalization 143
LGK bidiagonalization for a query q
1. β1p1 = q, z0 = 0
2. for i = 1 : k
αizi = A
T pi − βizi−1
βi+1pi+1 = Azi − αipi
3. end
The coefficients αi−1 and βi are determined so that ‖pi‖ = ‖zi‖ = 1.
The vectors pi and zi are collected in the matrices Pk+1 =
(
p1 · · · pk+1
)
and
Zk =
(
z1 · · · zk
)
. After k steps of this procedure we have generated a rank-k
approximation of A; see (7.16). We summarize the derivation in the following
proposition.
Proposition 11.11. Let AZk = Pk+1Bk+1 be the result of k steps of the LGK
recursion, and let
Bk+1 = Qk+1
(
B̂k
0
)
be the QR decomposition of the bidiagonal matrix Bk+1. Then we have a rank-k
approximation
A ≈ WkY Tk , (11.6)
where
Wk = Pk+1Qk+1
(
Ik
0
)
, Yk = ZkB̂
T
k .
It is possible to use the low-rank approximation (11.6) for query matching in
much the same way as the SVD low-rank approximation is used in LSI. However, it
turns out [16] that better performance is obtained if the following method is used.
The column vectors of Wk are an orthogonal, approximate basis for documents
that are close to the query q. Instead of computing the coordinates of the columns
of A in terms of this basis, we now choose to compute the projection of the query
in terms of this basis:
q̃ = WkW
T
k q ∈ R
m.
We then use the cosine measure
cos(θ̃j) =
q̃Taj
‖q̃‖2 ‖aj‖2
, (11.7)
book
2007/2/23
page 144
144 Chapter 11. Text Mining
0 1 2 3 4 5 6 7 8
0.75
0.8
0.85
0.9
0.95
1
Step
R
e
si
d
u
a
l
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
Recall (%)
P
re
ci
si
o
n
(
%
)
Figure 11.9. Query matching for Q9. The top graph shows the relative
residual as a function of the number of steps in the LGK bidiagonalization (solid line
and ×’s). As a comparison, the residual in terms of the basis of the first singular
vectors (principal component regression) is given (dashed line). In the bottom graph
we give recall versus precision for the full vector space model (solid line) and for
bidiagonalization with two steps (solid line and +’s), and eight steps (solid line
and ∗’s).
i.e., we compute the cosines of the angles between the projected query and the
original documents.
Example 11.12. We ran LGK bidiagonalization with Q9 as a starting vector. It
turns out that the relative residual decreases rather slowly to slightly below 0.8;
see Figure 11.9. Still, after two steps the method already gives results that are
book
2007/2/23
page 145
11.7. Average Performance 145
much better than the full vector space model. It is also seen that eight steps of
bidiagonalization give worse results.
Example 11.12 indicates that the low-rank approximation obtained by LGK
bidiagonalization has similar properties in terms of noise reduction as do LSI and
the centroid-based method. It is striking that when the query vector is used to
influence the first basis vectors, a much lower rank (in this case, 2) gives retrieval
results that are about as good. On the other hand, when the number of steps
increases, the precision becomes worse and approaches that of the full vector space
model. This is natural, since gradually the low-rank approximation becomes better,
and after around eight steps it represents almost all the information in the term-
document matrix that is relevant to the query, in the sense of the full vector space
model.
To determine how many steps of the recursion should be performed, one can
monitor the least squares residual:
min
y
‖AZky − q‖22 = min
y
‖Bk+1y − β1e1‖22 = min
y
‖B̂ky −QTk+1β1e1‖
2
2
= min
y
‖B̂ky − γ(k)‖22 + |γk|
2 = |γk|2,
where
(
γ(k)
γk
)
=
⎛
⎜⎜⎜⎝
γ0
γ1
…
γk
⎞
⎟⎟⎟⎠ = QTk+1
⎛
⎜⎜⎜⎝
β1
0
…
0
⎞
⎟⎟⎟⎠ ;
cf. (7.14). When the norm of the residual stops to decrease substantially, then we can
assume that the query is represented as well as is possible by a linear combination
of documents. Then we can expect the performance to be about the same as that
of the full vector space model. Consequently, to have better performance than the
full vector space model, one should stop well before the residual curve starts to level
off.
11.7 Average Performance
Experiments to compare different methods for information retrieval should be per-
formed on several test collections.27 In addition, one should use not only one single
query but a sequence of queries.
Example 11.13. We tested the 30 queries in the Medline collection and computed
average precision-recall curves for the five methods presented in this chapter. To
compute the average precision over the methods to be compared, it is necessary to
27Test collections and other useful information about text mining can be found at
http://trec.nist.gov/, which is the Web site of the Text Retrieval Conference (TREC). See also
http://www.cs.utk.edu/˜lsi/corpa.html.
book
2007/2/23
page 146
146 Chapter 11. Text Mining
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
Recall (%)
A
ve
ra
g
e
p
re
ci
si
o
n
(
%
)
Figure 11.10. Query matching for all 30 queries in the Medline collec-
tion. The methods used are the full vector space method (solid line), LSI of rank 100
(dashed line and diamonds), centroid approximation of rank 50 (solid line and cir-
cles), nonnegative matrix factorization of rank 50 (dashed line and ×’s), and two
steps of LGK bidiagonalization (solid line and +’s).
evaluate it at specified values of the recall, 5, 10, 15, . . . , 90%, say. We obtained
these by linear interpolation.
The results are illustrated in Figure 11.10. It is seen that LSI of rank 100,
centroid-based approximation of rank 50, nonnegative matrix factorization, and two
steps of LGK bidiagonalization all give considerably better average precision than
the full vector space model.
From Example 11.13 we see that for the Medline test collection, the meth-
ods based on low-rank approximation of the term-document matrix all perform
better than the full vector space model. Naturally, the price to be paid is more
computation. In the case of LSI, centroid approximation, and nonnegative matrix
factorization, the extra computations can be made offline, i.e., separate from the
query matching. If documents are added to the collection, then the approximation
must be recomputed, which may be costly. The method based on LGK bidiago-
nalization, on the other hand, performs the extra computation in connection with
the query matching. Therefore, it can be used efficiently in situations where the
term-document matrix is subject to frequent changes.
Similar results are obtained for other test collections; see, e.g., [11]. However,
the structure of the text documents plays an role. For instance, in [52] it is shown
that the performance of LSI is considerably better for medical abstracts than for
articles from TIME magazine.
book
2007/2/23
page 147
Chapter 12
Page Ranking for a Web
Search Engine
When a search is made on the Internet using a search engine, there is first a tradi-
tional text processing part, where the aim is to find all the Web pages containing
the words of the query. Due to the massive size of the Web, the number of hits
is likely to be much too large to be of use. Therefore, some measure of quality is
needed to filter out pages that are assumed to be less interesting.
When one uses a Web search engine it is typical that the search phrase is
underspecified.
Example 12.1. A Google28 search conducted on September 29, 2005, using the
search phrase university, gave as a result links to the following well-known univer-
sities: Harvard, Stanford, Cambridge, Yale, Cornell, Oxford. The total number of
Web pages relevant to the search phrase was more than 2 billion.
Obviously Google uses an algorithm for ranking all the Web pages that agrees
rather well with a common-sense quality measure. Somewhat surprisingly, the rank-
ing procedure is based not on human judgment but on the link structure of the Web.
Loosely speaking, Google assigns a high rank to a Web page if it has inlinks from
other pages that have a high rank. We will see that this self-referencing statement
can be formulated mathematically as an eigenvalue equation for a certain matrix.
12.1 Pagerank
It is of course impossible to define a generally valid measure of relevance that would
be acceptable for all users of a search engine. Google uses the concept of pagerank
as a quality measure of Web pages. It is based on the assumption that the number
of links to and from a page give information about the importance of a page. We
will give a description of pagerank based primarily on [74] and [33]. Concerning
Google, see [19].
28http://www.google.com/.
147
book
2007/2/23
page 148
148 Chapter 12. Page Ranking for a Web Search Engine
Let all Web pages be ordered from 1 to n, and let i be a particular Web page.
Then Oi will denote the set of pages that i is linked to, the outlinks. The number
of outlinks is denoted Ni = |Oi|. The set of inlinks, denoted Ii, are the pages that
have an outlink to i.
i
Ii
� �
Oi
�
In general, a page i can be considered as more important the more inlinks
it has. However, a ranking system based only on the number of inlinks is easy to
manipulate:29 when you design a Web page i that (e.g., for commercial reasons)
you would like to be seen by as many users as possible, you could simply create a
large number of (informationless and unimportant) pages that have outlinks to i.
To discourage this, one defines the rank of i so that if a highly ranked page j has an
outlink to i, this adds to the importance of i in the following way: the rank of page
i is a weighted sum of the ranks of the pages that have outlinks to i. The weighting
is such that the rank of a page j is divided evenly among its outlinks. Translating
this into mathematics, we get
ri =
∑
j∈Ii
rj
Nj
. (12.1)
This preliminary definition is recursive, so pageranks cannot be computed directly.
Instead a fixed-point iteration might be used. Guess an initial ranking vector r0.
Then iterate
r
(k+1)
i =
∑
j∈Ii
r
(k)
j
Nj
, k = 0, 1, . . . . (12.2)
There are a few problems with such an iteration: if a page has no outlinks, then
in the iteration process it accumulates rank only via its inlinks, but this rank is
never distributed further. Therefore it is not clear if the iteration converges. We
will come back to this problem later.
More insight is gained if we reformulate (12.1) as an eigenvalue problem for
a matrix representing the graph of the Internet. Let Q be a square matrix of
dimension n. Define
Qij =
{
1/Nj if there is a link from j to i,
0 otherwise.
29For an example of attempts to fool a search engine, see [96] and [59, Chapter 5].
book
2007/2/23
page 149
12.1. Pagerank 149
This means that row i has nonzero elements in the positions that correspond to
inlinks of i. Similarly, column j has nonzero elements equal to Nj in the positions
that correspond to the outlinks of j, and, provided that the page has outlinks, the
sum of all the elements in column j is equal to one. In the following symbolic
picture of the matrix Q, nonzero elements are denoted ∗:
j
i
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
∗
0
…
0 ∗ · · · ∗ ∗ · · ·
…
0
∗
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
← inlinks
↑
outlinks
Example 12.2. The following link graph illustrates a set of Web pages with
outlinks and inlinks:
1
�� 2 � 3
4 5 6
� � �
�
���
�
�
The corresponding matrix becomes
Q =
⎛
⎜⎜⎜⎜⎜⎜⎜⎝
0 1
3
0 0 0 0
1
3
0 0 0 0 0
0 1
3
0 0 1
3
1
2
1
3
0 0 0 1
3
0
1
3
1
3
0 0 0 1
2
0 0 1 0 1
3
0
⎞
⎟⎟⎟⎟⎟⎟⎟⎠
.
Since page 4 has no outlinks, the corresponding column is equal to zero.
Obviously, the definition (12.1) is equivalent to the scalar product of row i
and the vector r, which holds the ranks of all pages. We can write the equation in
book
2007/2/23
page 150
150 Chapter 12. Page Ranking for a Web Search Engine
matrix form,
λr = Qr, λ = 1, (12.3)
i.e., r is an eigenvector of Q with eigenvalue λ = 1. It is now easily seen that the
iteration (12.2) is equivalent to
r(k+1) = Qr(k), k = 0, 1, . . . ,
which is the power method for computing the eigenvector. However, at this point
it is not clear that pagerank is well defined, as we do not know if there exists an
eigenvalue equal to 1. It turns out that the theory of Markov chains is useful in the
analysis.
12.2 Random Walk and Markov Chains
There is a random walk interpretation of the pagerank concept. Assume that a
surfer visiting a Web page chooses the next page among the outlinks with equal
probability. Then the random walk induces a Markov chain (see, e.g., [70]). A
Markov chain is a random process in which the next state is determined completely
from the present state; the process has no memory. The transition matrix of the
Markov chain is QT . (Note that we use a slightly different notation than is common
in the theory of stochastic processes.)
The random surfer should never get stuck. In other words, our random walk
model should have no pages without outlinks. (Such a page corresponds to a zero
column in Q.) Therefore the model is modified so that zero columns are replaced
with a constant value in all positions. This means that there is equal probability to
go to any other Internet page. Define the vectors
dj =
{
1 if Nj = 0,
0 otherwise,
for j = 1, . . . , n, and
e =
⎛
⎜⎜⎜⎝
1
1
…
1
⎞
⎟⎟⎟⎠ ∈ Rn. (12.4)
The modified matrix is defined
P = Q +
1
n
edT . (12.5)
With this modification the matrix P is a proper column-stochastic matrix : It has
nonnegative elements, and the elements of each column sum up to 1. The preceding
statement can be reformulated as follows.
book
2007/2/23
page 151
12.2. Random Walk and Markov Chains 151
Proposition 12.3. A column-stochastic matrix P satisfies
eTP = eT , (12.6)
where e is defined by (12.4).
Example 12.4. The matrix in the previous example is modified to
P =
⎛
⎜⎜⎜⎜⎜⎜⎜⎝
0 1
3
0 1
6
0 0
1
3
0 0 1
6
0 0
0 1
3
0 1
6
1
3
1
2
1
3
0 0 1
6
1
3
0
1
3
1
3
0 1
6
0 1
2
0 0 1 1
6
1
3
0
⎞
⎟⎟⎟⎟⎟⎟⎟⎠
.
In analogy to (12.3), we would like to define the pagerank vector as a unique
eigenvector of P with eigenvalue 1,
Pr = r.
The eigenvector of the transition matrix corresponds to a stationary probability
distribution for the Markov chain. The element in position i, ri, is the probability
that after a large number of steps, the random walker is at Web page i. However,
the existence of a unique eigenvalue with eigenvalue 1 is still not guaranteed. To
ensure uniqueness, the matrix must be irreducible; cf. [53].
Definition 12.5. A square matrix A is called reducible if there is a permutation
matrix P such that
PAPT =
(
X Y
0 Z
)
, (12.7)
where X and Z are both square. Otherwise the matrix is called irreducible.
Example 12.6. To illustrate the concept of reducibility, we give an example of a
link graph that corresponds to a reducible matrix:
1 � 4 � 5
2 3 6
�
�
�
�
� �
�
�
book
2007/2/23
page 152
152 Chapter 12. Page Ranking for a Web Search Engine
A random walker who has entered the left part of the link graph will never get
out of it, and similarly will get stuck in the right part. The corresponding matrix is
P =
⎛
⎜⎜⎜⎜⎜⎜⎜⎜⎝
0 1
2
1
2
1
2
0 0
1
2
0 1
2
0 0 0
1
2
1
2
0 0 0 0
0 0 0 0 0 0
0 0 0 1
2
0 1
0 0 0 0 1 0
⎞
⎟⎟⎟⎟⎟⎟⎟⎟⎠
, (12.8)
which is of the form (12.7). Actually, this matrix has two eigenvalues equal to 1
and one equal to −1; see Example 12.10.
The directed graph corresponding to an irreducible matrix is strongly con-
nected : given any two nodes (Ni, Nj), in the graph, there exists a path leading
from Ni to Nj .
The uniqueness of the largest eigenvalue of an irreducible, positive matrix is
guaranteed by the Perron–Frobenius theorem; we state it for the special case treated
here. The inequality A > 0 is understood as all the elements of A being strictly
positive. By dominant eigenvalue we mean the largest eigenvalue in magnitude,
which we denote λ1.
Theorem 12.7. Let A be an irreducible column-stochastic matrix. The dominant
eigenvalue λ1 is equal to 1. There is a unique corresponding eigenvector r satisfying
r > 0, and ‖r‖1 = 1; this is the only eigenvector that is nonnegative. If A > 0, then
|λi| < 1, i = 2, 3, . . . , n.
Proof. Because A is column stochastic, we have eTA = eT , which means that 1
is an eigenvalue of A. The rest of the statement can be proved using the Perron–
Frobenius theory [70, Chapter 8].
Given the size of the Internet, we can be sure that the link matrix P is re-
ducible, which means that the pagerank eigenvector of P is not well defined. To
ensure irreducibility, i.e., to make it impossible for the random walker to get trapped
in a subgraph, one adds, artificially, a link from every Web page to all the others. In
matrix terms, this can be made by taking a convex combination of P and a rank-1
matrix,
A = αP + (1 − α)
1
n
eeT , (12.9)
for some α satisfying 0 ≤ α ≤ 1. It is easy to see that the matrix A is column-
stochastic:
eTA = αeTP + (1 − α)
1
n
eT eeT = αeT + (1 − α)eT = eT .
book
2007/2/2
page 153
12.2. Random Walk and Markov Chains 153
The random walk interpretation of the additional rank-1 term is that in each time
step the surfer visiting a page will jump to a random page with probability 1 − α
(sometimes referred to as teleportation).
We now see that the pagerank vector for the matrix A is well defined.
Proposition 12.8. The column-stochastic matrix A defined in (12.9) is irreducible
(since A > 0) and has the dominant eigenvalue λ1 = 1. The corresponding eigen-
vector r satisfies r > 0.
For the convergence of the numerical eigenvalue algorithm, it is essential to
know how the eigenvalues of P are changed by the rank-1 modification (12.9).
Theorem 12.9. Assume that the eigenvalues of the column-stochastic matrix
P are {1, λ2, λ3 . . . , λn}. Then the eigenvalues of A = αP + (1 − α) 1nee
T are
{1, αλ2, αλ3, . . . , αλn}.
Proof. Define ê to be e normalized to Euclidean length 1, and let U1 ∈ Rn×(n−1)
be such that U =
(
ê U1
)
is orthogonal. Then, since êTP = êT ,
UTPU =
(
êTP
UT1 P
)(
ê U1
)
=
(
êT
UT1 P
)(
ê U1
)
=
(
êT ê êTU1
UT1 P ê U
T
1 P
TU1
)
=
(
1 0
w T
)
, (12.10)
where w = UT1 P ê and T = U
T
1 P
TU1. Since we have made a similarity transforma-
tion, the matrix T has the eigenvalues λ2, λ3, . . . , λn. We further have
UT v =
(
1/
√
n eT v
UT1 v
)
=
(
1/
√
n
UT1 v
)
.
Therefore,
UTAU = UT (αP + (1 − α)veT )U = α
(
1 0
w T
)
+ (1 − α)
(
1/
√
n
UT1 v
)(√
n 0
)
= α
(
1 0
w T
)
+ (1 − α)
(
1 0√
nUT1 v 0
)
=:
(
1 0
w1 αT
)
.
The statement now follows immediately.
Theorem 12.9 implies that even if P has a multiple eigenvalue equal to 1,
which is actually the case for the Google matrix, the second largest eigenvalue in
magnitude of A is always equal to α.
Example 12.10. We compute the eigenvalues and eigenvectors of the matrix
A = αP + (1 − α) 1
n
eeT with P from (12.8) and α = 0.85. The MATLAB code
book
2007/2/23
page 154
154 Chapter 12. Page Ranking for a Web Search Engine
LP=eig(P)’;
e=ones(6,1);
A=0.85*P + 0.15/6*e*e’;
[R,L]=eig(A)
gives the following result:
LP = -0.5 1.0 -0.5 1.0 -1.0 0
R = 0.447 -0.365 -0.354 0.000 0.817 0.101
0.430 -0.365 0.354 -0.000 -0.408 -0.752
0.430 -0.365 0.354 0.000 -0.408 0.651
0.057 -0.000 -0.707 0.000 0.000 -0.000
0.469 0.548 -0.000 -0.707 0.000 0.000
0.456 0.548 0.354 0.707 -0.000 -0.000
diag(L) = 1.0 0.85 -0.0 -0.85 -0.425 -0.425
It is seen that the first eigenvector (which corresponds to the eigenvalue 1), is
the only nonnegative one, as stated in Theorem 12.7.
Instead of the modification (12.9) we can define
A = αP + (1 − α)veT ,
where v is a nonnegative vector with ‖ v ‖1 = 1 that can be chosen to make the search
biased toward certain kinds of Web pages. Therefore, it is sometimes referred to
as a personalization vector [74, 48]. The vector v can also be used for avoiding
manipulation by so-called link farms [57, 59].
12.3 The Power Method for Pagerank Computation
We want to solve the eigenvalue problem
Ar = r,
where r is normalized ‖ r ‖1 = 1. In this section we denote the sought eigenvector by
t1. Dealing with stochastic matrices and vectors that are probability distributions,
it is natural to use the 1-norm for vectors (Section 2.3). Due to the sparsity and
the dimension of A (of the order billions), it is out of the question to compute the
eigenvector using any of the standard methods described in Chapter 15 for dense
matrices, as those methods are based on applying orthogonal transformations to
the matrix. The only viable method so far is the power method.
Assume that an initial approximation r(0) is given. The power method is given
in the following algorithm.
book
2007/2/23
page 155
12.3. The Power Method for Pagerank Computation 155
The power method for Ar = λr
for k = 1, 2, . . . until convergence
q(k) = Ar(k−1)
r(k) = q(k)/‖ q(k) ‖1
The purpose of normalizing the vector (making it have 1-norm equal to 1) is
to avoid having the vector become either very large or very small and thus unrep-
resentable in the floating point system. We will see later that normalization is not
necessary in the pagerank computation. In this context there is no need to compute
an eigenvalue approximation, as the sought eigenvalue is known to be equal to one.
The convergence of the power method depends on the distribution of eigen-
values. To make the presentation simpler, we assume that A is diagonalizable, i.e.,
there exists a nonsingular matrix T of eigenvectors, T−1AT = diag(λ1, . . . , λn).
The eigenvalues λi are ordered 1 = λ1 > |λ2| ≥ · · · ≥ |λn|. Expand the initial
approximation r(0) in terms of the eigenvectors,
r(0) = c1t1 + c2t2 + · · · + cntn,
where c1 �= 0 is assumed30 and r = t1 is the sought eigenvector. Then we have
Akr(0) = c1A
kt1 + c2A
kt2 + · · · + cnAktn
= c1λ
k
1t1 + c2λ
k
2t2 + · · · + cnλ
k
ntn = c1t1 +
n∑
j=2
cjλ
k
j tj .
Obviously, since for j = 2, 3, . . . we have |λj | < 1, the second term tends to zero and the power method converges to the eigenvector r = t1. The rate of convergence is determined by |λ2|. If this is close to 1, then the iteration is very slow. Fortunately this is not the case for the Google matrix; see Theorem 12.9 and below. A stopping criterion for the power iteration can be formulated in terms of the residual vector for the eigenvalue problem. Let λ̂ be the computed approximation of the eigenvalue and r̂ the corresponding approximate eigenvector. Then it can be shown [94], [4, p. 229] that the optimal error matrix E, for which (A + E)r̂ = λ̂r̂, exactly, satisfies ‖E ‖2 = ‖ s ‖2, where s = Ar̂−λ̂r̂. This means that if the residual ‖s‖2 is small, then the computed approximate eigenvector r̂ is the exact eigenvector of a matrix A + E that is close 30This assumption can be expected to be satisfied in floating point arithmetic, if not at the first iteration, then after the second, due to round-off. book 2007/2/23 page 156 156 Chapter 12. Page Ranking for a Web Search Engine to A. Since in the pagerank computations we are dealing with a positive matrix, whose columns all add up to one, it is natural to use the 1-norm instead [55]. As the 1-norm and the Euclidean norm are equivalent (cf. (2.6)), this does not make much difference. In the usual formulation of the power method the vector is normalized to avoid underflow or overflow. We now show that this is not necessary when the matrix is column stochastic. Proposition 12.11. Assume that the vector z satisfies ‖ z ‖1 = eT z = 1 and that the matrix A is column stochastic. Then ‖Az ‖1 = 1. (12.11) Proof. Put y = Az. Then ‖ y ‖1 = eT y = eTAz = eT z = 1 since A is column stochastic (eTA = eT ). In view of the huge dimensions of the Google matrix, it is nontrivial to compute the matrix-vector product y = Az, where A = αP + (1 − α) 1 n eeT . Recall that P was constructed from the actual link matrix Q as P = Q + 1 n edT , where the row vector d has an element 1 in all those positions that correspond to Web pages with no outlinks (see (12.5)). This means that to form P , we insert a large number of full vectors into Q, each of the same dimension as the total number of Web pages. Consequently, we cannot afford to store P explicitly. Let us look at the multiplication y = Az in more detail: y = α ( Q + 1 n edT ) z + (1 − α) n e(eT z) = αQz + β 1 n e, (12.12) where β = αdT z + (1 − α)eT z. We do not need to compute β from this equation. Instead we can use (12.11) in combination with (12.12): 1 = eT (αQz) + βeT ( 1 n e ) = eT (αQz) + β. Thus, we have β = 1 − ‖αQz ‖1. An extra bonus is that we do not use the vector d at all, i.e., we need not know which pages lack outlinks. The following MATLAB code implements the matrix vector multiplication: book 2007/2/23 page 157 12.3. The Power Method for Pagerank Computation 157 yhat=alpha*Q*z; beta=1-norm(yhat,1); y=yhat+beta*v; residual=norm(y-z,1); Here v = (1/n) e, or a personalized teleportation vector; see p. 154. To save memory, we should even avoid using the extra vector yhat and replace it with y. From Theorem 12.9 we know that the second eigenvalue of the Google matrix satisfies λ2 = α. A typical value of α is 0.85. Approximately k = 57 iterations are needed to make the factor 0.85k equal to 10−4. This is reported [57] to be close to the number of iterations used by Google. 0 0.5 1 1.5 2 x 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 nz = 16283 Figure 12.1. A 20000 × 20000 submatrix of the stanford.edu matrix. Example 12.12. As an example we used the matrix P obtained from the domain stanford.edu.31 The number of pages is 281903, and the total number of links is 2312497. Part of the matrix is displayed in Figure 12.1. We computed the pagerank 31http://www.stanford.edu/˜sdkamvar/research.html. book 2007/2/23 page 158 158 Chapter 12. Page Ranking for a Web Search Engine 0 10 20 30 40 50 60 70 10 −7 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 R e si d u a l Iterations 0 0.5 1 1.5 2 2.5 3 x 10 5 0 0.002 0.004 0.006 0.008 0.01 0.012 Figure 12.2. The residual in the power iterations (top) and the pagerank vector (bottom) for the stanford.edu matrix. vector using the power method with α = 0.85 and iterated 63 times until the 1-norm of the residual was smaller than 10−6. The residual and the final pagerank vector are illustrated in Figure 12.2. Because one pagerank calculation can take several days, several enhancements of the iteration procedure have been proposed. In [53] an adaptive method is de- scribed that checks the convergence of the components of the pagerank vector and avoids performing the power iteration for those components. Up to 30% speed-up has been reported. The block structure of the Web is used in [54], and speed-ups of a factor of 2 have been reported. An acceleration method based on Aitken extrapo- book 2007/2/23 page 159 12.4. HITS 159 lation is described in [55]. Aggregation methods are discussed in several papers by Langville and Meyer and in [51]. When computing the pagerank for a subset of the Internet, say, one particular domain, the matrix P may be of a dimension for which one can use methods other than the power method, e.g., the Arnoldi method; see [40] and Section 15.8.3. It may even be sufficient to use the MATLAB function eigs, which computes a small number of eigenvalues and the corresponding eigenvectors of a sparse matrix using an Arnoldi method with restarts. A variant of pagerank is proposed in [44]. Further properties of the pagerank matrix are given in [84]. 12.4 HITS Another method based on the link structure of the Web was introduced at the same time as pagerank [56]. It is called HITS (Hypertext Induced Topic Search) and is based on the concepts of authorities and hubs. An authority is a Web page with many inlinks, and a hub has many outlinks. The basic idea is that good hubs point to good authorities and good authorities are pointed to by good hubs. Each Web page is assigned both a hub score y and an authority score x. Let L be the adjacency matrix of the directed Web graph. Then two equations are given that mathematically define the relation between the two scores, based on the basic idea: x = LT y, y = Lx. (12.13) The algorithm for computing the scores is the power method, which converges to the left and right singular vectors corresponding to the largest singular value of L. In the implementation of HITS, the adjacency matrix not of the whole Web but of all the pages relevant to the query is used. There is now an extensive literature on pagerank, HITS, and other ranking methods. For overviews, see [7, 58, 59]. A combination of HITS and pagerank has been proposed in [65]. Obviously, the ideas underlying pagerank and HITS are not restricted to Web applications but can be applied to other network analyses. A variant of the HITS method was recently used in a study of Supreme Court precedent [36]. HITS is generalized in [17], which also treats synonym extraction. In [72], generank, which is based on the pagerank concept, is used for the analysis of microarray experiments. book 2007/2/23 page 160 book 2007/2/23 page 161 Chapter 13 Automatic Key Word and Key Sentence Extraction Due to the explosion of the amount of textual information available, there is a need to develop automatic procedures for text summarization. A typical situation is when a Web search engine presents a small amount of text from each document that matches a certain query. Another relevant area is the summarization of news articles. Automatic text summarization is an active research field with connections to several other areas, such as information retrieval, natural language processing, and machine learning. Informally, the goal of text summarization is to extract content from a text document and present the most important content to the user in a condensed form and in a manner sensitive to the user’s or application’s need [67]. In this chapter we will have a considerably less ambitious goal: we will present a method for automatically extracting key words and key sentences from a text. There will be connections to the vector space model in information retrieval and to the concept of pagerank. We will also use nonnegative matrix factorization. The presentation is based on [114]. Text summarization using QR decomposition is described in [26, 83]. 13.1 Saliency Score Consider a text from which we want to extract key words and key sentences. As an example we will take Chapter 12 from this book. As one of the preprocessing steps, one should perform stemming so that the same word stem with different endings is represented by one token only. Stop words (cf. Chapter 11) occur frequently in texts, but since they do not distinguish between different sentences, they should be removed. Similarly, if the text carries special symbols, e.g., mathematics or mark-up language tags (HTML, LATEX), it may be necessary to remove those. Since we want to compare word frequencies in different sentences, we must consider each sentence as a separate document (in the terminology of information retrieval). After the preprocessing has been done, we parse the text, using the same type of parser as in information retrieval. This way a term-document matrix is 161 book 2007/2/23 page 162 162 Chapter 13. Automatic Key Word and Key Sentence Extraction prepared, which in this chapter we will refer to as a term-sentence matrix. Thus we have a matrix A ∈ Rm×n, where m denotes the number of different terms and n the number of sentences. The element aij is defined as the frequency 32 of term i in sentence j. The column vector (a1j a2j . . . amj) T is nonzero in the positions corresponding to the terms occurring in sentence j. Similarly, the row vector (ai1 ai2 . . . ain) is nonzero in the positions corresponding to sentences containing term i. The basis of the procedure in [114] is the simultaneous but separate ranking of the terms and the sentences. Thus, term i is given a nonnegative saliency score, denoted ui. The higher the saliency score, the more important the term. The saliency score of sentence j is denoted vj . The assignment of saliency scores is made based on the mutual reinforcement principle [114]: A term should have a high saliency score if it appears in many sentences with high saliency scores. A sentence should have a high saliency score if it contains many words with high saliency scores. More precisely, we assert that the saliency score of term i is proportional to the sum of the scores of the sentences where it appears; in addition, each term is weighted by the corresponding matrix element, ui ∝ n∑ j=1 aijvj , i = 1, 2, . . . ,m. Similarly, the saliency score of sentence j is defined to be proportional to the scores of its words, weighted by the corresponding aij , vj =∝ m∑ i=1 aijui, j = 1, 2, . . . , n. Collecting the saliency scores in two vectors u ∈ Rm and v ∈ Rn, these two equations can be written as σuu = Av, (13.1) σvv = A Tu, (13.2) where σu and σv are proportionality constants. In fact, the constants must be equal. Inserting one equation into the other, we get σu u = 1 σv AATu, σv v = 1 σu ATAv, 32Naturally, a term and document weighting scheme (see [12]) should be used. book 2007/2/23 page 163 13.1. Saliency Score 163 which shows that u and v are eigenvectors of AAT and ATA, respectively, with the same eigenvalue. It follows that u and v are singular vectors corresponding to the same singular value.33 If we choose the largest singular value, then we are guaranteed that the com- ponents of u and v are nonnegative.34 In summary, the saliency scores of the terms are defined as the components of u1, and the saliency scores of the sentences are the components of v1. 0 50 100 150 200 250 300 350 400 −0.1 0 0.1 0.2 0.3 0.4 0 20 40 60 80 100 120 140 160 180 200 −0.2 0 0.2 0.4 0.6 Figure 13.1. Saliency scores for Chapter 12: term scores (top) and sen- tence scores (bottom). Example 13.1. We created a term-sentence matrix based on Chapter 12. Since the text is written using LATEX, we first had to remove all LATEX typesetting commands. This was done using a lexical scanner called detex.35 Then the text was stemmed and stop words were removed. A term-sentence matrix A was constructed using the text parser TMG [113]: there turned out to be 388 terms in 183 sentences. The first singular vectors were computed in MATLAB, [u,s,v]=svds(A,1). (The matrix is sparse, so we used the SVD function for sparse matrices.) The singular vectors are plotted in Figure 13.1. By locating the 10 largest components of u1 and using the dictionary produced by the text parser, we found that the following words, ordered by importance, are the most important in the chapter: 33This can be demonstrated easily using the SVD of A. 34The matrix A has nonnegative elements; therefore the first principal component u1 (see Sec- tion 6.4) must have nonnegative components. The corresponding holds for v1. 35http://www.cs.purdue.edu/homes/trinkle/detex/. book 2007/2/23 page 164 164 Chapter 13. Automatic Key Word and Key Sentence Extraction A ≈ Figure 13.2. Symbolic illustration of a rank-1 approximation: A ≈ σ1u1vT1 . page, search, university, web, Google, rank, outlink, link, number, equal The following are the six most important sentences, in order: 1. A Google search conducted on September 29, 2005, using the search phrase university, gave as a result links to the following well-known universities: Har- vard, Stanford, Cambridge, Yale, Cornell, Oxford. 2. When a search is made on the Internet using a search engine, there is first a traditional text processing part, where the aim is to find all the Web pages containing the words of the query. 3. Loosely speaking, Google assign a high rank to a Web page if it has inlinks from other pages that have a high rank. 4. Assume that a surfer visiting a Web page chooses the next page from among the outlinks with equal probability. 5. Similarly, column j has nonzero elements equal to Nj in those positions that correspond to the outlinks of j, and, provided that the page has outlinks, the sum of all the elements in column j is equal to one. 6. The random walk interpretation of the additional rank-1 term is that in each time step the surfer visiting a page will jump to a random page with proba- bility 1 − α (sometimes referred to as teleportation). It is apparent that this method prefers long sentences. On the other hand, these sentences are undeniably key sentences for the text. The method described above can also be thought of as a rank-1 approximation of the term-sentence matrix A, illustrated symbolically in Figure 13.2. In this interpretation, the vector u1 is a basis vector for the subspace spanned by the columns of A, and the row vector σ1v T 1 holds the coordinates of the columns of A in terms of this basis. Now we see that the method based on saliency scores has a drawback: if there are, say, two “top sentences” that contain the same high- saliency terms, then their coordinates will be approximately the same, and both sentences will be extracted as key sentences. This is unnecessary, since they are very similar. We will next see that this can be avoided if we base the key sentence extraction on a rank-k approximation. book 2007/2/23 page 165 13.2. Key Sentence Extraction from a Rank-k Approximation 165 A ≈ Figure 13.3. Symbolic illustration of low-rank approximation: A ≈ CD. 13.2 Key Sentence Extraction from a Rank-k Approximation Assume that we have computed a good rank-k approximation of the term-sentence matrix, A ≈ CD, C ∈ Rm×k, D ∈ Rk×n, (13.3) illustrated in Figure 13.3. This approximation can be be based on the SVD, cluster- ing [114], or nonnegative matrix factorization. The dimension k is chosen greater than or equal to the number of key sentences that we want to extract. C is a rank-k matrix of basis vectors, and each column of D holds the coordinates of the corresponding column in A in terms of the basis vectors. Now recall that the basis vectors in C represent the most important directions in the “sentence space,” the column space. However, the low-rank approximation does not immediately give an indication of which are the most important sentences. Those sentences can be found if we first determine the column of A that is the “heaviest” in terms of the basis, i.e., the column in D with the largest 2-norm. This defines one new basis vector. Then we proceed by determining the column of D that is the heaviest in terms of the remaining k − 1 basis vectors, and so on. To derive the method, we note that in the approximate equality (13.3) we may introduce any nonsingular matrix T and its inverse between C and D, and we may multiply the relation with any permutation P from the right, without changing the relation: AP ≈ CDP = (CT )(T−1DP ), where T ∈ Rk×k. Starting from the approximation (13.3), we first find the column of largest norm in D and permute it by P1 to the first column; at the same time we move the corresponding column of A to the first position. Then we determine a Householder transformation Q1 that zeros the elements in the first column below the element in position (1, 1) and apply the transformation to both C and D: AP1 ≈ (CQ1)(QT1 DP1). In fact, this is the first step in the QR decomposition with column pivoting of D. We continue the discussion in terms of an example with m = 6, n = 5, and k = 3. book 2007/2/23 page 166 166 Chapter 13. Automatic Key Word and Key Sentence Extraction To illustrate that the procedure deals with the problem of having two or more very similar important sentences (cf. p. 164), we have also assumed that column 4 of D had almost the same coordinates as the column that was moved to the first position. After the first step the matrices have the structure (CQ1)(Q T 1 DP1) = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎝ × × × × × × × × × × × × × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎠ ⎛ ⎝κ1 × × × ×0 × × �1 × 0 × × �2 × ⎞ ⎠ , where κ is the Euclidean length of the first column of DP1. Since column 4 was similar to the one that is now in position 1, it has small entries in rows 2 and 3. Then we introduce the diagonal matrix T1 = ⎛ ⎝κ1 0 00 1 0 0 0 1 ⎞ ⎠ between the factors: C1D1 := (CQ1T1)(T −1 1 Q T 1 DP1) = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎝ ∗ × × ∗ × × ∗ × × ∗ × × ∗ × × ∗ × × ∗ × × ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎠ ⎛ ⎝1 ∗ ∗ ∗ ∗0 × × �1 × 0 × × �2 × ⎞ ⎠ . This changes only column 1 of the left factor and column 1 of the right factor (marked with ∗). From the relation AP1 ≈ C1D1, we now see that the first column in AP1 is approximately equal to the first column in C1. Remembering that the columns of the original matrix C are the dominating directions in the matrix A, we have now identified the “dominating column” of A. Before continuing, we make the following observation. If one column of D is similar to the first one (column 4 in the example), then it will now have small elements below the first row, and it will not play a role in the selection of the second most dominating document. Therefore, if there are two or more important sentences with more or less the same key words, only one of then will be selected. Next we determine the second most dominating column of A. To this end we compute the norms of the columns of D1, excluding the first row (because that row holds the coordinates in terms of the first column of C1). The column with the book 2007/2/23 page 167 13.2. Key Sentence Extraction from a Rank-k Approximation 167 largest norm is moved to position 2 and reduced by a Householder transformation in a similar manner as above. After this step we have C2D2 := (C1Q2T2)(T −1 2 Q T 2 D1P2) = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎝ ∗ ∗ × ∗ ∗ × ∗ ∗ × ∗ ∗ × ∗ ∗ × ∗ ∗ × ∗ ∗ × ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎠ ⎛ ⎝1 ∗ ∗ ∗ ∗0 1 ∗ �1 ∗ 0 0 × �2 × ⎞ ⎠ . Therefore the second column of AP1P2 ≈ C2D2 holds the second most dominating column. Continuing the process, the final result is AP ≈ CkDk, Dk = ( R S ) , where R is upper triangular and P is a product of permutations. Now the first k columns of AP hold the dominating columns of the matrix, and the rank-k approx- imations of these columns are in Ck. This becomes even clearer if we write AP ≈ CkRR−1Dk = Ĉ ( I Ŝ ) , (13.4) where Ĉ = CkR and Ŝ = R −1S. Since Ĉ = CQ1T1Q2T2 · · ·QkTkR is a rotated and scaled version of the original C (i.e., the columns of C and Ĉ span the same subspace in Rm), it still holds the dominating directions of A. Assume that ai1 , ai2 , . . . , aik are the first k columns of AP . Then (13.4) is equivalent to aij ≈ ĉj , j = 1, 2, . . . , k. This means that the dominating directions, which are given by the columns of Ĉ, have been directly associated with k columns in A. The algorithm described above is equivalent to computing the QR decompo- sition with column pivoting (see Section 6.9.1), DP = Q ( R S ) , where Q is orthogonal and R is upper triangular. Note that if we are interested only in finding the top k sentences, we need not apply any transformations to the matrix of basis vectors, and the algorithm for finding the top k sentences can be implemented in MATLAB as follows: book 2007/2/23 page 168 168 Chapter 13. Automatic Key Word and Key Sentence Extraction % C * D is a rank k approximation of A [Q,RS,P]=qr(D); p=[1:n]*P; pk=p(1:k); % Indices of the first k columns of AP Example 13.2. We computed a nonnegative matrix factorization of the term- sentence matrix of Example 13.1 using the multiplicative algorithm of Section 9.2. Then we determined the six top sentences using the method described above. The sentences 1, 2, 3, and 5 in Example 13.1 were selected, and in addition the following two: 1. Due to the sparsity and the dimension of A (of the order billions), it is out of the question to compute the eigenvector using any of the standard methods described in Chapter 15 for dense matrices, as those methods are based on applying orthogonal transformations to the matrix. 2. In [53] an adaptive method is described that checks the convergence of the components of the pagerank vector and avoids performing the power iteration for those components. The same results were obtained when the low-rank approximation was computed using the SVD. book 2007/2/23 page 169 Chapter 14 Face Recognition Using Tensor SVD Human beings are very skillful at recognizing faces even when the facial expression, the illumination, the viewing angle, etc., vary. To develop automatic procedures for face recognition that are robust with respect to varying conditions is a challeng- ing research problem that has been investigated using several different approaches. Principal component analysis (i.e., SVD) is a popular technique that often goes by the name “eigenfaces” [23, 88, 100]. However, this method is best when all pictures are taken under similar conditions, and it does not perform well when sev- eral environment factors are varied. More general bilinear models also have been investigated; see, e.g., [95]. Recently [102, 103, 104, 105], methods for multilinear analysis of image en- sembles were studied. In particular, the face recognition problem was considered using a tensor model, the TensorFaces approach. By letting the modes of the ten- sor represent a different viewing condition, e.g., illumination or facial expression, it became possible to improve the precision of the recognition algorithm compared to the PCA method. In this chapter we will describe a tensor method for face recognition, related to TensorFaces. Since we are dealing with images, which are often stored as m× n arrays, with m and n of the order 100–500, the computations for each face to be identified are quite heavy. We will discuss how the tensor SVD (HOSVD) can also be used for dimensionality reduction to reduce the flop count. 14.1 Tensor Representation Assume that we have a collection of images of np persons, where each image is an mi1 ×mi2 array with mi1mi2 = ni. We will assume that the columns of the images are stacked so that each image is represented by a vector in Rni . Further assume that each person has been photographed with ne different facial expressions. 36 Often one can have ni ≥ 5000, and usually ni is considerably larger than ne and np. The 36For simplicity here we refer to different illuminations, etc., as expressions. 169 book 2007/2/23 page 170 170 Chapter 14. Face Recognition Using Tensor SVD collection of images is stored as a tensor, A ∈ Rni×ne×np . (14.1) We refer to the different modes as the image mode, the expression mode, and the person mode, respectively. If, for instance we also had photos of each person with different illumination, viewing angles, etc., then we could represent the image collection by a tensor of higher degree [104]. For simplicity, here we consider only the case of a 3-mode tensor. The generalization to higher order tensors is straightforward. Example 14.1. We preprocessed images of 10 persons from the Yale Face Database by cropping and decimating each image to 112×78 pixels stored in a vector of length 8736. Five images are illustrated in Figure 14.1. Figure 14.1. Person 1 with five different expressions (from the Yale Face Database). Each person is photographed with a total of 11 different expressions. The ordering of the modes is arbitrary, of course; for definiteness and for illustration purposes we will assume the ordering of (14.1). However, to (somewhat) emphasize the ordering arbitrariness, will use the notation ×e for multiplication of the tensor by matrix along the expression mode, and similarly for the other modes. We now assume that ni � nenp and write the thin HOSVD (see Theorem 8.3 and (8.9)), A = S ×i F ×e G×p H, (14.2) where S ∈ Rnenp×ne×np is the core tensor, F ∈ Rni×nenp has orthonormal columns, and G ∈ Rne×ne and H ∈ Rnp×np are orthogonal. Example 14.2. We computed the HOSVD of the tensor of face images of 10 per- sons, each with 10 different expressions. The singular values are plotted in Figure 14.2. All 10 singular values in the expression and person modes are significant, which means that it should be relatively easy to distinguish between expressions and persons. The HOSVD can be interpreted in different ways depending on what it is to be used for. We first illustrate the relation A = D ×e G×p H, book 2007/2/23 page 171 14.1. Tensor Representation 171 0 20 40 60 80 100 10 −15 10 −10 10 −5 10 0 10 5 0 2 4 6 8 10 12 10 2 10 3 10 4 10 5 Figure 14.2. The singular values in the image mode (left), the expression mode (right, +), and the person mode (right, circles). where D = S ×i F : A e i p = D H G At this point, let us recapitulate the definition of tensor-matrix multiplication (Section 8.2). For definiteness we consider 2-mode, i.e., here e-mode, multiplication: (D ×e G)(i1, j, i3) = ne∑ k=1 gj,k di1,k,i3 . We see that fixing a particular value of the expression parameter, i.e., putting j = e0, say, corresponds to using only the e0th row of G. By doing the analogous choice in the person mode, we get A(:, e0, p0) = D ×e ge0 ×p hp0 , (14.3) where ge0 denotes the e0th row vector of G and hp0 the p0th row vector of H. We illustrate (14.3) in the following figure: book 2007/2/23 page 172 172 Chapter 14. Face Recognition Using Tensor SVD A(:, e0, p0) = D hp0 ge0 We summarize this in words: The image of person p0 in expression e0 can be synthesized by multipli- cation of the tensor D by hp0 and ge0 in their respective modes. Thus person p0 is uniquely characterized by the row vector hp0 and expression e0 is uniquely characterized by ge0 , via the bilinear form D ×e g ×p h. Example 14.3. The MATLAB code a=tmul(tmul(D,Ue(4,:),2),Up(6,:),3); gives person 6 in expression 4 (happy); see Figure 14.3. Recall that the function tmul(A,X,i) multiplies the tensor A by the matrix X in mode i. Figure 14.3. Person 6 in expression 4 (happy). 14.2 Face Recognition We will now consider the classification problem as follows: Given an image of an unknown person, represented by a vector in Rni , determine which of the np persons it represents, or decide that the un- known person is not in the database. book 2007/2/23 page 173 14.2. Face Recognition 173 For the classification we write the HOSVD (14.2) in the following form: A = C ×p H, C = S ×i F ×e G. (14.4) For a particular expression e we have A(:, e, 🙂 = C(:, e, 🙂 ×p H. (14.5) Obviously we can identify the tensors A(:, e, 🙂 and C(:, e, 🙂 with matrices, which we denote Ae and Ce. Therefore, for all the expressions, we have linear relations Ae = CeH T , e = 1, 2, . . . , ne. (14.6) Note that the same (orthogonal) matrix H occurs in all ne relations. With H T =( h1 . . . hnp ) , column p of (14.6) can be written a(e)p = Cehp. (14.7) We can interpret (14.6) and (14.7) as follows: Column p of Ae contains the image of person p in expression e. The columns of Ce are basis vectors for expression e, and row p of H, i.e., hp, holds the coordinates of the image of person p in this basis. Fur- thermore, the same hp holds the coordinates of the images of person p in all expression bases. Next assume that z ∈ Rni is an image of an unknown person in an unknown expression (out of the ne) and that we want to classify it. We refer to z as a test image. Obviously, if it is an image of person p in expression e, then the coordinates of z in that basis are equal to hp. Thus we can classify z by computing its coordinates in all the expression bases and checking, for each expression, whether the coordinates of z coincide (or almost coincide) with the elements of any row of H. The coordinates of z in expression basis e can be found by solving the least squares problem min αe ‖Ceαe − z ‖2. (14.8) The algorithm is summarized below: Classification algorithm (preliminary version) % z is a test image. for e = 1, 2, . . . , ne Solve minαe ‖Ceαe − z ‖2. for p = 1, 2, . . . , np If ‖αe − hp ‖2 < tol, then classify as person p and stop. end end book 2007/2/23 page 174 174 Chapter 14. Face Recognition Using Tensor SVD The amount of work in this algorithm is high: for each test image z we must solve ne least squares problems (14.8) with Ce ∈ Rni×np . However, recall from (14.4) that C = S ×i F ×e G, which implies Ce = FBe, where Be ∈ Rnenp×np is the matrix identified with (S ×e G)(:, e, :). Note that F ∈ Rni×nenp ; we assume that ni is considerably larger than nenp. Then, for the analysis only, enlarge the matrix so that it becomes square and orthogonal: F̂ = ( F F⊥ ) , F̂T F̂ = I. Now insert F̂T inside the norm: ‖Ceαe − z ‖22 = ‖ F̂ T (FBeαe − z) ‖2 = ∥∥∥∥ ( Beαe − FT z −(F⊥)T z )∥∥∥∥2 2 = ‖Beαe − FT z ‖22 + ‖ (F ⊥)T z ‖22. It follows that we can solve the ne least squares problems by first computing F T z and then solving min αe ‖Beαe − FT z ‖2, e = 1, 2, . . . , ne. (14.9) The matrix Be has dimension nenp × np, so it is much cheaper to solve (14.9) than (14.8). It is also possible to precompute a QR decomposition of each matrix Be to further reduce the work. Thus we arrive at the following algorithm. Classification algorithm Preprocessing step. Compute and save the thin QR decompositions of all the Be matrices, Be = QeRe, e = 1, 2, . . . , ne. % z is a test image. Compute ẑ = FT z. for e = 1, 2, . . . , ne Solve Reαe = Q T e ẑ for αe. for p = 1, 2, . . . , np If ‖αe − hp ‖2 < tol, then classify as person p and stop. end end book 2007/2/23 page 175 14.3. Face Recognition with HOSVD Compression 175 In a typical application it is likely that even if the test image is an image of a person in the database, it is taken with another expression that is not represented in the database. However, the above algorithm works well in such cases, as reported in [104]. Example 14.4. For each of the 10 persons in the Yale database, there is an image of the person winking. We took these as test images and computed the closest image in the database, essentially by using the algorithm above. In all cases the correct person was identified; see Figure 14.4. Figure 14.4. The upper row shows the images to be classified, the bottom row the corresponding closest image in the database. 14.3 Face Recognition with HOSVD Compression Due to the ordering properties of the core, with respect to the different modes (Theorem 8.3), we may be able to truncate the core in such a way that the truncated HOSVD is still a good approximation of A. Define Fk = F (:, 1 : k) for some value of k that we assume is much smaller than ni but larger than np. Then, for the analysis only, enlarge the matrix so that it becomes square and orthogonal: F̂ = (Fk F̃⊥), F̂ T F̂ = I. Then truncate the core tensor similarly, i.e., put Ĉ = (S ×e G)(1 : k, :, 🙂 ×i Fk. (14.10) It follows from Theorem 8.3, and the fact that the multiplication by G in the e-mode does not affect the HOSVD ordering properties in the i-mode, that ‖Ĉ − C‖2F = ni∑ ν=k+1 σ(i)ν . Therefore, if the rate of decay of the image mode singular values is fast enough, it should be possible to obtain good recognition precision, despite the compression. So if we use Ĉ in the algorithm of the preceding section, we will have to solve least squares problems min αe ‖ Ĉeαe − z ‖2 book 2007/2/23 page 176 176 Chapter 14. Face Recognition Using Tensor SVD with the obvious definition of Ĉe. Now, from (14.10) we have Ĉe = FkB̂e, where B̂e ∈ Rk×np . Multiplying by F̂ inside the norm sign we get ‖ Ĉeαe − z ‖22 = ‖ B̂eαe − F T k z ‖ 2 2 + ‖ F̃ T ⊥ z ‖ 2 2. In this “compressed” variant of the recognition algorithm, the operation ẑ = FT z is replaced with ẑk = F T k z, and also the least squares problems in the loop are smaller. Example 14.5. We used the same data as in the previous example but truncated the orthogonal basis in the image mode to rank k. With k = 10, all the test images were correctly classified, but with k = 5, 2 of 10 images were incorrectly classified. Thus a substantial rank reduction (from 100 to 10) was possible in this example without sacrificing classification accuracy. In our illustrating example, the numbers of persons and different expressions are so small that it is not necessary to further compress the data. However, in a realistic application, to classify images in a reasonable time, one can truncate the core tensor in the expression and person modes and thus solve much smaller least squares problems than in the uncompressed case. book 2007/2/23 page Part III Computing the Matrix Decompositions book 2007/2/23 page book 2007/2/23 page 179 Chapter 15 Computing Eigenvalues and Singular Values In MATLAB and other modern programming environments, eigenvalues and singu- lar values are obtained using high-level functions, e.g., eig(A) and svd(A). These functions implement algorithms from the LAPACK subroutine library [1]. To give an orientation about what is behind such high-level functions, in this chapter we briefly describe some methods for computing eigenvalues and singular values of dense matrices and large sparse matrices. For more extensive treatments, see e.g., [4, 42, 79]. The functions eig and svd are used for dense matrices, i.e., matrices where most of the elements are nonzero. Eigenvalue algorithms for a dense matrix have two phases: 1. Reduction of the matrix to compact form: tridiagonal in the symmetric case and Hessenberg in the nonsymmetric case. This phase consists of a finite sequence of orthogonal transformations. 2. Iterative reduction to diagonal form (symmetric case) or triangular form (non- symmetric case). This is done using the QR algorithm. For large, sparse matrices it is usually not possible (or even interesting) to compute all the eigenvalues. Here there are special methods that take advantage of the sparsity. Singular values are computed using variations of the eigenvalue algorithms. As background material we give some theoretical results concerning perturba- tion theory for the eigenvalue problem. In addition, we briefly describe the power method for computing eigenvalues and its cousin inverse iteration. In linear algebra textbooks the eigenvalue problem for the matrix A ∈ Rn×n is often introduced as the solution of the polynomial equation det(A− λI) = 0. In the computational solution of general problems, this approach is useless for two reasons: (1) for matrices of interesting dimensions it is too costly to compute the determinant, and (2) even if the determinant and the polynomial could be com- puted, the eigenvalues are extremely sensitive to perturbations in the coefficients of 179 book 2007/2/23 page 180 180 Chapter 15. Computing Eigenvalues and Singular Values the polynomial. Instead, the basic tool in the numerical computation of eigenvalues are orthogonal similarity transformations. Let V be an orthogonal matrix. Then make the transformation (which corresponds to a change of basis) A −→ V TAV. (15.1) It is obvious that the eigenvalues are preserved under this transformation: Ax = λx ⇔ V TAV y = λy, (15.2) where y = V Tx. 15.1 Perturbation Theory The QR algorithm for computing eigenvalues is based on orthogonal similarity trans- formations (15.1), and it computes a sequence of transformations such that the final result is diagonal (in the case of symmetric A) or triangular (for nonsymmetric A). Since the algorithm is iterative, it is necessary to decide when a floating point num- ber is small enough to be considered as zero numerically. To have a sound theoretical basis for this decision, one must know how sensitive the eigenvalues and eigenvectors are to small perturbations of the data, i.e., the coefficients of the matrix. Knowledge about the sensitivity of eigenvalues and singular values is useful also for a more fundamental reason: often matrix elements are measured values and subject to errors. Sensitivity theory gives information about how much we can trust eigenvalues, etc., in such situations. In this section we give a couple of perturbation results, without proofs,37 first for a symmetric matrix A ∈ Rn×n. Assume that eigenvalues of n× n matrices are ordered λ1 ≥ λ2 ≥ · · · ≥ λn. We consider a perturbed matrix A + E and ask how far the eigenvalues and eigen- vectors of A + E are from those of A. Example 15.1. Let A = ⎛ ⎜⎜⎝ 2 1 0 0 1 2 0.5 0 0 0.5 2 0 0 0 0 1 ⎞ ⎟⎟⎠ , A + E = ⎛ ⎜⎜⎝ 2 1 0 0 1 2 0.5 0 0 0.5 2 10−15 0 0 10−15 1 ⎞ ⎟⎟⎠ . This is a typical situation in the QR algorithm for tridiagonal matrices: by a se- quence of orthogonal similarity transformations, a tridiagonal matrix is made to converge toward a diagonal matrix. When are we then allowed to consider a small off-diagonal floating point number as zero? How much can the eigenvalues of A and A + E deviate? 37For proofs, see, e.g., [42, Chapters 7, 8]. book 2007/2/23 page 181 15.1. Perturbation Theory 181 Theorem 15.2. Let A ∈ Rn×n and A + E be symmetric matrices. Then λk(A) + λn(E) ≤ λk(A + E) ≤ λk(A) + λ1(E), k = 1, 2, . . . , n, and |λk(A + E) − λk(A)| ≤ ‖E‖2, k = 1, 2, . . . , n. From the theorem we see that, loosely speaking, if we perturb the matrix elements by �, then the eigenvalues are also perturbed by O(�). For instance, in Ex- ample 15.1 the matrix E has the eigenvalues ±10−15 and ‖E‖2 = 10−15. Therefore the eigenvalues of the two matrices differ by 10−15 at the most. The sensitivity of the eigenvectors depends on the separation of eigenvalues. Theorem 15.3. Let [λ, q] be an eigenvalue-eigenvector pair of the symmetric ma- trix A, and assume that the eigenvalue is simple. Form the orthogonal matrix Q = ( q Q1 ) and partition the matrices QTAQ and QTEQ, QTAQ = ( λ 0 0 A2 ) , QTEQ = ( � eT e E2 ) . Define d = min λi(A) �=λ |λ− λi(A)| and assume that ‖E ‖2 ≤ d/4. Then there exists an eigenvector q̂ of A + E such that the distance between q and q̂, measured as the sine of the angle between the vectors, is bounded: sin(θ(q, q̂)) ≤ 4 ‖ e ‖2 d . The theorem is meaningful only if the eigenvalue is simple. It shows that eigenvectors corresponding to close eigenvalues can be sensitive to perturbations and are therefore more difficult to compute to high accuracy. Example 15.4. The eigenvalues of the matrix A in Example 15.1 are 0.8820, 1.0000, 2.0000, 3.1180. The deviation between the eigenvectors of A and A+E corresponding to the smallest eigenvalue can be estimated by 4‖e‖2 |0.8820 − 1| ≈ 1.07 · 10−14. Since the eigenvalues are well separated, the eigenvectors of this matrix are rather insensitive to perturbations in the data. book 2007/2/23 page 182 182 Chapter 15. Computing Eigenvalues and Singular Values To formulate perturbation results for nonsymmetric matrices, we first intro- duce the concept of an upper quasi-triangular matrix : R ∈ Rn×n is called upper quasi-triangular if it has the form R = ⎛ ⎜⎜⎜⎝ R11 R12 · · · R1m 0 R22 · · · R2m ... ... . . . ... 0 0 · · · Rmm ⎞ ⎟⎟⎟⎠ , where each Rii is either a scalar or a 2× 2 matrix having complex conjugate eigen- values. The eigenvalues of R are equal to the eigenvalues of the diagonal blocks Rii (which means that if Rii is a scalar, then it is an eigenvalue of R). Theorem 15.5 (real Schur decomposition38). For any (symmetric or non- symmetric) matrix A ∈ Rn×n there exists an orthogonal matrix U such that UTAU = R, (15.3) where R is upper quasi-triangular. Partition U and R: U = ( Uk Û ) , R = ( Rk S 0 R̂ ) , where Uk ∈ Rn×k and Rk ∈ Rk×k. Then from (15.3) we get AUk = UkRk, (15.4) which implies R(AUk) ⊂ R(Uk), where R(Uk) denotes the range of Uk. Therefore Uk is called an invariant subspace or an eigenspace of A, and the decomposition (15.4) is called a partial Schur decomposition. If A is symmetric, then R is diagonal, and the Schur decomposition is the same as the eigenvalue decomposition UTAU = D, where D is diagonal. If A is nonsymmetric, then some or all of its eigenvalues may be complex. Example 15.6. The Schur decomposition is a standard function in MATLAB. If the matrix is real, then R is upper quasi-triangular: >> A=randn(3)
A = -0.4326 0.2877 1.1892
-1.6656 -1.1465 -0.0376
0.1253 1.1909 0.3273
>> [U,R]=schur(A)
U = 0.2827 0.2924 0.9136
0.8191 -0.5691 -0.0713
-0.4991 -0.7685 0.4004
38There is complex version of the decomposition, where U is unitary and R is complex and
upper triangular.
book
2007/2/2
page 183
15.1. Perturbation Theory 183
R = -1.6984 0.2644 -1.2548
0 0.2233 0.7223
0 -1.4713 0.2233
If we compute the eigenvalue decomposition, we get
>> [X,D]=eig(A)
X = 0.2827 0.4094 – 0.3992i 0.4094 + 0.3992i
0.8191 -0.0950 + 0.5569i -0.0950 – 0.5569i
-0.4991 0.5948 0.5948
D = -1.6984 0 0
0 0.2233+1.0309i 0
0 0 0.2233-1.0309i
The eigenvectors of a nonsymmetric matrix are not orthogonal.
The sensitivity of the eigenvalues of a nonsymmetric matrix depends on the
norm of the strictly upper triangular part of R in the Schur decomposition. For
convenience we here formulate the result using the complex version of the decom-
position.39
Theorem 15.7. Let UHAU = R = D + N be the complex Schur decomposition of
A, where U is unitary, R is upper triangular, and D is diagonal, and let τ denote
an eigenvalue of a perturbed matrix A + E. Further, let p be the smallest integer
such that Np = 0. Then
min
λi(A)
|λi(A) − τ | ≤ max(η, η1/p),
where
η = ‖E ‖2
p−1∑
k=0
‖N ‖k2 .
The theorem shows that the eigenvalues of a highly nonsymmetric matrix can
be considerably more sensitive to perturbations than the eigenvalues of a symmetric
matrix; cf. Theorem 15.2.
Example 15.8. The matrices
A =
⎛
⎝2 0 1030 2 0
0 0 2
⎞
⎠ , B = A +
⎛
⎝ 0 0 00 0 0
10−10 0 0
⎞
⎠
have the eigenvalues
39The notation UH means transposed and conjugated.
book
2007/2/23
page 184
184 Chapter 15. Computing Eigenvalues and Singular Values
2, 2, 2,
and
2.00031622776602, 1.99968377223398, 2.00000000000000,
respectively. The relevant quantity for the perturbation is η1/2 ≈ 3.164 ·10−04.
The nonsymmetric version of Theorem 15.3 is similar: again the angle between
the eigenvectors depends on the separation of the eigenvalues. We give a simplified
statement below, where we disregard the possibility of a complex eigenvalue.
Theorem 15.9. Let [λ, q] be an eigenvalue-eigenvector pair of A, and assume that
the eigenvalue is simple. Form the orthogonal matrix Q =
(
q Q1
)
and partition
the matrices QTAQ and QTEQ:
QTAQ =
(
λ vT
0 A2
)
, QTEQ =
(
� eT
δ E2
)
.
Define
d = σmin(A2 − λI),
and assume d > 0. If the perturbation E is small enough, then there exists an
eigenvector q̂ of A+E such that the distance between q and q̂ measured as the sine
of the angle between the vectors is bounded by
sin(θ(q, q̂)) ≤
4 ‖ δ ‖2
d
.
The theorem says essentially that if we perturb A by �, then the eigenvector
is perturbed by �/d.
Example 15.10. Let the tridiagonal matrix be defined as
An =
⎛
⎜⎜⎜⎜⎜⎝
2 −1.1
−0.9 2 −1.1
. . .
. . .
. . .
−0.9 2 −1.1
−0.9 2
⎞
⎟⎟⎟⎟⎟⎠ ∈ R
n×n.
For n = 100, its smallest eigenvalue is 0.01098771, approximately. The following
MATLAB script computes the quantity d in Theorem 15.9:
% xn is the eigenvector corresponding to
% the smallest eigenvalue
[Q,r]=qr(xn);
H=Q’*A*Q; lam=H(1,1);
A2=H(2:n,2:n);
d=min(svd(A2-lam*eye(size(A2))));
We get d = 1.6207 · 10−4. Therefore, if we perturb the matrix by 10−10, say, this
may change the eigenvector by a factor 4 · 10−6, approximately.
book
2007/2/23
page 185
15.2. The Power Method and Inverse Iteration 185
15.2 The Power Method and Inverse Iteration
The power method is a classical iterative method for computing the largest (in
magnitude) eigenvalue and the corresponding eigenvector. Its convergence can be
very slow, depending on the distribution of eigenvalues. Therefore it should never
be used for dense matrices. Usually for sparse matrices one should use a variant
of the Lanczos method or the Jacobi–Davidson method; see [4] and Section 15.8.3.
However, in some applications the dimension of the problem is so huge that no other
method is viable; see Chapter 12.
Despite its limited usefulness for practical problems, the power method is
important from a theoretical point of view. In addition, there is a variation of the
power method, inverse iteration, that is of great practical importance.
In this section we give a slightly more general formulation of the power method
than in Chapter 12 and recall a few of its properties.
The power method for computing the largest eigenvalue
% Initial approximation x
for k=1:maxit
y=A*x;
lambda=y’*x;
if norm(y-lambda*x) < tol*abs(lambda)
break % stop the iterations
end
x=1/norm(y)*y;
end
The convergence of the power method depends on the distribution of eigenval-
ues of the matrix A. Assume that the largest eigenvalue in magnitude is simple and
that λi are ordered |λ1| > |λ2| ≥ · · · ≥ |λn|. The rate of convergence is determined
by the ratio |λ2/λ1|. If this ratio is close to 1, then the iteration is very slow.
A stopping criterion for the power iteration can be formulated in terms of the
residual vector for the eigenvalue problem: if the norm of the residual r = Ax̂− λ̂x̂
is small, then the eigenvalue approximation is good.
Example 15.11. Consider again the tridiagonal matrix
An =
⎛
⎜⎜⎜⎜⎜⎝
2 −1.1
−0.9 2 −1.1
. . .
. . .
. . .
−0.9 2 −1.1
−0.9 2
⎞
⎟⎟⎟⎟⎟⎠ ∈ R
n×n.
The two largest eigenvalues of A20 are 3.9677 and 3.9016, approximately. As initial
approximation we chose a random vector. In Figure 15.1 we plot different error mea-
sures during the iterations: the relative residual ‖Ax(k)−λ(k)x(k) ‖/λ1 (λ(k) denotes
book
2007/2/23
page 186
186 Chapter 15. Computing Eigenvalues and Singular Values
0 50 100 150
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
Iteration
Figure 15.1. Power iterations for A20. The relative residual ‖Ax(k) −
λ(k)x(k) ‖/λ1 (solid line), the absolute error in the eigenvalue approximation (dash-
dotted line), and the angle (in radians) between the exact eigenvector and the ap-
proximation (dashed line).
the approximation of λ1 in the kth iteration), the error in the eigenvalue approx-
imation, and the angle between the exact and the approximate eigenvector. After
150 iterations the relative error in the computed approximation of the eigenvalue is
0.0032.
We have λ2(A20)/λ1(A20) = 0.9833. It follows that
0.9833150 ≈ 0.0802,
which indicates that the convergence is quite slow, as seen in Figure 15.1. This is
comparable to the reduction of the angle between the exact and the approximate
eigenvector during 150 iterations: from 1.2847 radians to 0.0306.
If we iterate with A−1 in the power method,
x(k) = A−1x(k−1),
then, since the eigenvalues of A−1 are 1/λi, the sequence of eigenvalue approxima-
tions converges toward 1/λmin, where λmin is the eigenvalue of smallest absolute
value. Even better, if we have a good enough approximation of one of the eigen-
values, τ ≈ λk, then the shifted matrix A− τI has the smallest eigenvalue λk − τ .
Thus, we can expect very fast convergence in the “inverse power method.” This
method is called inverse iteration.
book
2007/2/23
page 187
15.3. Similarity Reduction to Tridiagonal Form 187
Inverse iteration
% Initial approximation x and eigenvalue approximation tau
[L,U]=lu(A – tau*I);
for k=1:maxit
y=U(Lx);
theta=y’*x;
if norm(y-theta*x) < tol*abs(theta)
break % stop the iteration
end
x=1/norm(y)*y;
end
lambda=tau+1/theta; x=1/norm(y)*y;
Example 15.12. The smallest eigenvalue of the matrix A100 from Example 15.10
is λ100 = 0.01098771187192 to 14 decimals accuracy. If we use the approxima-
tion λ100 ≈ τ = 0.011 and apply inverse iteration, we get fast convergence; see
Figure 15.2. In this example the convergence factor is∣∣∣∣λ100 − τλ99 − τ
∣∣∣∣ ≈ 0.0042748,
which means that after four iterations, the error is reduced by a factor of the order
3 · 10−10.
To be efficient, inverse iteration requires that we have good approximation of
the eigenvalue. In addition, we must be able to solve linear systems (A− τI)y = x
(for y) cheaply. If A is a band matrix, then the LU decomposition can be obtained
easily and in each iteration the system can be solved by forward and back substitu-
tion (as in the code above). The same method can be used for other sparse matrices
if a sparse LU decomposition can be computed without too much fill-in.
15.3 Similarity Reduction to Tridiagonal Form
The QR algorithm that we will introduce in Section 15.4 is an iterative algorithm,
where in each step a QR decomposition is computed. If it is applied to a dense
matrix A ∈ Rn×n, then the cost of a step is O(n3). This prohibitively high cost can
be reduced substantially by first transforming the matrix to compact form, by an
orthogonal similarity transformation (15.1),
A −→ V TAV,
for an orthogonal matrix V . We have already seen in (15.2) that the eigenvalues
are preserved under this transformation,
Ax = λx ⇔ V TAV y = λy,
where y = V Tx.
book
2007/2/23
page 188
188 Chapter 15. Computing Eigenvalues and Singular Values
1 2 3 4 5 6 7
10
−14
10
−12
10
−10
10
−8
10
−6
10
−4
10
−2
Iteration
Figure 15.2. Inverse iterations for A100 with τ = 0.011. The relative resid-
ual ‖Ax(k)−λ(k)x(k) ‖/λ(k) (solid line), the absolute error in the eigenvalue approx-
imation (dash-dotted line), and the angle (in radians) between the exact eigenvector
and the approximation (dashed line).
Let A ∈ Rn×n be symmetric. By a sequence of Householder transformations
it can be reduced to tridiagonal form. We illustrate the procedure using an example
with n = 6. First we construct a transformation that zeros the elements in positions
3 through n in the first column when we multiply A from the left:
H1A = H1
⎛
⎜⎜⎜⎜⎜⎜⎝
× × × × × ×
× × × × × ×
× × × × × ×
× × × × × ×
× × × × × ×
× × × × × ×
⎞
⎟⎟⎟⎟⎟⎟⎠ =
⎛
⎜⎜⎜⎜⎜⎜⎝
× × × × × ×
∗ ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
⎞
⎟⎟⎟⎟⎟⎟⎠ .
Elements that are changed in the transformation are denoted by ∗. Note that the
elements of the first row are not changed. In an orthogonal similarity transformation
we shall multiply by the same matrix transposed from the right. Since in the left
multiplication the first row was not touched, the first column will remain unchanged:
H1AH
T
1 =
⎛
⎜⎜⎜⎜⎜⎜⎝
× × × × × ×
× × × × × ×
0 × × × × ×
0 × × × × ×
0 × × × × ×
0 × × × × ×
⎞
⎟⎟⎟⎟⎟⎟⎠H
T
1 =
⎛
⎜⎜⎜⎜⎜⎜⎝
× ∗ 0 0 0 0
× ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
⎞
⎟⎟⎟⎟⎟⎟⎠ .
book
2007/2/23
page 189
15.4. The QR Algorithm for a Symmetric Tridiagonal Matrix 189
Due to symmetry, elements 3 through n in the first row will be equal to zero.
In the next step we zero the elements in the second column in positions 4
through n. Since this affects only rows 3 through n and the corresponding columns,
this does not destroy the zeros that we created in the first step. The result is
H2H1AH
T
1 H
T
2 =
⎛
⎜⎜⎜⎜⎜⎜⎝
× × 0 0 0 0
× × ∗ 0 0 0
0 ∗ ∗ ∗ ∗ ∗
0 0 ∗ ∗ ∗ ∗
0 0 ∗ ∗ ∗ ∗
0 0 ∗ ∗ ∗ ∗
⎞
⎟⎟⎟⎟⎟⎟⎠ .
After n− 2 such similarity transformations the matrix is in tridiagonal form:
V TAV =
⎛
⎜⎜⎜⎜⎜⎜⎝
× × 0 0 0 0
× × × 0 0 0
0 × × × 0 0
0 0 × × × 0
0 0 0 × × ×
0 0 0 0 × ×
⎞
⎟⎟⎟⎟⎟⎟⎠ ,
where V = HT1 H
T
2 · · ·HTn−2 = H1H2 · · ·Hn−2.
In summary, we have demonstrated how a symmetric matrix can be reduced
to tridiagonal form by a sequence of n− 2 Householder transformations:
A −→ T = V TAV, V = H1H2 · · ·Hn−2, (15.5)
Since the reduction is done by similarity transformations, the tridiagonal matrix T
has the same eigenvalues as A.
The reduction to tridiagonal form requires 4n3/3 flops if one takes advantage
of symmetry. As in the case of QR decomposition, the Householder transformations
can be stored in the subdiagonal part of A. If V is computed explicitly, this takes
4n3/3 additional flops.
15.4 The QR Algorithm for a Symmetric Tridiagonal
Matrix
We will now give a sketch of the QR algorithm for a symmetric, tridiagonal matrix.
We emphasize that our MATLAB codes are greatly simplified and are intended only
to demonstrate the basic ideas of the algorithm. The actual software (in LAPACK)
contains numerous features for efficiency, robustness, and numerical stability.
The procedure that we describe can be considered as a continuation of the
similarity reduction (15.5), but now we reduce the matrix T to diagonal form:
T −→ Λ = QTTQ, Q = Q1Q2 · · · , (15.6)
where Λ = diag(λ1 λ2 . . . , λn). The matrices Qi will be orthogonal, but here they
will be constructed using plane rotations. However, the most important difference
book
2007/2/2
page 190
190 Chapter 15. Computing Eigenvalues and Singular Values
between (15.5) and (15.6) is that there does not exist a finite algorithm40 for com-
puting Λ. We compute a sequence of matrices,
T0 := T, Ti = Q
T
i Ti−1Qi, i = 1, 2, . . . , (15.7)
such that it converges to a diagonal matrix,
lim
i→∞
Ti = Λ.
We will demonstrate in numerical examples that the convergence is very rapid, so
that in floating point arithmetic the algorithm can actually be considered as finite.
Since all the transformations in (15.7) are similarity transformations, the diagonal
elements of Λ are the eigenvalues of T .
We now give a first version of the QR algorithm for a symmetric tridiagonal
matrix T ∈ Rn×n.
QR iteration for symmetric T : Bottom eigenvalue
for i=1:maxit % Provisional simplification
mu=wilkshift(T(n-1:n,n-1:n));
[Q,R]=qr(T-mu*I);
T=R*Q+mu*I
end
function mu=wilkshift(T);
% Compute the Wilkinson shift
l=eig(T);
if abs(l(1)-T(2,2))
(abs(T(i,i))+abs(T(i-1,i-1)))*C*eps
it=it+1;
mu=wilkshift(T(i-1:i,i-1:i));
[Q,R]=qr(T(1:i,1:i)-mu*eye(i));
T=R*Q+mu*eye(i);
end
D(i)=T(i,i);
end
D(1:2)=eig(T(1:2,1:2))’;
For a given submatrix T(1:i,1:i) the QR steps are iterated until the stopping
criterion
|ti−1,i|
|ti−1,i−1| + |ti,i|
< Cμ is satisfied, where C is a small constant and μ is the unit round-off. From Theo- rem 15.2 we see that considering such a tiny element as a numerical zero leads to a very small (and acceptable) perturbation of the eigenvalues. In actual software, a slightly more complicated stopping criterion is used. When applied to the matrix T100 (cf. (15.8)) with the value C = 5, 204 QR steps were taken, i.e., approximately 2 steps per eigenvalue. The maximum devia- tion between the computed eigenvalues and those computed by the MATLAB eig function was 2.9 · 10−15. It is of course inefficient to compute the QR decomposition of a tridiagonal matrix using the MATLAB function qr, which is a Householder-based algorithm. Instead the decomposition should be computed using n− 1 plane rotations in O(n) flops. We illustrate the procedure with a small example, where the tridiagonal matrix T = T (0) is 6 × 6. The first subdiagonal element (from the top) is zeroed by a rotation from the left in the (1, 2) plane, GT1 (T (0) − τI), and then the second subdiagonal is zeroed by a rotation in (2, 3), GT2 G T 1 (T (0) − τI). Symbolically,⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × × × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ −→ ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × + 0 × × + 0 × × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . book 2007/2/23 page 193 15.4. The QR Algorithm for a Symmetric Tridiagonal Matrix 193 Note the fill-in (new nonzero elements, denoted +) that is created. After n−1 steps we have an upper triangular matrix with three nonzero diagonals: R = GTn−1 · · ·G T 1 (T (0) − τI) = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × + 0 × × + 0 × × + 0 × × + 0 × × 0 × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . We then apply the rotations from the right, RG1 · · ·Gn−1, i.e., we start with a transformation involving the first two columns. Then follows a rotation involving the second and third columns. The result after two steps is⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × + × × × + × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . We see that the zeroes that we introduced below the diagonal in the transformations from the left are systematically filled in. After n− 1 steps we have T (1) = RG1G2 · · ·Gn−1 + τI = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × + × × × + × × × + × × × + × × + × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . But we have made a similarity transformation: with Q = G1G2 · · ·Gn−1 and using R = QT (T (0) − τI), we can write T (1) = RQ + τI = QT (T (0) − τI)Q + τI = QTT (0)Q, (15.9) so we know that T (1) is symmetric, T (1) = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × × × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . Thus we have shown the following result. Proposition 15.13. The QR step for a tridiagonal matrix QR = T (k) − τkI, T (k+1) = RQ + τkI, book 2007/2/23 page 194 194 Chapter 15. Computing Eigenvalues and Singular Values is a similarity transformation T (k+1) = QTT (k)Q, (15.10) and the tridiagonal structure is preserved. The transformation can be computed with plane rotations in O(n) flops. From (15.9) it may appear as if the shift plays no significant role. However, it determines the value of the orthogonal transformation in the QR step. Actually, the shift strategy is absolutely necessary for the algorithm to be efficient: if no shifts are performed, then the QR algorithm usually converges very slowly, in fact as slowly as the power method; cf. Section 15.2. On the other hand, it can be proved [107] (see, e.g., [93, Chapter 3]) that the shifted QR algorithm has very fast convergence. Proposition 15.14. The symmetric QR algorithm with Wilkinson shifts converges cubically toward the eigenvalue decomposition. In actual software for the QR algorithm, there are several enhancements of the algorithm that we outlined above. For instance, the algorithm checks all off- diagonals if they are small: when a negligible off-diagonal element is found, then the problem can be split in two. There is also a divide-and-conquer variant of the QR algorithm. For an extensive treatment, see [42, Chapter 8]. 15.4.1 Implicit shifts One important aspect of the QR algorithm is that the shifts can be performed implicitly. This is especially useful for the application of the algorithm to the SVD and the nonsymmetric eigenproblem. This variant is based on the implicit Q theorem, which we here give in slightly simplified form. Theorem 15.15. Let A be symmetric, and assume that Q and V are orthogonal matrices such that QTAQ and V TAV are both tridiagonal. Then, if the first columns of Q and V are equal, q1 = v1, then Q and V are essentially equal: qi = ±vi, i = 2, 3, . . . , n. For a proof, see [42, Chapter 8]. A consequence of this theorem is that if we determine and apply the first transformation in the QR decomposition of T − τI, and if we construct the rest of the transformations in such a way that we finally arrive at a tridiagonal matrix, then we have performed a shifted QR step as in Proposition 15.13. This procedure is implemented as follows. Let the first plane rotation be determined such that( c s −s c )( α1 − τ β1 ) = ( × 0 ) , (15.11) book 2007/2/23 page 195 15.4. The QR Algorithm for a Symmetric Tridiagonal Matrix 195 where α1 and β1 are the top diagonal and subdiagonal elements of T . Define GT1 = ⎛ ⎜⎜⎜⎜⎜⎝ c s −s c 1 . . . 1 ⎞ ⎟⎟⎟⎟⎟⎠ , and apply the rotation to T . The multiplication from the left introduces a new nonzero element in the first row, and, correspondingly a new nonzero is introduced in the first column by the multiplication from the right: GT1 TG1 = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × + × × × + × × × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ , where + denotes a new nonzero element. We next determine a rotation in the (2, 3)-plane that annihilates the new nonzero and at the same time introduces a new nonzero further down: GT2 G T 1 TG1B2 = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × 0 × × × + 0 × × × + × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . In an analogous manner we “chase the bulge” downward until we have⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × × × × × × + × × × + × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ , where by a final rotation we can zero the bulge and at the same time restore the tridiagonal form. Note that it was only in the determination of the first rotation (15.11) that the shift was used. The rotations were applied only to the unshifted tridiagonal matrix. Due to the implicit QR theorem, Theorem 15.15, this is equivalent to a shifted QR step as given in Proposition 15.13. 15.4.2 Eigenvectors The QR algorithm for computing the eigenvalues of a symmetric matrix (including the reduction to tridiagonal form) requires about 4n3/3 flops if only the eigenvalues book 2007/2/23 page 196 196 Chapter 15. Computing Eigenvalues and Singular Values are computed. Accumulation of the orthogonal transformations to compute the matrix of eigenvectors takes another 9n3 flops approximately. If all n eigenvalues are needed but only a few of the eigenvectors are, then it is cheaper to use inverse iteration (Section 15.2) to compute these eigenvectors, with the computed eigenvalues λ̂i as shifts: (A− λ̂iI)x(k) = x(k−1), k = 1, 2, . . . . The eigenvalues produced by the QR algorithm are so close to the exact eigenvalues (see below) that usually only one step of inverse iteration is needed to get a very good eigenvector, even if the initial guess for the eigenvector is random. The QR algorithm is ideal from the point of view of numerical stability. There exist an exactly orthogonal matrix Q and a perturbation E such that the computed diagonal matrix of eigenvalues D̂ satisfies exactly QT (A + E)Q = D̂ with ‖E ‖2 ≈ μ‖A ‖2, where μ is the unit round-off of the floating point system. Then, from Theorem 15.2 we know that a computed eigenvalue λ̂i differs from the exact eigenvalue by a small amount: ‖λ̂i − λi‖2 ≤ μ‖A ‖2. 15.5 Computing the SVD Since the singular values of a matrix A are the eigenvalues squared of ATA and AAT , it is clear that the problem of computing the SVD can be solved using algorithms similar to those of the symmetric eigenvalue problem. However, it is important to avoid forming the matrices ATA and AAT , since that would lead to loss of information (cf. the least squares example on p. 54). Assume that A is m × n with m ≥ n. The first step in computing the SVD of a dense matrix A is to reduce it to upper bidiagonal form by Householder trans- formations from the left and right, A = H ( B 0 ) WT , B = ⎛ ⎜⎜⎜⎜⎜⎝ α1 β1 α2 β2 . . . . . . αn−1 βn−1 αn ⎞ ⎟⎟⎟⎟⎟⎠ . (15.12) For a description of this reduction, see Section 7.2.1. Since we use orthogonal transformations in this reduction, the matrix B has the same singular values as A. Let σ be a singular value of A with singular vectors u and v. Then Av = σu is equivalent to ( B 0 ) ṽ = σũ, ṽ = WT v, ũ = HTu, from (15.12). book 2007/2/23 page 197 15.6. The Nonsymmetric Eigenvalue Problem 197 It is easy to see that the matrix BTB is tridiagonal. The method of choice for computing the singular values of B is the tridiagonal QR algorithm with implicit shifts applied to the matrix BTB, without forming it explicitly. Let A ∈ Rm×n, where m ≥ n. The thin SVD A = U1ΣV T (cf. Section 6.1) can be computed in 6mn2 + 20n3 flops. 15.6 The Nonsymmetric Eigenvalue Problem If we perform the same procedure as in Section 15.3 to a nonsymmetric matrix, then due to nonsymmetry, no elements above the diagonal are zeroed. Thus the final result is a Hessenberg matrix : V TAV = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × × × × × × × 0 × × × × × 0 0 × × × × 0 0 0 × × × 0 0 0 0 × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . The reduction to Hessenberg form using Householder transformations requires 10n3/3 flops. 15.6.1 The QR Algorithm for Nonsymmetric Matrices The “unrefined” QR algorithm for tridiagonal matrices given in Section 15.4 works equally well for a Hessenberg matrix, and the result is an upper triangular matrix, i.e., the R factor in the Schur decomposition. For efficiency, as in the symmetric case, the QR decomposition in each step of the algorithm is computed using plane rotations, but here the transformation is applied to more elements. We illustrate the procedure with a small example. Let the matrix H ∈ R6×6 be upper Hessenberg, and assume that a Wilkinson shift τ has been computed from the bottom right 2× 2 matrix. For simplicity we assume that the shift is real. Denote H(0) := H. The first subdiagonal element (from the top) in H−τI is zeroed by a rotation from the left in the (1, 2) plane, GT1 (H (0) − τI), and then the second subdiagonal is zeroed by a rotation in (2, 3), GT2 G T 1 (H (0) − τI). Symbolically, ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × × × × × × × × × × × × × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ −→ ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × 0 × × × × × 0 × × × × × × × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . After n− 1 steps we have an upper triangular matrix: book 2007/2/23 page 198 198 Chapter 15. Computing Eigenvalues and Singular Values R = GTn−1 · · ·G T 1 (H (0) − τI) = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × 0 × × × × × 0 × × × × 0 × × × 0 × × 0 × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . We then apply the rotations from the right, RG1 · · ·Gn−1, i.e., we start with a transformation involving the first two columns. Then follows a rotation involving the second and third columns. The result after two steps is⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × + × × × × × + × × ×× × × × × × × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . We see that the zeroes that we introduced in the transformations from the left are systematically filled in. After n− 1 steps we have H(1) = RG1G2 · · ·Gn−1 + τI = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × + × × × × × + × × × × + × × × + × × + × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . But we have made a similarity transformation: with Q = G1G2 · · ·Gn−1 and using R = QT (H(0) − τI), we can write H(1) = RQ + τI = QT (H(0) − τI)Q + τI = QTH(0)Q, (15.13) and we know that H(1) has the same eigenvalues as H(0). The convergence properties of the nonsymmetric QR algorithm are almost as nice as those of its symmetric counterpart [93, Chapter 2]. Proposition 15.16. The nonsymmetric QR algorithm with Wilkinson shifts con- verges quadratically toward the Schur decomposition. As in the symmetric case there are numerous refinements of the algorithm sketched above; see, e.g., [42, Chapter 7], [93, Chapter 2]. In particular, one usually uses implicit double shifts to avoid complex arithmetic. Given the eigenvalues, selected eigenvectors can be computed by inverse iter- ation with the upper Hessenberg matrix and the computed eigenvalues as shifts. 15.7 Sparse Matrices In many applications, a very small proportion of the elements of a matrix are nonzero. Then the matrix is called sparse. It is quite common that less than 1% of book 2007/2/23 page 199 15.7. Sparse Matrices 199 the matrix elements are nonzero. In the numerical solution of an eigenvalue problem for a sparse matrix, usually an iterative method is employed. This is because the transformations to compact form described in Section 15.3 would completely destroy the sparsity, which leads to excessive storage requirements. In addition, the computational complexity of the reduction to compact form is often much too high. In Sections 15.2 and 15.8 we describe a couple of methods for solving numer- ically the eigenvalue (and singular value) problem for a large sparse matrix. Here we give a brief description of one possible method for storing a sparse matrix. To take advantage of sparseness of the matrix, only the nonzero elements should be stored. We describe briefly one storage scheme for sparse matrices, com- pressed row storage. Example 15.17. Let A = ⎛ ⎜⎜⎝ 0.6667 0 0 0.2887 0 0.7071 0.4082 0.2887 0.3333 0 0.4082 0.2887 0.6667 0 0 0 ⎞ ⎟⎟⎠ . In compressed row storage, the nonzero entries are stored in a vector, here called val (we round the elements in the table to save space here), along with the corresponding column indices in a vector colind of equal length: val 0.67 0.29 0.71 0.41 0.29 0.33 0.41 0.29 0.67 colind 1 4 2 3 4 1 3 4 1 rowptr 1 3 6 9 10 The vector rowptr points to the positions in val that are occupied by the first element in each row. The compressed row storage scheme is convenient for multiplying y = Ax. The extra entry in the rowptr vector that points to the (nonexistent) position after the end of the val vector is used to make the code for multiplying y = Ax simple. Multiplication y = Ax for sparse A function y=Ax(val,colind,rowptr,x) % Compute y = A * x, with A in compressed row storage m=length(rowptr)-1; for i=1:m a=val(rowptr(i):rowptr(i+1)-1); y(i)=a*x(colind(rowptr(i):rowptr(i+1)-1)); end y=y’; book 2007/2/23 page 200 200 Chapter 15. Computing Eigenvalues and Singular Values It can be seen that compressed row storage is inconvenient for multiplying y = AT z. However, there is an analogous compressed column storage scheme that, naturally, is well suited for this. Compressed row (column) storage for sparse matrices is relevant in program- ming languages like Fortran and C, where the programmer must handle the sparse storage explicitly [27]. MATLAB has a built-in storage scheme for sparse matri- ces, with overloaded matrix operations. For instance, for a sparse matrix A, the MATLAB statement y=A*x implements sparse matrix-vector multiplication, and internally MATLAB executes a code analogous to the one above. In a particular application, different sparse matrix storage schemes can in- fluence the performance of matrix operations, depending on the structure of the matrix. In [39], a comparison is made of sparse matrix algorithms for information retrieval. 15.8 The Arnoldi and Lanczos Methods The QR method can be used to compute the eigenvalue and singular value de- compositions of medium-size matrices. (What a medium-size matrix is depends on the available computing power.) Often in data mining and pattern recognition the matrices are very large and sparse. However, the eigenvalue, singular value, and Schur decompositions of sparse matrices are usually dense: almost all elements are nonzero. Example 15.18. The Schur decomposition of the link graph matrix in Exam- ple 1.3, P = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎝ 0 1 3 0 0 0 0 1 3 0 0 0 0 0 0 1 3 0 0 1 3 1 2 1 3 0 0 0 1 3 0 1 3 1 3 0 0 0 1 2 0 0 1 0 1 3 0 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎠ , was computed in MATLAB: [U,R]=schur(A), with the result U = -0.0000 -0.4680 -0.0722 -0.0530 0.8792 -0.0000 -0.0000 -0.4680 -0.0722 -0.3576 -0.2766 -0.7559 -0.5394 0.0161 0.3910 0.6378 0.0791 -0.3780 -0.1434 -0.6458 -0.3765 0.3934 -0.3509 0.3780 -0.3960 -0.2741 0.6232 -0.4708 -0.1231 0.3780 -0.7292 0.2639 -0.5537 -0.2934 0.0773 -0.0000 book 2007/2/23 page 201 15.8. The Arnoldi and Lanczos Methods 201 R = 0.9207 0.2239 -0.2840 0.0148 -0.1078 0.3334 0 0.3333 0.1495 0.3746 -0.3139 0.0371 0 0 -0.6361 -0.5327 -0.0181 -0.0960 0 0 0 -0.3333 -0.1850 0.1751 0 0 0 0 -0.2846 -0.2642 0 0 0 0 0 0.0000 We see that almost all elements of the orthogonal matrix are nonzero. Therefore, since the storage requirements become prohibitive, it is usually out of the question to use the QR method. Instead one uses methods that do not transform the matrix itself but rather use it as an operator, i.e., to compute matrix vector products y = Ax. We have already described one such method in Section 15.2, the power method, which can be used to compute the largest eigenvalue and the corresponding eigenvector. Essentially, in the power method we compute a sequence of vectors, Ax0, A 2x0, A 3x0, . . . , that converges toward the eigenvector. However, as soon as we have computed one new power, i.e., we have gone from yk−1 = A k−1x to yk = A kx, we throw away yk−1 and all the information that was contained in the earlier approximations of the eigenvector. The idea in a Krylov subspace method is to use the information in the sequence of vectors x0, Ax0, A 2x0, . . . , A k−1, organized in a subspace, the Krylov subspace, Kk(A, x0) = span{x0, Ax0, A2x0, . . . , Ak−1x0}, and to extract as good an approximation of the eigenvector as possible from this subspace. In Chapter 7 we have already described the Lanczos bidiagonalization method, which is a Krylov subspace method that can be used for solving approx- imately least squares problems. In Section 15.8.3 we will show that it can also be used for computing an approximation of some of the singular values and vectors of a matrix. But first we present the Arnoldi method and its application to the problem of computing a partial Schur decomposition of a large and sparse matrix. 15.8.1 The Arnoldi Method and the Schur Decomposition Assume that A ∈ Rn×n is large, sparse, and nonsymmetric and that we want to compute the Schur decomposition (Theorem 15.5) A = URUT , where U is or- thogonal and R is upper triangular. Our derivation of the Arnoldi method will be analogous to that in Chapter 7 of the LGK bidiagonalization method. Thus we will start from the existence of an orthogonal similarity reduction to upper Hessenberg form (here n = 6): V TAV = H = ⎛ ⎜⎜⎜⎜⎜⎜⎝ × × × × × × × × × × × × 0 × × × × × 0 0 × × × × 0 0 0 × × × 0 0 0 0 × × ⎞ ⎟⎟⎟⎟⎟⎟⎠ . (15.14) book 2007/2/23 page 202 202 Chapter 15. Computing Eigenvalues and Singular Values In principle this can be computed using Householder transformations as in Sec- tion 15.3, but since A is sparse, this would cause the fill-in of the zero elements. Instead we will show that columns of V and H can be computed in a recursive way, using only matrix-vector products (like in the LGK bidiagonalization method). Rewriting (15.14) in the form AV = ( Av1 Av2 . . . Avj · · · ) (15.15) = ( v1 v2 . . . vj vj+1 · · · ) ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎝ h11 h12 · · · h1j · · · h21 h22 · · · h2j · · · h32 · · · h3j · · · . . . ... hj+1,j . . . ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎠ (15.16) and reading off the columns one by one, we see that the first is Av1 = h11v1 + h21v2, and it can be written in the form h21v2 = Av1 − h11v1. Therefore, since v1 and v2 are orthogonal, we have h11 = v T 1 Av1, and h21 is de- termined from the requirement that v2 has Euclidean length 1. Similarly, the jth column of (15.15)–(15.16) is Avj = j∑ i=1 hijvi + hj+1,jvj+1, which can be written hj+1,jvj+1 = Avj − j∑ i=1 hijvi. (15.17) Now, with v1, v2, . . . , vj given, we can compute vj+1 from (15.17) if we prescribe that it is orthogonal to the previous vectors. This gives the equations hij = v T i Avj , i = 1, 2, . . . , j. The element hj+1,j is obtained from the requirement that vj+1 has length 1. Thus we can compute the columns of V and H using the following recursion: book 2007/2/23 page 203 15.8. The Arnoldi and Lanczos Methods 203 Arnoldi method 1. Starting vector v1, satisfying ‖v1‖2 = 1. 2. for j = 1, 2, . . . (a) hij = v T i Avj , i = 1, 2, . . . , j. (b) v = Avj − ∑j i=1 hijvi. (c) hj+1,j = ‖v‖2. (d) vj+1 = (1/hj+1,j) v. 3. end Obviously, in step j only one matrix-vector product Avj is needed. For a large sparse matrix it is out of the question, mainly for storage reasons, to perform many steps in the recursion. Assume that k steps have been performed, where k � n, and define Vk = ( v1 v2 · · · vk ) , Hk = ⎛ ⎜⎜⎜⎜⎜⎝ h11 h12 · · · h1k−1 h1k h21 h22 · · · h2k−1 h2k h32 · · · h3k−1 h3k . . . ... ... hk,k−1 hkk ⎞ ⎟⎟⎟⎟⎟⎠ ∈ R k×k. We can now write the first k steps of the recursion in matrix form: AVk = VkHk + hk+1,kvk+1e T k , (15.18) where eTk = ( 0 0 . . . 0 1 ) ∈ R1×k. This is called the Arnoldi decomposition. After k steps of the recursion we have performed k matrix-vector multiplica- tions. The following proposition shows that we have retained all the information produced during those steps (in contrast to the power method). Proposition 15.19. The vectors v1, v2, . . . , vk are an orthonormal basis in the Krylov subspace K(A, v1) = span{v1, Av1, . . . , Ak−1v1}. Proof. The orthogonality of the vectors follows by construction (or is verified by direct computation). The second part can be proved by induction. The question now arises of how well we can approximate eigenvalues and eigenvectors from the Krylov subspace. Note that if Zk were an eigenspace (see (15.4)), then we would have AZk = ZkM for some matrix M ∈ Rk×k. Therefore, to see how much Vk deviates from being an eigenspace, we can check how large the residual AVk − VkM is for some matrix M . Luckily, there is a recipe for choosing the optimal M for any given Vk. book 2007/2/23 page 204 204 Chapter 15. Computing Eigenvalues and Singular Values Theorem 15.20. Let Vk ∈ Rn×k have orthonormal columns, and define R(M) = AVk − VkM , where M ∈ Rk×k. Then min M ‖R(M)‖F = min M ‖AVk − VkM‖F has the solution M = V Tk AVk. Proof. See, e.g., [93, Theorem 4.2.6]. From the Arnoldi decomposition (15.18) we immediately get the optimal ma- trix M = V Tk (VkHk + hk+1,kvk+1e T k ) = Hk, because vk+1 is orthogonal to the previous vectors. It follows, again from the Arnoldi decomposition, that the optimal residual is given by min M ‖R(M)‖F = ‖AVk − VkHk‖F = |hk+1,k|, so the residual norm comes for free in the Arnoldi recursion. Assuming that Vk is a good enough approximation of an eigenspace, how can we compute an approximate partial Schur decomposition AUk = UkRk? Let Hk = ZkR̂kZ T k be the Schur decomposition of Hk. Then, from AVk ≈ VkHk we get the approximate partial Schur decomposition of AÛk ≈ ÛkR̂k, Ûk = VkZk. It follows that the eigenvalues of R̂k are approximations of the eigenvalues of A. Example 15.21. We computed the largest eigenvalue of the matrix A100 defined in Example 15.10 using the power method and the Arnoldi method. The errors in the approximation of the eigenvalue are given in Figure 15.3. It is seen that the Krylov subspace holds much more information about the eigenvalue than is carried by the only vector in the power method. The basic Arnoldi method sketched above has two problems, both of which can be dealt with efficiently: • In exact arithmetic the vj vectors are orthogonal, but in floating point arith- metic orthogonality is lost as the iterations proceed. Orthogonality is repaired by explicitly reorthogonalizing the vectors. This can be done in every step of the algorithm or selectively, when nonorthogonality has been detected. • The amount of work and the storage requirements increase as the iterations proceed, and one may run out of memory before sufficiently good approxima- tions have been computed. This can be remedied by restarting the Arnoldi book 2007/2/23 page 205 15.8. The Arnoldi and Lanczos Methods 205 0 5 10 15 10 −3 10 −2 10 −1 10 0 Iteration Figure 15.3. The power and Arnoldi methods for computing the largest eigenvalue of A100. The relative error in the eigenvalue approximation for the power method (dash-dotted line) and the Arnoldi method (dash-× line). procedure. A method has been developed that, given an Arnoldi decom- position of dimension k, reduces it to an Arnoldi decomposition of smaller dimension k0, and in this reduction purges unwanted eigenvalues. This im- plicitly restarted Arnoldi method [64] has been implemented in the MATLAB function eigs. 15.8.2 Lanczos Tridiagonalization If the Arnoldi procedure is applied to a symmetric matrix A, then, due to sym- metry, the upper Hessenberg matrix Hk becomes tridiagonal. A more economical symmetric version of the algorithm can be derived, starting from an orthogonal tridiagonalization (15.5) of A, which we write in the form AV = ( Av1 Av2 . . . Avn ) = V T = ( v1 v2 . . . vn ) ⎛ ⎜⎜⎜⎜⎜⎜⎜⎝ α1 β1 β1 α2 β2 β2 α3 β3 . . . . . . . . . βn−2 αn−1 βn−1 βn−1 αn ⎞ ⎟⎟⎟⎟⎟⎟⎟⎠ . book 2007/2/23 page 206 206 Chapter 15. Computing Eigenvalues and Singular Values By identifying column j on the left- and right-hand sides and rearranging the equa- tion, we get βjvj+1 = Avj − αjvj − βj−1vj−1, and we can use this in a recursive reformulation of the equation AV = V T . The coefficients αj and βj are determined from the requirements that the vectors are orthogonal and normalized. Below we give a basic version of the Lanczos tridiago- nalization method that generates a Lanczos decomposition, AVk = VkTk + βkvk+1e T k , where Tk consists of the k first rows and columns of T . Lanczos tridiagonalization 1. Put β0 = 0 and v0 = 0, and choose a starting vector v1, satisfying ‖v1‖2 = 1. 2. for j = 1, 2, . . . (a) αj = v T j Avj . (b) v = Avj − αjvj − βj−1vj−1. (c) βj = ‖v‖2. (d) vj+1 = (1/βj) v. 3. end Again, in the recursion the matrix, A is not transformed but is used only in matrix-vector multiplications, and in each iteration only one matrix-vector product need be computed. This basic Lanczos tridiagonalization procedure suffers from the same deficiencies as the basic Arnoldi procedure, and the problems can be solved using the same methods. The MATLAB function eigs checks if the matrix is symmetric, and if this is the case, then the implicitly restarted Lanczos tridiagonalization method is used. 15.8.3 Computing a Sparse SVD The LGK bidiagonalization method was originally formulated [41] for the compu- tation of the SVD. It can be used for computing a partial bidiagonalization (7.11), AZk = Pk+1Bk+1, where Bk+1 is bidiagonal, and the columns of Zk and Pk+1 are orthonormal. Based on this decomposition, approximations of the singular values and the singular vec- tors can be computed in a similar way as using the tridiagonalization in the pre- ceding section. In fact, it can be proved (see, e.g., [4, Chapter 6.3.3]) that the LGK book 2007/2/23 page 207 15.9. Software 207 bidiagonalization procedure is equivalent to applying Lanczos tridiagonalization to the symmetric matrix ( 0 A AT 0 ) , (15.19) with a particular starting vector, and therefore implicit restarts can be applied. The MATLAB function svds implements the Lanczos tridiagonalization method for the matrix (15.19), with implicit restarts. 15.9 Software A rather common mistake in many areas of computing is to underestimate the costs of developing software. Therefore, it would be very unwise not to take advantage of existing software, especially when it is developed by world experts and is available free of charge. 15.9.1 LAPACK LAPACK is a linear algebra package that can be accessed and downloaded from Netlib at http://www.netlib.org/lapack/. We quote from the Web page description: LAPACK is written in Fortran 77 and provides routines for solving sys- tems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matri- ces. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision. LAPACK routines are written so that as much as possible of the com- putation is performed by calls to the Basic Linear Algebra Subprograms (BLAS). . . . Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. . . . Alternatively, the user can download ATLAS to automatically generate an optimized BLAS library for the architecture. The basic dense matrix functions in MATLAB are built on LAPACK. Alternative language interfaces to LAPACK (or translations/conversions of LAPACK) are available in Fortran 95, C, C++, and Java. ScaLAPACK, a parallel version of LAPACK, is also available from Netlib at http://www.netlib.org/scalapack/. This package is designed for message passing parallel computers and can be used on any system that supports MPI. book 2007/2/27 page 208 208 Chapter 15. Computing Eigenvalues and Singular Values 15.9.2 Software for Sparse Matrices As mentioned earlier, the eigenvalue and singular value functions in MATLAB are based on the Lanczos and Arnoldi methods with implicit restarts [64]. These algorithms are taken from ARPACK at http://www.caam.rice.edu/software/ ARPACK/. From the Web page: The package is designed to compute a few eigenvalues and corresponding eigenvectors of a general n by n matrix A. It is most appropriate for large sparse or structured matrices A where structured means that a matrix-vector product w ← Av requires order n rather than the usual order n2 floating point operations. An overview of algorithms and software for eigenvalue and singular value com- putations can be found in the book [4]. Additional software for dense and sparse matrix computations can be found at http://www.netlib.org/linalg/. 15.9.3 Programming Environments We used MATLAB in this book as a vehicle for describing algorithms. Among other commercially available software systems, we would like to mention Mathematica, and statistics packages like SAS r© and SPSS r©,41 which have facilities for matrix computations and data and text mining. 41http://www.wolfram.com/, http://www.sas.com/, and http://www.spss.com/. book 2007/2/23 page 209 Bibliography [1] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Don- garra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK Users’ Guide, 3rd ed. SIAM, Philadelphia, 1999. [2] ANSI/IEEE 754. Binary Floating Point Arithmetic. IEEE, New York, 1985. [3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, Addison-Wesley, New York, 1999. [4] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, eds. Tem- plates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia, 2000. [5] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia, 1994. [6] B. Bergeron. Bioinformatics Computing. Prentice–Hall, New York, 2002. [7] P. Berkin. A survey on PageRank computing. Internet Math., 2:73–120, 2005. [8] M. Berry and M. Browne. Email surveillance using non-negative matrix fac- torization. Comput. Math. Organization Theory, 11:249–264, 2005. [9] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Rev., 37:573–595, 1995. [10] M. J. A. Berry and G. Linoff. Mastering Data Mining. The Art and Science of Customer Relationship Management. John Wiley, New York, 2000. [11] M. W. Berry, ed. Computational Information Retrieval. SIAM, Philadelphia, 2001. [12] M. W. Berry and M. Browne. Understanding Search Engines. Mathematical Modeling and Text Retrieval, 2nd ed. SIAM, Philadelphia, 2005. 209 book 2007/2/23 page 210 210 Bibliography [13] M. W. Berry, M. Browne, A. Langville, V. P. Pauca, and R. J. Plemmons. Al- gorithms and Applications for Approximate Nonnegative Matrix Factorization. Technical report, Department of Computer Science, University of Tennessee, 2006. [14] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Philadel- phia, 1996. [15] Å. Björck. The calculation of least squares problems. Acta Numer., 13:1–51, 2004. [16] K. Blom and A. Ruhe. A Krylov subspace method for information retrieval. SIAM J. Matrix Anal. Appl., 26:566–582, 2005. [17] V. D. Blondel, A. Gajardo, M. Heymans, P. Senellart, and P. Van Dooren. A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM Rev., 46:647–666, 2004. [18] C. Boutsidis and E. Gallopoulos. On SVD-Based Initialization for Nonnega- tive Matrix Factorization. Technical Report HPCLAB-SCG-6/08-05, Univer- sity of Patras, Patras, Greece, 2005. [19] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Networks ISDN Syst., 30:107–117, 1998. [20] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. PNAS, 101:4164– 4169, 2004. [21] M. C. Burl, L. Asker, P. Smyth, U. Fayyad, P. Perona, L. Crumpler, and J. Aubele. Learning to recognize volcanoes on Venus. Machine Learning, 30:165–195, 1998. [22] P. A. Businger and G. H. Golub. Linear least squares solutions by Householder transformations. Numer. Math., 7:269–276, 1965. [23] R. Chelappa, C. L. Wilson, and S. Sirohey. Human and machine recognition of faces: A survey. Proc. IEEE, 83:705–740, 1995. [24] N. Christianini and J. Shawe-Taylor. An Introduction to Support Vector Ma- chines. Cambridge University Press, London, 2000. [25] K. J. Cios, W. Pedrycz, and R. W. Swiniarski. Data Mining. Methods for Knowledge Discovery. Kluwer, Boston, 1998. [26] J. M. Conroy, J. D. Schlesinger, D. P. O’Leary, and J. Goldstein. Back to basics: CLASSY 2006. In DUC 02 Conference Proceedings, 2006. Available at http://duc.nist.gov/pubs.html. [27] T. A. Davis. Direct Methods for Sparse Linear Systems. Fundamentals of Algorithms 2. SIAM, Philadelphia, 2006. book 2007/2/23 page 211 Bibliography 211 [28] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harsman. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci., 41:391–407, 1990. [29] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997. [30] I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143–175, 2001. [31] R. O. Duda, P. E. Hart, and D. G. Storck. Pattern Classification, 2nd ed. Wiley-Interscience, New York, 2001. [32] L. Eldén. Partial least squares vs. Lanczos bidiagonalization I: Analysis of a projection method for multiple regression. Comput. Statist. Data Anal., 46:11–31, 2004. [33] L. Eldén. Numerical linear algebra in data mining. Acta Numer., 15:327–384, 2006. [34] L. Eldén, L. Wittmeyer-Koch, and H. Bruun Nielsen. Introduction to Numer- ical Computation—Analysis and MATLAB Illustrations. Studentlitteratur, Lund, 2004. [35] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds. Advances in Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, Menlo Park, CA, 1996. [36] J. H. Fowler and S. Jeon. The Authority of Supreme Court Precedent: A Net- work Analysis. Technical report, Department of Political Science, University of California, Davis, 2005. [37] Y. Gao and G. Church. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinform., 21:3970–3975, 2005. [38] J. T. Giles, L. Wo, and M. W. Berry. GTP (General Text Parser) software for text mining. In Statistical Data Mining and Knowledge Discovery, H. Bozdogan, ed., CRC Press, Boca Raton, FL, 2003, pp. 455–471. [39] N. Goharian, A. Jain, and Q. Sun. Comparative analysis of sparse matrix algorithms for information retrieval. J. System. Cybernet. Inform., 1, 2003. [40] G. H. Golub and C. Greif. An Arnoldi-type algorithm for computing pagerank. BIT, 46:759–771, 2006. [41] G. Golub and W. Kahan. Calculating the singular values and pseudo-inverse of a matrix. SIAM J. Numer. Anal. Ser. B, 2:205–224, 1965. [42] G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Johns Hopkins Press, Baltimore, 1996. [43] D. Grossman and O. Frieder. Information Retrieval: Algorithms and Heuris- tics. Kluwer, Boston, 1998. book 2007/2/23 page 212 212 Bibliography [44] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proc., 30th International Conference on Very Large Databases, Morgan Kaufmann, 2004, pp. 576–587. [45] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2001. [46] D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, Cambridge, MA, 2001. [47] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learn- ing. Data Mining, Inference and Prediction. Springer, New York, 2001. [48] T. H. Haveliwala and S. D. Kamvar. An Analytical Comparison of Approaches to Personalizing PageRank. Technical report, Computer Science Department, Stanford University, Stanford, CA, 2003. [49] M. Hegland. Data mining techniques. Acta Numer., 10:313–355, 2001. [50] N. J. Higham. Accuracy and Stability of Numerical Algorithms, 2nd ed. SIAM, Philadelphia, 2002. [51] I. C. F. Ipsen and S. Kirkland. Convergence analysis of a PageRank updating algorithm by Langville and Meyer. SIAM J. Matrix Anal. Appl., 27:952–967, 2006. [52] E. R. Jessup and J. H. Martin. Taking a new look at the latent semantic analysis approach to information retrieval. In Computational Information Retrieval, M. W. Berry, ed., SIAM, Philadelphia, 2001, pp. 121–144. [53] S. D. Kamvar, T. H. Haveliwala, and G. H. Golub. Adaptive methods for the computation of pagerank. Linear Algebra Appl., 386:51–65, 2003. [54] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the Block Structure of the Web for Computing PageRank. Technical report, Computer Science Department, Stanford University, Stanford, CA, 2003. [55] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Ex- trapolation methods for accelerating PageRank computations. In Proc., 12th International World Wide Web Conference, Budapest, 2003, pp. 261–270. [56] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. Assoc. Comput. Mach., 46:604–632, 1999. [57] A. N. Langville and C. D. Meyer. Deeper inside PageRank. Internet Math., 1:335–380, 2005. [58] A. N. Langville and C. D. Meyer. A survey of eigenvector methods for web information retrieval. SIAM Rev., 47:135–161, 2005. book 2007/2/23 page 213 Bibliography 213 [59] A. N. Langville and C. D. Meyer. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, NJ, 2006. [60] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21:1253–1278, 2000. [61] C. L. Lawson and R. J. Hanson. Solving Least Squares Problems. Classics in Appl. Math. 15. SIAM, Philadelphia, 1995. Revised republication of work first published in 1974 by Prentice–Hall. [62] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, Nov. 1998. [63] D. Lee and H. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, Oct. 1999. [64] R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK Users’ Guide: So- lution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia, 1998. [65] R. Lempel and S. Moran. Salsa: The stochastic approach for link-structure analysis. ACM Trans. Inform. Syst., 19:131–160, 2001. [66] O. Mangasarian and W. Wolberg. Cancer diagnosis via linear programming. SIAM News, 23:1,18, 1990. [67] I. Mani. Automatic Summarization. John Benjamins, Amsterdam, 2001. [68] Matlab User’s Guide. Mathworks, Inc., Natick, MA, 1996. [69] J. Mena. Data Mining Your Website. Digital Press, Boston, 1999. [70] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, Philadel- phia, 2000. [71] C. Moler. The world’s largest matrix computation. Matlab News and Notes, Oct. 2002, pp. 12–13. [72] J. L. Morrison, R. Breitling, D. J. Higham, and D. R. Gilbert. Generank: Using search engine technology for the analysis of microarray experiment. BMC Bioinform., 6:233, 2005. [73] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Envi- ronmetrics, 5:111–126, 1994. [74] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Working Pa- pers, Stanford, CA, 1998. book 2007/2/23 page 214 214 Bibliography [75] C. C. Paige and M. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Software, 8:43–71, 1982. [76] H. Park, M. Jeon, and J. Ben Rosen. Lower dimensional representation of text data in vector space based information retrieval. In Computational In- formation Retrieval, M. W. Berry, ed., SIAM, Philadelphia, 2001, pp. 3–23. [77] H. Park, M. Jeon, and J. B. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT, 43:427–448, 2003. [78] V. P. Pauca, J. Piper, and R. Plemmons. Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl., 416:29–47, 2006. [79] Y. Saad. Numerical Methods for Large Eigenvalue Problems. Manchester University Press, Manchester, UK, 1992. [80] Y. Saad. Iterative Methods for Sparse Linear Systems, 2nd ed. SIAM, Philadelphia, 2003. [81] G. Salton, C. Yang, and A. Wong. A vector-space model for automatic in- dexing. Comm. Assoc. Comput. Mach., 18:613–620, 1975. [82] B. Savas. Analyses and Test of Handwritten Digit Algorithms. Master’s thesis, Mathematics Department, Linköping University, 2002. [83] J. D. Schlesinger, J. M. Conroy, M. E. Okurowski, H. T. Wilson, D. P. O’Leary, A. Taylor, and J. Hobbs. Understanding machine performance in the context of human performance for multi-document summarization. In DUC 02 Con- ference Proceedings, 2002. Available at http://duc.nist.gov/pubs.html. [84] S. Serra-Capizzano. Jordan canonical form of the Google matrix: A potential contribution to the PageRank computation. SIAM J. Matrix Anal. Appl., 27:305–312, 2005. [85] F. Shahnaz, M. Berry, P. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. J. Inform. Proc. Management, 42:373–386, 2006. [86] P. Simard, Y. Le Cun, and J. S. Denker. Efficient pattern recognition using a new transformation distance. In Advances in Neural Information Process- ing Systems 5, J. D. Cowan, S. J. Hanson, and C. L. Giles, eds., Morgan Kaufmann, San Francisco, 1993, pp. 50–58. [87] P. Y. Simard, Y.A. Le Cun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. Internat. J. Imaging System Tech., 11:181–194, 2001. [88] L. Sirovich and M. Kirby. Low dimensional procedures for the characterization of human faces. J. Optical Soc. Amer. A, 4:519–524, 1987. book 2007/2/23 page 215 Bibliography 215 [89] M. Sjöström and S. Wold. SIMCA: A pattern recognition method based on principal component models. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, eds., North-Holland, Amsterdam, 1980, pp. 351– 359. [90] P. Smaragdis and J. Brown. Non-negative matrix factorization for polyphonic music transcription. In Proc., IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2003, pp. 177–180. [91] A. Smilde, R. Bro, and P. Geladi. Multi-way Analysis: Applications in the Chemical Sciences. John Wiley, New York, 2004. [92] G. W. Stewart. Matrix Algorithms: Basic Decompositions. SIAM, Philadel- phia, 1998. [93] G. W. Stewart. Matrix Algorithms Volume II: Eigensystems. SIAM, Philadel- phia, 2001. [94] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, Boston, 1990. [95] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Comput., 12:1247–1283, 2000. [96] M. Totty and M. Mangalindan. As Google becomes Web’s gatekeeper, sites fight to get in. Wall Street Journal, 39, Feb. 26, 2003. [97] L. N. Trefethen and D. B. Bau, III. Numerical Linear Algebra. SIAM, Philadelphia, 1997. [98] L. R. Tucker. The extension of factor analysis to three-dimensional matrices. In Contributions to Mathematical Psychology, H. Gulliksen and N. Frederik- sen, eds., Holt, Rinehart and Winston, New York, 1964, pp. 109–127. [99] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psy- chometrika, 31:279–311, 1966. [100] M. A. Turk and A. P. Pentland. Eigenfaces for recognition. J. Cognitive Neurosci., 3:71–86, 1991. [101] G. van den Bergen. Collision Detection in Interactive 3D Environments. Morgan Kaufmann, San Francisco, 2004. [102] M. A. O. Vasilescu. Human motion signatures: Analysis, synthesis, recogni- tion. In Proc., International Conference on Pattern Recognition (ICPR ’02), Quebec City, Canada, 2002. [103] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensem- bles: Tensorfaces. In Proc., 7th European Conference on Computer Vision (ECCV ’02), Copenhagen, Denmark, Lecture Notes in Computer Science 2350, Springer-Verlag, New York, 2002, pp. 447–460. book 2007/2/23 page 216 216 Bibliography [104] M. A. O. Vasilescu and D. Terzopoulos. Multilinear image analysis for facial recognition. In Proc., International Conference on Pattern Recognition (ICPR ’02), Quebec City, Canada, 2002, pp. 511–514. [105] M. A. O. Vasilescu and D. Terzopoulos. Multilinear subspace analysis of image ensembles. In Proc., IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’03), Madison, WI, 2003, pp. 93–99. [106] P. Å. Wedin. Perturbation theory for pseudoinverses. BIT, 13:344–354, 1973. [107] J. H. Wilkinson. Global convergene of tridiagonal qr algorithm with origin shifts. Linear Algebra Appl., 1:409–420, 1968. [108] I. H. Witten and E. Frank. Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Fran- cisco, 2000. [109] H. Wold. Soft modeling by latent variables: The nonlinear iterative partial least squares approach. In Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett, J. Gani, ed., Academic Press, London, 1975. [110] S. Wold, A. Ruhe, H. Wold, and W. J. Dunn, III. The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput., 5:735–743, 1984. [111] S. Wold, M. Sjöström, and L. Eriksson. PLS-regression: A basic tool of chemometrics. Chemometrics Intell. Lab. Systems, 58:109–130, 2001. [112] S. Wolfram. The Mathematica Book, 4th ed. Cambridge University Press, London, 1999. [113] D. Zeimpekis and E. Gallopoulos. Design of a MATLAB toolbox for term- document matrix generation. In Proc., Workshop on Clustering High Dimen- sional Data and Its Applications, I. S. Dhillon, J. Kogan, and J. Ghosh, eds., Newport Beach, CA, 2005, pp. 38–48. [114] H. Zha. Generic summarization and keyphrase extraction using mutual rein- forcement principle and sentence clustering. In Proc., 25th Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, pp. 113–120. book 2007/2/23 page 217 Index 1-norm, 17 matrix, 19 vector, 17 2-norm, 17, 61 matrix, 19 vector, 17 3-mode array, 91 absolute error, 17 adjacency matrix, 159 Aitken extrapolation, 159 algebra, multilinear, 91 all-orthogonality, 95 ALS, see alternating least squares alternating least squares, 106 angle, 18 animation, 10 approximation low-rank, 63, 89, 109, 135, 139, 145, 168 rank-1, 164 rank-k, 135, 165 Arnoldi decomposition, 203 method, 159, 203, 204, 208 implicitly restarted, 205 recursion, 203 ARPACK, 72, 208 array n-mode, 91 n-way, 91 ATLAS, 207 authority, 159 score, 159 backward error, 9, 46, 54 analysis, 27 stability, 54 band matrix, 29 bandwidth, 29 basis, 20, 37 matrix, 99 orthogonal, 50 orthonormal, 38, 65 vector, 14, 165, 173 bidiagonal matrix, 81, 196 bidiagonalization Householder, 81 Lanczos–Golub–Kahan, 80, 84, 85, 142, 146, 201, 206 partial, 206 bilinear form, 172 bioinformatics, 3, 108 BLAS, 14, 207 breast cancer diagnosis, 103 bulge, 195 cancellation, 11, 43, 76 cancer, 103 centroid, 102, 114, 139 approximation, 146 chemometrics, 92, 94 Cholesky decomposition, 26, 30, 207 classification, 75, 114, 120, 127, 172– 174 cluster, 101, 114, 139 coherence, 102 clustering, 139, 165 coherence, cluster, 102 column pivoting, 72, 165 column-stochastic matrix, 150 complete orthogonal decomposition, 72 compressed column storage, 200 217 book 2007/2/23 page 218 218 Index compressed row storage, 199 computer games, 10 computer graphics, 10 concept vector, 139 condition number, 26, 34, 35, 69 coordinates, 14, 50, 89, 103, 165, 173 core tensor, 95, 170 cosine distance, 18, 114, 132, 136, 140, 141 data compression, 63, 100, 175 matrix, 66 deflated, 66 quality, 37 reduction, 20 decomposition Cholesky, 26, 30, 207 eigenvalue, 182, 200 LDLT , 25 LU, 24, 207 tridiagonal, 30 QR, 49, 161, 207 column pivoting, 72, 165 thin, 49 Schur, 182, 197, 200, 207 complex, 183 partial, 182, 204 real, 182 singular value, 57, 116, 135, 207 thin, 59 dense matrix, 31, 42, 179, 185 dependent variable, 75 determinant, 179 diagonal matrix, 12 digits, handwritten, 6, 91, 97, 113– 128 distance cosine, 18, 114, 132, 136, 140, 141 Euclidean, 17, 113, 122 tangent, 122, 124 document clustering, 108, 139 weighting, 132, 162 dominant eigenvalue, 152 e-business, 3 eigenfaces, 169 eigenspace, 182 eigenvalue, 150 decomposition, 182, 200 dominant, 152 perturbation, 181, 183 problem, 7 sensitivity, 180 similarity transformation, 180 eigenvector, 150, 196 perturbation, 181, 184 email surveillance, 108 equation, polynomial, 179 equivalent vector norms, 17 error absolute, 17 backward, 9, 46, 54 backward analysis, 27 floating point, 9, 46, 53 forward, 9 relative, 9, 17 Euclidean distance, 17, 113, 122 norm, 123 vector norm, 17, 19 explanatory variable, 75 face recognition, 172 FIFA, 4, 79, 105 finite algorithm, 190 floating point arithmetic, 9, 46, 155, 190, 196 error, 9, 46, 53 operation, 8 overflow, 10, 156 standard (IEEE), 9 underflow, 10, 156 flop, 8 count, 45, 53, 195, 197 football, 4, 80, 110 forward error, 9 frequency, term, 132 Frobenius norm, 19, 40, 64, 92, 99 fundamental subspace, 62 book 2007/2/23 page 219 Index 219 Gauss transformation, 23 Gaussian elimination, 23 generank, 159 Gilbert–Johnson–Keerthi algorithm, 10 Givens rotation, see plane rotation Google, 4, 7, 79, 104, 109, 147 matrix, 153 Gram–Schmidt, 90 graph Internet, 148 link, 7, 149 strongly connected, 152 GTP, see text parser handwritten digits, 6, 91, 97, 113–128 classification, 6, 91 U.S. Postal Service database, 6, 97, 113, 114, 121, 122, 128 Hessenberg matrix, 197 HITS, see hypertext induced topic search Hooke’s law, 31 HOSVD, 94, 170 thin, 96, 170 truncated, 175 Householder bidiagonalization, 81 matrix, 43 transformation, 43, 46, 47, 53, 80, 188, 196, 197 HTML, 132, 161 hub, 159 score, 159 hypertext induced topic search, 159 IEEE arithmetic, 9 double precision, 9 floating point standard, 9 single precision, 9 ill-conditioned matrix, 27 implicit Q theorem, 194 implicit shift, 194, 197 index, 130, 137 inverted, 130 infinity norm matrix, 19 information retrieval, 3, 4, 103, 129, 133, 161, 200 initialization, SVD, 108 inlink, 7, 148, 159 inner product, 15, 17, 92 Internet, 3, 4, 7, 147, 164 graph, 148 invariant subspace, 182 inverse document frequency, 132 inverse iteration, 186, 196, 198 inverse matrix, 21, 31 inverted index, 130 irreducible matrix, 151 k-means algorithm, 102, 139 Kahan matrix, 74 Karhunen–Loewe expansion, 58 Krylov subspace, 80, 89, 201, 203 Lanczos method, 72, 208 bidiagonalization, 84, 85 tridiagonalization, 206, 207 implicitly restarted, 206, 207 Lanczos–Golub–Kahan bidiagonaliza- tion, 80, 84, 85, 142, 146, 201, 206 LAPACK, 14, 72, 179, 189, 207 latent semantic analysis, 135 latent semantic indexing, 130, 135, 146 LATEX, 161, 163 LDLT decomposition, 25 least squares, 31, 85 alternating, 106 method, 32 nonnegative, 106 normal equations, 33, 54 perturbation, 69 prediction, 75 problem, 32, 51, 66, 85, 117 solution minimum norm, 70 QR decomposition, 51 SVD, 68 lexical scanner, 163 library catalogue, 129 linear independence, 20 book 2007/2/23 page 220 220 Index linear operator, 5 linear system, 23 overdetermined, 23, 31, 32, 51, 66 perturbation theory, 26 underdetermined, 71 link, 4, 148 farm, 154 graph, 7, 149 matrix, 7, 200 low-rank approximation, 63, 89, 109, 135, 139, 145, 168 matrix, 63 LSA, see latent semantic analysis LSI, see latent semantic indexing LU decomposition, 24, 207 tridiagonal, 30 machine learning, 161 manifold, 123 mark-up language, 161 Markov chain, 150 Mathematica, 8, 208 MATLAB, 8 matrix, 4 2-norm, 19 adjacency, 159 approximation, 63–65, 116 band, 29 basis, 99 bidiagonal, 81, 196 column-stochastic, 150 dense, 31, 42, 179, 185 diagonal, 12 factorization nonnegative, 102, 106, 141, 146, 161, 165, 168 Google, 153 Hessenberg, 197 Householder, 43 ill-conditioned, 27 inverse, 21, 31 irreducible, 151 Kahan, 74 link graph, 7, 200 low-rank, 63 multiplication, 15, 93 outer product, 16 nonsingular, 21 null-space, 61 orthogonal, 39 permutation, 24, 72, 165 positive, 152 positive definite, 25 range, 61, 182 rank, 21 rank-1, 21, 152 rank-deficient, 70 rectangular, 23 reducible, 151 reflection, 43 rotation, 40, 47, 55, 197 sparse, 5, 132, 163, 185, 198, 200, 208 storage, 199, 200 symmetric, 25 term-document, 4, 91, 104, 130, 131, 135 term-sentence, 162 transition, 150 triangular, 23 tridiagonal, 29, 188, 197, 205 upper quasi-triangular, 182 upper triangular, 47 matrix norm, 18 1-norm, 19 2-norm, 61 Frobenius, 19, 40, 64 infinity norm, 19 matrix-vector multiplication, 13 max-norm, 17 vector, 17 medical abstracts, 129 Medline, 129, 136, 140, 142, 144, 145 microarray, 159 mode, 91 model, reduced rank, 77, 115 MPI, 207 multilinear algebra, 91 multiplication i-mode, 92 matrix, 15, 93 book 2007/2/23 page 221 Index 221 matrix-vector, 13 tensor-matrix, 92 music transcription, 108 mutual reinforcement principle, 162 n-way array, 91 natural language processing, 161 Netlib, 207 network analysis, 159 noise reduction, 145 removal, 63 nonnegative least squares, 106 nonnegative matrix factorization, 102, 106, 141, 146, 161, 165, 168 nonsingular matrix, 21 norm 1-norm, 17 Euclidean, 123 matrix, 18 1-norm, 19 2-norm, 61 Frobenius, 19, 40, 64 infinity, 19 maximum, 17 operator, 18 p-norm, 17 tensor, 92 Frobenius, 92, 99 vector, 17 Euclidean, 17, 19 normal equations, 33, 54 null-space, 61 numerical rank, 63, 72, 76 operator norm, 18 orthogonal basis, 50 decomposition, complete, 72 matrix, 39 similarity transformation, 180, 187 transformation, floating point, 46 vectors, 18, 38 orthonormal basis, 38, 65 vectors, 38 outer product, 16, 59 outlink, 7, 148, 159 overdetermined system, 23, 31, 32, 51, 66 overflow, 10, 156 p-norm, vector, 17 pagerank, 147–159, 161 parser, text, 132, 161, 163 partial least squares, see PLS partial pivoting, 23, 30 pattern recognition, 6 PCA, see principal component analy- sis performance modeling, 133 permutation matrix, 24, 72, 165 Perron–Frobenius theorem, 152 personalization vector, 154 perturbation eigenvalue, 181, 183 eigenvector, 181, 184 least squares, 69 theory, 26, 28, 180 plane rotation, 40, 46, 47, 55, 197 PLS, see projection to latent struc- tures polynomial equation, 179 Porter stemmer, 131 positive definite matrix, 25 positive matrix, 152 power method, 150, 154, 185, 201, 204 precision, 133 prediction, 75 preprocessing, 130 principal component, 58 analysis, 66, 169 regression, 78, 144 projection to latent structures, 80, 89, 142 pseudoinverse, 71 psychometrics, 92, 94 QR algorithm, 179, 180 convergence, 194, 198 nonsymmetric, 197 symmetric, 190, 192 book 2007/2/23 page 222 222 Index QR decomposition, 49, 161, 207 column pivoting, 72, 165 thin, 49 updating, 54 qr function, 50 query, 5, 79, 129–147, 159 matching, 132–146 random surfer, 150 walk, 150 range, 61, 182 rank, 21 numerical, 63, 72, 76 rank-1 approximation, 164 matrix, 21, 152 rank-deficient matrix, 70 rank-k approximation, 135, 165 ranking, 4, 147, 148, 159 vector, 148 recall, 134 rectangular matrix, 23 reduced rank model, 77, 115 reducible matrix, 151 reflection matrix, 43 regression, principal component, 78, 144 relative error, 9, 17 residual, 75 reorthogonalization, 90, 204 residual relative, 75 vector, 32, 117 rotation Givens, 40 plane, 40, 46, 47, 55, 197 rotation matrix, 55, 197 rounding error, 9 saliency score, 162 SAS, 208 SAXPY, 14, 15 ScaLAPACK, 207 Schur decomposition, 182, 197, 200, 207 partial, 182, 204 search engine, 3, 7, 130, 147, 161 semantic structure, 135 shift, 186 implicit, 194, 197 Wilkinson, 190 SIMCA, 121 similarity transformation, orthogonal, 180, 187 singular image, 116 value, 58, 163 i-mode, 95 tensor, 95 vector, 58, 163 singular value decomposition, 57, 94, 116, 130, 135, 163, 165, 168, 169, 200, 206, 207 computation, 72, 196 expansion, 59 Lanczos–Golub–Kahan method, 108, 206 outer product form, 59 tensor, 94 thin, 59 truncated, 63, 78, 136 slice (of a tensor), 93 software, 207 sparse matrix, 5, 132, 163, 185, 198, 200, 208 storage, 199, 200 spectral analysis, 108 spring constant, 31 SPSS, 208 stemmer, Porter, 131 stemming, 130, 161 stop word, 130, 161 strongly connected graph, 152 subspace fundamental, 62 invariant, 182 Krylov, 80, 89, 201, 203 summarization, text, 161–168 Supreme Court precedent, 159 book 2007/2/23 page 223 Index 223 surfer, random, 150 SVD, see singular value decomposi- tion svd function, 60, 72 svds function, 72, 108, 207 symmetric matrix, 25 synonym extraction, 159 tag, 161 tangent distance, 122, 124 plane, 123 teleportation, 153, 164 tensor, 11, 91–100, 169–176 core, 95, 170 SVD, 94 unfolding, 93 TensorFaces, 169 term, 130, 162 frequency, 132 weighting, 132, 162 term-document matrix, 4, 91, 104, 130, 131, 135 term-sentence matrix, 162 test set, 114–116 text mining, 103, 129–146 text parser, 132, 161 GTP, 131 TMG, 163 Text Retrieval Conference, 145 text summarization, 161–168 theorem implicit Q, 194 Perron–Frobenius, 152 thin HOSVD, 96, 170 QR, 49 SVD, 59 TMG, see text parser trace, 19, 92 training set, 91, 114–116, 120, 127 transformation diagonal hyperbolic, 126 Gauss, 23 Householder, 43, 46, 47, 53, 80, 188, 196, 197 orthogonal, floating point, 46 parallel hyperbolic, 126 rotation, 125 scaling, 126 similarity, orthogonal, 180, 187 thickening, 127 translation, 125 transition matrix, 150 TREC, see Text Retrieval Conference triangle inequality, 17, 18 triangular matrix, 23 tridiagonal matrix, 29, 188, 197, 205 truncated HOSVD, 175 truncated SVD, 63, 78, 136 Tucker model, 94 underdetermined system, 71 underflow, 10, 156 unfolding, 93 unit roundoff, 9, 27, 46, 192, 196 updating QR decomposition, 54 upper quasi-triangular matrix, 182 upper triangular matrix, 47 U.S. Postal Service database, 6, 97, 113, 114, 121, 122, 128 variable dependent, 75 explanatory, 75 vector basis, 14, 173 concept, 139 norm, 17 1-norm, 17 2-norm, 17 equivalence, 17 Euclidean, 17, 19 max-norm, 17 personalization, 154 ranking, 148 residual, 117 singular, 163 vector space model, 130, 146, 161 vectors orthogonal, 18, 38 orthonormal, 38 book 2007/2/23 page 224 224 Index volcanos on Venus, 3 Web page, 4 Web search engine, see search engine weighting document, 132, 162 term, 132, 162 Wilkinson shift, 190 XML, 132 Yale Face Database, 170 zip code, 113