Talk:Principal components analysis

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.
Mathematics rating: Start Class High Priority  Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.
Please update this rating as the article progresses, or if the rating is inaccurate. Please also add comments to suggest improvements to the article.
This article may be too technical for a general audience.
Please help improve this article by providing more context and better explanations of technical details to make it more accessible, without removing technical details.

The article seems terribly cluttered. In particular, I dislike the table of symbols. Sboehringer

Contents

[edit] Split for readability?

Would it be worth putting the algorithm in a seperate artical, and maintaining this artical as a discussion of PCA theory? 134.225.217.52 (talk) —Preceding comment was added at 02:33, 25 April 2008 (UTC)

[edit] Question on reduced-space data matrix

The article states: and then obtaining the reduced-space data matrix Y by projecting X down into the reduced space defined by only the first L singular vectors, WL:

\mathbf{Y}=\mathbf{W_L}^T\mathbf{X} = \mathbf{\Sigma_L}\mathbf{V_L}^T

I believe that the correct formula is:

\mathbf{Y}=\mathbf{X}\mathbf{V_L} = \mathbf{W_L}\mathbf{\Sigma_L}

Can anyone verify this? —Preceding unsigned comment added by 216.113.168.141 (talkcontribs)

Afraid not. The way things are set up in the article, the data matrix X, of size M x N, consists of N column vectors, each representing a different sampling event; with each sampling made up of measurements of M different variables, so giving the matrix M different rows.
With the reduced space, we want to find a smaller set of L new variables, which for each sampling preserves as much of the information as possible out of the original M variables.
So we're looking for an L x N matrix, with the same number of columns (the same number of samples), but a smaller number of rows (so each sample is described by fewer variables).
Matrix WL is an M x L matrix, so WL ΣL is also an M x L matrix - not the shape we're looking for. But ΣL VLT is the desired L x N shape.
Hope this helps. -- Jheald 11:02, 17 June 2006 (UTC).


Yes, that clarifies. Thanks Jheald! I thought that X row vectors were the sampling events and the column vectors were the variables -- since the definition of X is in fact the transpose of what I thought, then everything makes sense. -- 12:33, 26 June 2006

[edit] Separate articles on arg max and arg min notations

We probably need a small article on the arg max and arg min notations.

[edit] Missing crucial details

The article seems to be missing crucial details. I can't see where the actual dimension reduction is happening. Is the idea that you have several samples of the measurement vector x and you use these to estimate the expectations? 130.188.8.9 16:49, 20 Aug 2003 (UTC)

- There should now be a clue. However, the article still needs work

[edit] Plural versus singular title

Principle components analysis is better known as Principle component analysis (singular). This should be the main title and the plural form a synonym referring to this page (Unfortunately I do not know how to do it).

I've always heard it with the plural. I have a PhD in statistics. I'm not saying the singular could never be used, but the plural is certainly the one that's frequently heard. Michael Hardy 21:18, 22 Mar 2004 (UTC)
The only monography solely dedicated to PCA is from Jolliffe to my knowledge and is titled "Principal component analysis". The naming issue is discussed in the introduction otherwise than you indicate. Then again naming issues are conventions and vary across the globe. Sboehringer
Google says: "Principal component analysis": 103,000 hits, "Principal components analysis": 46,300 hits. MH 13:48, 25 Mar 2004 (UTC)
I have that monograph and you are correct. It seems, however, that the analysis elucidates the principal components, plural, and so unless one is only interested in one principal component at a time, the plural appears to be more appropriate.
In various scientific papers/books I have seen it spelled like Principal Component Analysis. But as long as it is referred to the same content. I won't loose any sleep over it.
I have until now never seen it in plural, not in scientific papers as well. Do you ever write plural before "analysis"? "Houses analysis", "cars analysis", "components analysis"... I think plural is wrong, but I'm no native English speaker. Anoko moonlight 13:21, 30 July 2007 (UTC)
House prices analysis? Car sales analysis? Jheald 14:49, 30 July 2007 (UTC)
Okay, reasons to change it to singular: Primary PCA journal uses singular, more Google hits with singular, more Web of Science hits with singular. Reasons to keep it as plural: some wikipedia user says "the plural appears to be more appropriate." Well, I think that's enough for a change! 76.69.33.144 (talk) 15:20, 29 February 2008 (UTC)

[edit] Article needs serious improvement

Moving Michael Hardy's comments to Talk:

This article needs some serious revamping, to say the least. One cannot assume without loss of generality that the expectation is zero. If the expectation were observable, one could subtract it from x and get something with zero expectation, and so no generality would be lost by this assumption. In practice the expecation is never observable, and one must consider the probability distribution of the difference between x and an estimate, based on data, of the expectation of x.

Excuse me, but that is absurd. If the mean were observable, then one could simply subtract the mean from X, getting something with zero mean, and then indeed no generality would be lost by assuming that. In practice, one must use a data-based and therefore uncertain estimate of the mean, and one must therefore consider the probability distribution of the difference between X and the estimate of the mean of X.

If I may respond --- PCA is a technique that is applied to empirical data sets. PCA eigendecomposes the maximum likelihood covariance matrix. Indeed, there is a distribution of PCA decompositions about the "true" decomposition that you would get in the infinite data limit. But, that does not make it absurd. Or rather, no more absurd than any other maximum likelihood estimate. Any ML technique will have a variance around the estimate from infinite data.
Are you objecting because ML is not mentioned in the article? Or is it something else? -- hike395 04:39, 5 May 2004 (UTC)
Something else. Several something elses. It doesn't seem like that good an article. I'll probably drastically edit it within a few months; it's on my list. Michael Hardy 16:31, 5 May 2004 (UTC)

[edit] PCR and PLS?

would it be redundant to include some discussion of principal components regression? i don't think so, but i don't feel qualified to explain it.

It would also be nice to have a piece on Partial Least Squares. Geladi and Kowalski Analytica Chimica Acta 185 (1986) 1-17 may serve as a starting point.

I disagree --- PLS and PCR are both forms of linear regression, which is supervised learning. PCA is density estimation, which is unsupervised learning. Very different sorts of algorithms --- hike395 04:35, 22 Mar 2005 (UTC)
The Principal Components Regression is used when the predictive variables are not uncorrelated, it means cov(xi;xj)<>0, for some i<>j. When this happens, we are in presence of multicolineality, which reduces the power of the inference. The technique of PCA is applied to the independent variables, and finally a regression model is adjusted with the principal factors chosen. The new estimated parameters are biased, but uncorrelated, and the variance of the new model is lesser.

[edit] PCA & Least Squares

Is PCA the same as a least squares fit? (Furthermore, is either the same as finding the principle moment of inertia of an n-dimensional body?) —BenFrantzDale 23:53, August 3, 2005 (UTC)

No. A least-squares fit minimizes (the squares of) the residuals, the vertical distances from the fit line (hyperplane) to the data. PCA minimizes the orthogonal projections to the hyperplane. (Or something like that; I don't really know what I'm talking about.) As for moments of inertia, well, physics isn't exactly my area of expertise. —Caesura(t) 18:44, 14 December 2005 (UTC)
Yes. PCA is equivalent to finding the principal axes of inertia for N point masses in m dimensions, and then throwing all but l of the new transformed co-ordinates away. It's also mathematically the same problem as Total Least Squares (errors in all variables), rather than Ordinary Least Squares (errors only in y, not x), if you can scale it so the errors in all the variables are uncorrelated and the same size. You're then finding the best l dimensional hyperplane that your data ought to sit on through the m dimensional space. The real power tool behind all of this to get a feel for is Singular Value Decomposition. PCA is just SVD applied to your data. -- Jheald 19:40, 12 January 2006 (UTC).

[edit] Derivation of PCA

Shouldn't the constraint that we are looking for the maximum variance appear somewhere in that derivation ? I cannot understand it clearly as it is right now. --Raistlin 12:49, 24 August 2005 (UTC)

It is my understanding that the first principal component is the least squares fit to a multidimensional configuration of points, which happens to also be the axis of maximum variance. The second principal component is also a least squares fit to the configuration, with the additional constraint that it must be orthogonal to the first principal component. The third, fourth, fifth, etc, principal components are also least squares fits, except that they are each constrained to be orthogonal to all of the principal components before them. 24.221.60.71 05:03, 21 May 2007 (UTC)

Exactly right. The more of the variance that can be put into the first n components, ie the n-component subspace fitted, the less is the variance (sum of squares) of the points' residuals orthogonal to that subspace. Jheald 15:46, 21 May 2007 (UTC)

[edit] Conjugate transpose

and * T represents the conjugate transpose operation.

Why conjugate transpose instead of a normal transpose ? Does it even work with complex numbers ? Taw 04:18, 31 December 2005 (UTC)

As you probably know, conjugate transpose is a generalization of plain old transpose that allows these operations to work on complex numbers instead of just real numbers. If the source data X consists entirely of real numbers, then the conjugate operation is completely transparent, since the conjugate of a real number is the number itself. But if the source data includes complex numbers, then the conjugate operations is absolutely essential for the matrix operations to yield meaningful results. As far as I can tell, it does work on complex numbers. As an example where you might have complex numbers as source data, you might want to use PCA on the Fourier components of a real, discrete-time signal, which are in general complex. -- Metacomet 18:59, 1 January 2006 (UTC)
I have added a motivation paragraph at Conjugate_transpose#Motivation to try to show why it is so natural for the conjugate transpose to turn up, whenever the matrix you're transposing includes complex numbers. Hope it's helpful. -- Jheald 20:14, 12 January 2006 (UTC).

[edit] Computation -- surely this is not the right way to go ?

The section on computation looks to make a real meal of things, IMO; and to be pretty dubious too, as regards its numerical analysis. As soon as you square the data matrix, you're going to reduce the accuracy of your SVD from double precision to single precision.

Is there any reason to prefer either of the methods in the text, compared to choosing which bits of the SVD you actually want to keep, and then just wheeling out R-SVD ? (Which I imagine is quicker, too). -- Jheald 19:05, 12 January 2006 (UTC).

I agree that this article is unreadable. The lengthy "PCA algorithm" section is one of the main reasons - it is too long, and it doesn't agree with the equations in the introduction (where did we divide by N-1? why? what about the empirical standard deviations?). It doesn't even say what the output of the algorithm is, AFAICT. A5 13:32, 6 March 2006 (UTC)
I am working on improving the algorithm section to make it more readable. In the end, the section will still be quite long, because the algorithm is rather complicated and I think it is important to include enough detail so that people can actually implement it in software. After I have completed this upgrade, please make specific suggestions for further improvements. -- Metacomet 21:37, 9 March 2006 (UTC)
I am done for now. There is still more work to do, but it's a good start. Please provide comments and suggestions for improvement. Thanks. -- Metacomet 23:12, 9 March 2006 (UTC)
The improvement I would suggest is to delete the whole entire section completely, starting from the table, and then everything following it; and instead tell people to use SVD.
A standard SVD routine will be better written, better tested, faster, and more numerically stable.
IMO it is totally irresponsible for the article to be suggesting inefficient homespun routines, actually leading people away from the standard SVD routines. -- Jheald 00:03, 10 March 2006 (UTC).
I'm no expert Jheald, but I don't see what you're so worried about. Algorithms for SVD that I have seen on the WWW basically consist of the same algorithm that is listed on this PCA page, only done twice, once for left handed eigenvalues, once for right. Is there some other algorithm for SVD that is much preferable? --Chinasaur 08:40, 25 May 2006 (UTC)
I'm sort of an expert - I have a PhD in computer science, in algorithms, although numerical algorithms are not my specific thing - and yeah, actually, the algorithms for SVD that you'll find in a package like LAPACK, R or Mathematica actually are different from the one described here. They avoid computing the covariance matrix for the reason Jheald suggested. ProfessorSpice 01:07, 29 June 2007 (UTC)
I am really glad that you took some time to carefully review the work that I did and make some thoughtful recommendations. Thanks for the constructive feedback. Oh yes, that is sarcasm, in case you were wondering. -- Metacomet 00:48, 10 March 2006 (UTC)
"...totally irresponsible..." Don't you think that is just a wee bit of hyperbole?
"...homespun routines..." Are you referring to calculating the mean, the standard deviation, or the covariance? No, that can't be right, those are well-known and well-established procedures from statistics. Or perhaps eigenvectors and eigenvalues? Hmmmm, those are standard routines in linear algebra. Sorting the basis vectors by energy content and keeping only the ones with the highest contribution? No, that's also a standard concept called the 80-20 rule (or Pareto's principle). I guess I just don't understand what you mean by homespun routines....
-- Metacomet 01:36, 10 March 2006 (UTC)


BTW, I am pretty sure that dividing by N-1 is correct, which means the introduction needs to be fixed, not the algorithm. The reason the algorithm needs to divide by N-1 is that it is computing the expected value of the product, not the product itself. -- Metacomet 21:50, 9 March 2006 (UTC)
I dont know nothing abouth Maths, but all the pages about the Covariance Matrix use N so maybe N-1 is not so correct...? -- IC 18:48, 18 November 2006 (GMT+1)

A mathematical derivation with eigenvalues and eigenvectors is OK but such methods should not be called algorithms. The practical computation should be SVD. Squaring the matrix to get the covariance is harmful. By the way, homegrown SVD is harmful as well and I must support Jheald on both counts. A professional implementation should use SVD from some efficient and stable library, just as one should never write matrix-matrix multiplication except in a college homework. LAPACK is a de-facto standard for all that. (There might be justified exceptions such as programming something exotic with small memory.) Even if some engineering textbooks might have algorithms like here with the covariance created explicitly, their authors are obviously not professional numerical analysts or developers or they do not care about numerical aspects. Jmath666 07:28, 16 March 2007 (UTC)

I would like to support the suggestion of using an SVD function rather than a generic eigensolver. While the algorithm described is a mathematically correct way of doing PCA, not all mathematically correct algorithms are equally good. Two big issues for numerical algorithms are accuracy (how much does it lose in round-off error?) and efficiency. People spend their lives worrying about these issues, and those are the people who write programs like LAPack, SciPy, R, MatLab or Mathematica. Since the person implementing PCA is going install some package like this to compute the eigenvalues anyway, s/he might as well use the SVD function in the package instead of computing the covariance matrix and then using the eigen-solver function. SVD is more "special purpose" - it takes B, and computes the decomposition directly, without going through the step of computing BB^T. The SVD function will certainly be more accurate (as Jheald said, computing the covariance matrix loses digits to round-off error) and I believe almost always more efficient. And it makes the PCA implementation shorter and easier to do! Just to drag in a real big-gun reference, Golub and VanLoan's textbook Matrix Computation gets 14,000+ citations on Google Scholar, and it recommends against computing the covaraince matrix when computing SVD, in Section 8.3. ProfessorSpice 01:07, 29 June 2007 (UTC)

You all make a very strong case for SVD, however i hope we can all agree that the SVD article lacks any sort of decent step by step explanation of an algorithm to produce it. Until there is, i will use the method outlined here, as LAPACK is not an option for me. Please don't complain about what's up here until you know there is something better elsewhere on wikipedia, right now that's just not the case. Jordyhoyt 10:46, 1 August 2007 (UTC)

[edit] Eigenvector/eigenvalue ordering

Under "Find the eigenvectors and eigenvalues of the covariance matrix", the article says "The eigenvalues and eigenvectors are ordered and paired." But then the next section says to order the columns by decreasing eigenvalue. Maybe I'm misunderstanding the previous section but this seems contradictory. 71.199.186.28 (talk) 00:06, 30 March 2008 (UTC)

I'v used PCA for classification without reordering at thist stage successfuly. It may be that it is simply trying to preempt the later stage discussed. 134.225.217.52 (talk) —Preceding comment was added at 02:20, 25 April 2008 (UTC)

[edit] Simplification

Could someone put one sentence at the top explaining this in layman's terms? It looks to me like a very fancy and statistically smart way to average a whole heap of data into some sort of dataset common to all of the data -- is this at all a correct impression? --Fastfission 04:07, 28 January 2006 (UTC)

[edit] Cov Matrix (Contradiction)

If one is dealing with a MxN data set, i.e N factors and M obervations of each, the resulting cov matrix will be a NxN, not MxM.

It seems like everything from the mean vector subtraction to the covariance matrix calculation is done as if the data are organized as M rows of variables and N columns of observations. This is not properly explained in the "organizing the data" section, and is kind of opposite what most people would expect. I'm inclined to reverse everything. --Chinasaur 22:39, 19 May 2006 (UTC)
Yeah, this whole covariance matrix thing seems completely wrong. It states:
\mathbf{C} = { 1 \over N } \mathbf{B} \cdot \mathbf{B}^{*}
And this is inconsistent on two levels. First of all the covariance matrix of B is NxN, not MxM, as other's have stated above. This is in direct contradiction to what this section of the article states, and to what the "organizing data section" states. Secondly, assuming that each data set is in a column (so 3 datasets of 5 points each is organized into a 5x3 [M=5, N=3] matrix), the covariance matrix is NOT { 1 \over N } \mathbf{B} \cdot \mathbf{B}^{*} but is actually { 1 \over N } \mathbf{B}^{*} \cdot \mathbf{B}. So the equation given above is for the transpose of B. And in any case, that's not even the covariance matrix of the transpose of B, the 1/N is wrong, it should be 1/(N-1). Unfortunately I don't know enough about it to make the correction, and the Wikipedia article that I came to to learn about it is quite inadequate. Anyways, I will be putting a contradiction tag on this article because of this. This is a very, very poorly written article, and the original author really deserves a sound spanking. Once I actually do find a correct, concise source of information regarding PCA, I'll be redoing it. --JCipriani 22:50, 13 April 2007 (UTC)
This article seems taken from some confused engineering textbooks that make it too complicated because they try to be elementary and try to teach other things at the same time. I had to wade through the mess myself trying to learn about PCA not too long ago but I never found an acceptable source. In fact, it is very simple: PCA is the spectral decomposition of the sample covariance matrix. It is best computed by SVD. It can be proved that the eigenvectors have certain optimal properties regarding the variance. It is very short, really. The Karhunen-Loeve decomposition is something a bit else (see Loeve Probability theory ISBN 0-387-90262-7) and it is done in advanced graduate courses in probability theory; but once you know that you can just say that "PCA is KL decomposition with covariance replaced by the sample covariance". All else is crud. I may write it up one day if I have the time. If you want to give it a shot these are on the clearer side: Holmes et al ISBN 0-521-55142-0, and Liang, Y. C. et al Proper orthogonal decomposition and its applications. I Theory, Journal of Sound and Vibration, 252 (2002) 527--544. Jmath666 04:51, 22 April 2007 (UTC)
Should the sample covariance matrix be used here instead of the population covariance? It seems the calculation should be;
\mathbf{C} = { 1 \over N - 1 } \mathbf{B} \cdot \mathbf{B}^{T}
p.484 of David Lay's Linear Algebra and its Applications, 3rd ed ISBN 0-201-70970-8 and p.5 of the paper "A Tutorial on Principal Component Analysis" by Lindsay Smith support this. When you are performing PCA its typically on a sample of the population, right? Zhroth (talk) 16:21, 19 February 2008 (UTC)
Comments: Outer product between matrix is not commonly used math term. Outer product is usually understood as operator betwen two vectors. Anyway, the use of outer product here only make the description seemly more formal and complicated.
 —Preceding unsigned comment added by 68.147.165.202 (talk) 18:46, 8 March 2008 (UTC) 

[edit] Cov Matrix size

The size of the cov matrix C is still unclear. From the session “Find the eigenvectors and eigenvalues of the covariance matrix” on, it is considered to be NxN, while in the session “Find the covariance matrix” it is MxM, which I think is the right size, since the matrix B is a MxN. 133.6.156.71 12:07, 6 June 2006 (UTC)

Shouldn't it read "inner product" instead of "outer product" as C as the outer product B \cdot B^* would make it a M\times N\times N\times M tensor?

outer product is right it is just they switched the meaning of the dimensions as some of the previous comments have indicated. —Preceding unsigned comment added by 74.192.1.156 (talk) 11:46, 18 October 2007 (UTC)

[edit] This isn't really working!

The first point, I wondered about, is "Calculate the empirical mean". I think the mean is not calculated in the right way. The mean is calculated over each dimension M. Isn't that sophisticating the data. I think you have to take the mean over each observation (N-vector).
The second point is the size, first of the covariance and then the size of the eigenvalue-matrix. By calculating the eigenvalues you get one for each variable in the data set. So, the size of this matrix should be MxM. And before, to reach this result, the covariance Matrix must have the same size.
... Has anybody an idea how it's really working?

Subtracting the mean of of the observation is nonsense. If I have features X1 X2 where X1 is on the order of 10^20 and X2 is on the order of 10^-20 subtracting the mean of the observation will just make X2 a hugely negative value and X1 close to 0. Subtracting the mean of the dimension makes sense because you are trying to shift the problem back to the orgin (if it were plotted). —Preceding unsigned comment added by 74.192.1.156 (talk) 11:50, 18 October 2007 (UTC)

[edit] Whats the difference between PCA and ICA

Just wondering.. This ist clear to me for these articles? --137.215.6.53 12:18, 3 August 2006 (UTC)


[edit] Principle Components analysis versus Exploratory factor analysis

I suggest to include a subsection discussing the differences between PCA and exploratory factor analysis. Based on my experience in working in Stat Lab is that students/clients get them confused. Perhaps a description of the differences between PCA and EFA may be included. This can be added to common factor analyses. Below is my undertanding on the differences. I did not want to use "greek" symbols so that it may perhaps be more accessible to non-mathematicians. What do you think?

Exploratory factor analysis (EFA) and principal component analysis (PCA) may differ in their utility. The goal in using EFA is factor structure interpretation and also in data reduction (reducing a large set of variables to a smaller set of new variables); whereas, the goal for PCA is usually only data reduction.

EFA is used to determine the number and the nature of latent factors which may account for a large part of the correlations among a large number of measured variables. On the other hand, PCA is used to reduce scores on a large set of observed (or measured) variables to a smaller set of linear composites of the original (or observed) variables that retain as much information as possible from the original (or observed) variables. That is, the components (linear combinations of the observed items) serve as reduced set of the observed variables.

Moreover, the core theoretical assumptions are different for both methods. EFA is based on the common factor model (FA), whereas, PCA is not.

1. Common and unique variances

Common Factor Model (FA): Factors are latent variables that explain the covariances (or correlations) among the observed variables (items). That is, each observed item is a linear equation of the common factors (i.e., single or multiple latent factors) and one unique factor (latent construct affiliated with the observed variable). The latent factors are viewed as the causes of the observed variables.
Note: Total variance of variable = common variance + unique variance (in which, unique variance = specific + error variance).
Principal Components (PCA): In contrast, PCA does not distinguish between common or unique variances. The components are estimated to represent the variances of the observed variables in an economical fashion as possible (i.e., in a small a number of dimensions as possible), and no latent (or common) variables underlying the observed variables need to be invoked. Instead, the principal components are optimally weighted sums of the observed variables (i.e., components are linear combinations of the observed items). So, in a sense, the observed variables are the causes of the composite variables.

2. Reproduction of observed variables

FA: Underlying factor structure tries to reproduce the correlations among the items
PCA: Composites reproduce the variances of observed variables

3. Assumption concerning communalities & the matrix type.

FA: Assumes that a variable's variance is composed of common variance and unique variance. For this reason, we analyze the matrix of correlations among measured variables with communality estimates (i.e., proportion of variance accounted for in each variable by the rest of the variables) on the main diagonal. This matrix is called the Rreduced.
Note: Principal Axis factoring (PAF) = principal component analysis on Rreduced.
PCA: There is no place for unique variance and all variance is common. Hence, we analyze the matrix of correlations (Rxx) among measured variables with 1.0s (representing all of the variance of the observed variables) on the main diagonal. The variance of each measured variable are entirely accounted for by the linear combination of principal components.

Also See factor analysis

(please bare with me, I am new with using wikipedia).

RicoStatGuy 15:53, Sept 30, 2006(UTC)

[edit] Orthogonality of components

According to this PDF, the eigenvectors of a covariance matrix are orthogonal. The eigenvectors of an arbitrary matrix are not necessarily orthogonal, as seen in the leading picture on the eigenvector page. So what gives? Why are these eigenvectors necessarily orthogonal? —Ben FrantzDale 14:44, 7 September 2006 (UTC)

According to Symmetric matrix, "Another way of stating the spectral theorem is that the eigenvectors of a symmetric matrix are orthogonal." That explains that. 128.113.54.151 20:00, 7 September 2006 (UTC)
If the multiplicity of every eigenvalue of the covariance matrix is 1, then the eigenvectors will by necessity be orthogonal.
If there exists an eigenvalue of the covariance matrix with multiplicity greater than 1, say of dimension r, then this corresponds to an r-dimensional subspace of Rn (n being the dimension of the covariance matrix). Then the corresponding eigenvectors can be in principle any basis of this subspace. But generally speaking, the basis is chosen to be orthogonal.
So to answer the question, in some cases they must be orthogonal, and in some cases they do not all have to be, but are usually chosen to be so.
On a side note, all software packages I am aware of will return orthogonal eigenvectors in the multiplicity case. I suspect that this is because the algorithms implicitly force this by recursively projecting Rn into the nullspace of the most recent eigenvector, or something equivalent. Baccyak4H (talk) 17:56, 20 November 2006 (UTC)
Actually, 128.113.54.151 is exactly correct. Because covariance matricies are symmetric, they are necessarily normal. The complex spectral theorem tells us that in ALL cases a nomal operator on a complex vector space has an orthonormal basis of eigenvectors. In fact, the theorem tells us that such an orthonormal basis exists if and only if the operator is normal. If we restrict ourselvs to the reals, the Real Spectral Theorem tells us that a matrix has an orthonormal collection of eigenvectors if and only if it is self-adjoint. Covariance matricies are self-adjoint, so again the theorem holds. The statement above by Baccyak4H that in some cases the eigenvectors do not have to be orthogonal is incorrect when we are talking about covariance matrices. His assertion that an arbitrary collection of eigenvectors can be reshaped into an orthogonal collection of eigenvectors is also incorrect. 167.206.189.3 20:44, 19 June 2007 (UTC)
No, Baccyak4H is correct. As you say, it is always possible to find a basis of orthonormal eigenvectors for a real symmetric matrix. However if two eigenvectors u1 and u2 share the same common eigenvalue λ, then any arbitrary linear combinations v11u11u2 and v22u12u2 are also eigenvectors with the same eigenvalue. So yes, it is possible to find vectors u1 and u2 which are orthogonal, but if they share an eigenvalue then one can also find an infinite number of pairs of valid eigenvectors v1 and v2 which are not orthogonal. Jheald 21:06, 19 June 2007 (UTC)

[edit] Abdi

The following comes from my talk page:

" Hi Brusegadi, I deleted some references in principal component analysis related to an author called Abdi because they are neither standard references nor really related to the basic and strong theory expected of the article. Unless you can prove otherwise, I will take steps such that these references would not appear ever again. Can't you understand that this author is self-promoting or may be he is someone dear to you or you are Abdi. Mind your words! PCA_Hero "

PCA_Hero, the reason why I deleted your edit is becuase it looked like a case of blanking. You could have been able to tell that by the message I left on your anon talk page (note that this is the standard message prescribed by wiki to blanking vandals.) I see much blanking from anons. Had you provided an explanation like the one given in my talk page AFTER everything had happened in the ARTCLE'S talk page I would NOT have deleted that. Something we con both learn is WP:AGF since I accused you of blanking and you accused me of promoting spam. The difference is that when you look at your contributions, you see very few (as of today) but when you look at mine you see many and from a broad range of topics. I will also state that I do not appreciate your tone! Mind OUR assumtions! Brusegadi 19:51, 15 November 2006 (UTC)

[edit] Rows and columns

I think our convention for the Data matrix is probably the wrong way round. At the the of the day, it would probably be more natural if our "principal components" vector was a column vector.

I also think that confusion between the two conventions is one of the things that has been making the article more difficult than it needs to be.

I propose to go ahead and make this change, unless anyone thinks it's a bad idea ? Jheald 16:53, 31 January 2007 (UTC).


[edit] Percent variance??

Presumably one wants to compare the sum of the leading eigenvalues to the sum of all eigenvalues. The example of comparing to a threshold of 90% doesn't make much sense otherwise

[edit] Cumulative energy

This term for the contributions of components seems to be from some other field. Is there a better general term for this ? Shyamal 08:25, 12 March 2007 (UTC)

[edit] Terminology used

Many users of PCA expect certain terminology such as the decomposition into "loadings" and "scores". The term loading itself is never used in the article and this can be confusing. The following is a mechanical statement for PCA in Matlab.

For a dataset X we can use Eigenvalue decomposition to produce
1) An Eigenvector matrix V whose columns are Eigenvectors and
2) Eigenvalue matrix D (diagonal) such that
(X-D)*V=0
and X=V*D*inv(V)

depending on the algorithm the elements of D may be ascending, descending or in unsorted order, but the elements of D and the columns
of V may be suitably sorted without change in the identities, Matlab for instance puts the D values in ascending order in the eig()
function but descending is often preferred

If X consists of samples in rows and variables in columns, then 
X'X gives the covariance matrix if X is mean centered. 

PCA can be done on the covariance matrix or using X'X even without mean centering

Cov=X'X

Cov now is a square matrix with the dimension being the number of variables or columns

[V,D]=eig(Cov) in Matlab will now give the Loadings in V

The scores can be obtained with

Scores=X * V(:,m-k,m)

for k components, m is the number of variables

It can be verified that

X ≈ Scores * Loadings'

Instead of using the Covariance matrix X'X, one can also compute PCA using just the X matrix. Here the Singular Value Decomposition
algorithm (SVD) may be used. In Matlab

[U,S,V]=svd(X)

The V here is identical to the V (loadings) obtained by Eigenvalue decomposition and the Scores are now equal to U*S

Hope someone can use the above suitably formatted in the article with explanation of the terms scores and loadings. Shyamal 09:15, 12 March 2007 (UTC)

[edit] Merge POD and PCA

These seem to be just different terms used in different circles/applications for the same thing. Jmath666 01:53, 16 March 2007 (UTC)

Agree. Shyamal 08:48, 16 March 2007 (UTC)
Agree. Algorithms 16:57, 7 June 2007 (UTC)
Disagree. - I would find it very confusing. Though I wouldn't object the other way around if POD is really the same thing. --MatthewKarlsen 17:21, 16 July 2007 (UTC)

[edit] Request information on how to choose how many components to retain

Could someone include information on the proper method for choosing the number of components to retain? I've done some searching and haven't found any 'rules'. Part of my interest is that in MBH98 they retain only the first PC, but they apparently did incorrect centering. If it is correctly centered then a similar result is achieved if including the first 4 PCs. http://en.wikipedia.org/wiki/Hockey_stick_controversy It might be useful for individuals familiar with PCA add some details to the above link.

LetterRip 10:49, 22 March 2007 (UTC)

In the past, we usually used PCA to try to represent 95% of the variablity involved. I was involved in software metrics for a while and we'd collect a number of metrics that measured similar, but not quite the same, items. Using PCA, we could reduce the measures from 20ish to maybe 4 and account for 95% or more of the variability. This way we could have a 95% confidence saying things like "modules with this level of complexity have a far higher rate of bugs than modules with that level of complexity." Tangurena 04:15, 18 September 2007 (UTC)
  • Of course after I post the question I find a good reference :)

"Component retention in principal component analysis with application to cDNA microarray data"

"Many methods, both heuristic and statistically based, have been proposed to determine the number k, that is, the number of "meaningful" components. Some methods can be easily computed while others are computationally intensive. Methods include (among others): the broken stick model, the Kaiser-Guttman test, Log-Eigenvalue (LEV) diagram, Velicer's Partial Correlation Procedure, Cattell's SCREE test, cross-validation, bootstrapping techniques, cumulative percentage of total of variance, and Bartlett's test for equality of eigenvalues. For a description of these and other methods see [[7], Section 2.8] and [[9], Section 6.1]. For convenience, a brief overview of the techniques considered in this paper is given in the appendices.

Most techniques either suffer from an inherent subjectivity or have a tendency to under estimate or over estimate the true dimension of the data [20]. Ferré [21] concludes that there is no ideal solution to the problem of dimensionality in a PCA, while Jolliffe [9] notes "... it remains true that attempts to construct rules having more sound statistical foundations seem, at present, to offer little advantage over simpler rules in most circumstances." A comparison of the accuracy of certain methods based on real and simulated data can be found in [20-24]."

http://www.biology-direct.com/content/2/1/2

LetterRip 11:11, 22 March 2007 (UTC)

[edit] Question concerning subsection "Convert the source data to z-scores"

Is it correct to transform normalized source data using a PCA which is based on the covariance matrix? Would it not be necessary to use a PCA based on correlation matrix instead (which corresponds to the covariance matrix of normalized source data)?

The covariance matrix based on z scores is the correlation matrix. Technically, one might need to worry about whether the covariance matrix is the empirical moments or "unbiased estimators", which differ by a factor of n/(n-1). There is a related chapter in Jolliffe. —Preceding unsigned comment added by Dfarrar (talkcontribs) 14:16, 6 March 2008 (UTC)

What is h in the z-scores section? —Preceding unsigned comment added by 216.184.13.6 (talk) 17:45, 30 September 2007 (UTC)

[edit] Not necesarily orthogonal

I removed the bit about 'assumption that the principal components are orthogonal' I believe that two things got mixed up:

If the noise is not white, principal components are not orthogonal, such that PCA is not optimal, or canonnical, or anything. If the distribution of the noise is known, you may apply a linear transform to whiten the noise or (equivalently) apply a Generalized SVD or Restricted SVD.

I believe ICA applies to nonnormal variables. If the noise is jointly normally distributed, then a covariance of zero implies independence (See http://en.wikipedia.org/wiki/Normally_distributed_and_uncorrelated_does_not_imply_independent). Even though PCA is not optimal or cannonical or anything, I believe this means that PCA does find uncorrelated, and hence independent variables in this case. Hence, I think it should be concluded that PCA is a form of ICA for jointly normal variables. —Preceding unsigned comment added by 130.89.67.57 (talk) 14:41, 17 March 2008 (UTC)

[edit] Fixed basis?

What does this sentence from the Details section mean: "Unlike other linear transforms, PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set." A linear transformation does not possess a basis at all. Does it mean that there is no standard choice of basis, with respect to which to compute the coefficients (matrix) of the linear transformation? 137.22.3.172 (talk) 18:46, 20 March 2008 (UTC)

Good point. I removed that sentence. —Ben FrantzDale (talk) 00:36, 21 March 2008 (UTC)
Presumably what was meant was that PCA cannot be represented by a particular fixed matrix operator.
If it did have a fixed matrix operator, eg like a Fourier transform, you could use SVD to identify a particular characteristic set of "input" basis directions, a set of "output" basis directions, and a corresponding set of scalings.
But a PCA is not that kind of a transformation. (It is not linear in the data). Jheald (talk) 09:09, 21 March 2008 (UTC)
But PCA might be considered as an approximately linear in the data if the sample size is thought large enough for the covarince matrix to be essentially fixed, and if the effects on only a limited number of data points are considered. Melcombe (talk) 17:43, 22 April 2008 (UTC)