Problematic results #2

yue-wu · 2014-10-03T05:48:13Z

I made a toy example to test your code, but I guess it is somewhat incorrect. The following is the code that I used.

under ipython

from sklearn.decomposition import PCA
from pyIPCA import CCIPCA, Skocaj_IPCA, Hall_IPCA
import numpy as np

make toy data

data = np.random.rand( 10000, 10 ) * 100;

use sklearn pca

ncomp = 2;
pca = PCA( n_components = 2 );
pca.fit( data );
data_pca = pca.transform( data );
pyplot.scatter( data_pca[:,0], data_pca[:,1]),pyplot.title('Sklearn-PCA'), pyplot.show()

use CCIPCA

ipca = CCIPCA( n_components = 2 );
ipca.fit( data );
idata_pca = ipca.transform( data );
pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('CCIPCA'), pyplot.show()

use Skocaj_IPCA

ipca = Skocaj_IPCA( n_components = 2 );
ipca.fit( data );
idata_pca = ipca.transform( data );
pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('Skocaj_IPCA'), pyplot.show()

#use Hall_IPCA
ipca = Hall_IPCA( n_components = 2 );
ipca.fit( data );
idata_pca = ipca.transform( data );
pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('Hall_IPCA'), pyplot.show()

It seems that both CCIPCA and Skocaj_pca does not work properly, because their center after transformation is too far away from the origin (0,0) and their shapes are more like a oval rather than a circle.

By the way the Skocaj_IPCA often invokes the following warning on my machine:
RuntimeWarning: invalid value encountered in divide
explained_variance_.sum())

Many thanks to your contributions in sklearn

Rex

kevinhughes27 · 2014-10-03T11:41:15Z

Hmm you are right something does look off there - its been a while since I worked on this but I remember something about the last component being off with some of the methods. When working with real data the last dimension is usually useless and simply orthogonal to the others so I think these incremental methods might not bother with getting it correct. Maybe have a re-read of the papers and see if they mention it.

yue-wu · 2014-10-03T18:50:36Z

Thank you for your efforts. At least, Hull_IPCA works fine. I will use this to find PCA for 3M samples. If I found anything wrong, I shall let you know. By the way, later I noticed that pylearn2 ( still under development ) also has its own online PCA, but I guess it uses yet a different method. For your interests, here is the link to the project http://deeplearning.net/software/pylearn2/index.html.

kevinhughes27 · 2014-10-03T19:08:03Z

cool thanks for the link!

If you find any problems feel free to send a patch my way!

yue-wu · 2014-10-03T21:14:14Z

Thank you again. It seems that the incremental PCA only solves the problem of possible memory shortage, but it is not a good idea that ask a cluster to use only one core to compute PCA. I am now reading http://mdp-toolkit.sourceforge.net/tutorial/parallel.html. It seems that they provide a way to perform parallel PCA estimation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problematic results #2

Problematic results #2

yue-wu commented Oct 3, 2014

kevinhughes27 commented Oct 3, 2014

yue-wu commented Oct 3, 2014

kevinhughes27 commented Oct 3, 2014

yue-wu commented Oct 3, 2014

Problematic results #2

Problematic results #2

Comments

yue-wu commented Oct 3, 2014

I made a toy example to test your code, but I guess it is somewhat incorrect. The following is the code that I used.

under ipython

make toy data

use sklearn pca

use CCIPCA

use Skocaj_IPCA

kevinhughes27 commented Oct 3, 2014

yue-wu commented Oct 3, 2014

kevinhughes27 commented Oct 3, 2014

yue-wu commented Oct 3, 2014