I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.

So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, ** calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix**.

What I want is to ** calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance**. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.

Thanks very much for any help !

Note: I am using a machine with 24GB RAM and 8 cpu cores.

The Python toolkit scikit-learn has a few PCA variants, of which `RandomizedPCA`

can handle sparse matrices in any of the formats supported by `scipy.sparse`

. `scipy.io.mmread`

should be able to parse the Matrix Market format (I never tried it, though).

Disclaimer: I'm on the scikit-learn development team.

**EDIT**: the sparse matrix support from `RandomizedPCA`

has been deprecated in scikit-learn 0.14. `TruncatedSVD`

should be used in its stead. See the documentation for details.

Instead of running PCA, you could try Latent Dirichlet Allocation (LDA), which decomposes the document-word matrix into a document-topic and topic-word matrix. Here is a link to an R implementation: http://cran.r-project.org/web/packages/lda/ - there are quite a few implementations out there, though if you google.

With LDA you need to specify a fixed number of topics (similar to principle components) in advance. A potentially better alternative is HDP-LDA (http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-r21.tgz), which learns the number of topics that form a good representation of your corpus.

If you can fit our dataset in memory (which it seems like you can), then you also shouldn't have a problem running the LDA code.

As a number of people on the scicomp forum pointed out, there should be no need to compute all of the 120k principle components. Algorithms like http://en.wikipedia.org/wiki/Power_iteration calculate the largest eigenvalues of a matrix, and LDA algorithms will converge to a minimum-description-length representation of the data given the number of topics specified.

In R `big.PCA`

of `bigpca`

package http://cran.r-project.org/web/packages/bigpca/bigpca.pdf does the job.

Similar Questions

The script below has worked for me with the same data when computing pearsons correlation. I have recently adapted it to create a covariance matrix to input into a pca. I read on a forum that inputtin

I'm writing a program in Python using scipy's spsolve to solve a linear equation using a sparse matrix (csr_matrix). The matrices are fairly large (M=90826x90826, b=90826x1) and are hard to check by h

I am working on a project that involves the computation of the eigenvectors of a very large sparse matrix. To be more specific I have a Matrix that is the laplacian of a big graph and I am interested

I need to print very large and complicated object - virtualizet in two dimension matrix what have two tables at left and top side and at the middle represent the intersections of tables. Each table ca

Anyone know of any good sparse matrix library? I need it for doing kronecker products and multiplication on large sparse matrices (10,000 x 10,000). Right now we are using R, which handles them prett

I have a very large and sparse matrix of size 180GB(text , 30k * 3M) containing only the entries and no additional data. I have to do matrix multiplication , inversion and some similar linear algebra

I am using scikit-learning to do some dimension reduce task. My training/test data is in the libsvm format. It is a large sparse matrix in half million columns. I use load_svmlight_file function load

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will work on a scipy sparse matrix.

I have a huge sparse matrix (about 500K x 500K entries, with approximately 1% of the values being non-zero. I'm using @mikera's Vectorz library. t is a SparseRowMatrix composed of SparseIndexedVector

I see 2 implementations of sparse matrix in this package. OpenMapRealMatrix SparseFieldMatrix Both are documented as Sparse matrix implementation based on an open addressed map. Do you know what a

I'd like to do large-scale regression (linear/logistic) in R with many (e.g. 100k) features, where each example is relatively sparse in the feature space---e.g., ~1k non-zero features per example. It

I recently installed Cholmod in order to perform sparse cholesky decompositions in some C++ code. I wanted to then use the decomp to calculate the matrix inverse (I have the following problem: d^T . (

What do you think? What would be faster and how much faster: Doing sparse matrix (CSR) multiplication (with a vector) on the GPU or the CPU (multithreaded)?

I'm reading through instructions of Matrix package in R. But I couldn't understand the p argument in function: sparseMatrix(i = ep, j = ep, p, x, dims, dimnames, symmetric = FALSE, index1 = TRUE, give

I've read several topics, but I'm lost. I'm quite new to this. I want to store huge sparse matrix and have several idea's but can choose between them. Here's my needs: Adjacency matrix of approx. 50

I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an

My system is best described by a diagonal sparse matrix (Poisson). I have my diagonal sparse matrix, however, I want to change the boundary conditions (ie the edges of my matrix) to zero. It must be

I have two large square sparse matrices, A & B, and need to compute the following: A * B^-1 in the most efficient way. I have a feeling that the answer involves using scipy.sparse, but can't for t

I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm doing: import os.path import numpy import

There are at least two sparse matrix packages for R. I'm looking into these because I'm working with datasets that are too big and sparse to fit in memory with a dense representation. I want basic lin

The question is: Is it possible to create a sparse matrix using the following sparse list implementation? In special, using a class template with a class template (SparseList*>)? I've created a cla

It seems to me that in SAGE the only difference between creating a dense matrix and a sparse matrix is by the flag passed to the constructor (sparse = True). In particular, this means that if I want

I have got an output using sparse matrix in python, i need to store this sparse matrix in my hard disk, how can i do it? if i should create a database then how should i do?? this is my code: import nl

How can you visualize the sparsity pattern of a large sparse matrix? The matrix is too large to fit in memory as a dense array, so I have it in csr_matrix format. When I try pylab's matshow with it, I

I work on converting a large Matlab code to C++ and CUDA. I have problems converting some sparse matrix operations like: 1. full_Matrix * sparse_Matrix 2. sparse_Matrix * full_Matrix 3. sparse_Matrix

I am attempting to perform fastclust on a very large set of distances, but running into a problem. I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similaritie

I'm looking for a matrix / linear algebra library in Java that provides a sparse matrix that can be written to concurrently from different threads. Most of the libraries I've come across either do not

I'm having trouble loading my data set into a sparse matrix in R. I am using the Matrix package. The data I have is in the form x y value. For example: V1 V2 V3 1 2 .34 7 4 .56 4 5 .62 where I would

I have encountered a difference in how slicing a scipy sparse matrix works in 0.10.0 and 0.10.1. Consider the following piece of code: from numpy import array, ravel from scipy.sparse import csr_matri

I am trying to do some (k-means) clustering on a very large matrix. The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of 1 values per row). I want to get around 2000

I have a sparse logical matrix, which is quite large. I would like to draw random non-zero elements from it without storing all of its non-zero elements in a separate vector (eg. by using find command

I would like to know the best way to check if a scipy sparse matrix, if CSC or CSR. Right now I'm using. rows, cols = X.shape() indptr = X.indptr() if len(indptr) == cols + 1: print csc else: print

This Sparse Matrix and its 3-Tuple representation is not getting into my head... Either its bit tricky or my resources from where I am studying are really not that good... here is the URI Sparse Matri

I want to sum about 10000 columns like colSparseX on 1500 sparse rows of an dataframe. If I have the input: (I tried on OriginalDataframe this: coldatfra <- aggregate(. ~colID,datfra,sum) and this

I am creating a matrix from a Pandas dataframe as follows: dense_matrix = numpy.array(df.as_matrix(columns = None), dtype=bool).astype(np.int) And then into a sparse matrix with: sparse_matrix = scip

I'm working on a project, written in Java, which requires that I build a very large 2-D sparse array. Very sparse, if that makes a difference. Anyway: the most crucial aspect for this application is e

I am trying to calculate inverse of a very large matrix (11300x21500) in C++. So far I have tried Eigen and Armadillo libraries but both failed at initialization stage, saying that there is not enough

Question: How can I split 1 sparse matrix into 2, based on the values in a list? That is, I have a sparse matrix X: >>print type(X) <class 'scipy.sparse.csr.csr_matrix'> that I visualize

I have a large scipy sparse matrix, which is taking up >90% of my total system memory. I would like to save it to disk, as it takes hours to build the matrix... I tried cPickle, but that leads to a

I have the following sparse matrix A. 2 3 0 0 0 3 0 4 0 6 0 -1 -3 2 0 0 0 1 0 0 0 4 2 0 1 Then I would like to capture the following information from there: cumulative count of entries, as matrix i

I've got a strange bug that I'm hoping a more experience programmer might have some insight into. I'm using the boost ublas sparse matrices, specifically mapped_matrix, and there is an intermittent bu

In scipy, we can construct a sparse matrix using scipy.sparse.lil_matrix() etc. But the matrix is in 2d. I am wondering if there is an existing data structure for sparse 3d matrix / array (tensor) in

I want to apply a function on each element in a matrix. I've written the following code: function p = an(x) p= x + 1; end The matrix is for example: B = [1 2 3; 3 4 5; 6 7 8] When I try to do this:

Why does fromiter fail if I want to apply a function over the entire matrix? >>> aaa = np.matrix([[2],[23]]) >>> np.fromiter( [x/2 for x in aaa], np.float) array([ 1., 11.]) This wo

I have very large data of size (1 x 23750811). I would like to visualise this data in histogram-Matlab. As the data is very large, I am getting only a single dot in my plot. But I could visualise them

I want to make a sparse matrix in python. I have the index and value of non-zero elements as a dictionary i.e.: {((1,3),0.0001),(10,4),0.0212)...} which means that value of element (1,3) is 0.0001, (

Suppose I have a NxN matrix M (lil_matrix or csr_matrix) from scipy.sparse, and I want to make it (N+1)xN where M_modified[i,j] = M[i,j] for 0 <= i < N (and all j) and M[N,j] = 0 for all j. Basi

In the context of a finite element problem, I have a 12800x12800 sparse matrix. I'm trying to solve the linear system just using MATLAB's \ operator to solve and I get an out of memory error using mld

I have a sparse matrix (dgCMatrix) as the result of fitting a glmnet. I want to write this result to a CSV but can't use write.table() the matrix because it can't coerced into a data.frame. Is there a

I am new to the use of sparse matrices, but now need to utilize one in my work to save space. I understand that the following matrix: 10 0 0 0 -2 0 3 9 0 0 0 3 0 7 8 7 0 0 3 0 8 7 5 0 0 8 0 9 9 13 0 4