I have a very large matrix (about 500000 * 20000) containing the data that I would analyze with pca. To do this I'm using ParallelColt library, but both using singular value decomposition and eigenvalues decomposition in order to get the eigenvectors and eigenvalues of the covariance matrix. But these methods waste the heap and I get "OutOfMemory" errors...

Also using SparseDoubleMatrix2D (the data are very sparse) the errors still remain, so I ask you : how can I solve this problem ?

Change library ?

You can compute PCA with Oja's rule : it's an iterative algorithm, improving an estimate of the PCA, one vector a time. It's slower than the usual PCA, but requires you to store only one vector in memory. It's also very numerically stable

I'm not sure that changing libraries will help. You're going to need doubles (8 bytes per). I don't know what the dimension of the covariance matrix would be in this case, but switching libraries won't change the underlying calculations much.

What is the -Xmx setting when you run? What about the perm gen size? Perhaps you can increase them.

Does the algorithm halt immediately or does it run for a while? If it's the latter, you can attach to the process using Visual VM 1.3.3 (download and install all the plugins). It'll let you see what's happening on the heap, threads, etc. Could help you ferret out the root cause.

A Google search for "Java eigenvalue of large matricies" turned up this library from Google. If you scroll down in the comments I wonder of a block Lanczos eigenvalue analysis might help. It might be enough if you can get a subset of the eigenvalues.

These SVM implementations claim to be useful for large datasets:

http://www.support-vector-machines.org/SVM_soft.html

I don't think you can ask for more than 2GB for a JVM:

http://www.theserverside.com/discussions/thread.tss?thread_id=26347

According to Oracle, you'll need a 64-bit JVM running on a 64-bit OS:

http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#gc_heap_32bit

I built some sparse, incremental algorithms for just this sort of problem. Conveniently, it's built on top of Colt.

See the HallMarshalMartin class in trickl-cluster library below. You can feed it chunks of rows at a time, so it should solve your memory issues.

The code is available under the GPL. I'm afraid I've only just released it, so it's short on documentation, hopefully it's fairly self explanatory. There are JUnit tests that should help with usage.

Similar Questions

In a project I'm currently working reside about 200,000 users. For each of these users we defined a similarity measure with regard to an other user. This yields a similarity matrix of 200000x200000. A

What is the difference between Principal Component Analysis (PCA) and Feature Selection in Machine Learning? Is PCA a means of feature selection?

I have a very large (about 91 million non-zero entries) sparseMatrix() in R that looks like: > myMatrix a b c a . 1 2 b 1 . . c 2 . . I would like to convert it to a triangular matrix (upper or lo

For example, I have 9 variables and 362 cases. I've made PCA calculation, and found out that first 3 PCA coordinates are enough for me. Now, I have new point in my 9-dimensional structure, and I want

I have a large data matrix (33183x1681), each row corresponding to one observation and each column corresponding to the variables. I applied K-medoids clustering using PAM function in R, and I tried

The script below has worked for me with the same data when computing pearsons correlation. I have recently adapted it to create a covariance matrix to input into a pca. I read on a forum that inputtin

I need to do some matrix operations on my computer. These matrices are large 1000000x1000000 and more, some operations requiring TB of memory. Obviously these cannot be directly loaded into memory and

I would like to do PCA for my dataset using weka's PCA. I saw online the java code is: PrincipalComponents pca = new PrincipalComponents(); pca.setMaximumAttributeNames(300); pca.setInputFormat(Data)

I have a very large matrix (901x1801) which I generated by combining values of three similar arrays (with repeated values in them) to generate unique codes using a mathematical expression. The matrix

Can someone please either confirm or correct this Wikipedia algorithm for computing the first principal component? I want a simple implementation of PCA in D, which doesn't have any existing libraries

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to b

I work on converting a large Matlab code to C++ and CUDA. I have problems converting some sparse matrix operations like: 1. full_Matrix * sparse_Matrix 2. sparse_Matrix * full_Matrix 3. sparse_Matrix

I have downloaded and included UJMP (Universal Java Matrix Package) library to my project for generating sparse matrix. But I could not find any documentation about functions of the library, how to cr

How should I call a large URL in java? I'm integrating scene7 image server with java application. Here I call a URL of around 10000 characters which should return me an Image, can anyone please sugges

I'm looking for an efficient data structure to represent a very large matrix of integers in Python/Cython with focus on element-wise operations. I'm currently building a model that requires a lot of e

I am trying to run the full SVD of a large (120k x 600k) and sparse (0,1% of non-null values) matrix M. Due to memory limitation all my previous attempts failed (with SVDLIBC, Octave, and R) and I am

I have a large matrix: set.seed(1) a <- matrix(runif(9e+07),ncol=300) I want to sort each row in the matrix: > system.time(sorted <- t(apply(a,1,sort))) user system elapsed 42.48 3.40 45.88

Does anyone know how to construct following matrix in Java? I can see a transpose pattern but I might be on the wrong track with that. This is what I've got so far... don't laugh :-) import java.io.*

How can I have multiple buttons and multiple listeners doing various operations in java swing. Here is an example of what I have, I can redirect to the AddStudent class but the button to redirect to t

I'm so confused by graphs and adjacency matrices. I'm doing an assignment for a class where I have a text file of nodes and a text file of edges and I have to read each of them and make them a graph o

I have very large dataset with dimension of 60K x 4 K. I am trying add every four values in succession in every row column wise. The following is smaller example dataset. set.seed(123) mat <- matr

I am trying to perform Kernel PCA using scikit-learn, using a kernel that is not in their implementation (and a custom input format that is recognized by this kernel). It would probably be easiest if

I am working on analysis of a large data set using matlab. I would like to be able to run something along the lines of the fprintf command on this matrix, which has about 22000 columns. So, here is wh

ok, so yesterday i posted a question in regards to creating a java jframe that simulates the matrix rain from the movies which i want to be just like this php example http://mgccl.com/2007/03/30/simpl

I've got a document classification problem with only 2 classes and my training dataset matrix size, after the CountVectorizer becomes (40,845 X 218,904) (unigram). In the case of considering trigrams,

I am looking at taking the inverse of a large matrix, common size of 1000 x 1000, but sometimes exceeds 100000 x 100000 (which is currently failing due to time and memory). I know that the normal sent

I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a

I'm having problems getting PCA and Eigenfaces working using the latest C++ syntax with the Mat and PCA classes. The older C syntax took an array of IplImage* as a parameter to perform its processing

I have PCA with 3D numpy array as pcar =[[xa ya za] [xb yb zb] [xc yc zc] . . [xn yn zn]] where each row is a point and I have selected any two random rows from above PCA as a cluster as out_list=pca

I am doing a java project. We have to add PCA and linear regress function to this project. Could you tell me java lib that can do these matrix's things? Many Thanks! BTW, I used to use Matlab Builder

I would like to create a correlation matrix for 50 variables where different variables have different correlations. In the perfect case when each variable has the same correlation I would use: cor.ta

I just found this method inside a Utils-type class in our codebase. It was written a long time ago by a developer who no longer works for us. What in tarnation is it doing? What is it returning?!? O

Is there a builtin data structure in java that lets me do this: Matrix matrix = new Matrix(); // No fixed size, and O(1) to initate matrix.set(x,y,32); // adds 32 to arbitrary positions x and y in O(

How can I visualize a data set with a large amount of features using scatter3 plot in matlab. I already have it reduce to three features using PCA, but how do I get it to show up in different colours

I've got a big square matrix, which I've taken the first row for testing purposes.... so the initial matrix is 1x63000, which is pretty big. Every time i try to multiply it by itself, using a %*% b

I have successfully implemented face detection part in my Face Recognition project.Now i have a rectangular region of face in an image.Now i have to implement PCA on this detected rectangular region t

I want to calculate SVD , but I didn't find good java library for this. Now, I have data store in hashmap, because matrix didn't fit into memory due to the fact that sizes are about 400 000 X 10 000 a

I am trying to perform PCA on a matrix (C) where each column represents a different time points and each row represents a feature and I am trying to find the top principal components and graph them ag

I'm using scikit-learn to perform PCA on this dataset. The scikit-learn documentation states that Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this imp

I'm working on Object Registration and Object Classification. I'm using PCA and the question is how to set 'number of components' and 'threshold' that are suitable for all objects I'm a beginner so if

Hi I am fairly new to Java coding so please excuse for any silly errors or questions. I got this code from some internet source which multiplies two matrices and gives the resultant one in Java. I hav

I've got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common task when you're doing things

Is there a generally-accepted way to return a large list of objects using Java EE? For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (re

According to this question, I should try to use Preallocation is Matlab. Now I have a situation that I cannot calculate the exact size of the matrix to preallocate. I can guess the size. suppose the

I'm creating an Image filter program and I want to convert a coloured picture to a grayscale picture with the help of an array matrix. This is what I have currently: import java.awt.Color; import se.l

Before I start let me say that I really don't know, what I am doing when it comes to c++, I am more of what you might call a trial and improve coder! For that reason my code is probably not the most e

Suppose I have an int[size][size] matrix. I want to iterate through it like so: 01 02 03 04 08 07 06 05 09 10 11 12 16 15 14 13 Do I really have to say: for (int y=0; y<size; y++) { int startIndex

I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes

This question already has an answer here: Scatterplot with color groups - No ggplot 1 answer I have a problem, when i try to do a PCA plot on some gene expression data, i use the code below to

I'm doing some signal processing and I need to generate a Poisson matrix but the data I am working with are large enough that matlab runs out of memory the way I am currently doing it. I've been muck