I would like to take column and row names from a text file and build a sparse matrix using the row and column information (the algorithm can be found in the description below). I have a working solution but it is slow for a text file with over 3,000,000 entries.

Does anyone have any suggestions for a faster algorithm than the one I describe below?

First, I start with a text file which provides column and row names, separated by a space. For example:

```
aaaa 11111 22222 33333 bbbb 11111 22222 cccc 11111
```

where `{aaaa,bbbb,cccc}`

are 4 character column names and `{11111,22222,33333}`

are 5 character row names.

Second, I load this text file into `R`

using the scan function:

```
char_vec <- scan(file = "textFile.txt", what = "character")
```

which converts the textFile information into a character vector.

Third, I find all of the possible column names and row names:

```
c_names <- unique(char_vec[nchar(char_vec) == 4])
r_names <- unique(char_vec[nchar(char_vec) == 5])
```

Fourth, I create a sparse matrix from the data:

```
library(Matrix)
createMatrix <- function(char_vec=char_vec, c_names=c_names, r_names=r_names)
{
mySparseMatrix <- Matrix(0, nrow = length(r_names), ncol = length(c_names),
sparse = TRUE)
for (i1 in 1:length(char_vec))
{
if (char_vec[i1] %in% c_names)
{
c_index <- match(char_vec[i1], c_names)
}
if (char_vec[i1] %in% r_names)
{
r_index <- match(char_vec[i1], r_names)
mySparseMatrix[r_index, c_index] <- 1
}
}
colnames(mySparseMatrix) <- c_names
rownames(mySparseMatrix) <- r_names
return(mySparseMatrix)
}
```

This gives this output:

```
aaaa bbbb cccc
11111 1 1 1
22222 1 1 .
33333 1 . .
```

To show how fast this algorithm works, I padded out the character vector (albeit in an unrealistic manner but I think it serves its purpose as an example):

```
char_vec <- rep(c("aaaa", "11111", "22222", "33333", "bbbb", "11111", "22222", "cccc", "11111"), 1000)
```

and then ran:

```
system.time(createMatrix(char_vec, c_names, r_names))
```

Output:

```
user system elapsed
9.89 0.00 9.94
```

I have profiled the function using:

```
Rprof("createMatrixOut.out")
z <- createMatrix(char_vec, c_names, r_names)
Rprof(NULL)
```

and display a subset of the output using:

```
summaryRprof("createMatrixOut.out")$by.total[1:10,]
```

Output:

```
total.time total.pct self.time self.pct
"createMatrix" 8.08 100.00 0.08 0.99
"[<-" 7.96 98.51 0.08 0.99
"replCmat4" 7.40 91.58 0.04 0.50
"as" 5.64 69.80 0.04 0.50
"asMethod" 5.06 62.62 0.16 1.98
"standardGeneric" 4.68 57.92 0.24 2.97
"new" 4.52 55.94 0.02 0.25
"initialize" 4.40 54.46 0.04 0.50
"callNextMethod" 4.24 52.48 0.08 0.99
".Call" 4.12 50.99 0.60 7.43
```

I have changed the structure of the data: Instead of storing them in a character vector, I create list:

```
> lst
$aaaa
[1] "11111" "22222" "33333"
$bbbb
[1] "11111" "22222"
$cccc
[1] "11111"
```

It is than much faster to iterate through this list

```
createMatrix2 <- function(char_vec=char_vec, c_names=c_names, r_names=r_names)
{
# create list
lst <- list()
for (i1 in 1:length(char_vec))
{
if (nchar(char_vec[i1])==4)
{
cn <- char_vec[i1]
} else {
if (!(char_vec[i1] %in% lst[[cn]])){lst[[cn]] <- c(lst[[cn]],char_vec[i1])}
}
}
# create empty matrix
mySparseMatrix <- Matrix(0, nrow = length(r_names), ncol = length(c_names),
sparse = TRUE)
# fill the matrix
for (cn in names(lst)){
c_index <- match(cn, c_names)
for(rn in lst[[cn]]){
r_index <- match(rn, r_names)
mySparseMatrix[r_index, c_index] <- 1
}
}
# names and return
colnames(mySparseMatrix) <- c_names
rownames(mySparseMatrix) <- r_names
return(mySparseMatrix)
}
> system.time(createMatrix(char_vec, c_names, r_names))
user system elapsed
9.60 0.00 10.36
> system.time(createMatrix2(char_vec, c_names, r_names))
user system elapsed
0.06 0.00 0.06
```

Similar Questions

how can i convert first line of a text file into list in python? I want to escape NaNs while converting into the list. import csv with open ('data.txt', 'r') as f: first_row = [column[0] for column in

I am learning how to use Scipy.sparse. The first thing I tried was checking a diagonal matrix for sparsity. However, Scipy claims it is not sparse. Is this a correct behavior? The following code retur

How to convert an Eigen::Matrix<double,Dynamic,Dynamic> to an Eigen::SparseMatrix<double> ? I'm looking for a better way instead of iterate through the dense matrix

I have two matrices A and B. A is N-by-L matrix and B is L-by-N matrix. A = [1 2 3; 4 5 6]; B = [ 7 8; 9 10; 11 12]; I would like to multiply the each row of the first matrix by the corresponding col

We can construct a sparse matrix from an index and value of non-zero element with the sparseMatrix or spMatrix. Is there any function convert a sparse matrix back to an index and value of all non-zero

I'm very new in Python. I want to create a m x n matrix and add names to its columns and rows. I have a list contains row names and a list contains column names. It seems that I need to use Pandas.

I have a very large and sparse matrix of size 180GB(text , 30k * 3M) containing only the entries and no additional data. I have to do matrix multiplication , inversion and some similar linear algebra

I'm writing a program that will convert a sparse matrix to Blocked Compressed Row Storage BCRS. I know how to acquire Rowptr, Colind(although not in the code) and A_f. Code: p = 0; for (i = 0; i <

I work on converting a large Matlab code to C++ and CUDA. I have problems converting some sparse matrix operations like: 1. full_Matrix * sparse_Matrix 2. sparse_Matrix * full_Matrix 3. sparse_Matrix

I need a command to check for zero sparse matrix, isempty(..) does not work. Is there some sparse version of isempty(..)? >> mlf2=sparse([],[],[],2^31+1,1) mlf2 = All zero sparse: 2147483649-by-

I am using Scipy to construct a large, sparse (250k X 250k) co-occurrence matrix using scipy.sparse.lil_matrix. Co-occurrence matrices are triangular; that is, M[i,j] == M[j,i]. Since it would be high

I have a matrix with rows of 4 integers, with an unspecified number of columns (depends on the text file). I'm wanting to apply a function to each row of the matrix, independently. The function has 4

I have some data in a csv file, which includes row names. I want to take a single column of the data, while retaining the row names. The csv file was produced in the following manner: MAT <- matrix

I am using Python with numpy, scipy and scikit-learn module. I'd like to classify the arrays in very big sparse matrix. (100,000 * 100,000) The values in the matrix are equal to 0 or 1. The only thing

I am writing a matrix into a text file and need to read the file in another python script. The second script needs to get the text back into a numpy array. I have been struggling to find out how to do

I am trying to solve an LP problem represented using sparse matrices in Gurobi / python. max c′ x, subject to A x = b, L ≤ x ≤ U where A is a SciPy linked list sparse matrix of size ~10002. Using t

I have a matrix of about 1000 row X 500 variable, I am trying to establish a correlation matrix for these variables with names rather than numbers, so the outcome should look like this variable1 varia

I'm looking for any standard C program that uses OpenMP APIs for a sparse matrix-vector or matrix-matrix multiplications. Can anyone let me know if there are any such programs.

I have two matrices: matrix1: col1 col2 row1 5 4 row2 4 6 matrix2: col1 col2 row1 48 50 row2 47 46 What I want is a new matrix or table: Dim1 Dim2 rank row2col1 4 47 1 row1col2 4 50 2 row1col1 5 48

for class I have to write my own linear equation solver for sparse matrices. I am free to use any type of data structure for sparse matrices and I have to implement several solves, including conjuguat

In the context of a finite element problem, I have a 12800x12800 sparse matrix. I'm trying to solve the linear system just using MATLAB's \ operator to solve and I get an out of memory error using mld

I am writing a java program which involves working with a 1058 X 1058 matrix containing float values. This matrix contains many zero values and so I need to store this as a sparse matrix and later use

I am trying to speed up a sparse matrix-vector product using open mp, the code is as follows: void zAx(double * z, double * data, long * colind, long * row_ptr, double * x, int M){ long i, j, ckey; in

I try to create a compressed_matrix using a coordinate_matrix as a builder: #include <boost/numeric/ublas/io.hpp> #include <boost/numeric/ublas/matrix_sparse.hpp> using namespace boost::nu

I have a scipy.sparse.csr matrix and would like to dump it to a CSV file. Is there a way to preserve the sparsity of the matrix and write it to a CSV?

How can you add a 1-column matrix to a sparse matrix, or add a sparse matix to a column matrix (either way)? It shouldn't replace the data, just join it into one data type. The sparse matrix: >>

I am looking for a single MYSQL script to convert ALL column names in a database to lowercase in one go... I have inherited a MYSQL database that has a lot of mixed case column names (150 tables with

Is it possible to retrieve column names from a table and load them into another table or a text file in hive? Please let me know if we can do this

Basically I have dataframe with two columns (target_id and fpkm). I want to keep only those row names in first column that are not duplicated. For example in the below dataframe you can see there are

Hi I'd like to convert a file that's tab delimited and looks like this: Species Date Data 1 Dec 3 2 Jan 4 2 Dec 6 2 Dec 3 to a matrix like this (species is the row header): 1 2 Dec 3 9 Jan 4 I'm gu

Is there a simpler way of generating sparse matrix other than this? for (i = 0; i < 1000; i++) { if (rand() % 3 == 0) { array[i] = rand() % 3; } else { array[i] = ((rand() % 3) - 1); } } Thanks. I

I have two large square sparse matrices, A & B, and need to compute the following: A * B^-1 in the most efficient way. I have a feeling that the answer involves using scipy.sparse, but can't for t

I have an empty 7 by 7 matrix but I want it to look like this 1 2 3 4 5 6 7 1 0 1 2 3 4 5 6 2 NA 0 1 2 3 4 5 3 NA NA 0 1 2 3 4 4 NA NA NA 0 1 2 3 5 NA NA NA NA 0 1 2 6 NA NA NA NA NA 0 1 7 NA NA NA N

Is there a well-vectorized way to take the product of all the nonzero elements in each column of a sparse matrix in octave (or matlab) (returning a row-vector of products)?

How can I extract the top ten values from a single column matrix, and output the corresponding row names to a vector? I want to put labels on a pie chart I have generated, but for the life of me can't

I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this: return DataFrame(matrix.toarray(), columns=features, index=observations) Is there a way to

I have a csv file with headers like: Given this test.csv file contains sparse matrix: A,B,C,D,E,F,timestamp 611.88243,0,0,0,0,0,0 0,9089.5601,0,864.07514,0,0,0 0,0,5133.0,0,0,0,0 I simp

I have the following matrix z: 0 1 2 3 4 5 8 9 11 12 15 16 17 [1,] 0.9992149 0.0001345895 4.486317e-05 2.243158e-05 6.729475e-05 8.972633e-05 2.243158e-05 4.486317e-05 4.486317e-05 2.243158e-05 2.2

I was trying to iterate over the non zero elements of a row major sparse matrix, such as shown below: Eigen::SparseMatrix<double,Eigen::RowMajor> Test(2, 3); Test.insert(0, 1) = 34; Test.insert

While solving a machine learning problem using scikit (python) I need to do scaling of scipy.sparse matrix before doing the training using SVM in order to achieve higher accuracy. But its clearly ment

I am implementing a sparse matrix based on the Stack class, and I'm getting the following error: Sparse.java:6: Sparse is not abstract and does not override abstract method pop() in Stack public clas

I have two sparse matrix A (affinity matrix) and D (Diagonal matrix) with dimension 100000*100000. I have to compute the Laplacian matrix L = D^(-1/2)*A*D^(-1/2). I am using scipy CSR format for spars

This question has two parts (maybe one solution?): Sample vectors from a sparse matrix: Is there an easy way to sample vectors from a sparse matrix? When I'm trying to sample lines using random.sample

When trying to directly set the data attribute of a sparse lil_matrix, I encounter very unexpected behavior. Can someone explain what is going on in the following simple example? My particular use ca

I am trying to create a very huge sparse matrix which has a shape (447957347, 5027974). And, it contains 3,289,288,566 elements. But, when i create a csr_matrix using scipy.sparse, it return something

I am trying to use Weka for text classification. For this purpose, it makes sense to use the sparse ARFF data file format. Using Weka 3.7.2, I tried: Transforming a text directory to an Instances obj

I would like to get (and store) the corresponding row and column number in a matrix, eg. to element number 12. adj.m <- diag(c(3,2,1),nrow = 3, ncol = 3) now i want to find the row and correspondi

I'm looking for a matrix / linear algebra library in Java that provides a sparse matrix that can be written to concurrently from different threads. Most of the libraries I've come across either do not

I am trying to read in a variance-covariance matrix written out by LISREL in the following format in a plain text, whitespace separated file: 0.23675E+01 0.86752E+00 0.28675E+01 -0.36190E+00 -0.36190

I want to write a matrix with a lot of data to a file row by row. For example, I have a matrix 100*100 and I want to have it in form 100*100 in the file. However, it doesn't work.Following is my code