I am trying to speed up a sparse matrix-vector product using open mp, the code is as follows:

```
void zAx(double * z, double * data, long * colind, long * row_ptr, double * x, int M){
long i, j, ckey;
int chunk = 1000;
//int * counts[8]={0};
#pragma omp parallel num_threads(8)
{
#pragma omp for private(ckey,j,i) schedule(static,chunk)
for (i=0; i<M; i++ ){
z[i]=0;
for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
j = colind[ckey];
z[i] += data[ckey]*x[j];
}
}
}
}
```

Now, this code runs fine, and produces the correct result, it however only gives me a speed up of ~30%. I have checked to see that the threads are all getting about the same number of non-zero elements (they are), and the matrix is fairly large (300,000 x 300,000), so I'm hoping the overhead isn't the only issue. I've also tried running with different chunk sizes and thread numbers, and I get similar performance.

Is there something else I could try to get a bit of extra speed out of this? Or something I'm obviously doing wrong?

Cheers.

Edit: Just commented out '//int * counts[8]={0}', because it was leftover from counting the work allocation. Not needed

Edit2 (more details):

Okay so I timed a loop of calling this 5000 times and got the average times:

- seq: 0.0036 (seconds?)
- 2 threads: 0.002613
- 4 threads: 0.002308
- 8 threads: 0.002384

The matrix is of size: **303544x303544** and has: **2122980** non-zero elements.

With a much smaller matrix 30000x30000 I get times that go more like

- seq 0.000303
- 8 threads 0.000078

So it seems the large size may be my issue.

I can't write this in a comment so I'll do it as an answer. I think it is the problem but not 100% sure.

Shared variables across threads can cause issues. I do not think it is a problem here but might be. Usually only when writing, but if there are no locks then it will just result in corrupt data. Not sure if OpenMP does any locking internally or not.

Chances are your threads are stalling due to locks which is the only reason why multi-threading runs much slower in proportion to a single thread. That or it is not your code at all. It is best if you test it on a small data set that is in memory with no potentially for bottlenecks(so all you're doing is processing the data and timing *only* the zAx function).

0.3M^2 = 90B. Which means you are definitely going to have issues with either paging or loading the files. (and if you are using int's that 4's times the size)

A better approach might be to operate on X amount of the matrix while the disk loads Y amount in parallel. By choosing X and Y correctly you will not have much reduction in speed. If you are loading 8GB, processing, then loading 8GB more you have to wait each time to load the data.

You can make the processing intelligent by choosing X and Y = (8GB - X) by monitoring the time the processing and file loading are doing nothing.

To check and see if the disk access is the problem try using a smaller dataset and time only zAx and see if that helps. If it does then it's the disk. If not, it's the code.

Welcome to the wonderful world of memory-bound problems. To relieve you from your pain, I would like to inform you, that sparse matrix-vector multiplication is one of the many things that cannot be effectively parallelised or even vectorised on a single multi-core chip, unless *all* data could fit in the last level cache or the memory bus is *really really wide*.

Why? Simply because the ratio of computation to memory access is extremely low. For each iteration of the inner loop you fetch once the column index into `j`

(8 bytes), the matrix element into `data`

(8 bytes), the value of the vector element (8 bytes) and the previous value of the result (since compilers rarely optimise access to shared variables) (8 bytes). Then you perform 2 very fast floating point operations (FLOPs) and perform a store (although the `+=`

operator is translated to a single instruction, it is still a "fetch-modify-write" one). In total you load 32 bytes and you perform 2 FLOPs over them. This makes 1/16 FLOPs per byte.

A modern **SSE-capable CPU core** can perform 4 double-precision FLOPs/cycle, which often results to something like 8 GFLOPS per CPU core (assuming basic frequency of 2 GHz). With AVX the number is doubled, so you get up to 16 GFLOPS per core on a 2 GHz Intel Sandy/Ivy Bridge or AMD equivalent. In order to saturate this processing power with data, given the 1/16 FLOPs/byte, you would need at least 128 GiB/s of memory bandwidth.

A **high-end Nehalem-EX processor** like Xeon X7560 runs at 2,26 GHz (9,04 GFLOPS/core) and its shared L3 cache (L1 and L2 caches are per-core) delivers about 275 GiB/s. At 9,04 GFLOPS/core, you need 144,64 GiB/s per core in order to feed the inner loop of your `zAx`

routine. This means that in the ideal case, the L3 cache of this CPU cannot feed more than 2 fully vectorised multiplication kernels.

Without SSE vectorisation the FLOPS rate is twice as lower for double precision, so one could expect the problem to scale up to 4 threads. Things get extremely bad once your problem becomes larger than the L3 cache as the memory bus delivers about ten times less bandwidth.

Try the following version of the inner loop just to see if the compiler is smart enough to follow the relaxed memory view of OpenMP:

```
#pragma omp for private(ckey,j) schedule(static,chunk)
for (i=0; i<M; i++){
double zi = 0.0;
for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
j = colind[ckey];
zi += data[ckey]*x[j];
}
z[i] = zi;
}
```

Unfortunately there is nothing more that you can do. Sparse matrix-vector multiplication scales with the number of CPU sockets, not with the number of CPU cores. You would need a multi-socket system with separate memory controllers, e.g. any system with more than one (post-)Nehalem or AMD64 processors.

Edit: optimisation hint. Do you really need `long`

to store the column index and the row pointers? With 2122980 non-zero elements you would be pretty fine using `int`

instead. It would save loading 4 bytes per element in the inner loop and another 4 bytes per row in the outer loop.

Similar Questions

preface: As the matlab guiderules state, Usually, when one wants to efficiently populate a sparse matrix in matlab, he should create a vector of indexes into the matrix and a vector of values he wants

I am trying to find an efficient way to retrieve a list / vector / array of the non-zero upper triangular elements of a sparse matrix in R. For example: library(igraph) Gmini <- as.directed(graph.

I've a Sparse matrix in CSR Sparse format in python and I want to import it to MATLAB. MATLAB does not have a CSR Sparse format. It has only 1 Sparse format for all kind of matrices. Since the matrix

I would like to know how I can efficiently read local mp3 files with NAudio? I tried several methods to read mp3 and play it back using a WaveOut but it is very slow sometimes. For my current test cas

In numpy if you want to calculate the sinus of each entry of a matrix (elementise) then a = numpy.arange(0,27,3).reshape(3,3) numpy.sin(a) will get the job done! If you want the power let's say to 2

I would like to know the best way to check if a scipy sparse matrix, if CSC or CSR. Right now I'm using. rows, cols = X.shape() indptr = X.indptr() if len(indptr) == cols + 1: print csc else: print

I'm involved in the resolution of a system of the type Ax = b, where A is a square sparse matrix, x is the vector of the unknows (I have to compute it) and b is a vector of all zeros excpet for the la

What do you think? What would be faster and how much faster: Doing sparse matrix (CSR) multiplication (with a vector) on the GPU or the CPU (multithreaded)?

I'm writing a program in Python using scipy's spsolve to solve a linear equation using a sparse matrix (csr_matrix). The matrices are fairly large (M=90826x90826, b=90826x1) and are hard to check by h

I create a sparse matrix in scala breeze, ie using http://www.scalanlp.org/api/breeze/linalg/CSCMatrix.html. Now I want to get a column slice from it. How to do this? Edit: there are some further requ

Given a sparse binary matrix A (csr, coo, whatever) I want to make a plot such that I can see the position (i,j) = white in the figure if A(i,j) = 1, and (i,j) = black if A(i,j) = 0; For a dense numpy

So I have this matrix here, and it is of size 13 x 8198. (I have called it 'blah'). This is a sparse matrix, in that, most of its entries are 0. When I do an imagesc(blah), I get the following image:

I'm working with large sparse matrices that are not exactly very sparse and I'm always wondering how much sparsity is required for storage of a matrix as sparse to be beneficial? We know that sparse r

I need to insert the zero elements in any sparse matrix in the Matrix Market format (but already without the headers). The first column is the number of the ROW, the second columns is the number of th

I read through this page on assigning a sparse matrix. Unfortunately, I do not understand it. Can anyone help me out with an example? For instance, how should I assign the following 10 by 8 sparse mat

How can you visualize the sparsity pattern of a large sparse matrix? The matrix is too large to fit in memory as a dense array, so I have it in csr_matrix format. When I try pylab's matshow with it, I

I am trying to use the CUSP library. I am reading .txt files which are basically sparse COO representation. I am using CUSP to convert into CSR format. When I print the matrix with cusp::print() it pr

I would appreciate any help, to understand following behavior when slicing a lil_matrix (A) from the scipy.sparse package. Actually, I would like to extract a submatrix based on an arbitrary index lis

In MATLAB, it is very convenient to create a pentadiagonal sparse matrix using commands like this: I = eye(m); % create identity matrix e = ones(m,1); % create an array of all 1's T = spdiags([e -4*e

I have a very large and sparse matrix of size 180GB(text , 30k * 3M) containing only the entries and no additional data. I have to do matrix multiplication , inversion and some similar linear algebra

I'm trying to create a sparse vector class in C++, like so: template<typename V, V Default> class SparseVector { ... } Internally, it will be represented by an std::map<int, V> (where V

If I am using the sparse.lil_matrix format, how can I remove a column from the matrix easily and efficiently?

When trying to directly set the data attribute of a sparse lil_matrix, I encounter very unexpected behavior. Can someone explain what is going on in the following simple example? My particular use ca

I have made a simple class called Vector3. It's a 3 dimensional vector with some basic math implementions. Now i want to be able to rotate this single vector, but i get an exception. I have this: pri

I am implementing parallel dot product in open MP I have this code: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <math.h> #includ

I'm currently working on a scipy sparse csr matrix. I would like to delete all rows in the matrix that contain 0 in the data array of the matrix (the data array is the 1s and 2s you can see in the exa

I am using cusp for sparse matrix multiplication. From the resultant matrix i need the max value without copying the matrix from device memory to host memory. I am planning to wrap the resultant matri

In order to compute the product between 2 matrices A and B (nxm dimension) in a parallel mode, I have the following restrictions: the server sends to each client a number of rows from matrix A, and a

I have two large square sparse matrices, A & B, and need to compute the following: A * B^-1 in the most efficient way. I have a feeling that the answer involves using scipy.sparse, but can't for t

I'm trying to implement the functionality of MATLAB function sparse. Insert a value in sparse matrix at a specific index such that: If a value with same index is already present in the matrix, then th

I've read several topics, but I'm lost. I'm quite new to this. I want to store huge sparse matrix and have several idea's but can choose between them. Here's my needs: Adjacency matrix of approx. 50

I have a big sparse matrix. I want to take log4 for all element in that sparse matrix. I try to use numpy.log() but it doesn't work with matrices. I can also take logarithm row by row. Then I crush

I have a list of sparse vectors (in R). I need to convert this list to a sparse matrix. Doing it via a for-loop takes a long time. sm<-spMatrix(length(tc2),n.col) for(i in 1:length(tc2)){ sm[i,]<

Is there a well-vectorized way to take the product of all the nonzero elements in each column of a sparse matrix in octave (or matlab) (returning a row-vector of products)?

I want to create a custom optimized matrix operation (a smart kronecker product based on what I know about the sparse matrices i'm using) using MathNet.numerics for csharp. Is there an accessor to get

It's possible to represent every sparse matrix by a two dimensional array. to do so, should save rows, columns and non-zero elements in it. A L-diagonal matrix ($n$ by $n$) is given, what is the bigge

I am trying to make my control algorithm more efficient since my matrices are sparse. Currently, I am doing conventional matrix-vector multiplications in Simulink/xPC for a real-time application. I ca

I'm trying to transpose a sparse matrix in c++. I'm struggling with the traversal of the new transposed matrix. I want to enter everything from the first row of the matrix to the first column of the n

Question: How can I split 1 sparse matrix into 2, based on the values in a list? That is, I have a sparse matrix X: >>print type(X) <class 'scipy.sparse.csr.csr_matrix'> that I visualize

I have a MxN Matrix and would like to convert into a vector MNx1 with all the elements of the row from the Matrix as the elements of the Vector. I tried using reshape but I was not successful. Here i

How can I create a sparse matrix from a list of dimension names? Suppose you have this matrix edgelist in a data frame: from to weight 1 4 a 1 2 5 b 2 3 6 c 3 It can be created like this: from <-

What is the state of the art for fastest linear solver for sparse, positive semi definite and strictly diagonally dominant matrix with N varies from ~700 to ~3000, and about a 1/16 of the matrix is no

I'm fairly new to MATLAB. Normal matrix multiplication of a M x K matrix by an K x N matrix -- C = A * B -- has c_ij = sum(a_ik * b_kj, k = 1:K). What if I want this to be instead c_ij = sum(op(a_ik,

It seems to me that in SAGE the only difference between creating a dense matrix and a sparse matrix is by the flag passed to the constructor (sparse = True). In particular, this means that if I want

I have n documents in MongoDB containing a scipy sparse vector, stored as a pickle object and initially created with scipy.sparse.lil. The vectors are all of the same size, say p x 1. What I need to

I ve got a sparse representation of a square matrix in matlab. I am trying to perform eigen analysis calculating its eigen value and eigen vector n-1 times by removing its time a row and a column. My

I have the matrix like this Table1 = [A B ; C D ; E F] and the vector: V = [a ; b ; c] How to get the multiplication of second column of matrix M to get the answer as below? ans =[aB ; bD; cF] Curr

I have downloaded and included UJMP (Universal Java Matrix Package) library to my project for generating sparse matrix. But I could not find any documentation about functions of the library, how to cr

type(A) <class 'scipy.sparse.csc.csc_matrix'> A.shape (8529, 60877) print A[0,:] (0, 25) 1.0 (0, 7422) 1.0 (0, 26062) 1.0 (0, 31804) 1.0 (0, 41602) 1.0 (0, 43791) 1.0 print A[1,:] (0, 7044) 1.0

I have a large scipy sparse symmetric matrix which I need to condense by taking the sum of blocks to make a new smaller matrix. For example, for a 4x4 sparse matrix A I will like to make a 2x2 matrix