I'm writing a lexer generator as a spare time project, and I'm wondering about how to go about table compression. The tables in question are 2D arrays of `short`

and very sparse. They are always 256 characters in one dimension. The other dimension is varying in size according to the number of states in the lexer.

The basic requirements of the compression is that

- The data should be accessible without decompressing the full data set. And accessible in constant O(1) time.
- Reasonably fast to compute the compressed table.

I understand the row displacement method, which is what I currently have implemented. It might be my naive implementation, but what I have is horrendously slow to generate, although quite fast to access. I suppose I could make this go faster using some established algorithm for string searching such as one of the algorithms found here.

I suppose an option would be to use a `Dictionary`

, but that feels like cheating, and I would like the fast access times that I would be able to get if I use straight arrays with some established algorithm. Perhaps I'm worrying needlessly about this.

From what I can gather, flex does not use this algorithm for it's lexing tables. Instead it seems to use something called *row/column equivalence which* I haven't really been able to find any explanation for.

I would really like to know how this row/column equivalence algorithm that flex uses works, or if there is any other good option that I should consider for this task.

**Edit:** To clarify more about what this data actually is. It is state information for state transitions in the lexer. The data needs to be stored in a compressed format *in memory* since the state tables can potentially be huge. It's also from this memory that the actual values will be accessed directly, without decompressing the tables. I have a working solution using row displacement, but it's murderously slow to compute - in partial due to my silly implementation.

Perhaps my implementation of the row displacement method will make it clearer how this data is accessed. It's a bit verbose and I hope it's OK that I've put it on pastebin instead of here.

The data is very sparse. It is usually a big bunch of zeroes followed by a few shorts for each state. It would be trivial to for instance run-length encode it but it would spoil the linear access time.

**Flex** apparently has two pairs of tables, `base`

and `default`

for the first pair and `next`

and `check`

for the second pair. These tables seems to index one another in ways I don't understand. The dragon book attempts to explain this, but as is often the case with that tome of arcane knowledge what it says is lost on lesser minds such as mine.

What's unclear is if your table has "sequence property" in one dimension or another.

Sequence property naturally happens in human speech, since a word is composed of many letters, and the sequence of letters is likely to appear later on. It's also very common in binary program, source code, etc.

On the other hand, sampled data, such as raw audio, seismic values, etc. do not advertise sequence property. Their data can still be compressed, but using another model (such as a simple "delta model" followed by "entropy").

If your data has "sequence property" in any of the 2 dimensions, then you can use common compression algorithm, which will give you both speed and reliability. You just need to provide it with an input which is "sequence friendly" (i.e. select your dimension).

If speed is a concern for you, you can have a look at this C# implementation of a fast compressor which is also a very fast decompressor : https://github.com/stangelandcl/LZ4Sharp

This paper, http://www.syst.cs.kumamoto-u.ac.jp/~masato/cgi-bin/rp/files/p606-tarjan.pdf, describes a method for compressing sparse tables, and might be of interest.

Are you tables known beforehand, and you just need an efficient way to store and access them?

I'm not really familiar with the problem domain, but if your table has a fix size along one axis (256), then would a array of size 256, where each element was a vector of variable length work? Do you want to be able to pick out an element given a (x,y) pair?

Another cool solution that I've always wanted to use for something is a perfect hash table, http://burtleburtle.net/bob/hash/perfect.html, where you generate a hash function from your data, so you will get minimal space requirements, and O(1) lookups (ie no collisions).

None of these solutions employ any type of compression, tho, they just minimize the amount of space wasted..

Similar Questions

I really couldn't google it. How to transform sparse matrix to ndarray? Assume, I have sparse matrix t of zeros. Then g = t.todense() g[:10] matrix([[0], [0], [0], [0], [0], [0], [0], [0], [0], [0]])

I want to initialise a sparse matrix (for use with scipy minimum_spanning_tree if that matters) from a list of matrix coordinates and values. That is, I have: coords - Nx2 array of coordinates to be s

I created a sparse matrix in MEX using mxCreateSparse. mxArray *W; W=mxCreateSparse(n*n,n*n,xsize,mxREAL); double *wpoint; wpoint=mxGetPr(W); for(p=0;p<xsize;p++) { Wpoint[(returnindex1(xi[p][0],xi

There's some Netlib code written in Fortran which performs transposes and multiplication on sparse matrices. The library works with Bank-Smith (sort of), old Yale, and new Yale formats. Unfortunat

I'm using numpy and scipy. I have a large sparse matrix and I want to find the largest eigenvalue of the sparse matrix. How can I do that?

I'm looking for a matrix / linear algebra library in Java that provides a sparse matrix that can be written to concurrently from different threads. Most of the libraries I've come across either do not

I am learning how to use Scipy.sparse. The first thing I tried was checking a diagonal matrix for sparsity. However, Scipy claims it is not sparse. Is this a correct behavior? The following code retur

This question already has an answer here: Sparse matrices / arrays in Java 11 answers I need to implement a sparse matrix as efficiently memory-wise as I can in Java.I receive a matrix with mor

let's say I have a big Matrix X with a lot of zeros, so of course I make it sparse in order to save on memory and CPU. After that I do some stuff and at some point I want to have the nonzero elements.

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times. Say the input

I'm working with a very large sparse matrix multiplication (matmul) problem. As an example let's say: A is a binary ( 75 x 200,000 ) matrix. It's sparse, so I'm using csc for storage. I need to do th

My goal is to combine many sparse matrices together to form one large sparse matrix. The only two ideas I've been able to think of are (1) create a large sparse matrix and overwrite certain blocks, (2

I researched a lot on this but couldn't find a practical solution to this problem. I am using scipy to create csr sparse matrix and want to substract this matrix from an equivalent matrix of all ones.

Is there any effective implement of the solution for sparse matrix linear equation using CUDA?

If I have a matrix like this A = [1 2; 3 4]; I can use interp2 to interpolate it like this newA = interp2(A,2); and I get a 5x5 interpolated matrix. But what if I have a matrix like this: B = zeros(

I have a question about Eigen library in C++. Actually, I want to calculate inverse matrix of sparse matrix. When I used Dense matrix in Eigen, I can use .inverse() operation to calculate inverse of d

I'm looking for an efficient data structure to represent a very large matrix of integers in Python/Cython with focus on element-wise operations. I'm currently building a model that requires a lot of e

I am trying to create a large sparse matrix, 10^5 by 10^5 in R, but am running into memory issues. > Matrix(nrow=1e5,ncol=1e5,sparse=TRUE) Error in Matrix(nrow = 1e+05, ncol = 1e+05, sparse = TRUE)

I have a little problem, I would like to convert a matrix 10*10 in a CSR or COO sparse matrix/format. The matrix is: 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.45 0.10 -0.45 0.00 0.00 0.00

I'm trying to set up a particular kind of sparse matrix in R. The following code gives the result I want, but is extremely slow: library(Matrix) f <- function(x){ out <- rbind(head(x, -1), tail(

I am using Boost's uBLAS in a numerical code and have a 'heavy' solver in place: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?LU_Matrix_Inversion The code works excellently, however,

I have a data file storing a large matlab sparse matrix (matlab 7.3) that needs to be used in my python program. I use h5py to load this sparse matrix and find there are 3 data structures associated w

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to b

I want to do a lot of matrix indexing of a high-D array, but the indices are split up. I came up with a few solutions: ### setup test <- array(0, c(3,3,3,3)) test[1,2,3,2] <- 1 system.time(for (

I'm trying to use large 10^5x10^5 sparse matrices but seem to be running up against scipy: n = 10 ** 5 x = scipy.sparse.rand(n, n, .001) gets ValueError: Trying to generate a random sparse matrix su

Does anyone know how to perform svd operation on a sparse matrix in python? It seems that there is no such functionality provided in scipy.sparse.linalg.

Below is my code for generating my sparse matrix: import numpy as np import scipy def sparsemaker(X, Y, Z): 'X, Y, and Z are 2D arrays of the same size' x_, row = np.unique(X, return_inverse=True) y_,

I have a large sparse graph that I am representing as an adjacency matrix (100k by 100k or bigger), stored as an array of edges. An example with a (non-sparse) 4 by 4 matrix: 0 7 4 0 example_array = [

For a current project I have to compute the inner product of a lot of vectors with the same matrix (which is quite sparse). The vectors are associated with a two dimensional grid so I store the vector

for class I have to write my own linear equation solver for sparse matrices. I am free to use any type of data structure for sparse matrices and I have to implement several solves, including conjuguat

I want to make a sparse matrix in python. I have the index and value of non-zero elements as a dictionary i.e.: {((1,3),0.0001),(10,4),0.0212)...} which means that value of element (1,3) is 0.0001, (

When trying to run the cor() function on sparse matrices (of either type dgCMatrix or dgTMatrix) I get the following error: Error in cor(x) : supply both 'x' and 'y' or a matrix-like 'x' Converting m

I am filling a sparse matrix P (230k,290k) with values coming from a text file which I read line by line, here is the (simplified) code while ... C = textscan(text_line,'%d','delimiter',',','EmptyValu

I have this huge sparse matrix A of size 2 million by 10 thousand. I want to index particular 1000 rows (index) from this matrix. If I do B = A(index,:); it takes some time. Is there a better quick

I need to perform a set of operations on a scipy sparse matrix in a Cython method. To efficiently apply these I need access to lil_matrix representation. The lil (linked-list sparse matrix) data repre

How can we display a sparse matrix? Someone who wanted to help me said: use linked lists. Also I have read some where that we should create class like this : SpMatrix { int non_zero_value; int i,j; }

I'm thinking of using Boost's Sparse Matrix for a computation where minimal memory usage is the goal. Unfortunately, the documentation page didn't include a discussion of the sparse matrix implementat

I noticed on Google Summer of Code 2013 that a possible project was implement sparse matrix support for Decision Trees and ensemble methods. Out of curiosity, did this project get anywhere? I really n

Are there any algorithms that allow efficient creation (element filling) of sparse (e.g. CSR or coordinate) matrix in parallel?

Why the last printf in the main function doesn't print to the screen the value 10? I know that in ANSI C, statically allocated matrix are arranged in memory in this way: matrix: matrix[0][0], matrix[0

For the purpose I used the solution from that thread link by now, however it gives memory error as expected since my matrix A size is 6 million to 40000 matrix. Therefore I am looking for any other so

I have to perform this operation: N = A'*P*A The structure of the P matrix is block diagonal while the A matrix is largely sparse (also in a banded structure). The multiplication is performed in blo

I have an n x p matrix and would like to compute the n x n matrix B defined as B[i, j] = f(A[i,], A[j,]) where f is a function that accepts arguments of the appropriate dimensionality. Is there a nea

I have sparse vectors with dimensionalities of around 200.000. I also have a matrix with the same amount of columns, and the same amount of rows as the number of vectors. I want to set all of these in

I have a large m *n sparse matrix Y. I would like to normalize each row of Y, so that each row has zero mean. I first tried this. But the mean of each row is also subtracted from the zero entries, wh

Suppose I have a NxN matrix M (lil_matrix or csr_matrix) from scipy.sparse, and I want to make it (N+1)xN where M_modified[i,j] = M[i,j] for 0 <= i < N (and all j) and M[N,j] = 0 for all j. Basi

Is it possible to have a SQL statement in Microsoft Access that can disable the unicode compression property on a column ? Something in the lines of : ALTER TABLE MyTable ALTER COLUMN MyColumn DISABLE

I have googled around but havnt found an answer that suits me for OpenGL. I want to construct a sparse matrix with a single diagonal and around 9 off-diagonals. These diagonals arent necessarily next

I wanted CSR files preferably from matrix market for my OpenCL library, I searched a lot for CSR generators in C but didn't get any. I find matrix market formats comfortable since they have defined th

I am currently trying to speed up my large sparse (scipy) matrix multiplications. I have successfully linked my numpy installation with OpenBLAS and henceforth, also scipy. I have run these tests with