I would like to take column and row names from a text file and build a sparse matrix using the row and column information (the algorithm can be found in the description below). I have a working solution but it is slow for a text file with over 3,000,000 entries.

Does anyone have any suggestions for a faster algorithm than the one I describe below?

First, I start with a text file which provides column and row names, separated by a space. For example:

```
aaaa 11111 22222 33333 bbbb 11111 22222 cccc 11111
```

where `{aaaa,bbbb,cccc}`

are 4 character column names and `{11111,22222,33333}`

are 5 character row names.

Second, I load this text file into `R`

using the scan function:

```
char_vec <- scan(file = "textFile.txt", what = "character")
```

which converts the textFile information into a character vector.

Third, I find all of the possible column names and row names:

```
c_names <- unique(char_vec[nchar(char_vec) == 4])
r_names <- unique(char_vec[nchar(char_vec) == 5])
```

Fourth, I create a sparse matrix from the data:

```
library(Matrix)
createMatrix <- function(char_vec=char_vec, c_names=c_names, r_names=r_names)
{
mySparseMatrix <- Matrix(0, nrow = length(r_names), ncol = length(c_names),
sparse = TRUE)
for (i1 in 1:length(char_vec))
{
if (char_vec[i1] %in% c_names)
{
c_index <- match(char_vec[i1], c_names)
}
if (char_vec[i1] %in% r_names)
{
r_index <- match(char_vec[i1], r_names)
mySparseMatrix[r_index, c_index] <- 1
}
}
colnames(mySparseMatrix) <- c_names
rownames(mySparseMatrix) <- r_names
return(mySparseMatrix)
}
```

This gives this output:

```
aaaa bbbb cccc
11111 1 1 1
22222 1 1 .
33333 1 . .
```

To show how fast this algorithm works, I padded out the character vector (albeit in an unrealistic manner but I think it serves its purpose as an example):

```
char_vec <- rep(c("aaaa", "11111", "22222", "33333", "bbbb", "11111", "22222", "cccc", "11111"), 1000)
```

and then ran:

```
system.time(createMatrix(char_vec, c_names, r_names))
```

Output:

```
user system elapsed
9.89 0.00 9.94
```

I have profiled the function using:

```
Rprof("createMatrixOut.out")
z <- createMatrix(char_vec, c_names, r_names)
Rprof(NULL)
```

and display a subset of the output using:

```
summaryRprof("createMatrixOut.out")$by.total[1:10,]
```

Output:

```
total.time total.pct self.time self.pct
"createMatrix" 8.08 100.00 0.08 0.99
"[<-" 7.96 98.51 0.08 0.99
"replCmat4" 7.40 91.58 0.04 0.50
"as" 5.64 69.80 0.04 0.50
"asMethod" 5.06 62.62 0.16 1.98
"standardGeneric" 4.68 57.92 0.24 2.97
"new" 4.52 55.94 0.02 0.25
"initialize" 4.40 54.46 0.04 0.50
"callNextMethod" 4.24 52.48 0.08 0.99
".Call" 4.12 50.99 0.60 7.43
```

I have changed the structure of the data: Instead of storing them in a character vector, I create list:

```
> lst
$aaaa
[1] "11111" "22222" "33333"
$bbbb
[1] "11111" "22222"
$cccc
[1] "11111"
```

It is than much faster to iterate through this list

```
createMatrix2 <- function(char_vec=char_vec, c_names=c_names, r_names=r_names)
{
# create list
lst <- list()
for (i1 in 1:length(char_vec))
{
if (nchar(char_vec[i1])==4)
{
cn <- char_vec[i1]
} else {
if (!(char_vec[i1] %in% lst[[cn]])){lst[[cn]] <- c(lst[[cn]],char_vec[i1])}
}
}
# create empty matrix
mySparseMatrix <- Matrix(0, nrow = length(r_names), ncol = length(c_names),
sparse = TRUE)
# fill the matrix
for (cn in names(lst)){
c_index <- match(cn, c_names)
for(rn in lst[[cn]]){
r_index <- match(rn, r_names)
mySparseMatrix[r_index, c_index] <- 1
}
}
# names and return
colnames(mySparseMatrix) <- c_names
rownames(mySparseMatrix) <- r_names
return(mySparseMatrix)
}
> system.time(createMatrix(char_vec, c_names, r_names))
user system elapsed
9.60 0.00 10.36
> system.time(createMatrix2(char_vec, c_names, r_names))
user system elapsed
0.06 0.00 0.06
```

