Is it possible to perform a fold in the State monad in constant stack and heap space? Or is a different functional technique a better fit to my problem?

The next sections describe the problem and a motivating use case. I'm using Scala, but solutions in Haskell are welcome too.

`State`

Monad Fills the HeapAssume Scalaz 7. Consider a monadic fold in the State monad. To avoid stack overflows, we'll trampoline the fold.

```
import scalaz._
import Scalaz._
import scalaz.std.iterable._
import Free.Trampoline
type TrampolinedState[S, B] = StateT[Trampoline, S, B] // monad type constructor
type S = Int // state is an integer
type M[B] = TrampolinedState[S, B] // our trampolined state monad
type R = Int // or some other monoid
val col: Iterable[R] = largeIterableofRs() // defined elsewhere
val (count, sum): (S, R) = col.foldLeftM[M, R](Monoid[R].zero){
(acc: R, x: R) => StateT[Trampoline, S, R] {
s: S => Trampoline.done {
(s + 1, Monoid[R].append(acc, x))
}
}
} run 0 run
// In Scalaz 7, foldLeftM is implemented in terms of foldRight, which in turn
// is a reversed.foldLeft. This pulls the whole collection into memory and kills
// the heap. Ignore this heap overflow. We could reimplement foldLeftM to avoid
// this overflow or use a foldRightM instead.
// Our real issue is the heap used by the unexecuted State mobits.
```

For a large collection `col`

, this will fill the heap.

I believe that during the fold, a closure (a State mobit) is created for each value in the collection (the `x: R`

parameter), filling the heap. None of those can be evaluated until `run 0`

is executed, providing the initial state.

Can this O(n) heap usage be avoided?

More specifically, can the initial state be provided before the fold so that the State monad can execute during each bind, rather than nesting closures for later evaluation?

Or can the fold be constructed such that it is executed lazily after the State monad is `run`

? In this way, the next `x: R`

closure would not be created until after the previous ones have been evaluated and made suitable for garbage collection.

Or is there a better functional paradigm for this sort of work?

But perhaps I'm using the wrong tool for the job. The evolution of an example use case follows. Am I wandering down the wrong path here?

Consider reservoir sampling, i.e., picking in one pass a uniform random `k`

items from a collection too large to fit in memory. In Scala, such a function might be

```
def sample[A](col: TraversableOnce[A])(k: Int): Vector[A]
```

and if pimped into the `TraversableOnce`

type could be used like this

```
val tenRandomInts = (Int.Min to Int.Max) sample 10
```

The work done by `sample`

is essentially a `fold`

:

```
def sample[A](col: Traversable[A])(k: Int): Vector[A] = {
col.foldLeft(Vector()){update(k)(_: Vector[A], _: A)}
}
```

However, `update`

is stateful; it depends on `n`

, the number of items already seen. (It also depends on an RNG, but for simplicity I assume that is global and stateful. The techniques used to handle `n`

would extend trivially.). So how to handle this state?

The impure solution is simple and runs with constant stack and heap.

```
/* Impure version of update function */
def update[A](k: Int) = new Function2[Vector[A], A, Vector[A]] {
var n = 0
def apply(sample: Vector[A], x: A): Vector[A] = {
n += 1
algorithmR(k, n, acc, x)
}
}
def algorithmR(k: Int, n: Int, acc: Vector[A], x: A): Vector[A] = {
if (sample.size < k) {
sample :+ x // must keep first k elements
} else {
val r = rand.nextInt(n) + 1 // for simplicity, rand is global/stateful
if (r <= k)
sample.updated(r - 1, x) // sample is 0-index
else
sample
}
}
```

But what about a purely functional solution? `update`

must take `n`

as an additional parameter and return the new value along with the updated sample. We could include `n`

in the implicit state, the fold accumulator, e.g.,

```
(col.foldLeft ((0, Vector())) (update(k)(_: (Int, Vector[A]), _: A)))._2
```

But that obscures the intent; we only really intend to accumulate the sample vector. This problem seems ready made for the State monad and a monadic left fold. Let's try again.

We'll use Scalaz 7, with these imports

```
import scalaz._
import Scalaz._
import scalaz.std.iterable_
```

and operate over an `Iterable[A]`

, since Scalaz doesn't support monadic folding of a `Traversable`

.

`sample`

is now defined

```
// sample using State monad
def sample[A](col: Iterable[A])(k: Int): Vector[A] = {
type M[B] = State[Int, B]
// foldLeftM is implemented using foldRight, which must reverse `col`, blowing
// the heap for large `col`. Ignore this issue for now.
// foldLeftM could be implemented differently or we could switch to
// foldRightM, implemented using foldLeft.
col.foldLeftM[M, Vector[A]](Vector())(update(k)(_: Vector[A], _: A)) eval 0
}
```

where update is

```
// update using State monad
def update(k: Int) = {
(acc: Vector[A], x: A) => State[Int, Vector[A]] {
n => (n + 1, algorithmR(k, n + 1, acc, x)) // algR same as impure solution
}
}
```

Unfortunately, this blows the stack on a large collection.

So let's trampoline it. `sample`

is now

```
// sample using trampolined State monad
def sample[A](col: Iterable[A])(k: Int): Vector[A] = {
import Free.Trampoline
type TrampolinedState[S, B] = StateT[Trampoline, S, B]
type M[B] = TrampolinedState[Int, B]
// Same caveat about foldLeftM using foldRight and blowing the heap
// applies here. Ignore for now. This solution blows the heap anyway;
// let's fix that issue first.
col.foldLeftM[M, Vector[A]](Vector())(update(k)(_: Vector[A], _: A)) eval 0 run
}
```

where update is

```
// update using trampolined State monad
def update(k: Int) = {
(acc: Vector[A], x: A) => StateT[Trampoline, Int, Vector[A]] {
n => Trampoline.done { (n + 1, algorithmR(k, n + 1, acc, x) }
}
}
```

This fixes the stack overflow, but still blows the heap for very large collections (or very small heaps). One anonymous function per value in the collection is created during the fold (I believe to close over each `x: A`

parameter), consuming the heap before the trampoline is even run. (FWIW, the State version has this issue too; the stack overflow just surfaces first with smaller collections.)

Our real issue is the heap used by the unexecuted State mobits.

No, it is not. The real issue is that the collection doesn't fit in memory and that `foldLeftM`

and `foldRightM`

force the entire collection. A side effect of the impure solution is that you are freeing memory as you go. In the "purely functional" solution, you're not doing that anywhere.

Your use of `Iterable`

ignores a crucial detail: what kind of collection `col`

actually is, how its elements are created and how they are expected to be discarded. And so, necessarily, does `foldLeftM`

on `Iterable`

. It is likely too strict, and you are forcing the entire collection into memory. For example, if it is a `Stream`

, then as long as you are holding on to `col`

all the elements forced so far will be in memory. If it's some other kind of lazy `Iterable`

that doesn't memoize its elements, then the fold is still too strict.

I tried your first example with an `EphemeralStream`

did not see any significant heap pressure, even though it will clearly have the same "unexecuted State mobits". The difference is that an `EphemeralStream`

's elements are weakly referenced and its `foldRight`

doesn't force the entire stream.

I suspect that if you used `Foldable.foldr`

, then you would not see the problematic behaviour since it folds with a function that is lazy in its *second argument*. When you call the fold, you want it to return a suspension that looks something like this immediately:

```
Suspend(() => head |+| tail.foldRightM(...))
```

When the trampoline resumes the first suspension and runs up to the next suspension, all of the allocations between suspensions will become available to be freed by the garbage collector.

Try the following:

```
def foldM[M[_]:Monad,A,B](a: A, bs: Iterable[B])(f: (A, B) => M[A]): M[A] =
if (bs.isEmpty) Monad[M].point(a)
else Monad[M].bind(f(a, bs.head))(fax => foldM(fax, bs.tail)(f))
val MS = StateT.stateTMonadState[Int, Trampoline]
import MS._
foldM[M,R,Int](Monoid[R].zero, col) {
(x, r) => modify(_ + 1) map (_ => Monoid[R].append(x, r))
} run 0 run
```

This will run in constant heap for a trampolined monad `M`

, but will overflow the stack for a non-trampolined monad.

But **the real problem is that Iterable is not a good abstraction for data that are too large to fit in memory.** Sure, you can write an imperative side-effecty program where you explicitly discard elements after each iteration or use a lazy right fold. That works well until you want to compose that program with another one. And I'm assuming that the whole reason you're investigating doing this in a

`State`

monad to begin with is to gain compositionality.So what can you do? Here are some options:

- Make use of
`Reducer`

,`Monoid`

, and composition thereof, then run in an imperative explicitly-freeing loop (or a trampolined lazy right fold) as the*last step*, after which composition is not possible or expected. - Use
`Iteratee`

composition and monadic`Enumerator`

s to feed them. - Write compositional stream transducers with Scalaz-Stream.

The last of these options is the one that I would use and recommend in the general case.

Using `State`

, or any similar monad, isn't a good approach to the problem. Using `State`

is condemned to blow the stack/heap on large collections. Consider a value of `x: State[A,B]`

constructed from a large collection (for example by folding over it). Then `x`

can be evaluated on different values of the initial state `A`

, yielding different results. So `x`

needs to retain all information contained in the collection. An in pure settings, `x`

can't forget some information not to blow stack/heap, so anything that is computed remains in memory until the whole monadic value is freed, which happens only after the result is evaluated. So the memory consumption of `x`

is proportional to the size of the collection.

I believe a fitting approach to this problem is to use functional **iteratees/pipes/conduits**. This concept (referred to under these three names) was invented to process large collections of data with constant memory consumption, and to describe such processes using simple combinator.

I tried to use Scalaz' `Iteratees`

, but it seems this part isn't mature yet, it suffers from stack overflows just as `State`

does (or perhaps I'm not using it right; the code is available here, if anybody is interested).

However, it was simple using my (still a bit experimental) *scala-conduit* library (**disclaimer:** I'm the author):

```
import conduit._
import conduit.Pipe._
object Run extends App {
// Define a sampling function as a sink: It consumes
// data of type `A` and produces a vector of samples.
def sampleI[A](k: Int): Sink[A, Vector[A]] =
sampleI[A](k, 0, Vector())
// Create a sampling sink with a given state. It requests
// a value from the upstream conduit. If there is one,
// update the state and continue (the first argument to `requestF`).
// If not, return the current sample (the second argument).
// The `Finalizer` part isn't important for our problem.
private def sampleI[A](k: Int, n: Int, sample: Vector[A]):
Sink[A, Vector[A]] =
requestF((x: A) => sampleI(k, n + 1, algorithmR(k, n + 1, sample, x)),
(_: Any) => sample)(Finalizer.empty)
// The sampling algorithm copied from the question.
val rand = new scala.util.Random()
def algorithmR[A](k: Int, n: Int, sample: Vector[A], x: A): Vector[A] = {
if (sample.size < k) {
sample :+ x // must keep first k elements
} else {
val r = rand.nextInt(n) + 1 // for simplicity, rand is global/stateful
if (r <= k)
sample.updated(r - 1, x) // sample is 0-index
else
sample
}
}
// Construct an iterable of all `short` values, pipe it into our sampling
// funcition, and run the combined pipe.
{
print(runPipe(Util.fromIterable(Short.MinValue to Short.MaxValue) >->
sampleI(10)))
}
}
```

**Update:** It'd be possible to solve the problem using `State`

, but we need to implement a custom fold specifically for `State`

that knows how to do it constant space:

```
import scala.collection._
import scala.language.higherKinds
import scalaz._
import Scalaz._
import scalaz.std.iterable._
object Run extends App {
// Folds in a state monad over a foldable
def stateFold[F[_],E,S,A](xs: F[E],
f: (A, E) => State[S,A],
z: A)(implicit F: Foldable[F]): State[S,A] =
State[S,A]((s: S) => F.foldLeft[E,(S,A)](xs, (s, z))((p, x) => f(p._2, x)(p._1)))
// Sample a lazy collection view
def sampleS[F[_],A](k: Int, xs: F[A])(implicit F: Foldable[F]):
State[Int,Vector[A]] =
stateFold[F,A,Int,Vector[A]](xs, update(k), Vector())
// update using State monad
def update[A](k: Int) = {
(acc: Vector[A], x: A) => State[Int, Vector[A]] {
n => (n + 1, algorithmR(k, n + 1, acc, x)) // algR same as impure solution
}
}
def algorithmR[A](k: Int, n: Int, sample: Vector[A], x: A): Vector[A] = ...
{
print(sampleS(10, (Short.MinValue to Short.MaxValue)).eval(0))
}
}
```

Similar Questions

In the following code: MyClass oMyClass1; MyClass oMyClass2 = null; My doubt is how the above two lines will affect memory (stack & heap). Will create reference in stack?

I am trying to insert 1.30 GB data through ETL in oracle .But i am getting Exception in thread pool-1-thread-1 java.lang.OutOfMemoryError: Java heap space error.This is my Exception Tree Wed Mar 05

this question is related to this question I have a state monad. An object provides an update function as in the OOD strategy pattern. The choice of having a object is that in real, production code,

I'm now testing the address area of heap and stack in C++ my code is #include <iostream> using namespace std; int g; int uninitialized_g; class Heap{ int a; int b; }; int main() { int stack_vari

I thought I had a good handle on Haskell Monads until I realized this very simple piece of code made no sense to me (this is from the haskell wiki about the State monad): playGame :: String -> Stat

What happens when the unused memory space between stack and heap in a process's virtual memory is exhausted ?

Is an Objective-C object, e.g., NSString, placed on the stack or the heap?

How do you design and build your monadic stacks? For the first time I need to build a monadic stack (using transformers) to solve a real world problem, but I'm not thoroughly sure in which order to st

I have to write a program in Haskell that will solve some nondeterministic problem. I think i understand List Monad in 75% so it is oblivious choice but... (My problem is filling n x m board with ship

I'm writing a wrapper to Lucene. When a search request is made frequently, it's possible Could not reserve enough space for object heap will be thrown. How can I get the size of the object heap? And

In a great series of posts Eric Lippert outlines the so-called Monad Pattern for .NET types that kinda act like monads and implements return and bind for some of them. As examples of monadic types h

I'm running the following c++ code on Ubuntu with 4GBs of RAM const long long nSize = 400000000; double Array1[nSize]; for(int i=0; i<nSize; i++) Array1[i]= 2*2; // store on the stack And this fit

I am getting java heap space error while running from hudson and I set my MAVEN_OPTS as below, any body can let me know what is the resolution for this issue. -Xmx4096m -XX:PermSize=3000m -XX:MaxPermS

I wrote a sample java application which allocates memory and then running forever. why is the memory used by the survivor space 0kbytes ?! List<String> stringlist = new ArrayList<String>(

I have remote logged into my machine and trying to start tomcat server. But, I get the following error. Error occurred during initialization of VM Could not reserve enough space for object heap Coul

I have some unusual requirement that a variable should always be there on heap and not on stack. Now I tried doing this using private destructor and static method of class which will simply take point

I want to be able to execute the .Jar file, and if the heap space isn't set big enough, it should launch a new JVM with the same .Jar file, but set with a bigger heap space, and then close the first J

In a demand paged system like linux where pages maybe~4k from what I read, it ensures protection by checking if the stack or heap size exceeds the number of pages given to each. WHen I create two vari

if I have an array declaration like this, int a[]; here a is an array of integer type. Where this array of integer is stored on heap or stack?This is a primitve type int, all primitive types are not

Possible Duplicate: What and where are the stack and heap Where is heap memory and stack memory stored?I mean where on the harddisk?what are the limits of their size?

This question is related to Java Refuses To Start - Could Not Resrve Enough Space for Object Heap and should be easy enough to figure out. However; my searches haven't yielded anything useful. Essenti

I'm running Mahout 0.6 from the command line on an Amazon Elastic MapReduce cluster trying to canopy-cluster ~1500 short documents, and the jobs keep failing with a Error: Java heap space message. B

I wonder whether the reference variables such as c in this code: int a = 5; int & c = a; are allocated from heap or stack. Can anyone help? Thanks

I'm trying to compile really large program in eclipse (inherited it). When attempting to build the project, I get an out of heap space exception, so I can't ever compile it completely. Can I update th

I'm trying to import the monad State. I did the following command: :m Control.Monad.State But the module cannot be found. I'm using GHCi, version 7.0.4:. Can you give me some hint to fix the problem

I was reading Learn You a Haskell's guide on the state monad, but I had trouble understanding it since the stack example couldn't compile. In the guide, he used the following piece of code: import Con

Like written in the topic, I wonder if it is possible to increase the heap space of my c program on an OSX system, When I run my program, at some point I get the window with the message your system h

For our multithreaded application that uses H2 database, we saw the following error in our logs immediately following a heap space error: java.util.concurrent.ExecutionException: java.lang.OutOfMemo

I am doing a technique called stack painting. To determine how much stack space a particular function utilized. If I allocated 1MB worth of items on the stack. And then am sure that I am not using a

i have small task to emulate imperative loop in monadic code with state involved and there should be no IO, the task is to exit loop on condition and here is my attempt: > execState (forever $ modi

What is the amount of heap and stack available to a Program and how do I determine it? And is it dependent on the compiler or PC or both?

I'm curious to know how Integer and Integer Array are stored on the stack/heap in java, is there a link someone could point me to? Or could someone explain it to me please. Update 1: and how does this

There are several parts to this question. According to most of the resources available on net and according to the text books as well, heap and stack memory grow in opposite directions. Do Heap and S

Many higher-order functions can be defined in term of the fold function. For example, here is the relation between filter and foldl in Haskell. myFilter p [] = [] myFilter p l = foldl (\y x -> if (

I've been playing around with some simple binary encoding and it seemed to be working correctly for the most part, up until I added the state monad. The plan was to use the state to keep a lookup tabl

In c / c++ local objects are created on the stack, and data is fed from the stack to the cpu registers. In Java there is no stack, all objects are allocated on the heap, now for pre-written code the

I want to build a nondeterministic state monad in Haskell. This will allow me to generate all the elements in my search space using the built up state to prune bad locations. Suppose I have the follow

Threads each have their own stack, but they share a common heap. Its clear to everyone that stack is for local/method variables & heap is for instance/class variables. What is the benefit of shari

one question according to the get function of the State Monad: If I run runState get 1 I got the result (1,1) and this is ok for me because the get function set the result value to the state and i

while reading C# in Depth I was going through the section Reference types live on the heap, value types live on the stack. Now what I could understand is (mainly for ref type) : class Program { in

Our tomcat server threw out java.lang.OutOfMemoryError: Java heap space, but the heap size in dump file is only 1.7GB, and the -Xmx is 4GB. I'm not sure what's happened, could you help me? Environme

In some systems, the stack grows in upward direction whereas the heap grows in downward direction and in some systems, the stack grows in downward direction and the heap grows in upward direction. But

Hi Could anyone tell me what is the maximum application size supported by iphone? Also what is the maximum heap size and stack size supported? Application goes 'out of memory' very soon...

There seems to be some uncertainty about the terminology that should be used. There seems to be 2 different points of view: some people prefer to use heap and stack to mean the location of bytes othe

I keep getting this java.lang.OutOfMemoryError: Java heap space error in eclipse using JDK 1.6 u43 and eclipse 4.2.2 under Windows 7 64bit. I don't know what that error means or how to solve it...

My goal is to create a function that uses the list monad inside either a ReaderT WriterT stack, or a RWS stack. More generally, how do I use the list monad inside mtl typeclasses such as MonadReader,

I am trying to use the coreference module of the Stanford CoreNLP pipeline, but I end up getting an OutOfMemory error in Java. I already increased the heap size (via Run->Run Configurations->VM

I come across some short piece of monadic code and I have a question not related to the actual subject of the example ap :: (Monad m) => m (a -> b) -> m a -> m b ap mf mx = do f <- mf x

When I am running I am getting the following exception repeatedly each time I try to run the program. Error occurred during initialization of VM Could not reserve enough space for object heap Could n

I attempted to change my heap size for maven easyb plugin http://www.easyb.org/maven-easyb-plugin/. In my easyb test I added standard code to print the heap size using the code Runtime.getFreeMemory()