Suppose I have the following frequency table.

```
> print(dat)
V1 V2
1 1 11613
2 2 6517
3 3 2442
4 4 687
5 5 159
6 6 29
# V1 = Score
# V2 = Frequency
```

How can I efficiently compute the Mean and standard deviation? Yielding: SD=0.87 MEAN=1.66. Replicating the score by frequency takes too long to compute.

I might be missing something, but this seems to work very quickly, even substituting millions in the frequency column:

```
dset <- data.frame(V1=1:6,V2=c(11613,6517,2442,687,159,29))
mean(rep(dset$V1,dset$V2))
#[1] 1.664102
sd(rep(dset$V1,dset$V2))
#[1] 0.8712242
```

Mean is easy. SD is a little trickier (can't just use fastmean() again because there's an n-1 in the denominator.

```
> dat <- data.frame(freq=seq(6),value=runif(6)*100)
> fastmean <- function(dat) {
+ with(dat, sum(freq*value)/sum(freq) )
+ }
> fastmean(dat)
[1] 55.78302
>
> fastRMSE <- function(dat) {
+ mu <- fastmean(dat)
+ with(dat, sqrt(sum(freq*(value-mu)^2)/(sum(freq)-1) ) )
+ }
> fastRMSE(dat)
[1] 34.9316
>
> # To test
> expanded <- with(dat, rep(value,freq) )
> mean(expanded)
[1] 55.78302
> sd(expanded)
[1] 34.9316
```

Note that `fastRMSE`

calculates `sum(freq)`

twice. Eliminating this would probably result in another minor speed boost.

**Benchmarking**

```
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 12.433 13.5335 14.776 15.398 23.921
2 mean(with(dat, rep(value, freq))) 21.225 22.3990 22.714 23.406 86.434
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
>
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 13.177 14.544 15.8860 17.2905 54.983
2 mean(with(dat, rep(value, freq))) 42.610 48.659 49.8615 50.6385 151.053
> dat <- data.frame(freq=seq(600),value=runif(600)*100)
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 15.706 17.489 25.8825 29.615 79.113
2 mean(with(dat, rep(value, freq))) 1827.146 2283.551 2534.7210 2884.933 26196.923
```

The replicating solution appears to be O( N^2 ) *in the number of entries.*

The `fastmean`

solution appears to have a 12ms or so fixed cost after which it scales beautifully.

**More benchmarking**

```
Comparison with dot product.
dat <- data.frame(freq=seq(600),value=runif(600)*100)
dbaupp <- function(dat) {
total.count <- sum(dat$freq)
as.vector(dat$freq %*% dat$value) / total.count
}
microbenchmark(
fastmean(dat),
mean( with(dat, rep(value,freq) ) ),
dbaupp(dat)
)
Unit: microseconds
expr min lq median uq max
1 dbaupp(dat) 20.162 21.6875 25.6010 31.3475 104.054
2 fastmean(dat) 14.680 16.7885 20.7490 25.1765 94.423
3 mean(with(dat, rep(value, freq))) 489.434 503.6310 514.3525 583.2790 30130.302
```

How about:

```
> m = sum(dat$V1 * dat$V2) / sum(dat$V2)
> m
[1] 1.664102
> sigma = sqrt(sum((dat$V1 - m)**2 * dat$V2) / (sum(dat$V2)-1))
> sigma
[1] 0.8712242
```

No replication here.

The following code doesn't use replication, and it uses R builtins (for the dot product especially) as much as possible so it is probably more efficient that solutions that use `sum(V1 * V2)`

. (Edit: this is possibly false: @gsk3's solution seems to be about 1.5 - 2 times faster from my testing.)

The definition of mean (or expectation) is `sum(n * freq(n)) / total.count`

where `n`

is the "score" and `freq(n)`

is the frequency of `n`

(`total.count`

is just `sum(freq(n))`

). The sum in the numerator is precisely the dot product of the scores with the frequencies.

The dot product in R is `%*%`

(it returns a matrix, but this can basically be treated at a vector for most purposes):

```
> total.count <- sum(dat$V2)
> mean <- dat$V1 %*% dat$V2 / total.count
> mean
[,1]
[1,] 1.664102
```

There is a formula at the end of this section of the Wikipedia article, which translates to the following code

```
> sqrt(dat$V1^2 %*% dat$V2 / total.count - mean^2)
[,1]
[1,] 0.8712039
```

Similar Questions

Hi Im trying to use a standard deviation library from Apache Commons Math. I'm unable to import the library because I can't find the jar file after I download the zip. Am I missing something that's t

I need to compute the geometric mean of a large set of numbers, whose values are not a priori limited. The naive way would be double geometric_mean(std::vector<double> const&data) // failure

I am writing a code to find the mean and standard deviation of 6 vectors with 8000 elements each. I was wondering if I could do it using CUDA and speed up the operation. I could think of how to find t

Could anybody tell me why does the compiler gives me an error - ERROR: Insufficient page size to print frequency table. while running proc freq in sas. I am trying to run a very simple peice of code

I'd like to calculate the mean and the standar deviation of many consecutive intervals of two related arrays (List below), where the first two columns are (let's say) time and distance respectively. T

Is there a way to generate a Bell curve from a standard deviation and a mean? I have read reference to this but cant find an algorthym. I have a list of numbers, they are pre-normalised look ups (scor

Is it possible to draw a line chart with the standard deviation around the line shown as a colour density, so for example 1 stdev would be the same but a deeper colour that 2stdev?

I have an array of numbers shown below. However, I can only calculate the standard deviation of the whole array with only output 1 result as shown below: Codes: function standard_deviation_sample ($a)

I'm using the Images package and I want to load in an image and get the mean and standard deviation of the pixels in r image. I tried: using Images, Color, FixedPointNumbers, ImageView, Testimages img

Where is the filesystem located for the fresh install of MEAN on Google Compute Engine? I see the This is the home view in port 3000 but can't find the actual files.

I'm writing a c extension to calculate he standard deviation. Performance is important because it will be performed over large data sets. I'm having a hard time figuring out how to get the value of py

I am having a table TF with rows FILEID, WORD, FREQUENCY. I am trying to delete all word rows with word 'and' 'of' 'the' from the table. Below is my query. It is not doing anything. delete from TF whe

I am solving some examples from Data Networks by Gallager and i didnt quite understand this particular question. The question goes as follows: Suppose the expected frame length on a link is 1000 bits

i would like to compute the data from a table where it consists date, and data ..there are a lot of data inside the table. example are as below: DATE hour data1 data2 ------------------------------- 0

I'm trying to write a program that takes a large data frame and replaces each column of values by the cumulative frequency of those values (sorted ascending). For instance, if the column of values are

I've got a database table of type-2 data, and I want to find records that were deleted since I last synced with it. It's got date_from and date_to columns, and the raw data has an ID column object_id.

I need to generate a simple frequency table in MongoDB. Let's say I have the following documents in a collection called books. { _id: 1, genre: [ Fantasy, Crime, Drama ] } { _id: 2, genre: [

I am using glmer and I wish to extract the standard deviation of the variance components of the random effects (intercept and slope). I have tried using: VarCorr(model) which returns the two standard

I'm trying to understand how impulse response works in frequency domain. I mean we usually use Z transformation to convert signal from time domain to frequency domain, but I wonder is there any practi

I use the query below to delete outliers (1.5 times the sd). DELETE FROM sv_condition_sw WHERE snow_mountain > ( SELECT AVG(snow_mountain)+1.5*STDDEV(snow_mountain) FROM sv_condition_sw WHERE lud='

I am finding this to be quite tricky. I have an R time series data frame, consisting of a value for each day for about 50 years of data. I would like to compute the mean of only the last 5 values for

I have two frequency tables created using R's table() function: freq1 <- table(unlist(strsplit(topic_list1, split=;))) freq2 <- table(unlist(strsplit(topic_list2, split=;))) topic_list1 and

I'm looking for a way to generate a set of integers with a specified mean and std. deviation. Using the random library, it is possible to generate a set of random doubles distributed in gaussian fashi

I am trying to compute efficiently (using SQL Server 2008) the moving average of the ProductCount over a period of 24 hours. For every single row in the Product table, I'd like to know what was the av

I'm writing a code to categorize the datas, and get the average and standard deviation. Here are the example of my data. 3917 1 -0.662261 25.148 22.9354 68.8076 3918 1 12.7649 18.7451 7.68473 69.0063

The Statistics::Descriptive module provides the Full class that allows you to compute a trimmed mean. Is there a way to get a trimmed standard deviation out of it? It's already sorting the data to get

For each X-value I calculated the average Y-value and the standard-deviation (sd) of each Y-value x = 1:5 y = c(1.1, 1.5, 2.9, 3.8, 5.2) sd = c(0.1, 0.3, 0.2, 0.2, 0.4) plot (x, y) How can I add the

I would like to make a program that can transfer data as pulses of a certain frequency but am unsure on how to detect if a frequency is present. I would assume I need to filter out all unneeded freque

If I have a big list or numpy array or etc that I need to split into sub-lists, how could I efficiently calculate the stadistics (mean, standar deviation, etc) for the whole list? As a simple example,

The source of this data is server performance metrics. The numbers I have are the mean (os_cpu) and standard deviation (os_cpu_sd). Mean clearly doesn't tell the whole story, so I want to add standard

This question already has an answer here: Undefined Behavior and Sequence Points 4 answers Any reason for the following aberration? Consider the following C program (named PstFixInc.c) #include

I have another newbie question; lets say i have a set of numbers graph_val <- c(4,2,3,4,1,1,9) and i need to create a frequency table of them against this scale 1 2 3 4 5 9 Very Poor Poor Av

How to convert given Greenwich Mean Time to Indian Standard Time in perl? Thanks

I have a time series x_0 ... x_t. I would like to compute the exponentially weighted variance of the data. That is: V = SUM{w_i*(x_i - x_bar)^2, i=1 to T} where SUM{w_i} = 1 and x_bar=SUM{w_i*x_i} re

I have a data_aa as below ID x.cal tx.cal 1 0 0 2 0 0 3 0 0 4 0 0 5 1 1 6 10 1 7 10 1 8 11 1 9 11 1 With aboe data_aa, I made a frequency table as below and got an result_A variable. x.cal tx.cal fre

I am writing python code to take an image from the web and calculate the standard deviation, ... and do other image processing with it. I have the following code: from scipy import ndimage from urllib

I have a pre-allocate array with size 100: a = zeros(1, 100); % Do some thing.... % Calculate standard deviation here. Now I want to calculate standard deviation from element 1 to element 20. How can

My standard deviation is way off. When I enter: 2 4 4 4 5 5 7 9 n I don't get 2. This is my code. I believe everything checks out so I don't understand why I keep getting 1.8284791953425266 instead of

Hi I want to calculate weighted standard deviation in SQL Server 2012. Is there any inbuilt function as of standard deviation in SQL Server or how to built a user define function in SQL Server.

I'm using gcutil to access Google Compute Engine instances. I noticed that when I spin up a new instance, the new instance has a user that I used on a previous machine in this project. I want to remov

I have an array [1,2,4,5,4,7] and I want to find the frequency of each number and store it in a hash. I have this code, but it returns NoMethodError: undefined method '+' for nil:NilClass def score( a

I have to read words from a 10 G file and put them in a sorted manner of their frequency, how can I achieve this in most efficient way?

I have a variable x and its standard deviation sigma. I know , mean mu .How can I compute probabilty of x (using normal distribution ) that it is less / greater than limit a or inbetween limits a and

Using R, I converted a frequency table (created using the table() function) to a data-frame. My data frame now looks as follows: words frequency 1 'home' 1 2 'paper' 5 3 'letter' 6 Now I want to con

I am trying to replicate a table often used in official statistics but no success so far. Given a dataframe like this one: d1 <- data.frame( StudentID = c(x1, x10, x2, x3, x4, x5, x6,

I keep getting an error saying there is an error in your sql syntax. when i use this sql statement: SELECT * FROM gsm_oceanwide_integration EDIT: To put this in context heres the code im using this

I have to compute in Android yaw, roll, pitch from gyroscope´s smartphones output and I wrote this code: if(event.sensor.getType()==Sensor.TYPE_GYROSCOPE){ float xgyr=event.values[0]; //rotation arou

I'm new with R. I need to generate a simple Frequency Table (as in books) with cumulative frequency and relative frecuency. So I want to generate from some simple data like > x [1] 17 17 17 17 17

I am trying to call php-cgi.exe from a .NET program. I use RedirectStandardOutput to get the output back as a stream but the whole thing is very slow. Do you have any idea on how I can make that faste

I want to calculate Z-scores using means and standard deviations generated from each group. For example I have following table. It has 3 groups of data, I can generate mean and standard deviation for