Some thoughts on data visualisation after reading JC's recent blogpost (http://jeelabs.org/2011/03/14/gas-flow-measurement-artifacts/)

The problem here is incorrect display of calculated data. Let's look at the raw data and figure out what is actually going on.

What actually happens is that we get a pulse after a certain amount of gas (or kWh) has been consumed. This is measured by looking at light pulses or analog meter rotations or something. This is fundamentally different from what we are used to, which is measuring every x seconds. In our case, we don't have a constant time interval in which we do our measurements. So we have to be careful when calculating averages and rates.

We could plot our data directly in a graph depicting totals. The only thing we can know for sure is the total consumption at the times the data is measured. The actual consumtion is a line (any continuously rising line), that goes through those data points. We get one possible graph by connecting the dots with straight lines (red). Another option would be connecting the points with a smooth line (by means of some curve fitting algorithm) (green):

However, the total consumtion is generally not what we are interested in. Much more interesting is the consumption rate, which is the differential of the consumption graph.

What we could do is calculate a rate for every data point by dividing our standard amount by the time between that pulse and the previous one. This is what JC shows in his graph. You could connect these data points by straight lines to obtain the top (blue) graph:

However, this is by no means the differential of the first graph. Connecting the data points this way is not a good way to represent what is actually going on. If we would differentiate the first graph we would get horizontal plateaus with equal area. This area corresponds exactly to our standard amount. Thus, a better way to plot the data is the second graph in de picture above (red).

The thing with these graph is that we don't have a clue what the consumtion did between measured data points. We assumed a linear interpolation, which results in horizontal plateaus in the consumtion rate. This could be anything. All we know is that a line in the first graph must pass through all data points. For the second graph that means that the true graphs should average out to the horizontal line in every frame of equal amount. One possible actual consumption rate may be this one:

This is a wavy line, but is kinda wraps around the horizontal plateaus. For every frame, there is an equal area above and under the plateau. This way, it averages out nicely.

So, how to draw our data? We have to look at what is going on. In JC's case, it is gas consumption, which presumably does not vary continuously. The heater is either on or off. Since the standard amount of gas is pretty small in comparison to total consumption, the data points are very dense when the heater is on (a data point every 20 - 30 seconds or so, judging from JC's graph).

When we have a frame of low consumption rate surrounded by high rates, the rate is not likely to be low that entire time. The heater might still be on after the previous point and switched of somewhere in the interval. However, it can't be very far in the interval, otherwise, it would have generated another high data point pretty quickly.

So, high data points are fairly accurate, since they occur quickly. The low data points are less accurate, they may be elevated by consumption in the beginning or in the end of the frame. How much, we just don't know.

In the end, the best graph we can make is the red one.