Statistical patch to gnuplot
This is a short introduction to the new 'stats' command for gnuplot. The
source can be downloaded as a
patch
from the gnuplot patch tracker. The
documentation
patch can be found at the same place.
From the help: (For the complete documentation of the 'stats'
command, follow the
link.)
The `stats` command calculates basic summary statistics for a data set,
displays them in human-readable form and (optionally) makes them available
as gnuplot variables.
Syntax:
stats {<ranges>}
{"<datafile>" {datafile-modifiers}}
{[no]output} {variables[=prefix]}
Permissible data file modifiers are `index`, `every`, and `using`, all of
which behave exactly as for the `plot` command. Up to two columns can be
specified with `using`, and inline transformations are available (same as
for `plot`)...
The variables that are either defined or printed to the screen/file are the
following
records : number of valid records found
invalid : number of invalid records found
blank : number of blank lines found
blocks : number of data blocks in the file (separated by double blank lines)
mean_* : mean
stddev_* : standard deviation
sumx_* : sum of all values
sumx2_* : sum of the squares of all values
min_* : minimal value
min_pos_* : position of the minimum value in file
lo_quartile_* : lower quartile (defined at 25%)
median_* : median
up_quartile_* : upper quartile (defined at 75%)
max_* : maximum value
max_pos_* : position of the maximum value in file
Examples
Here we would like to show a couple of examples as to how this patch can be used.
Linear regression
Simple linear regression (This particular case would basically be identical to
the fit command.)
unset key
set xrange [0:20]
set yrange [0:15]
set multiplot layout 2,2
stat "anscombe" u 1:2 var noout
plot "anscombe" u 1:2, slope*x + intercept
stat "anscombe" u 3:4 var noout
plot "anscombe" u 3:4, slope*x + intercept
stat "anscombe" u 5:6 var noout
plot "anscombe" u 5:6, slope*x + intercept
stat "anscombe" u 7:8 var noout
plot "anscombe" u 7:8, slope*x + intercept
unset multiplot
The data file for this plot can be found
here.
Using standard deviations, minimum, maximum and the like
# This first part up to 'set yrange [0:2]' is just to generate some data
set sample 50
set table 'stats1.dat'
plot [0:10] 0.5+rand(0)
unset table
set sample 200
set table 'stats2.dat'
plot [0:10] 0.5+rand(0)
unset table
set yrange [0:2]
unset key
set multiplot layout 2,2
# Plotting the minimum and maximum ranges with a shaded background
stats 'stats2.dat' u 1:2 var
set label 1 gprintf("Minimum = %g", min_y) at 2, min_y-0.2
set label 2 gprintf("Maximum = %g", max_y) at 2, max_y+0.2
plot min_y with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \
max_y with filledcurves y1=mean_y lt 1 lc rgb "#bbddbb", \
'stats2.dat' u 1:2 w p pt 7 lt 1 ps 1
# Plotting the range of the standard deviation with a shaded background
stats 'stats2.dat' u 1:2 var
set label 1 gprintf("Mean = %g", mean_y) at 2, min_y-0.15
set label 2 gprintf("Sigma = %g", stddev_y) at 2, min_y-0.3
plot mean_y-stddev_y with filledcurves y1=mean_y+stddev_y lt 1 lc rgb "#bbbbdd", \
mean_y w l lt 3, 'stats2.dat' u 1:2 w p pt 7 lt 1 ps 1
# Removing points based on the standard deviation
stats 'stats2.dat' u 1:2 var
set label 1 gprintf("Mean = %g", mean_y) at 2, min_y-0.15
set label 2 gprintf("Sigma = %g", stddev_y) at 2, min_y-0.3
plot mean_y w l lt 3, mean_y+stddev_y w l lt 3, mean_y-stddev_y w l lt 3, \
'stats2.dat' u 1:(abs($2-mean_y) < stddev_y ? $2 : 1/0) w p pt 7 lt 1 ps 1
# Automatically adding an arrow at a position that depends on the min/max
stats 'stats1.dat' u 1:2 var
stats 'stats1.dat' u 1:2 every ::(min_pos_y-1)::(min_pos_y-1) var=min
stats 'stats1.dat' u 1:2 every ::(max_pos_y-1)::(max_pos_y-1) var=max
set arrow 1 from minmin_x, minmin_y-0.2 to minmin_x, minmin_y-0.02 lw 0.5
set arrow 2 from maxmax_x, maxmax_y+0.2 to maxmax_x, maxmax_y+0.02 lw 0.5
set label 1 'Minimum' at minmin_x, minmin_y-0.3 centre
set label 2 'Maximum' at maxmax_x, maxmax_y+0.3 centre
plot 'stats1.dat' u 1:2 w p pt 6
unset multiplot
The first plot simply highlights the yrange of the data file, between its minimum and
maximum. The second plot shows the standard deviation, the third one plots only those
points that fall in the range of sigma around the mean, and in the fourth one, we place
arrows and labels positioned based on the statistical properties of the data set.
Note that in the fourth stats command, the every keyword has been used to specify the data
range, just as in the case of plot. In this particular case, we use it to pull out a single
value (the value of the x position) at the minimum or maximum.
Whisker-and-box plots
# Again, the first part, up to 'unset table' is to generate data
set samples 50
set table 'whisker.dat'
a = 1.0+rand(0); b=0.5+rand(0)*0.5
p a+rand(0)*b
a = 1.0+rand(0); b=0.5+rand(0)*0.5
p a+rand(0)*b
a = 1.0+rand(0); b=0.5+rand(0)*0.5
p a+rand(0)*b
a = 1.0+rand(0); b=0.5+rand(0)*0.5
p a+rand(0)*b
unset table
set print 'w.dat'
stats 'whisker.dat' u 2 i 0 var noout
print lo_quartile_x, min_x, max_x, up_quartile_x, median_x
stats 'whisker.dat' u 2 i 1 var noout
print lo_quartile_x, min_x, max_x, up_quartile_x, median_x
stats 'whisker.dat' u 2 i 2 var noout
print lo_quartile_x, min_x, max_x, up_quartile_x, median_x
stats 'whisker.dat' u 2 i 3 var noout
print lo_quartile_x, min_x, max_x, up_quartile_x, median_x
set print
set xrange [-1:4]
set boxwidth 0.5
plot 'w.dat' using 0:1:2:3:4 with candlesticks whiskerbars 0.5 lw 2, \
'' using 0:5:5:5:5 with candlesticks notitle lw 2 lc rgb "#008800"
Here note that the index keyword has been used to specify the data block
that we want to process.