So you have data you’re tracking every day. Maybe a lot of data. Maybe tracking it a lot of ways. You might even have graphs of the data over time, so that you can spot trends and anomalies. And then you find that you’ve got too many graphs…
It’s time to let the data tell you when it needs attention.
Get your data into a format (through a table, a view, or a subquery; we’ll say you have it in
measurements) where you have records that have the thing you’re measuring (we’ll call it
aspectID), when you measured it (we’ll use
date, although this will work for monthly measurements, weekly, quarterly, yearly, hourly, or anything else), and what its measurement was then (
What management calls a trend, statistics calls a slope. If your data happens to fall in a line, the slope is how much is goes up (or down) each unit of time. −2 dollars per day. +100 page views per hour. +5 accounts per month. Probably your data doesn’t line up nicely for you though.
Some flavors of SQL will make this pretty easy to get, through linear regression functions. Oracle PL/SQL, for instance:
SELECT aspectID, REGR_SLOPE(value, TO_NUMBER(TO_CHAR(date, 'J')) AS slope FROM measurements;
while it’s a little less obvious in flavors without those built-in functions, such as Microsoft T‑SQL:
SELECT aspectID, ( (COUNT(1) * SUM(CAST([date] AS FLOAT) * [value])) - (SUM(CAST([date] AS FLOAT)) * SUM([value])) ) / ( (COUNT(1) * SUM(POWER(CAST([date] AS FLOAT), 2))) - (POWER(SUM(CAST([date] AS FLOAT)), 2)) ) AS slope FROM measurements WHERE [value] IS NOT NULL GROUP BY aspectID;
Some additional things you may want to do:
- restrict (in the
WHEREclause) the data to the recent past (30 days or 90 days for dailies, perhaps, or 72 hours for hourlies, etc.)
- scale the slopes (something that’s going from 1 to 100 is probably more interesting than something that’s going from 50,001 to 50,100), by dividing them by the
MAX(value)for the period. This will give you a more useful sorting column than the “raw” slope, for some problem spaces.
The other thing you’ll often want to know is when is the current value “spiking” (up or down). A brief spike may not have a big impact on the linear regression, but it’s still a point of interest. Oracle PL/SQL first again:
SELECT aspectID, CASE WHEN STDDEV(value) = 0 THEN 0 ELSE (MAX(value) KEEP (DENSE_RANK LAST ORDER BY date) - AVG(value)) / STDDEV(value) END AS spike FROM measurements GROUP BY aspectID;
“Spike” here is the number of standard deviations above (or below) average the most recent data point is. I like to look for instances that are more than two standard deviations from the current average (which should mean I’m getting hits 5% of the time for each measurement, or hits on 5% of the metrics with any given run) by adding
WHERE CASE WHEN STDDEV(value) = 0 THEN 0 ELSE ABS(MAX(value) KEEP (DENSE_RANK LAST ORDER BY date) - AVG(value)) / STDDEV(value) END >= 2
(Note the addition of
ABS there—you want sudden drops as well as sudden increases.) If you put that into a view, your
WHERE clause will be a little prettier. Restrict the time scale as needed, just like with slopes.
Getting to the most recent value as well as aggregations of the values in Microsoft T-SQL is a little more cumbersome, but I don’t find either particularly more readable than the other:
SELECT m.aspectID, CASE WHEN STDEV(m.value) = 0 THEN 0 ELSE (lastm.value - AVG(m.value)) / STDEV(m.value) END AS spike FROM measurements m JOIN (SELECT aspectID, [value], ROW_NUMBER() OVER (PARTITION BY aspectID ORDER BY MAX([date]) DESC) AS rank FROM #measurements GROUP BY aspectID, [value] ) lastm ON lastm.aspectID = m.aspectID WHERE m.[value] IS NOT NULL AND lastm.rank = 1 GROUP BY m.aspectID, lastm.value;
And those are the base tools. Get them into a view or schedule a stored procedure to run periodically and email you the metrics that trip your thresholds. And please let me know if you have better (more readable) approaches in these or other SQL flavors, or if you’ve put these kinds of measurements to interesting uses.