Spikes, Slopes, and SQL

So you have data you’re tracking every day. Maybe a lot of data. Maybe tracking it a lot of ways. You might even have graphs of the data over time, so that you can spot trends and anomalies. And then you find that you’ve got too many graphs…

It’s time to let the data tell you when it needs attention.

Get your data into a format (through a table, a view, or a subquery; we’ll say you have it in measurements) where you have records that have the thing you’re measuring (we’ll call it aspectID), when you measured it (we’ll use date, although this will work for monthly measurements, weekly, quarterly, yearly, hourly, or anything else), and what its measurement was then (value).

A graph of data points and simple linear regression

Linear regression

What management calls a trend, statistics calls a slope. If your data happens to fall in a line, the slope is how much is goes up (or down) each unit of time. −2 dollars per day. +100 page views per hour. +5 accounts per month. Probably your data doesn’t line up nicely for you though.

Some flavors of SQL will make this pretty easy to get, through linear regression functions. Oracle PL/SQL, for instance:

SELECT aspectID,
       REGR_SLOPE(value, TO_NUMBER(TO_CHAR(date, 'J')) AS slope
   FROM measurements;

while it’s a little less obvious in flavors without those built-in functions, such as Microsoft T‑SQL:

SELECT aspectID,
       ( (COUNT(1) * SUM(CAST([date] AS FLOAT) * [value])) -
         (SUM(CAST([date] AS FLOAT)) * SUM([value]))
       ) /
       ( (COUNT(1) * SUM(POWER(CAST([date] AS FLOAT), 2))) -
         (POWER(SUM(CAST([date] AS FLOAT)), 2))
       ) AS slope
   FROM measurements
   WHERE [value] IS NOT NULL
   GROUP BY aspectID;

Some additional things you may want to do:

  • restrict (in the WHERE clause) the data to the recent past (30 days or 90 days for dailies, perhaps, or 72 hours for hourlies, etc.)
  • scale the slopes (something that’s going from 1 to 100 is probably more interesting than something that’s going from 50,001 to 50,100), by dividing them by the MAX(value) for the period. This will give you a more useful sorting column than the “raw” slope, for some problem spaces.

The other thing you’ll often want to know is when is the current value “spiking” (up or down). A brief spike may not have a big impact on the linear regression, but it’s still a point of interest. Oracle PL/SQL first again:

SELECT aspectID,
       CASE
          WHEN STDDEV(value) = 0 THEN 0
          ELSE (MAX(value) KEEP (DENSE_RANK LAST ORDER BY date)
                   -  AVG(value)) /
               STDDEV(value)
       END AS spike
   FROM measurements
   GROUP BY aspectID;

“Spike” here is the number of standard deviations above (or below) average the most recent data point is. I like to look for instances that are more than two standard deviations from the current average (which should mean I’m getting hits 5% of the time for each measurement, or hits on 5% of the metrics with any given run) by adding

WHERE CASE
         WHEN STDDEV(value) = 0 THEN 0
         ELSE ABS(MAX(value) KEEP (DENSE_RANK LAST ORDER BY date)
                  -  AVG(value)) /
              STDDEV(value)
      END >= 2

(Note the addition of ABS there—you want sudden drops as well as sudden increases.) If you put that into a view, your WHERE clause will be a little prettier. Restrict the time scale as needed, just like with slopes.

Getting to the most recent value as well as aggregations of the values in Microsoft T-SQL is a little more cumbersome, but I don’t find either particularly more readable than the other:

SELECT m.aspectID,
       CASE
          WHEN STDEV(m.value) = 0 THEN 0
          ELSE (lastm.value - AVG(m.value)) / STDEV(m.value)
       END AS spike
   FROM measurements m
      JOIN (SELECT aspectID,
                   [value],
                   ROW_NUMBER()
                      OVER (PARTITION BY aspectID
                            ORDER BY MAX([date]) DESC) AS rank
               FROM #measurements
               GROUP BY aspectID,
                        [value]
           ) lastm ON lastm.aspectID = m.aspectID
   WHERE m.[value] IS NOT NULL
      AND lastm.rank = 1
   GROUP BY m.aspectID,
            lastm.value;

And those are the base tools. Get them into a view or schedule a stored procedure to run periodically and email you the metrics that trip your thresholds. And please let me know if you have better (more readable) approaches in these or other SQL flavors, or if you’ve put these kinds of measurements to interesting uses.

—jhunterj

One thought on “Spikes, Slopes, and SQL

Leave a Reply

Your email address will not be published. Required fields are marked *

*


− two = 5

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>