Transforming mostly static data to date ranges

A cityscape showing flat peaks and troughs

For data that has flat plateaus and canyons, like this cityscape. Derived from a CC-BY-2.0 image by abdallah.

I recently had to transform a set of data from typical date & measurement to a more compact date range & measurement format. The data in this case was very static: as the date incremented, the measurement was much more likely to be remain the same than it was to change. So storing the starting date and ending date for each measurement takes less space than storing each date’s measurement separately. Sure, it makes some subsequent  queries more convoluted, but let’s say that you found this post because you also need a similar transformation.

I stumbled at first by subconsciously assuming that a measurement would not repeat once its range was ended. This assumption works if your measurements never decrease (or if they never increase), say for the total number of copies of a book printed. They’re printed in batches, and most days no new copies are printed. If your data does meet that criterion, the query is simple (and should be portable from Microsoft Transact-SQL, where I wrote it):

SELECT d.[group],
       MIN(d.[date]) AS [start_date],
       MAX(d.[date]) AS [end_date],
       d.[measurement]
   FROM mydata d
   GROUP BY d.[group], d.[measurement]
   ORDER BY d.[group], MIN(d.[date])

So for data like this:

group date measurement
Beta 2013-03-01 12
Beta 2013-03-02 12
Beta 2013-03-12 12
Beta 2013-03-13 18
Beta 2013-03-14 18
Beta 2013-03-31 18

this generates the desired output:

group start_date end_date measurement
Beta 2013-03-01 2013-03-12 12
Beta 2013-03-13 2013-03-31 18

Unfortunately, my data did not meet this criterion, and my results from that query had overlapping ranges, which was quite incorrect:

group date measurement
Alpha 2013-03-01 12
Alpha 2013-03-07 12
Alpha 2013-03-08 15
Alpha 2013-03-11 15
Alpha 2013-03-12 12
Alpha 2013-03-13 18
Alpha 2013-03-22 18
Alpha 2013-03-23 21
Alpha 2013-03-27 21
Alpha 2013-03-28 18
Alpha 2013-03-31 18

yields:

group start_date end_date measurement
Alpha 2013-03-01 2013-03-12 12
Alpha 2013-03-08 2013-03-11 15
Alpha 2013-03-13 2013-03-31 18
Alpha 2013-03-23 2013-03-27 21

I solved it by counting off my start dates and end dates, and then for each group and measurement placing the first start date with the first end date, the second with the second, and so on. This still seems like overkill; if you know an optimization I missed, please add your comment below!

In Microsoft’s Transact SQL, counting off like that involves the OVER clause and CASE expressions against some LEFT JOINs. I LEFT JOIN the table against itself twice, once on the previous date and once on the subsequent date. Finding the NULLs on those joins (in the OVER + CASE constructs) allows me to count the starts and ends of each block of measurements. I also need a future date to move all of the “middle” dates out of order to the end—and they get thrown out by the WHERE COALESCE([start_date], [end_date]) IS NOT NULL part later.

/* any date well after all of the dates in the database will do */
DECLARE @futuredate DATE = '2100-01-01';

SELECT t.[group],
       MIN(t.[start_date]) AS [start_date],
       MIN(t.[end_date]) AS [end_date],
       t.[measurement]
   FROM (SELECT d.[group],
                CASE 
                   WHEN d2.[date] IS NULL THEN 
                      ROW_NUMBER()
                         OVER (PARTITION BY d.[group]
                               ORDER BY CASE
                                           WHEN d2.[date] IS NULL THEN d.[date]
                                           ELSE @futuredate
                                        END)
                   ELSE NULL
                END AS start_seq,
                CASE 
                   WHEN d3.[date] IS NULL THEN 
                      ROW_NUMBER()
                         OVER (PARTITION BY d.[group]
                               ORDER BY CASE
                                           WHEN d3.[date] IS NULL THEN d.[date]
                                           ELSE @futuredate
                                        END)
                   ELSE NULL
                END AS end_seq,
                CASE
                   WHEN d2.[date] IS NULL THEN d.[date]
                   ELSE NULL
                END AS [start_date],
                CASE
                   WHEN d3.[date] IS NULL THEN d.[date]
                   ELSE NULL
                END AS [end_date],
                d.[measurement]

            FROM mydata d
               LEFT JOIN mydata d2 ON d2.[group] = d.[group]
                                      AND d2.[date] = DATEADD(DD, -1, d.[date])
                                      AND d2.[measurement] = d.[measurement]
               LEFT JOIN mydata d3 ON d3.[group] = d.[group]
                                      AND d3.[date] = DATEADD(DD, 1, d.[date])
                                      AND d3.[measurement] = d.[measurement]

        ) t
   WHERE COALESCE([start_date], [end_date]) IS NOT NULL
   GROUP BY [group], COALESCE([start_seq], [end_seq]), [measurement]
   ORDER BY [group], [start_date]

which gives me the correct results:

group start_date end_date measurement
Alpha 2013-03-01 2013-03-07 12
Alpha 2013-03-08 2013-03-11 15
Alpha 2013-03-12 2013-03-12 12
Alpha 2013-03-13 2013-03-22 18
Alpha 2013-03-23 2013-03-27 21
Alpha 2013-03-28 2013-03-31 18
Beta 2013-03-01 2013-03-12 12
Beta 2013-03-13 2013-03-31 18

I’m still making an assumption, that a measurement was stored in the data for every date. If the measurements are not taken that rigorously, you will need to accommodate that gap with a generated table of the appropriate dates to join against. If you’re up against that, let me know and I’ll do a follow-up post.

—jhunterj

Leave a Reply

Your email address will not be published. Required fields are marked *

*


three − = 2

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>