Transforming mostly static data to date ranges

Hunter — Sat, 06 Apr 2013 13:59:27 +0000

For data that has flat plateaus and canyons, like this cityscape. Derived from a CC-BY-2.0 image by abdallah.

I recently had to transform a set of data from typical date & measurement to a more compact date range & measurement format. The data in this case was very static: as the date incremented, the measurement was much more likely to be remain the same than it was to change. So storing the starting date and ending date for each measurement takes less space than storing each date’s measurement separately. Sure, it makes some subsequent queries more convoluted, but let’s say that you found this post because you also need a similar transformation.

I stumbled at first by subconsciously assuming that a measurement would not repeat once its range was ended. This assumption works if your measurements never decrease (or if they never increase), say for the total number of copies of a book printed. They’re printed in batches, and most days no new copies are printed. If your data does meet that criterion, the query is simple (and should be portable from Microsoft Transact-SQL, where I wrote it):

SELECT d.[group],
       MIN(d.[date]) AS [start_date],
       MAX(d.[date]) AS [end_date],
       d.[measurement]
   FROM mydata d
   GROUP BY d.[group], d.[measurement]
   ORDER BY d.[group], MIN(d.[date])

So for data like this:

group	date	measurement
Beta	2013-03-01	12
Beta	2013-03-02	12
…
Beta	2013-03-12	12
Beta	2013-03-13	18
Beta	2013-03-14	18
…
Beta	2013-03-31	18

this generates the desired output:

group	start_date	end_date	measurement
Beta	2013-03-01	2013-03-12	12
Beta	2013-03-13	2013-03-31	18

Unfortunately, my data did not meet this criterion, and my results from that query had overlapping ranges, which was quite incorrect:

group	date	measurement
Alpha	2013-03-01	12
…
Alpha	2013-03-07	12
Alpha	2013-03-08	15
…
Alpha	2013-03-11	15
Alpha	2013-03-12	12
Alpha	2013-03-13	18
…
Alpha	2013-03-22	18
Alpha	2013-03-23	21
…
Alpha	2013-03-27	21
Alpha	2013-03-28	18
…
Alpha	2013-03-31	18

yields:

group	start_date	end_date	measurement
Alpha	2013-03-01	2013-03-12	12
Alpha	2013-03-08	2013-03-11	15
Alpha	2013-03-13	2013-03-31	18
Alpha	2013-03-23	2013-03-27	21

I solved it by counting off my start dates and end dates, and then for each group and measurement placing the first start date with the first end date, the second with the second, and so on. This still seems like overkill; if you know an optimization I missed, please add your comment below!

In Microsoft’s Transact SQL, counting off like that involves the OVER clause and CASE expressions against some LEFT JOINs. I LEFT JOIN the table against itself twice, once on the previous date and once on the subsequent date. Finding the NULLs on those joins (in the OVER + CASE constructs) allows me to count the starts and ends of each block of measurements. I also need a future date to move all of the “middle” dates out of order to the end—and they get thrown out by the WHERE COALESCE([start_date], [end_date]) IS NOT NULL part later.

/* any date well after all of the dates in the database will do */
DECLARE @futuredate DATE = '2100-01-01';

SELECT t.[group],
       MIN(t.[start_date]) AS [start_date],
       MIN(t.[end_date]) AS [end_date],
       t.[measurement]
   FROM (SELECT d.[group],
                CASE 
                   WHEN d2.[date] IS NULL THEN 
                      ROW_NUMBER()
                         OVER (PARTITION BY d.[group]
                               ORDER BY CASE
                                           WHEN d2.[date] IS NULL THEN d.[date]
                                           ELSE @futuredate
                                        END)
                   ELSE NULL
                END AS start_seq,
                CASE 
                   WHEN d3.[date] IS NULL THEN 
                      ROW_NUMBER()
                         OVER (PARTITION BY d.[group]
                               ORDER BY CASE
                                           WHEN d3.[date] IS NULL THEN d.[date]
                                           ELSE @futuredate
                                        END)
                   ELSE NULL
                END AS end_seq,
                CASE
                   WHEN d2.[date] IS NULL THEN d.[date]
                   ELSE NULL
                END AS [start_date],
                CASE
                   WHEN d3.[date] IS NULL THEN d.[date]
                   ELSE NULL
                END AS [end_date],
                d.[measurement]

            FROM mydata d
               LEFT JOIN mydata d2 ON d2.[group] = d.[group]
                                      AND d2.[date] = DATEADD(DD, -1, d.[date])
                                      AND d2.[measurement] = d.[measurement]
               LEFT JOIN mydata d3 ON d3.[group] = d.[group]
                                      AND d3.[date] = DATEADD(DD, 1, d.[date])
                                      AND d3.[measurement] = d.[measurement]

        ) t
   WHERE COALESCE([start_date], [end_date]) IS NOT NULL
   GROUP BY [group], COALESCE([start_seq], [end_seq]), [measurement]
   ORDER BY [group], [start_date]

which gives me the correct results:

group	start_date	end_date	measurement
Alpha	2013-03-01	2013-03-07	12
Alpha	2013-03-08	2013-03-11	15
Alpha	2013-03-12	2013-03-12	12
Alpha	2013-03-13	2013-03-22	18
Alpha	2013-03-23	2013-03-27	21
Alpha	2013-03-28	2013-03-31	18
Beta	2013-03-01	2013-03-12	12
Beta	2013-03-13	2013-03-31	18

I’m still making an assumption, that a measurement was stored in the data for every date. If the measurements are not taken that rigorously, you will need to accommodate that gap with a generated table of the appropriate dates to join against. If you’re up against that, let me know and I’ll do a follow-up post.

—jhunterj

Decoding SQL’s Decode

Hunter — Tue, 05 Feb 2013 12:32:29 +0000

I came across a SQL function I was not familiar with: decode. I looked it up and immediately replaced it with a CASE statement (in addition to other code cleanup). I’m afraid I don’t understand the existence of Oracle SQL’s decode() function:

decode(expression,
       search, result
       [, search , result]...
       [, default]
      )

seems functionally equivalent to

CASE expression 
   WHEN search THEN result
   [WHEN search THEN result]...
   [ELSE default]
END

By Steven Straiton (originally posted to Flickr as Gorilla) [CC-BY-2.0], via Wikimedia Commons

except the CASE version

has no limitation of “only” 255 total expression + search + result + default parameters
is easier to read
works in PL/SQL context
is ANSI-compliant

The only “benefit” to decode I could find is that the decode function will attempt to convert all of the results to the type of the first result, while CASE just errors if you mix types. To me, that benefit would simply encourage sloppy coding and keep you from noticing buggy code as quickly.

So, what’s the point? Are their coders out there who “get” the function syntax more readily than the CASE syntax, so it helps their coding efficiency? Or is there a benefit I’m missing?

—jhunterj

jhunterj.com » sql

Transforming mostly static data to date ranges

Decoding SQL’s Decode