I recently had to transform a set of data from typical date & measurement to a more compact date range & measurement format. The data in this case was very static: as the date incremented, the measurement was much more likely to be remain the same than it was to change. So storing the starting date and ending date for each measurement takes less space than storing each date’s measurement separately. Sure, it makes some subsequent queries more convoluted, but let’s say that you found this post because you also need a similar transformation.
I stumbled at first by subconsciously assuming that a measurement would not repeat once its range was ended. This assumption works if your measurements never decrease (or if they never increase), say for the total number of copies of a book printed. They’re printed in batches, and most days no new copies are printed. If your data does meet that criterion, the query is simple (and should be portable from Microsoft Transact-SQL, where I wrote it):
SELECT d.[group], MIN(d.[date]) AS [start_date], MAX(d.[date]) AS [end_date], d.[measurement] FROM mydata d GROUP BY d.[group], d.[measurement] ORDER BY d.[group], MIN(d.[date])
So for data like this:
group | date | measurement |
Beta | 2013-03-01 | 12 |
Beta | 2013-03-02 | 12 |
… | ||
Beta | 2013-03-12 | 12 |
Beta | 2013-03-13 | 18 |
Beta | 2013-03-14 | 18 |
… | ||
Beta | 2013-03-31 | 18 |
this generates the desired output:
group | start_date | end_date | measurement |
Beta | 2013-03-01 | 2013-03-12 | 12 |
Beta | 2013-03-13 | 2013-03-31 | 18 |
Unfortunately, my data did not meet this criterion, and my results from that query had overlapping ranges, which was quite incorrect:
group | date | measurement |
Alpha | 2013-03-01 | 12 |
… | ||
Alpha | 2013-03-07 | 12 |
Alpha | 2013-03-08 | 15 |
… | ||
Alpha | 2013-03-11 | 15 |
Alpha | 2013-03-12 | 12 |
Alpha | 2013-03-13 | 18 |
… | ||
Alpha | 2013-03-22 | 18 |
Alpha | 2013-03-23 | 21 |
… | ||
Alpha | 2013-03-27 | 21 |
Alpha | 2013-03-28 | 18 |
… | ||
Alpha | 2013-03-31 | 18 |
yields:
group | start_date | end_date | measurement |
Alpha | 2013-03-01 | 2013-03-12 | 12 |
Alpha | 2013-03-08 | 2013-03-11 | 15 |
Alpha | 2013-03-13 | 2013-03-31 | 18 |
Alpha | 2013-03-23 | 2013-03-27 | 21 |
I solved it by counting off my start dates and end dates, and then for each group and measurement placing the first start date with the first end date, the second with the second, and so on. This still seems like overkill; if you know an optimization I missed, please add your comment below!
In Microsoft’s Transact SQL, counting off like that involves the OVER clause and CASE expressions against some LEFT JOINs. I LEFT JOIN the table against itself twice, once on the previous date and once on the subsequent date. Finding the NULLs on those joins (in the OVER + CASE constructs) allows me to count the starts and ends of each block of measurements. I also need a future date to move all of the “middle” dates out of order to the end—and they get thrown out by the WHERE COALESCE([start_date], [end_date]) IS NOT NULL
part later.
/* any date well after all of the dates in the database will do */ DECLARE @futuredate DATE = '2100-01-01'; SELECT t.[group], MIN(t.[start_date]) AS [start_date], MIN(t.[end_date]) AS [end_date], t.[measurement] FROM (SELECT d.[group], CASE WHEN d2.[date] IS NULL THEN ROW_NUMBER() OVER (PARTITION BY d.[group] ORDER BY CASE WHEN d2.[date] IS NULL THEN d.[date] ELSE @futuredate END) ELSE NULL END AS start_seq, CASE WHEN d3.[date] IS NULL THEN ROW_NUMBER() OVER (PARTITION BY d.[group] ORDER BY CASE WHEN d3.[date] IS NULL THEN d.[date] ELSE @futuredate END) ELSE NULL END AS end_seq, CASE WHEN d2.[date] IS NULL THEN d.[date] ELSE NULL END AS [start_date], CASE WHEN d3.[date] IS NULL THEN d.[date] ELSE NULL END AS [end_date], d.[measurement] FROM mydata d LEFT JOIN mydata d2 ON d2.[group] = d.[group] AND d2.[date] = DATEADD(DD, -1, d.[date]) AND d2.[measurement] = d.[measurement] LEFT JOIN mydata d3 ON d3.[group] = d.[group] AND d3.[date] = DATEADD(DD, 1, d.[date]) AND d3.[measurement] = d.[measurement] ) t WHERE COALESCE([start_date], [end_date]) IS NOT NULL GROUP BY [group], COALESCE([start_seq], [end_seq]), [measurement] ORDER BY [group], [start_date]
which gives me the correct results:
group | start_date | end_date | measurement |
Alpha | 2013-03-01 | 2013-03-07 | 12 |
Alpha | 2013-03-08 | 2013-03-11 | 15 |
Alpha | 2013-03-12 | 2013-03-12 | 12 |
Alpha | 2013-03-13 | 2013-03-22 | 18 |
Alpha | 2013-03-23 | 2013-03-27 | 21 |
Alpha | 2013-03-28 | 2013-03-31 | 18 |
Beta | 2013-03-01 | 2013-03-12 | 12 |
Beta | 2013-03-13 | 2013-03-31 | 18 |
I’m still making an assumption, that a measurement was stored in the data for every date. If the measurements are not taken that rigorously, you will need to accommodate that gap with a generated table of the appropriate dates to join against. If you’re up against that, let me know and I’ll do a follow-up post.
—jhunterj