Hive - Sliding Last Not Null Value

Post Image

Hive supports windowing features over columns according to current row. You can see the documantation about windowing functions in Hive at LanguageManual WindowingAndAnalytics. Those functions are not going be the topic in this article.

Assume that, you have a dataset contains null value on some columns and you want to fill those null values with most recent not null value of same column. Let me give an example dataset to explain it;

Here is my table;

CREATE TABLE IF NOT EXISTS my_table  
(
`date` string,
 number int,
 estimate int,
 client string
) row format delimited fields terminated by ',';

Lets have a dataset like;

hive> select * from my_table;  
OK  
2016-08-01       3       10      A  
2016-08-02       NULL    10      NULL  
2016-08-03       5       10      A  
2016-08-04       NULL    10      NULL  
2016-08-05       NULL    10      NULL  
2016-08-06       2       10      A  
Time taken: 0.06 seconds, Fetched: 6 row(s)  

And we want to have a result like;

OK  
2016-08-01       3       10      A  
2016-08-02       3       10      A  
2016-08-03       5       10      A  
2016-08-04       5       10      A  
2016-08-05       5       10      A  
2016-08-06       2       10      A  
Time taken: 19.177 seconds, Fetched: 6 row(s)  

As you can see, NULL values are replaced with most recent not null values of same column before current row.

Normally with SQL, you can calculate at row level. If you need to calculate something by using column level according to current row, this is called as sliding. For example, retrieveing previous columns before current row and sum them and count them etc. Retreiveing last not null value before current row is what we need to solve the problem desribed in this article. LAST_VALUE function is what we need. Here is the function description;

LAST_VALUE(<field>,<ignore nulls as boolean>)  

So, the query we need is to have asked result;

SELECT  
    `date`,
    COALESCE(number, LAST_VALUE(number, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
    estimate,
    COALESCE(client, LAST_VALUE(client, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM my_table;  

COALESCE function provide us to make sure to have last not NULL value of the field. If the value of a field is null, last value which is not null before the current row will be selected.

This query result will be as we want;

OK  
2016-08-01       3       10      A  
2016-08-02       3       10      A  
2016-08-03       5       10      A  
2016-08-04       5       10      A  
2016-08-05       5       10      A  
2016-08-06       2       10      A  
Time taken: 19.177 seconds, Fetched: 6 row(s)