Hive - Windowing - Range and Rows

Post Image

Range and rows are types of windowing in hive. While RANGE is dealing with content of the column which is ordered by, ROWS is about row numbers. With ROWS, you can window on preceeding and following rows. With RANGE, you window the content in a range.

I was facing an issue with hive windowing functions. It was a simple scenario, but my hive query never ends. My window was like;

select  
    user_id,
    LAST_VALUE(location)over w,
    event_time
from my_users  
window as (user_id by category order by event_time)  

This does pin last location per user.

According to Hive Windowdowing Function Tutorial;

When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.  

It means, the query above, it is using RANGE as default. I didn't know that. Here is the version of the example above how hive runs as;

select  
    user_id,
    LAST_VALUE(location)over w,
    event_time
from my_users  
window as (user_id by category order by event_time RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)  

Strangely, RANGE does thing so so slower than the way of ROWS does. In most cases, you probably need ROWS and RANGE may be the one which is being used as default because you didn't noticed it. I don't technically know the detail why it is slower. This was actually the issue I was facing. I changed this to;

select  
    user_id,
    LAST_VALUE(location)over w,
    event_time
from my_users  
window as (user_id by category order by event_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)  

With this small change, the query lasted 1 hour,lasts 120 seconds now.