Hive - Sampling Volume 1: percent_rank

Sampling data operation is something you need to feed your other environments than production something like test environment, or for unittesting etc. Because of your data is probably so huge amount as you don't want to use the same scale for other environments, you will look for sampling the data and scaling it down without loosing the scale as much as possible.

Good sampling is actually kind of something hard to achieve. Because, you need to sample it without breaking the scale of actual data as much as possible. I say as much as possible, becuase, it depends what you are doing. What I mean is; you need to define the columns which the scaling will be on. For example, Let's assume you want to sample your data by scaling it on a field something like category field. You can sample it by limiting all rows or limiting each category group with a strict size by help of rank windowing function or as limiting each category group a percent over group size by help of percent_rank windowing function. I think the third one which is percent sampling on some fields is better way of doing sampling.

Let's see that on the example below; Assume we have a data like that;

Category Product Name Product Description
Casual Black Shoe This is shoe
Electronic Washing Machine This is waching machine
Electronic TV This is tv
Casual Blue Trousers This is trousers
Casual White Shoe This is shoe
Casual Blue Shoe This is shoe
Limiting all results (not suggested)

Let's assume you think, this data is so big and you want to scale it for your unittesting. For example lets scale it down %50. It can be done by limiting all results like rownum*100/max(rownum), but this won't be the proper way of doing it. Let's see the result if you do that;

Category Product Name Product Description
Casual Black Shoe This is shoe
Electronic Washing Machine This is waching machine
Electronic TV This is tv

It seems, you have more electronic type products than casual ones. This doesn't seem good scaling to me honestly.

Limiting each category by a strict size (not suggested)

This means, we are going to take only n rows of each category. Let's assume this n is 2.

SELECT  
  p.category,
  p.product_name,
  p.product_description
FROM (  
  SELECT
    p.category,
    p.product_name,
    p.product_description,
    rank() over w1 rnk
  FROM products p
  window w1 AS ( PARTITION BY p.category ORDER BY p.product_id)
)t1
WHERE t1.rnk<2  

This will result like;

Category Product Name Product Description
Casual Black Shoe This is shoe
Electronic Washing Machine This is waching machine
Electronic TV This is tv
Casual Blue Trousers This is trousers

There is same amount of casual products as electroinc ones. It doesn't seem good scaling at all. Becuase There are double times more casual products than electronic products.

Limiting each category by a percentage(suggested)

I think this is the best way of doing it. Becuase percentage give it with always proper scale.

Let's assume we want to scale each category group down as %50 percent. percent_rank windowing function is what we are looking for.

SELECT  
  p.category,
  p.product_name,
  p.product_description
FROM (  
  SELECT
    p.category,
    p.product_name,
    p.product_description,
   (percent_rank() OVER w1) * 100 AS percent_rnk
  FROM products p
  window w1 AS ( PARTITION BY p.category ORDER BY p.product_id)
)t1
WHERE t1.percent_rnk<50  

This will result like;

Category Product Name Product Description
Casual Black Shoe This is shoe
Electronic Washing Machine This is waching machine
Casual Blue Trousers This is trousers

As you see, scale still remains between categories. There are still double times casual products than electronic ones.