Hive - Unit-testing

Testing is really important for coding. It provide ensuring the code is working. It is really a way to gain rather it is thought as time lost. This is actually another topic sould be discussed in another article.

Unit-testing is one of most tough thing among coding phases. If you are dealing with data, it is getting harder to do it. Many people give up on that challenging phase. But If you are a good developer - by the way that means, you are addicted to produce good quality of code which can not be broken easily after you first say, I am done - you must love that phase. Becuase, this phase will be the phase that makes you make sure about what you did. I said, this is another topic, but I couldn't keep myself away from talking about it :-) .

As I said, when you are dealing with test needs data, it is hard to prepare those kind of test data which is called as provided data or initial data. (There are some tools to make it easier. I will also talk about it in another article). Actualy, complexity in the calculations describe the level at preparing those test data. Because, if you access too many tables in the script, you are going to need to have all test data of those tables for your cases.

Testing your code is OK. But I didn't think that we could test our hive scripts. Our hive scripts carry too much about our bussiness logic and it was untested and it really bothers me. Too many simple mistakes have been occurred in our scripts and that cost us much. Beucase, we are releasing our application and you realise the problem on production. New changes will need new release or hotfix. It cost us to pull an all-nighter to fix it. Our scripts was tested manually, but it was more exhausting and it does not guarantee anything. Making mistakes is the nature of this job. This is almost nothing about being good developer. Good developer can also do kind of simple mistakes. We are not machine, right?. But, what makes us bad developer is not using machines to test and make sure our work is really done, I mean unit-testing. So, we searched about unit-testing for hive scripts and found something. It is zero-installation tool; it means, you just improt a java library and write your test. You don't need any outer integrated things. It means no hadoop, hdfs, hive installation will be needed. This is really cool, because this allowed us to unit-test for our hive scripts and run it in anywhere simply. The library has some problem as of today, but it works very well. It is Hive Runner.

OK, enought talking about the importance of testing and lets prepare an example;

All we need is to have this library added into our pom.xml with junit dependency.;

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>com.klarna</groupId>
            <artifactId>hiverunner</artifactId>
            <version>3.0.0</version>
            <scope>test</scope>
        </dependency>

Note: If you are using hive-exec library in anywhere else, be sure its version 1.2 or lower. It is not working with version 2.1 or newer.

Now lets have our hive table create scripts under main/resources/hive/create

-- school.hql

use mydatabase;

CREATE TABLE IF NOT EXISTS school  
(
  school_id   BIGINT,
  school_name        string
) ROW FORMAT delimited fields terminated BY ',';
-- student.hql

use mydatabase;

CREATE TABLE IF NOT EXISTS student  
(
  student_id BIGINT,
  name       string,
  lastname   string,
  birth_date string,
  school_id  BIGINT
) ROW FORMAT delimited fields terminated BY ',';
-- student_count_report.hql

use mydatabase;

CREATE TABLE IF NOT EXISTS student_count_report  
(
  school_name  string,
  student_coun BIGINT

) ROW FORMAT delimited fields terminated BY ',';

We will be executing all of those scripts to create our needed hive tables.

And lets create our report execution script under main/resources/hive/report;

-- execute_student_count_report.hql

use mydatabase;

INSERT INTO TABLE student_count_report  
SELECT  
  school.school_name,
  count(student.student_id) as cnt
FROM school  
LEFT JOIN student on student.school_id = school.school_id  
GROUP BY school.school_name;  

After we fill the tables with test data, we will execute that script and get the result of it from the report result table.

Here is my test data files under test/resources/StudentCountReportTest/successCase1;

school.csv

1,Cumhuriyet İlköğretim Okulu  
2,Atatürk Lisesi  
3,Samsun Anadolu Lisesi  

student.csv

1,ali,yılmaz,21-09-1991,1  
2,mehmet,yılmaz,22-07-1991,1  
3,veli,kal,23-08-1990,1  
4,şaban,yaşar,24-05-1990,1  
5,ahmet,güngör,25-03-1990,1  
6,seda,akyüz,26-07-1990,2  
7,büşra,kilit,27-01-1990,2  
8,ali,kıymık,28-06-1990,3  
9,veysel,bolu,29-05-1990,3  
10,ahmet,dal,21-09-1990,3  

Now it is time to write our test;

package dal.ahmet.hive.unittest;

import com.klarna.hiverunner.HiveShell;  
import com.klarna.hiverunner.StandaloneHiveRunner;  
import com.klarna.hiverunner.annotations.HiveSQL;  
import org.junit.Assert;  
import org.junit.Before;  
import org.junit.Test;  
import org.junit.runner.RunWith;

import java.io.File;  
import java.io.FileNotFoundException;  
import java.net.URL;  
import java.util.Arrays;  
import java.util.List;


/**
 * Created by ahmetdal on 02/09/16.
 */


@RunWith(StandaloneHiveRunner.class)
public class StudentCountReportTest {

    private String database = "mydatabase";

    @HiveSQL(files = {})
    private HiveShell hiveShell;


    @Before
    public void setUp() throws FileNotFoundException {


        // Create the database
        hiveShell.execute("CREATE DATABASE IF NOT EXISTS " + database + ";");


        // Create Tables
        hiveShell.execute(getResourceFile("hive/create/school.hql"));
        hiveShell.execute(getResourceFile("hive/create/student.hql"));
        hiveShell.execute(getResourceFile("hive/create/student_count_report.hql"));
    }

    private void insertInto(String table, String resourcePath) {
        String delimiter = ",";
        String nullValue = "\\N";

        hiveShell.insertInto(database, table)
                .withAllColumns()
                .addRowsFromDelimited(
                        new File(
                                getClass()
                                        .getClassLoader()
                                        .getResource(resourcePath)
                                        .getPath()
                        ),
                        delimiter,
                        nullValue
                )
                .commit();
    }

    private File getResourceFile(String resourcePath) throws FileNotFoundException {
        URL url = getClass().getClassLoader().getResource(resourcePath);
        if (null == url) {
            throw new FileNotFoundException(resourcePath);
        }
        return new File(url.getPath());
    }


    @Test
    public void testSuccessCase1() throws FileNotFoundException {

        // Insert test data
        insertInto("school", "StudentCountReportTest/successCase1/school.csv");
        insertInto("student", "StudentCountReportTest/successCase1/student.csv");

        // Execute the report
        hiveShell.execute(getResourceFile("hive/report/execute_student_count_report.hql"));


        // Check and assert the result
        List<Object[]> actualResultSet = hiveShell.executeStatement("select * from student_count_report order by school_name");


        List<Object[]> expectedResultSet = Arrays.asList(
                new Object[]{"Atatürk Lisesi", 2L},
                new Object[]{"Cumhuriyet İlköğretim Okulu", 5L},
                new Object[]{"Samsun Anadolu Lisesi", 3L}
        );

        Assert.assertArrayEquals(expectedResultSet.toArray(), actualResultSet.toArray());
    }


}

Note: Another important problem about the tool is we can't do abstraction. It means, you can't move hiveShell variable with @HiveSQL annotations into abstract test class. To handle that, we did a workaround by adding a getter method for it and using it in abstract class.

Let me explain the code; hiveShell is starting automatically as default. If you wanna set some variable before it start, you need to set its default atuoStart value as false in annotation initialiozation and set your variables in setUp and start hiveShell manaually. I didn't need it in here.

  • First we create our database which is used in our scripts (setUp).
  • I execute the create scripts and create all tables I need in this case (setUp).
  • Insert school and student test datas from school.csv and student.csv files (testSuccessCase1)
  • Execute the script that calculates and fills student_count_report table(testSuccessCase1).
  • Retrieve student_count_report result set and assert it (testSuccessCase1).

In my test data files, there are 10 students. 5 students at "Cumhuriyet İlköğretim Okulu" school, 2 students at "Atatürk Lisesi" school and 3 students at "Samsun Anadolu Lisesi" school. You can see my expectations and you can run and see this case will pass. At least I hope :-)

You can check and play with the example in this artice at https://github.com/javrasya/blog-examples/tree/master/hive/unittest