Refactor test suite to be more readable? #175

pmeier · 2022-01-20T09:52:17Z

While working on #174, I also worked on the test suite. In there we have the ginormous tests that are hard to parse, because they do so many things at the same time:

data/test/test_datapipe.py

Lines 382 to 426 in c06066a

    
           def test_line_reader_iterdatapipe(self) -> None: 
        
               text1 = "Line1\nLine2" 
        
               text2 = "Line2,1\nLine2,2\nLine2,3" 
        
               # Functional Test: read lines correctly 
        
               source_dp = IterableWrapper([("file1", io.StringIO(text1)), ("file2", io.StringIO(text2))]) 
        
               line_reader_dp = source_dp.readlines() 
        
               expected_result = [("file1", line) for line in text1.split("\n")] + [ 
        
                   ("file2", line) for line in text2.split("\n") 
        
               ] 
        
               self.assertEqual(expected_result, list(line_reader_dp)) 
        
               # Functional Test: strip new lines for bytes 
        
               source_dp = IterableWrapper( 
        
                   [("file1", io.BytesIO(text1.encode("utf-8"))), ("file2", io.BytesIO(text2.encode("utf-8")))] 
        
               ) 
        
               line_reader_dp = source_dp.readlines() 
        
               expected_result_bytes = [("file1", line.encode("utf-8")) for line in text1.split("\n")] + [ 
        
                   ("file2", line.encode("utf-8")) for line in text2.split("\n") 
        
               ] 
        
               self.assertEqual(expected_result_bytes, list(line_reader_dp)) 
        
               # Functional Test: do not strip new lines 
        
               source_dp = IterableWrapper([("file1", io.StringIO(text1)), ("file2", io.StringIO(text2))]) 
        
               line_reader_dp = source_dp.readlines(strip_newline=False) 
        
               expected_result = [ 
        
                   ("file1", "Line1\n"), 
        
                   ("file1", "Line2"), 
        
                   ("file2", "Line2,1\n"), 
        
                   ("file2", "Line2,2\n"), 
        
                   ("file2", "Line2,3"), 
        
               ] 
        
               self.assertEqual(expected_result, list(line_reader_dp)) 
        
               # Reset Test: 
        
               source_dp = IterableWrapper([("file1", io.StringIO(text1)), ("file2", io.StringIO(text2))]) 
        
               line_reader_dp = LineReader(source_dp, strip_newline=False) 
        
               n_elements_before_reset = 2 
        
               res_before_reset, res_after_reset = reset_after_n_next_calls(line_reader_dp, n_elements_before_reset) 
        
               self.assertEqual(expected_result[:n_elements_before_reset], res_before_reset) 
        
               self.assertEqual(expected_result, res_after_reset) 
        
               # __len__ Test: length isn't implemented since it cannot be known ahead of time 
        
               with self.assertRaisesRegex(TypeError, "has no len"): 
        
                   len(line_reader_dp)

I was wondering if there is a reason for that. Can't we split this into multiple smaller ones? Utilizing pytest, placing the following class in the test module is equivalent to the test above:

class TestLineReader:
    @pytest.fixture
    def text1(self):
        return "Line1\nLine2"

    @pytest.fixture
    def text2(self):
        return "Line2,1\nLine2,2\nLine2,3"

    def test_functional_read_lines_correctly(self, text1, text2):
        source_dp = IterableWrapper([("file1", io.StringIO(text1)), ("file2", io.StringIO(text2))])
        line_reader_dp = source_dp.readlines()
        expected_result = [("file1", line) for line in text1.split("\n")] + [
            ("file2", line) for line in text2.split("\n")
        ]
        assert expected_result == list(line_reader_dp)

    def test_functional_strip_new_lines_for_bytes(self, text1, text2):
        source_dp = IterableWrapper(
            [("file1", io.BytesIO(text1.encode("utf-8"))), ("file2", io.BytesIO(text2.encode("utf-8")))]
        )
        line_reader_dp = source_dp.readlines()
        expected_result_bytes = [("file1", line.encode("utf-8")) for line in text1.split("\n")] + [
            ("file2", line.encode("utf-8")) for line in text2.split("\n")
        ]
        assert expected_result_bytes == list(line_reader_dp)

    def test_functional_do_not_strip_newlines(self, text1, text2):
        source_dp = IterableWrapper([("file1", io.StringIO(text1)), ("file2", io.StringIO(text2))])
        line_reader_dp = source_dp.readlines(strip_newline=False)
        expected_result = [
            ("file1", "Line1\n"),
            ("file1", "Line2"),
            ("file2", "Line2,1\n"),
            ("file2", "Line2,2\n"),
            ("file2", "Line2,3"),
        ]
        assert expected_result == list(line_reader_dp)

    def test_reset(self, text1, text2):
        source_dp = IterableWrapper([("file1", io.StringIO(text1)), ("file2", io.StringIO(text2))])
        line_reader_dp = LineReader(source_dp, strip_newline=False)
        expected_result = [
            ("file1", "Line1\n"),
            ("file1", "Line2"),
            ("file2", "Line2,1\n"),
            ("file2", "Line2,2\n"),
            ("file2", "Line2,3"),
        ]

        n_elements_before_reset = 2
        res_before_reset, res_after_reset = reset_after_n_next_calls(line_reader_dp, n_elements_before_reset)
        assert expected_result[:n_elements_before_reset] == res_before_reset
        assert expected_result == res_after_reset

    def test_len(self, text1, text2):
        source_dp = IterableWrapper([("file1", io.StringIO(text1)), ("file2", io.StringIO(text2))])
        line_reader_dp = LineReader(source_dp, strip_newline=False)

        with pytest.raises(TypeError, match="has no len"):
            len(line_reader_dp)

This is a lot more readable, since we now actually have 5 separate test cases that can individually fail. Plus, while writing this I also found that test_reset and test_len were somewhat dependent on test_functional_do_not_strip_newlines since they don't neither define line_reader_dp nor expected_result themselves.

The text was updated successfully, but these errors were encountered:

pmeier · 2022-01-20T10:10:17Z

Or even more readable:

class TestLineReader:
    @pytest.fixture
    def files_with_text(self):
        return [
            ("file1", "Line1\nLine2"),
            ("file2", "Line2,1\nLine2,2\nLine2,3"),
        ]

    def make_str_dp(self, files_with_text):
        return IterableWrapper([(file, io.StringIO(text)) for file, text in files_with_text])

    def make_bytes_dp(self, files_with_text):
        return IterableWrapper([(file, io.BytesIO(text.encode("utf-8"))) for file, text in files_with_text])

    def test_functional_read_lines_correctly(self, files_with_text):
        line_reader_dp = self.make_str_dp(files_with_text).readlines()

        expected = []
        for file, text in files_with_text:
            expected.extend((file, line) for line in text.splitlines())

        assert expected == list(line_reader_dp)

    def test_functional_strip_new_lines_for_bytes(self, files_with_text):
        line_reader_dp = self.make_bytes_dp(files_with_text).readlines()

        expected = []
        for file, text in files_with_text:
            expected.extend((file, line.encode("utf-8")) for line in text.splitlines())

        assert expected == list(line_reader_dp)

    def test_functional_do_not_strip_newlines(self, files_with_text):
        line_reader_dp = self.make_str_dp(files_with_text).readlines(strip_newline=False)

        expected = []
        for file, text in files_with_text:
            expected.extend((file, line) for line in text.splitlines(keepends=True))

        assert expected == list(line_reader_dp)

    def test_reset(self, files_with_text):
        line_reader_dp = LineReader(self.make_str_dp(files_with_text))

        expected = []
        for file, text in files_with_text:
            expected.extend((file, line) for line in text.splitlines())

        n_elements_before_reset = 2
        res_before_reset, res_after_reset = reset_after_n_next_calls(line_reader_dp, n_elements_before_reset)

        assert expected[:n_elements_before_reset] == res_before_reset
        assert expected == res_after_reset

    def test_len(self, files_with_text):
        line_reader_dp = LineReader(self.make_str_dp(files_with_text))

        with pytest.raises(TypeError, match="has no len"):
            len(line_reader_dp)

ejguan · 2022-01-20T14:35:55Z

I like this idea!
cc: @NivekT Do you want to incorporate this into your PR pytorch/pytorch#70215

pmeier · 2022-01-20T14:41:09Z

Ah, that might be an issue. In PyTorch core you cannot rely on pytest so if you want to have this there, you need to adapt what I proposed a little:

For unittest each test case needs to inherit from unittest.TestCase or any other derivative.
@pytest.fixture's are not available. A workaround might be to store the files_with_text in a class constant and access it from there.

ejguan · 2022-01-20T14:44:31Z

@pytest.fixture's are not available. A workaround might be to store the files_with_text in a class constant and access it from there.

I believe we can do setupClass for this case.

NivekT · 2022-01-20T15:38:54Z

Thanks for the suggestion! I think this is cleaner than what we have. It will take quite a bit of manual refactoring of each DataPipe to get there.

I am wondering if we can do something even better - a standard template to test out DataPipe with less manual code writing (maybe just specifying the inputs), similar to what OpsInfo does in PyTorch Core.

erip · 2022-01-21T19:25:44Z

FWIW, we've started something similar in torchtext. See here if you're interested.

NivekT added the Better Engineering label Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor test suite to be more readable? #175

Refactor test suite to be more readable? #175

pmeier commented Jan 20, 2022

pmeier commented Jan 20, 2022

Uh oh!

ejguan commented Jan 20, 2022

Uh oh!

pmeier commented Jan 20, 2022

Uh oh!

ejguan commented Jan 20, 2022

Uh oh!

NivekT commented Jan 20, 2022

Uh oh!

erip commented Jan 21, 2022

Uh oh!

Refactor test suite to be more readable? #175

Refactor test suite to be more readable? #175

Comments

pmeier commented Jan 20, 2022

pmeier commented Jan 20, 2022

Uh oh!

ejguan commented Jan 20, 2022

Uh oh!

pmeier commented Jan 20, 2022

Uh oh!

ejguan commented Jan 20, 2022

Uh oh!

NivekT commented Jan 20, 2022

Uh oh!

erip commented Jan 21, 2022

Uh oh!