Skip to content

Fix ResourceWarnings during unittest #5264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 21, 2020
Merged

Fix ResourceWarnings during unittest #5264

merged 16 commits into from
May 21, 2020

Conversation

lfiedler
Copy link
Contributor

@lfiedler lfiedler commented Apr 6, 2020

This PR fixes the issue where files were not closed in a timely manner, after some objects were writing to them during tests using unittest #5230

Description

I ran through possible sources for the ResourceWarnings being raised. Classes directly pointed out by the warnings were Language, Tagger, Vectors. There the methods to_disk used write functions that did not close properly after write when used within tests (unittest and pytest as it turned out). Tests were implemented by catching the warnings and checking whether they still occur.

Having had a closer look, the classes Pipes and EntityLinker seemed to have implemented similar versions of to_disk, so I took care of them as well.

For each case I wrote unit tests (unittest, pytest) and fixed the affected code in a minimal possible way, so that existing unit tests still run green.

Side notes:

  • In Vectors the numpy method save is used to persist numpy arrays to disk. Reading the source code (numpy versions 1.18 and 1.15) this should close the file descriptor. However, after wrapping the affected line into a context manager, the ResourceWarnings vor Vectors disappeared. I left a comment about this at the corresponding line I changed.
  • As mention in the discussion of Warnings when saving a spacy model during unittest #5230 ResourceWarnings are swallowed by pytest. They can be made visible in pytest as again simply by catching them using warnings.

Tests were run using both, pytest and unittest.

Looking forward for your feedback. If there is anything that should be improved in this PR I'd be happy to do so.

Thanks for your great work!

Types of change

Bug fix

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added feat / pipeline Feature: Processing pipeline and components feat / serialize Feature: Serialization, saving and loading labels Apr 6, 2020
@lfiedler
Copy link
Contributor Author

lfiedler commented Apr 7, 2020

Ok, the pipelines for python 3.5 both failed but from python 3.6 onwards they succeeded.

I think this is what happened: https://docs.python.org/3/library/os.path.html#os.path.exists

Changed in version 3.6: Accepts a path-like object.

First of all the logs produced by the broken Azure Pipelines:

/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/site-packages/spacy/tests/regression/test_issue5230.py:94: in write_obj_and_catch_warnings
    obj.to_disk(d)
pipes.pyx:1399: in spacy.pipeline.pipes.EntityLinker.to_disk
    ???
/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/site-packages/spacy/util.py:645: in to_disk
    writer(path / key)
pipes.pyx:1395: in spacy.pipeline.pipes.EntityLinker.to_disk.lambda34
    ???
kb.pyx:318: in spacy.kb.KnowledgeBase.dump
    ???
kb.pyx:449: in spacy.kb.Writer.__init__
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path = PosixPath('/tmp/tmpiu_j7xu1/kb')

    def exists(path):
        """Test whether a path exists.  Returns False for broken symbolic links"""
        try:
>           os.stat(path)
E           TypeError: argument should be string, bytes or integer, not PosixPath

/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/genericpath.py:19: TypeError

For Windows it is almost the same message (WindowsPath instead of PosixPath).
The error was thrown during running my unit tests (both, pytest and unittest).

When looking at class EntityLinker this error seems to be raised when trying to dump its KnowledgeBase object (which happens in the line just before my edits). From the same line one can see that a Path object is passed to KnowledgeBase.dump which ends up in the __init__ of some Writer object. This reads:

cdef class Writer:
    def __init__(self, object loc):
        if path.exists(loc):
            assert not path.isdir(loc), "%s is directory." % loc
        if isinstance(loc, Path):
            loc = bytes(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'wb')
        if not self._fp:
            raise IOError(Errors.E146.format(path=loc))
        fseek(self._fp, 0, 0)

    def close(self):
    ...

I guess, swapping the first two if-statements could fix this.
Please confirm.

Note, that I did not try compiling spacy in my local copy for python 3.5 (it took me too long to install another VS). I.e. I did not directly check, if I could reproduce this without the changes I did. So I could as well be wrong about this.

@svlandeg
Copy link
Member

svlandeg commented Apr 8, 2020

Sure, it's not always possible to test on any platform! That's why we have the CI :-)

I think swapping the two if statements would make sense - feel free to try. Commiting to your original branch will trigger the CI to run the tests again.

@svlandeg svlandeg changed the title Fix Issue 5230 Fix ResourceWarnings during unittest Apr 10, 2020
@svlandeg svlandeg added the ⚠️ wip Work in progress label Apr 10, 2020
@lfiedler
Copy link
Contributor Author

Ah ok, I see :)

@svlandeg
Copy link
Member

I've added the "Work in progress" label, so feel free to "play around" ;-)
You can ping me here once the tests succeed and you feel like you're done - then we can review :-)

@lfiedler
Copy link
Contributor Author

I think the changes can be reviewed now. These are the things I changed in addition those mentioned above:

  • both kb.Writer.__init__ and kb.Reader.__init__ did instance checks of parameter loc after using them as input in os.path. As it happens, loc can be an instance of Path and this throws an exception in Python 3.5 (and even segfaults pytest). Fixed this by pulling the instance checks upfront in the according methods.
  • added unit test for instantiating kb.Writer
  • added unit test for loading and dumping an instance of kb.KnowledgeBase

Unfortunately I was not able to provide a unit test for this for kb.Reader that works under Windows. For some reason a file that was written to could not be released properly in order to open it with the Reader instance within the same scope provided by the context manager. I.e. the following did not work under Windows using Python 3.7:

with make_tempdir as d:
    path = d / "test" 
    with path.open("w") as f:
        f.write("test")
    reader = Reader(path)

But maybe I was using it wrong. Instead I provided a unit test using a Writer instance (KnowledgeBase), hoping that this is sufficient.

@svlandeg svlandeg removed the ⚠️ wip Work in progress label Apr 12, 2020
@svlandeg svlandeg added the bug Bugs and behaviour differing from documentation label Apr 15, 2020
@svlandeg svlandeg requested a review from honnibal May 1, 2020 11:26
@svlandeg svlandeg linked an issue May 19, 2020 that may be closed by this pull request
@honnibal honnibal merged commit 93c4d13 into explosion:master May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components feat / serialize Feature: Serialization, saving and loading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warnings when saving a spacy model during unittest
3 participants