-
Notifications
You must be signed in to change notification settings - Fork 66
[SHARE-739][Improvement] Check quality of OAI sources #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SHARE-739][Improvement] Check quality of OAI sources #660
Conversation
ba7a1ef
to
134f789
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 🍚
Since earliest_date
is used for the default fullharvest
start date, I think it might be better if the "incorrect" values were commented out or replaced with more correct dates.
configs: | ||
- base_url: http://oaktrust.library.tamu.edu/dspace-oai/request | ||
disabled: false | ||
earliest_date: null | ||
earliest_date: 2004-11-10 20:56:05 # their earliestDatestamp is most recent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
tests/share/test_sources.py
Outdated
|
||
|
||
@pytest.mark.django_db | ||
class TestSources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these tests be disabled by default? I think it's great we have them, but I suspect they take a while and it doesn't seem necessary to run them every time anyone makes a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking that too. Especially since they could fail and have nothing to do with the PR. They should be run fairly frequently though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a nightly task that we run in prod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, that would be the best
I did include the ones that looked they were displaying the published date but could definitely comment those out. The published date will always be before any records though. Thoughts? Latestedu.oaktrust.mods: 2017-05-16T09:19:07 Error?edu.scholarsarchiveosu.mods: 0011-01-01T00:00:00Z Publishedgov.nodc: 1996-10-09 |
If there's an easy way to get something closer to the actual earliest datestamp for the "published" ones, I feel like that'd be nicer. If it'd require wading through their OAI feed until you find it, though, not worth it. I think those first three should just have |
I'll remove the published dates. We are going to email all of these people and I wanted the test to alert us that they fixed their earliestDatestamp. They will most likely respond to the email though. |
5522fe0
to
d8e871d
Compare
89bb1f5
to
d4bfe2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a tiny admin page be added for this as well?
share/tasks.py
Outdated
} | ||
|
||
|
||
def get_field_from_identify(response, field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than cutting these up into various methods, consider consolidating it into a class.
The OAI one can then just subclass the existing one and provide the same interface.
And then can you have a single task that just runs all of them.
share/tasks.py
Outdated
response_elapsed_time=response_elapsed_time, | ||
response_exception=response_exception, | ||
) | ||
source_stat.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to save a model that you've made with create
share/tasks.py
Outdated
|
||
|
||
@celery.task(bind=True) | ||
def get_source_stats(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a future improvement we could, optionally, tie this into the harvest task.
share/models/sources.py
Outdated
return '{}: {}'.format(self.config.source.long_title, self.config.label) | ||
|
||
|
||
class SourceStatOAI(SourceStat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking, you realize this is using Django's multi-table inheritance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( Yeah...it doesn't make any sense to do it that way. Didn't really think that through...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @cwisecarver for his opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to codify the conversation that @chrisseto and I had. Multi-table inheritance is basically the root of all evil (in most situations) because it forces an implicit join. I'd say if you're going to use these extra fields more often then not I'd put them on the original model and make them nullable. If you're not going to use them very often or they're going to be duplicated often I'd put them in a separate model and FK from the original model to the new model so they can be re-used. If you need them to be independent the FK should point the other way, which is basically the same thing as multi-table inheritance except that it's not going to do the join implicitly.
391772e
to
8a7d76b
Compare
share/util/source_stat.py
Outdated
logger = logging.getLogger(__name__) | ||
|
||
|
||
class SourceStatus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the ()
s here.
except Exception as e: | ||
logger.warning('Exception received from source: %s', e) | ||
return (None, e) | ||
return (r, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something in particular I should be looking at?
share/tasks.py
Outdated
|
||
@celery.shared_task(bind=True) | ||
def get_source_stats(self, oai, config_id): | ||
if oai: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to just load the config here and just check if it's oai?
share/tasks.py
Outdated
get_source_stats.apply_async((True, config['id'])) | ||
|
||
non_oai_sourceconfigs = SourceConfig.objects.filter( | ||
disabled=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source__is_delete=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should get stats on disabled source configs?
share/util/source_stat.py
Outdated
return (r, None) | ||
|
||
def get_source_stats(self, config_id): | ||
sourceconfig = SourceConfig.objects.get(pk=config_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the entire class is built around the source config, it might be a bit nicer to just pass it into __init__
and then just reference it as self.source_config
or whatever name you prefer.
share/util/source_stat.py
Outdated
} | ||
|
||
def get_field_from_identify(self, response, field): | ||
parsed = etree.fromstring(response.content, parser=etree.XMLParser(recover=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a great way to do this, might just be worth adding a todo. It would be interesting to see what sources provide invalid XML. IE fail without recover=True
share/util/source_stat.py
Outdated
def get_source_stats(self, config_id): | ||
|
||
# Known incorrect baseUrl: | ||
incorrect_base_urls = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a class level variable.
share/util/source_stat.py
Outdated
} | ||
|
||
# Known incorrect earliestDatestamp (all emailed): | ||
incorrect_earliest_datestamp = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a class level variable.
share/models/sources.py
Outdated
base_url_source = models.TextField(blank=True, null=True) | ||
base_urls_match = models.BooleanField(default=False) | ||
|
||
objects = NaturalKeyManager('config.label') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @aaxelb, I don't think this is a valid use or use case of NaturalKeyManager
. I think the value needs to be unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, natural keys should be unique. If config
were a one-to-one field, this would work (except it should be 'config__label'
). But it looks like there could be many SourceStats per SourceConfig?
__all__ = ('SourceStat',) | ||
|
||
|
||
class SourceStat(models.Model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a grade
or ok
field, so we can filter for sources that are having issues?
0222438
to
7fa61da
Compare
Update base urls Add tests for sources Add source stat model Save source stat information Update sources who have corrected their earliestDatestamp Catch and log all source exceptions Add admin page for source stats Use one model for source stats Don't save the model instance twice Refactor task logic Add grade field
7fa61da
to
70ff10e
Compare
I had to run
./manage.py loadsources --overwrite
to update the existing source configsPurpose
Check quality of OAI sources and make sure they all provide links back to the source.
Changes
earliest_date
values