[SHARE-739][Improvement] Check quality of OAI sources #660

laurenbarker · 2017-05-17T21:22:00Z

I had to run ./manage.py loadsources --overwrite to update the existing source configs

Purpose

Check quality of OAI sources and make sure they all provide links back to the source.

Changes

Add comment too source config if DSpace or Digital Commons/Bepress
Update base urls
Add earliest_date values
Add task that stores stats for sources
Display on admin

aaxelb

Nice! 🍚

Since earliest_date is used for the default fullharvest start date, I think it might be better if the "incorrect" values were commented out or replaced with more correct dates.

aaxelb · 2017-05-18T13:39:18Z

share/sources/edu.oaktrust/source.yaml

 configs:
 - base_url: http://oaktrust.library.tamu.edu/dspace-oai/request
  disabled: false
-  earliest_date: null
+  earliest_date: 2004-11-10 20:56:05  # their earliestDatestamp is most recent


aaxelb · 2017-05-18T13:41:05Z

tests/share/test_sources.py

+
+
+@pytest.mark.django_db
+class TestSources:


Should these tests be disabled by default? I think it's great we have them, but I suspect they take a while and it doesn't seem necessary to run them every time anyone makes a PR.

Yeah I was thinking that too. Especially since they could fail and have nothing to do with the PR. They should be run fairly frequently though?

It could be a nightly task that we run in prod?

Good idea, that would be the best

laurenbarker · 2017-05-18T13:51:25Z

I did include the ones that looked they were displaying the published date but could definitely comment those out. The published date will always be before any records though. Thoughts?

Latest

edu.oaktrust.mods: 2017-05-16T09:19:07

Error?

edu.scholarsarchiveosu.mods: 0011-01-01T00:00:00Z
edu.uwashington.mods: 2083-03-01T08:00:00Z

Published

gov.nodc: 1996-10-09
org.philpapers: 1990-01-01T00:00:00Z
org.ttu.mods: 1989-05-01T05:00:00Z
es.csic.mods: 1989-11-28T07:40:02Z
edu.umich.mods: 1983-01-01T05:00:00Z
edu.udc.mods: 1981-09-01T13:37:51Z
be.ghent.mods: 1970-01-01T00:00:01Z
edu.citeseerx: 1970-01-01
br.pcurio: 1970-01-01
ch.cern: 1969-12-31T23:00:00Z
edu.vtech.mods: 1900-02-02T05:00:00Z
edu.icpsr: 01-01-1900
pt.rcaap: 1900-01-01T00:00:00Z
org.ucescholarship: 1885-05-01
com.nature: 1869-11-04

aaxelb · 2017-05-18T14:00:42Z

If there's an easy way to get something closer to the actual earliest datestamp for the "published" ones, I feel like that'd be nicer. If it'd require wading through their OAI feed until you find it, though, not worth it.

I think those first three should just have null (edit: just saw you already did this for the two clearly broken ones), unless you already found a reasonable value (is the 2004 date for oaktrust right?)

laurenbarker · 2017-05-18T14:08:40Z

There isn't an easy way to get the earliestDatestamp that I know of
The error ones aren't included
2004 is oaktrust

I'll remove the published dates. We are going to email all of these people and I wanted the test to alert us that they fixed their earliestDatestamp. They will most likely respond to the email though.

chrisseto

Could a tiny admin page be added for this as well?

chrisseto · 2017-06-05T20:51:13Z

share/tasks.py

+}
+
+
+def get_field_from_identify(response, field):


Rather than cutting these up into various methods, consider consolidating it into a class.

The OAI one can then just subclass the existing one and provide the same interface.

And then can you have a single task that just runs all of them.

chrisseto · 2017-06-05T20:52:07Z

share/tasks.py

+        response_elapsed_time=response_elapsed_time,
+        response_exception=response_exception,
+    )
+    source_stat.save()


You don't have to save a model that you've made with create

chrisseto · 2017-06-05T20:54:23Z

share/tasks.py

+
+
+@celery.task(bind=True)
+def get_source_stats(self, *args, **kwargs):


As a future improvement we could, optionally, tie this into the harvest task.

chrisseto · 2017-06-05T20:58:08Z

share/models/sources.py

+        return '{}: {}'.format(self.config.source.long_title, self.config.label)
+
+
+class SourceStatOAI(SourceStat):


Just double checking, you realize this is using Django's multi-table inheritance?

:( Yeah...it doesn't make any sense to do it that way. Didn't really think that through...

Pinging @cwisecarver for his opinion.

Just to codify the conversation that @chrisseto and I had. Multi-table inheritance is basically the root of all evil (in most situations) because it forces an implicit join. I'd say if you're going to use these extra fields more often then not I'd put them on the original model and make them nullable. If you're not going to use them very often or they're going to be duplicated often I'd put them in a separate model and FK from the original model to the new model so they can be re-used. If you need them to be independent the FK should point the other way, which is basically the same thing as multi-table inheritance except that it's not going to do the join implicitly.

chrisseto · 2017-06-09T13:22:53Z

share/util/source_stat.py

+logger = logging.getLogger(__name__)
+
+
+class SourceStatus():


You don't need the ()s here.

chrisseto · 2017-06-09T13:23:40Z

share/util/source_stat.py

+        except Exception as e:
+            logger.warning('Exception received from source: %s', e)
+            return (None, e)
+        return (r, None)


https://blog.golang.org/error-handling-and-go

Is there something in particular I should be looking at?

chrisseto · 2017-06-09T13:32:51Z

share/tasks.py

+
+@celery.shared_task(bind=True)
+def get_source_stats(self, oai, config_id):
+    if oai:


Would it make more sense to just load the config here and just check if it's oai?

chrisseto · 2017-06-09T13:33:53Z

share/tasks.py

+        get_source_stats.apply_async((True, config['id']))
+
+    non_oai_sourceconfigs = SourceConfig.objects.filter(
+        disabled=False,


source__is_delete=False?

Are you saying we should get stats on disabled source configs?

chrisseto · 2017-06-09T13:35:38Z

share/util/source_stat.py

+        return (r, None)
+
+    def get_source_stats(self, config_id):
+        sourceconfig = SourceConfig.objects.get(pk=config_id)


As the entire class is built around the source config, it might be a bit nicer to just pass it into __init__ and then just reference it as self.source_config or whatever name you prefer.

chrisseto · 2017-06-09T13:39:22Z

share/util/source_stat.py

+    }
+
+    def get_field_from_identify(self, response, field):
+        parsed = etree.fromstring(response.content, parser=etree.XMLParser(recover=True))


Not sure if there's a great way to do this, might just be worth adding a todo. It would be interesting to see what sources provide invalid XML. IE fail without recover=True

chrisseto · 2017-06-09T13:39:36Z

share/util/source_stat.py

+    def get_source_stats(self, config_id):
+
+        # Known incorrect baseUrl:
+        incorrect_base_urls = {


This should be a class level variable.

chrisseto · 2017-06-09T13:39:48Z

share/util/source_stat.py

+        }
+
+        # Known incorrect earliestDatestamp (all emailed):
+        incorrect_earliest_datestamp = {


Also a class level variable.

chrisseto · 2017-06-09T13:42:39Z

share/models/sources.py

+    base_url_source = models.TextField(blank=True, null=True)
+    base_urls_match = models.BooleanField(default=False)
+
+    objects = NaturalKeyManager('config.label')


ping @aaxelb, I don't think this is a valid use or use case of NaturalKeyManager. I think the value needs to be unique?

Yeah, natural keys should be unique. If config were a one-to-one field, this would work (except it should be 'config__label'). But it looks like there could be many SourceStats per SourceConfig?

chrisseto · 2017-06-09T13:44:01Z

share/models/sources.py

+__all__ = ('SourceStat',)
+
+
+class SourceStat(models.Model):


Could you add a grade or ok field, so we can filter for sources that are having issues?

Update base urls Add tests for sources Add source stat model Save source stat information Update sources who have corrected their earliestDatestamp Catch and log all source exceptions Add admin page for source stats Use one model for source stats Don't save the model instance twice Refactor task logic Add grade field

laurenbarker force-pushed the feature/oai-links branch 2 times, most recently from ba7a1ef to 134f789 Compare May 18, 2017 13:34

aaxelb reviewed May 18, 2017

View reviewed changes

laurenbarker force-pushed the feature/oai-links branch 2 times, most recently from 5522fe0 to d8e871d Compare May 19, 2017 15:03

laurenbarker force-pushed the feature/oai-links branch 2 times, most recently from 89bb1f5 to d4bfe2d Compare June 1, 2017 17:22

chrisseto suggested changes Jun 5, 2017

View reviewed changes

laurenbarker force-pushed the feature/oai-links branch from 391772e to 8a7d76b Compare June 8, 2017 17:38

chrisseto suggested changes Jun 9, 2017

View reviewed changes

laurenbarker force-pushed the feature/oai-links branch from 0222438 to 7fa61da Compare June 26, 2017 14:57

laurenbarker force-pushed the feature/oai-links branch from 7fa61da to 70ff10e Compare June 26, 2017 15:05

laurenbarker added 2 commits June 26, 2017 11:11

Update sources that have fixed their earliest datestamp

00a3bc4

Resolve migration conflicts

c810b0d

chrisseto merged commit 92192d2 into CenterForOpenScience:develop Jun 26, 2017



		@celery.task(bind=True)
		def get_source_stats(self, args, *kwargs):

		return '{}: {}'.format(self.config.source.long_title, self.config.label)


		class SourceStatOAI(SourceStat):

[SHARE-739][Improvement] Check quality of OAI sources #660

[SHARE-739][Improvement] Check quality of OAI sources #660

Uh oh!

Conversation

laurenbarker commented May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Uh oh!

aaxelb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaxelb May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurenbarker commented May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Latest

Error?

Published

Uh oh!

aaxelb commented May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laurenbarker commented May 18, 2017

Uh oh!

chrisseto left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaxelb Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurenbarker commented May 17, 2017 •

edited

Loading

aaxelb May 18, 2017 •

edited

Loading

laurenbarker commented May 18, 2017 •

edited

Loading

aaxelb commented May 18, 2017 •

edited

Loading

aaxelb Jun 9, 2017 •

edited

Loading