Skip to content

[SHARE-739][Improvement] Check quality of OAI sources #660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

laurenbarker
Copy link
Contributor

@laurenbarker laurenbarker commented May 17, 2017

I had to run ./manage.py loadsources --overwrite to update the existing source configs

Purpose

Check quality of OAI sources and make sure they all provide links back to the source.

Changes

  • Add comment too source config if DSpace or Digital Commons/Bepress
  • Update base urls
  • Add earliest_date values
  • Add task that stores stats for sources
  • Display on admin

screen shot 2017-06-13 at 2 07 14 pm

@laurenbarker laurenbarker force-pushed the feature/oai-links branch 2 times, most recently from ba7a1ef to 134f789 Compare May 18, 2017 13:34
Copy link
Contributor

@aaxelb aaxelb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🍚

Since earliest_date is used for the default fullharvest start date, I think it might be better if the "incorrect" values were commented out or replaced with more correct dates.

configs:
- base_url: http://oaktrust.library.tamu.edu/dspace-oai/request
disabled: false
earliest_date: null
earliest_date: 2004-11-10 20:56:05 # their earliestDatestamp is most recent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol



@pytest.mark.django_db
class TestSources:
Copy link
Contributor

@aaxelb aaxelb May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these tests be disabled by default? I think it's great we have them, but I suspect they take a while and it doesn't seem necessary to run them every time anyone makes a PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking that too. Especially since they could fail and have nothing to do with the PR. They should be run fairly frequently though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a nightly task that we run in prod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, that would be the best

@laurenbarker
Copy link
Contributor Author

laurenbarker commented May 18, 2017

I did include the ones that looked they were displaying the published date but could definitely comment those out. The published date will always be before any records though. Thoughts?

Latest

edu.oaktrust.mods: 2017-05-16T09:19:07

Error?

edu.scholarsarchiveosu.mods: 0011-01-01T00:00:00Z
edu.uwashington.mods: 2083-03-01T08:00:00Z

Published

gov.nodc: 1996-10-09
org.philpapers: 1990-01-01T00:00:00Z
org.ttu.mods: 1989-05-01T05:00:00Z
es.csic.mods: 1989-11-28T07:40:02Z
edu.umich.mods: 1983-01-01T05:00:00Z
edu.udc.mods: 1981-09-01T13:37:51Z
be.ghent.mods: 1970-01-01T00:00:01Z
edu.citeseerx: 1970-01-01
br.pcurio: 1970-01-01
ch.cern: 1969-12-31T23:00:00Z
edu.vtech.mods: 1900-02-02T05:00:00Z
edu.icpsr: 01-01-1900
pt.rcaap: 1900-01-01T00:00:00Z
org.ucescholarship: 1885-05-01
com.nature: 1869-11-04

@aaxelb
Copy link
Contributor

aaxelb commented May 18, 2017

If there's an easy way to get something closer to the actual earliest datestamp for the "published" ones, I feel like that'd be nicer. If it'd require wading through their OAI feed until you find it, though, not worth it.

I think those first three should just have null (edit: just saw you already did this for the two clearly broken ones), unless you already found a reasonable value (is the 2004 date for oaktrust right?)

@laurenbarker
Copy link
Contributor Author

  • There isn't an easy way to get the earliestDatestamp that I know of
  • The error ones aren't included
  • 2004 is oaktrust

I'll remove the published dates. We are going to email all of these people and I wanted the test to alert us that they fixed their earliestDatestamp. They will most likely respond to the email though.

@laurenbarker laurenbarker force-pushed the feature/oai-links branch 2 times, most recently from 5522fe0 to d8e871d Compare May 19, 2017 15:03
@laurenbarker laurenbarker force-pushed the feature/oai-links branch 2 times, most recently from 89bb1f5 to d4bfe2d Compare June 1, 2017 17:22
Copy link
Member

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a tiny admin page be added for this as well?

share/tasks.py Outdated
}


def get_field_from_identify(response, field):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than cutting these up into various methods, consider consolidating it into a class.

The OAI one can then just subclass the existing one and provide the same interface.

And then can you have a single task that just runs all of them.

share/tasks.py Outdated
response_elapsed_time=response_elapsed_time,
response_exception=response_exception,
)
source_stat.save()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to save a model that you've made with create

share/tasks.py Outdated


@celery.task(bind=True)
def get_source_stats(self, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a future improvement we could, optionally, tie this into the harvest task.

return '{}: {}'.format(self.config.source.long_title, self.config.label)


class SourceStatOAI(SourceStat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, you realize this is using Django's multi-table inheritance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( Yeah...it doesn't make any sense to do it that way. Didn't really think that through...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging @cwisecarver for his opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to codify the conversation that @chrisseto and I had. Multi-table inheritance is basically the root of all evil (in most situations) because it forces an implicit join. I'd say if you're going to use these extra fields more often then not I'd put them on the original model and make them nullable. If you're not going to use them very often or they're going to be duplicated often I'd put them in a separate model and FK from the original model to the new model so they can be re-used. If you need them to be independent the FK should point the other way, which is basically the same thing as multi-table inheritance except that it's not going to do the join implicitly.

logger = logging.getLogger(__name__)


class SourceStatus():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the ()s here.

except Exception as e:
logger.warning('Exception received from source: %s', e)
return (None, e)
return (r, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something in particular I should be looking at?

share/tasks.py Outdated

@celery.shared_task(bind=True)
def get_source_stats(self, oai, config_id):
if oai:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to just load the config here and just check if it's oai?

share/tasks.py Outdated
get_source_stats.apply_async((True, config['id']))

non_oai_sourceconfigs = SourceConfig.objects.filter(
disabled=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source__is_delete=False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we should get stats on disabled source configs?

return (r, None)

def get_source_stats(self, config_id):
sourceconfig = SourceConfig.objects.get(pk=config_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the entire class is built around the source config, it might be a bit nicer to just pass it into __init__ and then just reference it as self.source_config or whatever name you prefer.

}

def get_field_from_identify(self, response, field):
parsed = etree.fromstring(response.content, parser=etree.XMLParser(recover=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a great way to do this, might just be worth adding a todo. It would be interesting to see what sources provide invalid XML. IE fail without recover=True

def get_source_stats(self, config_id):

# Known incorrect baseUrl:
incorrect_base_urls = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a class level variable.

}

# Known incorrect earliestDatestamp (all emailed):
incorrect_earliest_datestamp = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a class level variable.

base_url_source = models.TextField(blank=True, null=True)
base_urls_match = models.BooleanField(default=False)

objects = NaturalKeyManager('config.label')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @aaxelb, I don't think this is a valid use or use case of NaturalKeyManager. I think the value needs to be unique?

Copy link
Contributor

@aaxelb aaxelb Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, natural keys should be unique. If config were a one-to-one field, this would work (except it should be 'config__label'). But it looks like there could be many SourceStats per SourceConfig?

__all__ = ('SourceStat',)


class SourceStat(models.Model):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a grade or ok field, so we can filter for sources that are having issues?

Update base urls

Add tests for sources

Add source stat model

Save source stat information

Update sources who have corrected their earliestDatestamp

Catch and log all source exceptions

Add admin page for source stats

Use one model for source stats

Don't save the model instance twice

Refactor task logic

Add grade field
@chrisseto chrisseto merged commit 92192d2 into CenterForOpenScience:develop Jun 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants