Skip to content

Observations from a reconciliation process #2083

@susannaanas

Description

@susannaanas

Describe the bug
I created a memo of problems with a reconciliation process and enhancement ideas. I will break it down to proper issues after finishing the project and possible comments, corrections and removals.

Desktop (please complete the following information):

  • OS: MacOS 10.13.6
  • Browser Version: Version 75.0.3770.100 (Official Build) (64-bit)
  • JRE or JDK Version:Java(TM) SE Runtime Environment (build 1.8.0_111-b14)

OpenRefine (please complete the following information):

  • Version 3.2-SNAPSHOT [8d89a2a]

Datasets
https://drive.google.com/open?id=1_i-Gy2Q584I9V0ktiGdp2ME9qt5vvedk

Additional context

Updating OR in the reconciliation process
When a Wikidata import needed to be stopped to make a correction, the OpenRefine database and Wikidata got out of sync. The information already added to Wikidata was not updated to the OpenRefine items that had already been uploaded. On another occasion, they did not get updated although the update ran normally. To my knowledge, the new QID should be matched to the OR item, but if that is not the case, updating reconciliation data would be an enhancement request. The new items are however created in Wikidata and this will result in a new matching process.

After this interrupted Wikidata upload I was able to identify the record number where the upload has stopped. I would have liked to select records based on the record number range to correct them.

Matching by external IDs
It should be possible to match more explicitly based on an external identifier. It is ok for some cases that there is a fallback mechanism for a regular match, but that should be possible to switch off/control. It is difficult to know which of the matches are certain, and which are good guesses. This could also be solved by indicating the type of match in the UI.

The type matching mechanism could be repurposed to serve this need. One could select any property, such as subclass of, or an external ID, against which to reconcile the cell value which is not necessarily a label.

Popups
The popup matching interface is somewhat limited. If the item is not in the first set of suggestions in the flyout, there is no way to retrieve more.

Selecting an option (the item to reconcile with, or any of the options) in the popup makes the interface jump to the first page of the table. The checkboxes do not refresh the page in a similar manner. This remarkably slows down work. I know this is a known bug and I will give my thumb to correcting it.

The new hover interface forces to use the popup for matching and the page to refresh after matching. Perhaps the hovercard could be repositioned so that it would not cover the checkboxes.

The file gets slow over time (I wonder if it gets corrupted). The popups are impossible to use as they do not populate. It becomes impossible to reconcile. A workaround could be that much information of the reconciliation options that are now only shown in the popups dynamically could be imported to the file in the data retrieval phase.

Multiple rows in a record

Key columns with values in multiple rows causes misery and sorrow. It is unclear how records with multiple rows should be handled.
1 The key column needs to be filled down to be able to add the multiple properties to the item (I think). If any of the values for properties is declared within the schema rather than in the data, it is added to all the split rows, multiple times to one item.
2 When using the “Choose new match” popup for such a filled-down item with multiple rows, it is unclear if a new value is created multiple times for each row in the record, once to the whole record or multiple times to all records containing the same value or multiple times to all rows containing the same value. It is not displayed in any way after adding to be able to track errors.

It is unclear how to manage new items as values for properties in the schema. How to give the new valueitems necessary properties to be able to use them as values in the schema. It seems it should be made an additional schemaitem in the schema, but in which order is data read and created is not clear. This has resulted in errors, but I need to dig into archives to find the specifics.

→ It is unclear how to manage multiple items in one import schema.

Miscellaneous

There could be visual interfaces for geolocations. Maybe even the distance between the compared geocoordinates could be retrieved rather than the mysterious score.

Metadata

Metadata

Assignees

No one assigned

    Labels

    reconciliationRelated to the reconciliation operations and other featuresreconciliation API designwhen changing the reconciliation API is required to improve a workflowwikibaseRelated to wikidata/wikibase integration

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions