Skip to content

Deduplication

Joana Maia edited this page Jun 27, 2025 · 4 revisions

🔀 Merge Profiles - Deduplicating profiles

Principles behind deduplication

Profile (people and organizations) deduplication is the process of identifying and merging multiple representations of the same individual or entity within a dataset. In a system like Community Data Platform, profile duplication can arise due to multiple data sources and inconsistent identities from the collected entities.

The deduplication process is guided by the following core principles:

  • Accuracy over Aggressiveness: Merging profiles should be conservative to avoid incorrect merges. False positives are more damaging than duplicates.
  • Traceability: Every merge should be recorded so it can be audited or reversed.
  • Source Priority: Data from more reliable sources (e.g. GitHub, LinkedIn) should take precedence when merging conflicting profile fields.
  • Minimal Data Loss: When merging, no important data should be discarded unless it is known to be redundant or superseded. Instead, data should be merged into one profile.
  • Automation with Oversight: Deduplication can be automated with confidence thresholds, but always supports manual review for edge cases.
  • Automation First: Wherever feasible, we favor automated processes over manual intervention to ensure scalability, consistency, and efficiency.

Deduplication goal

The goal of deduplication is to ensure that each person or entity is represented by a single, unified profile that aggregates all related activity and metadata. This enhances:

  • Data Quality: Fewer duplicates, redundancies and inconsistencies.
  • Analytics Accuracy: More reliable metrics and insights.

A well-merged profile serves as the single source of truth for a person's identity and engagement history across platforms.

Deduplication concepts

Primary profile

The primary profile is the one the system keeps when merging duplicates. The profile priveliges all the data that initialy had and inherits all the data points linked to the secondary profile and that didn't exist on the primary one.

When the system detects duplicated profiles, it suggests one primary and one secondary, where the secondary is merged into the primary one. This is the creteria to choose the primary profile:

  • First priority: Member with more identities comes first.
  • Second priority: If both members have the same number of identities, the member with higher activity count comes is marked as the primary one.
  • Equal case: If both criteria are equal, the order remains unchanged

Secondary Profile

Secondary profiles are the ones that get merged into the primary profile. They might have overlapping or partial info — like a GitHub handle here, a Twitter profile there — and we pull in anything useful that’s not already in the primary profile.

Once merged, these profiles are hidden or removed, but their data sticks around if it's relevant.

Similarity and Confidence Thresholds

The similarity score is in a range of 0-1.

The edit distance is calculated using the Levenshtein distance algorithm from the fast-levenshtein library. It measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another. The system finds the smallest edit distance between any verified identity of the primary member and any verified identity of the similar member.

Fixed Similarity Scores:

  • 0.2: When there are clashing identities (username identities in same platform with different values) or when the identity length is less than the smallest edit distance and no additional confidence factors are found.
  • 0.95: When a primary member's verified identity matches a similar member's unverified identity.
  • 0.98: When there are verified↔unverified email matches between members.

Calculated Similarity Scores:

  • Bumped score: The bumped score is calculated by starting with either a provided score or the default high confidence score (0.9), then for each additional confidence factor (location, organization, languages, programming languages, timezone), it adds a bump factor calculated as Math.floor((1 - currentScore) / 5), with the final score capped at 1.0.

Manual Merges

If a duplicated profile hasn't been idenfitied by the system, the user still has the possibility to manually merge the profiles via the UI.

Merge Suggestions

Since we are collecting millions of profiles, simply relying on manual processes for deduplication wouldn't be scalable. Therefore, the system has an automatic mechanism in place to detect duplicates and mark them as merge suggestions. Merge suggestions can then be manually reviewerd by a user, or handled by an LLM agent automatically. For each suggestion, the merge can be accepted or ignored depending if it's correct or wrong.

Unmerging process

Unmerging allows the users to revert a previously merged profile in case it was incorrect or premature.

  • Each merged profile retains a history of which profiles were merged into it.
  • When unmerging, the user has the option to restore all original profiles to their pre-merge state, OR unmerge a single identity and its associated activities into a new profile. The user is always given the opportunity to preview the unmerge operation, to make sure that the resulting profiles are correct.
  • Once the unmerge operation is triggered, most of the static data will be moved synchronously, and the majority of activity updates will occur asynchronously.
  • Unmerging can only be done manually, and is only available to admins to prevent creating incorrect data.
  • All unmerge events are logged for auditing.

Manual deduplication process

Manual deduplication lets users manually merge two or more profiles if the duplicates weren't found as suggestions or if the user wants to merge one of the presented suggestions themselves.

  • Users can select multiple profiles and trigger a manual merge.
    • The system will suggest a primary and secondary profile. However, the user has the ability to switch the primary profile in case the suggestion is not the best in terms of data quality.
  • When merging, the user is given the opportunity to preview the merge operation, to make sure the resulting profile is correct.
  • Once the merge operation is triggered, most of the static data will be moved synchronously, and the majority of activity updates will occur asynchronously. In addition, the system will keep a backup of the original profiles, in case an unmerge operation is needed in the future.
  • All unmerge events are logged for auditing.

Automatic deduplication / Merge suggestions process

The system uses an automated process to create suggestions for possible duplicated profiles and to automatically merge these profiles based on the confidence score and each profile's data.

  • Merge suggestions are generated every 2 hours based on new profiles from the last time the suggestions were created.
  • Merge suggestions presented to the user have a mininum of 0.75 similarity score. Admins can review, accept, or reject merge suggestions.
  • To avoid a huge manual effort, the system also has an LLM in place to review every Monday the open suggestions, and automatically merge them or reject them.
    • For members, the LLM only reviews suggestions above 0.8 similarity score.
    • For organizations, the LLM only reviews suggestions above 0.75 similarity score.

Algorithm to generate people merge suggestions

People Merge suggestions

Algorithm to generate organization merge suggestions

Organization Merge suggestions

Technical implementation

Audit Logs

Since a merge and unmerge operation is dangerous and updates critical data, the system keeps a record of the user who triggered the action, and the record of the updates fields in the profiles.

This is stored in postgres, auditLogs and auditLogAction.

Merge Actions

Whenever there is a merge or unmerge operation, the system stores a backup of both of the profiles and the progress of the merge operation since some of the operations will run asynchronously. This is particularly useful to give a better feedback to the user.

This is stored in postgres, mergeActions.

Merge Suggestions

The system has a scheduled operation to generate merge suggestions for newly created profiles. To keep track of the last run, the tenants table in Postgres, stores memberMergeSuggestionsLastGeneratedAt and organizationMergeSuggestionsLastGeneratedAt.

Then for each suggestion generated, the system stores it in 2 tables in postgres, memberToMergeRaw and memberToMerge, and for organizations, organizationToMergeRaw and organizationToMerge.

The raw tables stores all suggestions regardless of the similarity score. The ToMerge tables, are the suggestions that will be shown to the users, and that are above the 0.75 similarity score threshold. The LLM is only reading from the suggestion from the toMerge table. In addition, whenever 2 profiles are merged, the suggestion is removed from the ToMerge and raw tables. If one suggestion is marked as rejected/ignored, it will also be removed from these tables.

Rejected Merges

The memberNoMerge and organizationNoMerge tables in postgres, store all the profiles that should not be merged, that were marked as rejected in an existing merge suggestion. Once the profiles are in these tables, they will never be created as a merge suggestion again.

LLM

The system is currently using Claude 3.5 Sonnet model from Amazon Bedrock to make the decision automatically on wether or not a suggestion should be merged or rejected. For every decision, the system stores both the prompt and results in llmPromptHistory and llmSuggestionVerdicts respectively.

Components

Deduplication components

People merges

People merge

Organization merges

Organization merge

Merge Suggestions and Automatic Merging

Merge Suggestions and Automatic Merging

Data Schema

Merges data schema

Clone this wiki locally