-
Notifications
You must be signed in to change notification settings - Fork 738
Deduplication
Profile (people and organizations) deduplication is the process of identifying and merging multiple representations of the same individual or entity within a dataset. In a system like Community Data Platform, profile duplication can arise due to multiple data sources and inconsistent identities from the collected entities.
The deduplication process is guided by the following core principles:
- Accuracy over Aggressiveness: Merging profiles should be conservative to avoid incorrect merges. False positives are more damaging than duplicates.
- Traceability: Every merge should be recorded so it can be audited or reversed.
- Source Priority: Data from more reliable sources (e.g. GitHub, LinkedIn) should take precedence when merging conflicting profile fields.
- Minimal Data Loss: When merging, no important data should be discarded unless it is known to be redundant or superseded. Instead, data should be merged into one profile.
- Automation with Oversight: Deduplication can be automated with confidence thresholds, but always supports manual review for edge cases.
- Automation First: Wherever feasible, we favor automated processes over manual intervention to ensure scalability, consistency, and efficiency.
The goal of deduplication is to ensure that each person or entity is represented by a single, unified profile that aggregates all related activity and metadata. This enhances:
- Data Quality: Fewer duplicates, redundancies and inconsistencies.
- Analytics Accuracy: More reliable metrics and insights.
A well-merged profile serves as the single source of truth for a person's identity and engagement history across platforms.
The primary profile is the one the system keeps when merging duplicates. The profile priveliges all the data that initialy had and inherits all the data points linked to the secondary profile and that didn't exist on the primary one.
When the system detects duplicated profiles, it suggests one primary and one secondary, where the secondary is merged into the primary one. This is the creteria to choose the primary profile:
- First priority: Member with more identities comes first.
- Second priority: If both members have the same number of identities, the member with higher activity count comes is marked as the primary one.
- Equal case: If both criteria are equal, the order remains unchanged
Secondary profiles are the ones that get merged into the primary profile. They might have overlapping or partial info — like a GitHub handle here, a Twitter profile there — and we pull in anything useful that’s not already in the primary profile.
Once merged, these profiles are hidden or removed, but their data sticks around if it's relevant.
The similarity score is in a range of 0-1.
The edit distance is calculated using the Levenshtein distance algorithm from the fast-levenshtein library. It measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another. The system finds the smallest edit distance between any verified identity of the primary member and any verified identity of the similar member.
Fixed Similarity Scores:
- 0.2: When there are clashing identities (username identities in same platform with different values) or when the identity length is less than the smallest edit distance and no additional confidence factors are found.
- 0.95: When a primary member's verified identity matches a similar member's unverified identity.
- 0.98: When there are verified↔unverified email matches between members.
Calculated Similarity Scores:
-
Bumped score: The bumped score is calculated by starting with either a provided score or the default high confidence score (0.9), then for each additional confidence factor (location, organization, languages, programming languages, timezone), it adds a bump factor calculated as
Math.floor((1 - currentScore) / 5)
, with the final score capped at 1.0.
If a duplicated profile hasn't been idenfitied by the system, the user still has the possibility to manually merge the profiles via the UI.
Since we are collecting millions of profiles, simply relying on manual processes for deduplication wouldn't be scalable. Therefore, the system has an automatic mechanism in place to detect duplicates and mark them as merge suggestions. Merge suggestions can then be manually reviewerd by a user, or handled by an LLM agent automatically. For each suggestion, the merge can be accepted or ignored depending if it's correct or wrong.
Unmerging allows the users to revert a previously merged profile in case it was incorrect or premature.
- Each merged profile retains a history of which profiles were merged into it.
- When unmerging, the user has the option to restore all original profiles to their pre-merge state, OR unmerge a single identity and its associated activities into a new profile. The user is always given the opportunity to preview the unmerge operation, to make sure that the resulting profiles are correct.
- Once the unmerge operation is triggered, most of the static data will be moved synchronously, and the majority of activity updates will occur asynchronously.
- Unmerging can only be done manually, and is only available to admins to prevent creating incorrect data.
- All unmerge events are logged for auditing.
Manual deduplication lets users manually merge two or more profiles if the duplicates weren't found as suggestions or if the user wants to merge one of the presented suggestions themselves.
- Users can select multiple profiles and trigger a manual merge.
- The system will suggest a primary and secondary profile. However, the user has the ability to switch the primary profile in case the suggestion is not the best in terms of data quality.
- When merging, the user is given the opportunity to preview the merge operation, to make sure the resulting profile is correct.
- Once the merge operation is triggered, most of the static data will be moved synchronously, and the majority of activity updates will occur asynchronously. In addition, the system will keep a backup of the original profiles, in case an unmerge operation is needed in the future.
- All unmerge events are logged for auditing.
The system uses an automated process to create suggestions for possible duplicated profiles and to automatically merge these profiles based on the confidence score and each profile's data.
- Merge suggestions are generated every 2 hours based on new profiles from the last time the suggestions were created.
- Merge suggestions presented to the user have a mininum of 0.75 similarity score. Admins can review, accept, or reject merge suggestions.
- To avoid a huge manual effort, the system also has an LLM in place to review every Monday the open suggestions, and automatically merge them or reject them.
- For members, the LLM only reviews suggestions above 0.8 similarity score.
- For organizations, the LLM only reviews suggestions above 0.75 similarity score.
- The manual merge process is handled by the
entity-merging-worker
service. - The merge suggestions process is handled by the
merge-suggestions-worker
service. - The member merge process consists of eight main steps:
- Recording action as an audit log. -> captureApiChange
- Recording merge action to show the user the progress of the operation. -> updateMergeActionState
- Move identities to the primary profile. -> moveIdentitiesBetweenMembers
- Move static fields to the primary profile. -> update
- Move organizations data to the primary profile -> moveOrgsBetweenMembers
- Update activities to reflect the new relationships and affiliations (memberId and organizationId). -> updateActivities
- Sync new member activity aggregates. -> syncMember
- Delete secondary member. -> deleteMember
- The organization merge process consists of eight main steps:
- Recording action as an audit log. -> captureApiChange
- Recording merge action to show the user the progress of the operation. -> updateMergeActionState
- Move identities to the primary profile. -> moveIdentitiesBetweenOrganizations
- Move static fields to the primary profile. -> update
- Move members to the primary profile -> moveMembersBetweenOrganizations
- Update activities to reflect the new relationships and affiliations (memberId and organizationId). -> updateActivities
- Sync new organization activity aggregates. -> syncOrganization
- Delete secondary organization. -> deleteOrganization
- The workflow to merge members is orchestrated by Temporal in this file -> finishMemberMerging
- The workflow to merge organizations is orchestrated by Temporal in this file -> finishOrganizationMerging
- The workflow to unmerge members is orchestrated by Temporal in this file -> finishMemberUnmerging
- The workflow to unmerge organizations is orchestrated by Temporal in this file -> finishOrganizationUnmerging
- The scheduled job to generate member merge suggestions is orchestrated by Temporal in this file -> generateMemberMergeSuggestions
- The scheduled job to generate member organizations suggestions is orchestrated by Temporal in this file -> generateOrganizationMergeSuggestions
- The scheduled job to automatically merge member suggestions is orchestrated by Temporal in this file -> mergeMembersWithLLM
- The scheduled job to automatically merge organization suggestions is orchestrated by Temporal in this file -> mergeOrganizationsWithLLM
- The calculateSimilarity method generates a similarity score for two different member profiles.
- The calculateSimilarity method generates a similarity score for two different organization profiles.
Since a merge and unmerge operation is dangerous and updates critical data, the system keeps a record of the user who triggered the action, and the record of the updates fields in the profiles.
This is stored in postgres, auditLogs
and auditLogAction
.
Whenever there is a merge or unmerge operation, the system stores a backup of both of the profiles and the progress of the merge operation since some of the operations will run asynchronously. This is particularly useful to give a better feedback to the user.
This is stored in postgres, mergeActions
.
The system has a scheduled operation to generate merge suggestions for newly created profiles. To keep track of the last run, the tenants
table in Postgres, stores memberMergeSuggestionsLastGeneratedAt
and organizationMergeSuggestionsLastGeneratedAt
.
Then for each suggestion generated, the system stores it in 2 tables in postgres, memberToMergeRaw
and memberToMerge
, and for organizations, organizationToMergeRaw
and organizationToMerge
.
The raw
tables stores all suggestions regardless of the similarity score. The ToMerge
tables, are the suggestions that will be shown to the users, and that are above the 0.75 similarity score threshold. The LLM is only reading from the suggestion from the toMerge
table. In addition, whenever 2 profiles are merged, the suggestion is removed from the ToMerge
and raw
tables. If one suggestion is marked as rejected/ignored, it will also be removed from these tables.
The memberNoMerge
and organizationNoMerge
tables in postgres, store all the profiles that should not be merged, that were marked as rejected in an existing merge suggestion. Once the profiles are in these tables, they will never be created as a merge suggestion again.
The system is currently using Claude 3.5 Sonnet
model from Amazon Bedrock to make the decision automatically on wether or not a suggestion should be merged or rejected. For every decision, the system stores both the prompt and results in llmPromptHistory
and llmSuggestionVerdicts
respectively.
People merges
Organization merges
Merge Suggestions and Automatic Merging
- Home
- Features
- Areas
- Backend
- Frontend
- Core Platform
- Integrations Pipeline
- Integrations
- Data Correctness
- Resources
- Deployment
- Kubernetes
- Local Development
- Monitoring
- Oracle Cloud
- Scripts
- Archives