Skip to content

Data Analysis of Potential Duplicate Authors #10438

@RayBB

Description

@RayBB

Proposal

Here we have a list of 100k+ authors that have the exact same name and IDs that are only one off from each other. They are likely to be duplicate authors because of a race condition. As of 2024 there are few instances of this problem. But we still need to fix up the old instances.

Here's what someone should do:

  1. For every single author get the number of works and if they only have 1 work (it seems most do) get the title of that work. Upload that here as a CSV like the one already attached but with these two additional fields.
  2. From that, we should produce a list of all pairs of authors that have identical names, works with identical (case in-sensitive) titles, and IDs off by one.

From there staff can decide if we want to do an automated merge.

To work on this please use the data dumps, do not call the API.

CSV of authors in case you can't see the file on slack:

ids_next_to_each_other.csv

Justification

No response

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

Metadata

Metadata

Assignees

Labels

Lead: @RayBBIssues overseen by Ray (Onboarding & Documentation Lead) [manages]Needs: HelpIssues, typically substantial ones, that need a dedicated developer to take them on. [managed]Needs: ResponseIssues which require feedback from leadType: Feature RequestIssue describes a feature or enhancement we'd like to implement. [managed]

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions