As UBC’s institutional repository, cIRcle’s mandate is to preserve and promote published and unpublished research from the UBC community. Faculty research is an important part of this; cIRcle currently has more than 8,800 items in its UBC Faculty and Research Publications collection, with over 42,000 author names listed. However, many of these names are simply different forms of a single author’s name. Having different versions of a single author’s name can make it harder to find their work.
When new publications are added to cIRcle, they can come from a variety of sources and in many different formats. cIRcle’s metadata standards guide how author names are entered for each individual publication, but any review of author name variants occurs as a separate process as staffing and expertise allow. In 2021, cIRcle set out to make it easier to find UBC faculty research by exploring ways to identify authors under a single version of their name and eliminate confusing variants.
This blog post will share some highlights about our experience developing an author name management workflow and challenges we faced when trying to automate the process using OpenRefine, a data cleaning tool. Although the content and processes of each institutional repository may be unique, author name management is an ongoing and ever-present challenge for all.
Name Disambiguation vs. Reconciliation
Author name management can include both disambiguation and reconciliation. Name disambiguation is needed when a single unique name represents more than one individual author. For example, the author name ‘Parry, J. D.’ could represent both ‘Parry, Jessica D.’ and ‘Parry, Joseph Dean’. Name reconciliation is needed when a single author’s name is captured using varying forms of name. For example, items with the author names ‘Smith, John,’ ‘Smith, J. A.,’ and ‘Smith, John A.’ could all authored by one person. Multiple authors under a single form of name, and multiple forms of name for a single author within a repository like cIRcle cause issues with access and discoverability. Kansas State University Library’s Author Disambiguation Defined guide offers a useful explanation of author name disambiguation and its effects on research impact and dissemination.
Preliminary workflows and challenges
During our initial exploratory phase, the cIRcle Specialist identified opportunities for automating the metadata analysis and remediation workflows by using metadata cleaning tools like OpenRefine. OpenRefine is an open-source application used for data cleanup and transformation, or “data wrangling,” and is widely used in libraries to clean, organize, and analyze complex data (Williams, 2018). Not surprisingly, both the 2020 and 2022 OpenRefine user surveys showed that librarians make up the largest group of OpenRefine users (Fauconnier, 2022).
While working with our data in OpenRefine, we faced considerable challenges. The first was figuring out how to get the remediation work done in OpenRefine back into cIRcle’s DSpace repository system. The second involved the time commitment. Metadata remediation can be automated in many cases, but this particular project proved too complex to automate because of the level of manual review and reference required. Ideally, a name reconciliation project would be straightforward and would allow for all work to be done in OpenRefine, without requiring any external reference or comparison with records elsewhere. However, the nature of cIRcle – with records dating back to 2008 and archived via different ingestion streams, each with its own metadata particularities – meant that any reconciliation work would require regular reference to cIRcle and other sites, such as academic journal sites and UBC departmental pages. Determining whether ‘Johnson, Kate C.’ and ‘Johnson, Katherine C.’ are the same person would require review of publication topics, co-authors, and any noted affiliations.
Revised workflows and collaborative efforts
Once we determined that this work had to be done manually, we concluded that remediating author names in cIRcle would have to be done on an ongoing basis, rather than as part of a defined project. With this in mind, we redesigned our workflow for identifying the names that would (or might) need remediating with the help of UBC Library’s Technical Services (TS) team. The TS team reviews all new submissions to cIRcle to ensure that the item’s metadata aligns with UBC Library and broader metadata standards. TS also checks the Library of Congress (LC) Name Authorities for an authorized form of name for all UBC Faculty authors. Although ORCID is sometimes used for name authority work, it was not considered appropriate for this project. Rationale for this decision may be discussed in later posts. Because the TS team routinely reviews author names in cIRcle, we tapped into their knowledge about name authorities and metadata standards to identify UBC Faculty author names in cIRcle that need disambiguation and/or remediation. Using a shared Google Spreadsheet, the TS team records any UBC Faculty author names they come across during their review work that also need remediating.
Since the shared Google Spreadsheet went live in January 2023, the TS team has identified over 180 UBC Faculty author names that need remediating and more are added each week. Of these 180 names, more than 160 of them were cases where there were multiple forms of a single author’s name in cIRcle. Using this spreadsheet, the cIRcle Specialist has been able to remediate nearly all of these names and identify a single, unique form of name for all future publications by these UBC Faculty authors in cIRcle. Through this work, we are making it easier for the UBC community to find publications from a single UBC Faculty member author in a single search.
Next steps in author name management
The cIRcle Office is currently in the planning stages of a major repository migration, from DSpace version 5.6 to DSpace 7. DSpace 7 offers new author name management tools, such as the use of entities in the data model and integrations with ORCID for identifying authors. These new features may change how we manage author names in cIRcle. For now, with the current workflow in place, cIRcle hopes to clean up as many author names as possible to make the migration process easier and cleaner.
To see this work in progress, search our collections to see how different author names are formatted and how that might affect your search results. Find a name that needs remediating? Contact us to let us know what you’ve found and we’ll take care of fixing it up.
Kansas State University Library. (n.d.) Author Disambiguation Defined. https://guides.lib.k-state.edu/c.php?g=181705&p=4492841
Fauconnier, Sandra. (2022, June 28). OpenRefine’s 2022 user survey: the results are in! OpenRefine. https://openrefine.org/blog/2022/06/28/2022-survey-results
OpenRefine. (n.d.) OpenRefine. https://openrefine.org/
Williams, Mita. (2018, October 24). OpenRefine for Librarians. https://librarian.aedileworks.com/2019/04/23/open-refine-for-librarians/
What’s in a Name? Author Name Management in cIRcle © 2023 by Kelly Gauvin is licensed under CC BY-NC 4.0