[go: up one dir, main page]

Skip to content

globbestael/DedupEndNote

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DedupEndNote

Deduplication of EndNote RIS files:

  • deduplicate one file: produces a new RIS file with the unique records
  • deduplicate two files (NEW-RECORDS and OLD-RECORDS): deduplicates both files and produces a RIS file with the unique records from NEW-RECORDS
  • mark the duplicates of one file: produces a RIS file with the Label field containing the ID of the duplicate record

DedupEndNote is available at http://dedupendnote.nl:9777

Actions

  • Export one or two EndNote databases as RIS file(s)
  • Upload the file(s)
  • Choose the action
  • Download the result file (RIS)
  • Import the result file into a new EndNote database

Building your own version

DedupEndNote is a Java web application (Java 17, Spring Boot 2.7, fat jar). It can be started locally with:

    java -jar DedupEndNote-[VERSION].jar

and the application will be available at

    http://localhost:9777

Why DedupEndNote?

Deduplication in EndNote misses many duplicate records. Building and maintaining a Journals List within Endnote can partly solve this problem, but there remain lots of cases where EndNote is too unforgiving when comparing records. Some bibliographic databases offer deduplication for their own databases (OVID: Medline and EMBASE), but this does not help PubMed, Cochrane or Web of Science users.

DedupEndNote deduplicates an EndNote RIS file and writes a new RIS file with the unique records, which can be imported into a new EndNote database. It is more forgiving than EndNote itself when comparing records, but tests have shown that it identifies many more duplicates (see below under "Performance").

The program has been tested on EndNote databases with records from:

  • CINAHL (EBSCOHost)
  • Cochrane Library (Trials)
  • EMBASE (OVID)
  • Medline (OVID)
  • PsycINFO (OVID)
  • PubMed
  • Scopus
  • Web of Science

The program has been tested with files with up to 50.000 records.

What does DedupEndNote do?

1. Deduplicate

Each pair of records is compared in 5 different ways. The general rule is:

ComparisonResultAction
1 ... 5 YES go to next comparison if present,
else mark the records as duplicates
(insufficient data for comparison)
NO stop comparisons for this pair of record

The following comparisons are used (in this order, chosen for performance reasons):

  1. Publication year: Are they at most 1 year apart?
  • Prepocessing: publication years before 1900 are removed (see insufficient data)
  • Insufficient data: Records without a publication year are compared to all records unless they have been identified as a duplicate.
  1. Starting page or DOI: Are they the same?
    If the starting pages are different or one or both are absent, the DOIs are compared.
  • Preprocessing: Article number is treated as a starting page if starting page itself is empty or contains "-".
  • Preprocessing: Starting pages are compared only for number: "S123" and "123" are considered the same.
  • Preprocessing: In DOIs 'http://dx.doi.org/', 'http://doi.org/', ... are left out. URL- and HTML-encoded DOIs are decoded ('10.1002/(SICI)1098-1063(1998)8:6&lt;627::AID-HIPO5&gt;3.0.CO;2-X' becomes '10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X'). DOIs are lowercased.
  • Insufficient data: If one or both DOIs are missing and one or both of the starting pages are missing, the answer is YES. This is important because of PubMed ahead of print publications.
  1. Authors: Is the Jaro-Winkler similarity of the authors > 0.67?
  • Preprocessing: The author "Anonymous," is treated as no author.
  • Preprocessing: Group author names are removed. "Author" names which contain "consortium", "grp", "group", "nct" or "study" are considered group author names.
  • Preprocessing: First names are reduced to initials ("Moorthy, Ranjith K." to "Moorthy, R. K.").
  • Preprocessing: All authors from each record are joined by "; ".
  • Insufficient data: If one or both records have no authors, the answer is YES (except if one of the records is a reply (see below) and one of the records has no starting page or DOI).
  1. Title: Is the Jaro-Winkler similarity of (one of) the normalized titles > 0.9?
    The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section (T3, see below) are treated as titles. Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the beginning of a string, the normalized titles are also reversed.
  • Preprocessing: The titles are normalized (converted to lower case, text between "<...>" removed, all characters which are not letters or numbers are replaced by a space character, ...).
  • Insufficient data: If one of the records is a reply (see below), the titles are not compared / the answer is YES (but the Jaro-Winkler similarity of the authors should be > 0.75 and the comparison between the journals is more strict).

Reply: a publication is considered a reply if the title (field TI) contains "reply", or contains "author(...)respon(...)", or is nothing but "response" (all case insensitive).

T3 field: Especially EMBASE (OVID) uses this field for (1) Conference title (majority of cases), (2) an alternative journal title, and (3) original (non English) title. Case 1 (identified as containing a number or "Annual", "Conference", "Congress", "Meeting" or "Society") is skipped. All other T3 fields are treated as Journals and as titles.

  1. ISSN or Journal: Are they the same (ISSN) or similar (Journal)?
    The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3, see below) are treated as journals, ISBNs as ISSNs. All ISSns and journal titles (including abbreviations) in the records are used. Abbreviated and full journal titles are compared in a sensible way (see examples below). If the ISSns are different or one or both records have no ISSN, the journals are compared.
  • Preprocessing: ISSNs are normalized (dashes are removed, lowercased). For ISBN-10 the first 9 digits are used, for ISBN-13 the 9 digits starting at position 4.
  • Preprocessing: Journal titles of the form "Zhonghua wai ke za zhi [Chinese journal of surgery]" or "Zhonghua wei chang wai ke za zhi = Chinese journal of gastrointestinal surgery" or "The Canadian Journal of Neurological Sciences / Le Journal Canadien Des Sciences Neurologiques" are split into 2 journal titles.
  • Preprocessing: the journal titles are normalized (hyphens, dots and apostrophes are replaced with space, end part between round or square brackets is removed, initial article is removed, ...).

If two records get 5 YES answers, they are considered duplicates. Only the first record of a set of duplicate records is copied to the output file.

2. Enrich the records

When writing the output file (except in Mark Mode), the following fields can be changed:

  • Author (AU):
    • if the (only) author is "Anonymous", the author is omitted
  • DOI (DO):
    • the DOIs of the removed duplicate records are copied to the saved record and deduplicated. The DOI field is important for finding the full text in EndNote.
    • DOIs of the form "10.1038/ctg.2014.12", "http://dx.doi.org/10.1038/ctg.2014.12", ... are rewritten in the prescribed form "https://doi.org/10.1038/ctg.2014.12". DOIs of this form are clickable links in EndNote.
  • Publication year (PY):
    • if the saved record has no value for its Publication year but one of the removed duplicate records has, the first not empty Publication year of the duplicates is copied to the saved record.
  • Starting page (SP) and Article Number (C7):
    • the article number from field C7 is put in the Pages field (SP) if the Pages field is empty or does not contain a "-", overwriting the Pages field content.
    • the article number field (C7) is omitted
    • if the saved record has no value for its Pages field (e.g. PubMed ahead of print publications) but one of the removed duplicate records has, the first not empty pages of the duplicates are copied to the saved record.
    • the Pages field gets an unabbreviated form: e.g. "482-91" is rewritten as "482-491".
    • if the ending page is the same as the starting page, only the starting page is written ("192" instead of "192-192").
  • Title (TI):
    • If the publication is a reply, the title is replaced with the longest title from the duplicates (e.g. "Reply from the authors" is replaced by "Coagulation parameters and portal vein thrombosis in cirrhosis Reply")

The output file is a new RIS file which can be imported into a new EndNote database.

DedupEndNote is slower than EndNote in deduplicating records because its comparisons are more time consuming. EndNote can deduplicate a EndNote database of ca. 15,000 records in less dan 5 seconds. DedupEndNote needs around 20 seconds to deduplicate the export file in RIS format (115MB).

Performance

Data are from:

  • [SRA] Rathbone, J., Carter, M., Hoffmann, T. et al. Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module. Syst Rev 4, 6 (2015). https://doi.org/10.1186/2046-4053-4-6
    The data sets are available at https://osf.io/dyvnj/
  • [McKeown] McKeown, S., Mir, Z.M. Considerations for conducting systematic reviews: evaluating the performance of different methods for de-duplicating references. Syst Rev 10, 38 (2021). https://doi.org/10.1186/s13643-021-01583-y
  • [BIG_SET] Own test database for DedupEndNote on portal vein thrombosis (52,828 records, with 4923 records validated)
Name Tool True pos False neg Sensitivity True neg False pos Specificity Accuracy
SRA: Cytology screening
(1856 rec)
EndNote X9 885 518 63.1% 452 1 99.8% 72.0%
SRA-DM 1265 139 90.1% 452 0 100.0% 92.5%
DedupEndNote 1359 61 95.7% 436 0 100.0% 96.8%
SRA: Haematology (1415 rec) EndNote 159 87 64.6% 1165 4 99.7% 93.6%
SRA-DM 208 38 84.6% 1169 0 100.0% 97.3%
DedupEndNote 222 14 94.1% 1179 0 100.0% 99.0%
SRA: Respiratory
(1988 rec)
EndNote X9 410 391 51.2% 1185 2 99.8% 80.2%
SRA-DM 674 125 84.4% 1189 0 100.0% 93.7%
DedupEndNote 766 34 95.7% 1188 0 100.0% 97.8%
SRA: Stroke
(1292 rec)
EndNote X9 372 134 73.5% 784 2 99.7% 89.5%
SRA-DM 426 81 84.0% 785 0 100.0% 93.7%
DedupEndNote 503 7 98.6% 782 0 100.0% 99.5%
McKeown
3130 rec
OVID 1982 90 95.7% 1058 0 100.0% 97.1%
EndNote 1541 531 74.4% 850 208 80.3% 76.4%
Mendeley 1877 195 90.6% 1041 17 98.4% 93.2%
Zotero 1473 599 71.1% 1038 20 98.1% 80.2%
Covidence 1952 120 94.2% 1056 2 99.8% 96.1%
Rayyan 2023 49 97.6% 1006 52 95.1% 96.8%
DedupEndNote 2010 62 97.0% 1058 0 100.0% 98.0%
BIG_SET
(4923 rec)
DedupEndNote 3685 271 93.1% 966 1 99.9% 94.5%

Limitations

  • Input file size: The maximum size of the input file is limited to 150MB.
  • Input file format: only EndNote RIS file (at present)
  • Input file encoding: The program assumes that the input file is encoded as UTF-8.
  • The program uses a bibliographic point of view: an article or conference abstract that has been published in more than one (issue of a) journal is not considered a duplicate publication.
  • If authors AND (all) titles AND (all) journal names for a record use a non-Latin script, results for this record may be inaccurate.
  • Each input file must be an export from ONE EndNote database: the ID fields are used internally for identifying the records, so they have to be unique. When comparing 2 files the ID fields may be common between the 2 files.
  • The program has been developed and tested for biomedical databases (PubMed, EMBASE, ...) and some general databases (Web of Science, Scopus). Deduplicating records from other databases is not garanteed to work.
  • Records for each publication year are compared to records from the same and the following year: a record from 2016 is compared to the records from 2015 (when treating the records from 2015) and from 2016 and 2017 (when treating the records from 2016). A PubMed ahead-of-print record from 2013 and a corresponding record from 2017 (when it was 'officially' published) will not be compared (and possibly deduplicated).
  • Bibliographic databases are not always very accurate in the starting page of a publication. Because starting page is part of the comparisons, DedupEndNote misses the duplicates when bibliographic databases don't agree on the starting page (and one or both records have no DOIs).

Releases

No releases published

Packages

No packages published