How to Deduplicate BibTeX Entries?

I'm writing my dissertation recently. My dissertation is a combination of several publications from my PhD career. Therefore, part of my dissertation writing process involves copy-pasting the papers together into a single document.

Like any good academics, I typeset my publications with LaTeX, and use BibTeX to incorporate citations into the documents. My collection of bibliographies is fairly ad hoc: during each writing project, I search for related work to cite in my paper. Unlike most others, I create a separate bib file for each BibTeX entry named after the citation key. For example, I would have bib/ndn-tlv.bib for a BibTeX entry named "ndn-tlv", and bib/Mininet.bib for the "Mininet" entry. This allows me to find available citation keys with a quick glance over the bib/ directory. My build process then concatenates these small bib files into ref.bib as an input to BibTeX.

My dissertation combines all my publications, and thus needs a union of BibTeX entries from those combinations. To make this union, I can copy all these single-entry bib files into the same directory. If two previous papers cited the same reference, their bib files should have the same name, and only one copy would be left in the combined directory.

Except that the above assumption is true only if I cited the same reference with the same citation key. And so I discovered a citation appearing twice in my dissertation:

duplicate references in dissertation

While I can fix this after I spotted it, this makes my think: How to find duplicate citations from my LaTeX document, and remove them?

A few Internet searches lead me to a useful command: bibexport. This command is part of TeX Live 2015, and comes with texlive-bibtex-extra package on Ubuntu. It reads an aux file, one of many temporary files used by LaTeX, and generates a BibTeX database that only contain the citations used in the LaTeX document. After all, I am not concerned with whether there are duplicates in my entire BibTeX database (which contains a list of all RFCs and is very large), but only care about the duplicate citatons appearing in my dissertation.

I ran bibexport dissertation.aux:

bibexport console window

It writes a bibexport.bib file, and the duplicate that I spotted earlier are of course in this file:

bibexport result

Are there any other duplicate citations in my dissertation? I wrote a PHP script to answer this question. Source code and an online demo are at the end of this article.

The script takes four steps:

  1. Extract titles from the exported BibTeX database bibexport.bib.
  2. Normalize titles: delete anything except letters and whitespaces, combine consecutive whitespaces. Only first 255 characters are kept in each title due to the limitation of levenshtein function used in next step.
  3. Compute edit distance between every two titles. Edit distance measures how dissimilar two strings are to one another. A pair of titles with smaller edit distance is more likely to be duplicates. I use PHP's built-in levenshtein function for this calculation.
  4. Sort pairs of titles by increasing edit distance. Print the sorted pairs.

It shows the following output:

bibdedup.php output

The script is able to find the duplicate citation that I spotted early on. It also tells me that my dissertation does not contain any other duplicate citations.

PHP script: Deduplicate BibTeX entries

<?php
require_once 'vendor/autoload.php';

$parser = new RenanBr\BibTexParser\Parser();
$listener = new RenanBr\BibTexParser\Listener();
$parser->addListener($listener);
$parser->parseFile('bibexport.bib');
$entries = $listener->export();

function normalizeTitle($title) {
  $title = preg_replace('/\s/', ' ', $title);
  $title = preg_replace('/[^a-z ]/i', '', $title);
  $title = preg_replace('/\s+/', ' ', $title);
  return substr(strtolower($title), 0, 255);
}

$titles = array();
foreach ($entries as $entry) {
  $titles[$entry['citation-key']] = normalizeTitle($entry['title']);
}
count($titles) > 0 or die('No BibTeX entry found.');

$matrix = array();
foreach ($titles as $k1=>$t1) {
  foreach ($titles as $k2=>$t2) {
    if (strcmp($k1, $k2) >= 0) {
      continue;
    }
    $matrix[$k1."\n".$k2] = levenshtein($t1, $t2);
  }
}

asort($matrix);
foreach ($matrix as $keys=>$cost) {
  list($k1, $k2) = explode("\n", $keys);
  printf("%4d %s => %s\n     %s => %s\n\n", $cost, $k1, $titles[$k1], $k2, $titles[$k2]);
}
?>

To use this script:

  1. Build your LaTeX documentation with pdflatex or latex.
  2. Run bibexport paper.aux; substitute "paper" with whatever your document is called.
  3. Copy this script and save as bibdedup.php.
  4. Get Composer, and install BibTeX parser library by running composer require renanbr/bibtex-parser:0.5.0 in the same directory as this script.
  5. Move bibexport.bib to the same directory as this script.
  6. Run php bibdedup.php | less.
  7. Look at the top few lines to see if there are any duplicates.

Online Demo: Deduplicate BibTeX entries

Input is limited to 256KB. Output is limited to first 500 entries.

Tags: LaTeX PHP