Duplicate work detection

A work’s reference metadata that candidate to a BOWI assignment, called candidate work, is flagged as a duplicate of an existing work when their titles, subtitles and contributors are equivalent.

If only one of the two compared works defines a subtitle, they are not considered equivalent.

Title equivalence

Titles are considered equivalent when their normalized forms are equal ignoring case.

A title’s normalized form is its trimmed value without diacritics, punctuation and special characters, and with all whitespace chains replaced by a single space.

Subtitle equivalence

Subtitles are considered equivalent when their normalized forms are equal ignoring case.

A subtitle’s normalized form is its trimmed value without diacritics, punctuation and special characters, and with all whitespace chains replaced by a single space.

Contributor lists equivalence

Contributor lists are considered equivalent when, for both lists, for all contributors, there is an equivalent contributor in the other list.

Contributor equivalence

Contributors equivalence rules by priority order, rules are exclusive: 1. If both contributor ISNIs are defined, contributors are considered equivalent when ISNIs are equal regardless of other information. 2. If both contributor IPIs are defined, contributors are considered equivalent when IPIs are equal regardless of contributor names. 3. contributors are considered equivalent when their names are equivalent.

Contributor name equivalence

Contributor names are considered equivalent when they have the same type and when their variants match.

A name type can be either Pseudonym or RealName.

For each (lastname, firstname) couple, 2 name variants are built: - “<firstname> <lastname>”, eg: “Antonio Vivaldi” - “<firstname’s first chararacter> <lastname>”, eg: “A Vivaldi”

Then, if a couple of variants match, then the contributor names are considered equivalent.

Name variants match when their normalized forms are equal ignoring case.

A contributor name’s normalized form is its trimmed value without diacritics, punctuation and special characters, and with all whitespace chains replaced by a single space.