How to obtain all persistent author IDs from the dblp.xml?

There are actually two kinds of author identifiers in dblp, one being more stable and the other being more transient. By iterating over all elements in the dblp.xml and collecting the identifiers you can obtain a list of all persistent author IDs used in dblp.

The first (more stable and preferred) identifier are all the keys of the "Home Page" records. Those records are the pseudo-records contained in the dblp.xml that are (for historical reasons) of the tag type "www" and carry a "title" element with text content "Home Page". These pseudo-records represent all the author profile pages in dblp. All their keys start with "homepages/" and they do usually not change unless an author profile needs to be split or merged. You may also find "www" pseudo-records carrying a "crossref" element. These are remnants of an author profile that has been merged with another one. The given crossref key points to the record of the merged profile.

The second (more transient and ambiguous) kind of author IDs are the text content of the "author" (and "editor") elements in the dblp.xml. That is, the whole string of the name (using HTML entities for non ASCII characters) including any possible "magic" four digit number we might have added to disambiguate entities. The exact same case-sensitive string always refers to the same author entity. Author entities however may have more than one string associated with them (aliases). All strings referring to the same author entity are collected in the "author" elements of its "Home Page" pseudo-record.

a service of  Schloss Dagstuhl - Leibniz Center for Informatics