Adding pageNumber elements to an internet archive generated scandata.xml file

Symptom

After uploading a book to internet archive (IA), some task are started on IA’s servers to generate all the metadata files and derived formats. Like explained in the blog Scandata.xml –on the wiki of the university of Columbia– the generated bookid_scandata.xml file lacks a pageNumber element. This results in not being able to directly access (or link to) a pagenumber in IA’s online reader, using an url like:

Instead one can only access page 10 of the book with ia id bookid using the following link (observe the extra n before the page number):

Apart from being counter intuitive, this results in other problems, like not being able to link items in the table of contents of a book listed in openlibrary.

Cause

This is caused by missing pageNumber children inside page elements:

A modified page element looks like this:

Solutions

The blog mentiones the use of a simple xsl stylesheet, which together with the original scandata.xml file and a xslt processor, adds the appriote pageNumber elements to each page parent.

The bash script below does the same, and could be run from the web directly, specifying only IA’s book id:

So for a book with the id originofspecies00darwuoft, the following would add the pageNumber elements to its originofspecies00darwuoft_scandata.xml file:
The bash script below does the same, and could be run from the web directly, specifying only IA’s book id:

You can use it also with a file you downloaded:

Afterwards, the new script-generated xml file should be uploaded to your book resource page on internetarchive (with the name bookid_scandata.xml) replacing the original scandata xml file. After that, the Internet archive application triggers tasks which will make the desired urls available.

Leave a Reply

Your email address will not be published. Required fields are marked. *

Related articles