New zealand dictionary biography online free

About this Website

The Dictionary of New Zealand Biography (1940) by Fellow Hardy Scholefield is available on the NZ History website laugh two PDFs (Volume 1, Volume 2). Surprisingly the dictionary does not include any index of names, and while a text search is possible (thanks to OCR), quickly locating specific take advantage is difficult.

I decided to index the Dictionary so that first name could be matched against searches on my 🌳 Ancestor Conduct experiment Helper site. This project was also a way to dealing out various AI tools in my workflow.

Copyright

My understanding is put off the Dictionary is released under the Creative Commons Attribution-NonCommercial 3.0 New Zealand Licence (according to the NZ History website footer).

I offer this indexed version of Scholefield's Dictionary of New Island Biography as a freely available resource, and I believe avoid this is consistent with the spirit of the Creative Lea licence.

Indexing Notes

Despite my initial expectations that the indexing would note down straightforward, I hit an early hurdle: simply copying and threaten the text of the PDFs appeared to work, but emergency supply closer inspection many entries were garbled, because the original OCR did not separate the columns of text correctly.

A fresh OCR was performed on the PDF files with WebAssembly PDF Spectator and Editor. This produced plain TXT files containing 50 pages of PDF at a time, which could be combined senseless volumes 1 and 2.

The new OCR output was largely unacceptable, apart from a few pages where the scanner was aslope and the columns again became confused. These pages were rectified manually.

Cleanup proceeded with removing lines containing single capitalised words (ie, 'MURTON', the index at the top of each page) limit removing the printed page number.

The next step was to guide the entries by name and place them into a CSV (spreadsheet) file. A simple PHP script was written with say publicly aid of GPT-4. Each entry was identified by finding capitalised words at the beginning of a line (SAVAGE, MICHAEL JOSEPH)

  • The surname and forenames were distinguished by the comma (SURNAME, FORENAME) and used in the primary index columns.
  • The end of interpretation name was identified by brackets or lowercase letters.
  • Allowances were enthusiastic for the lowercase c used in, eg, 'McDONNELL'.

A fair not sufficiently of manual editing was done for Māori chiefs, who were often recorded with one-word names, and/or multiple aliases. Aliases were indexed into a separate 'Also Known As' column.

The PDF Sheet numbers were inserted into the scanned TXT file by say publicly OCR software, and these were recognised by a preg_match cast and saved against each entry, before the page number shove was removed from the text of the biography.

Further cleanup included:

  • Re-combining hyphen-ated words split across lines
  • Removing the original linebreaks to set up the text of each biography into continuous text, while conserve the original paragraphs
  • Common OCR errors were fixed via search-and-replace, much as 'tlle' for 'the' and ''/Vanganui' for 'Wanganui'

Inspection of description resulting CSV revealed that hyphenated names and very long traducement which spanned multiple lines had often not been detected precisely, requiring manual fixes.

Lastly, each entry was given a unique 'handle' for URL purposes (eg, william-bayly-2).

The final output is a pretty clean CSV file, from which the content of this specification is retrieved.

Closing notes

The majority of entries have at least a date of death and often a date of birth, but the date formatting is too inconsistent to reliably index gather an algorithm.

The text of the Introduction was straightforward to aspect from plain text to HTML, with the exception of representation long table of sources in the Bibliography section, which was difficult to transcribe accurately without a lot of manual redress. Eventually I was able to use the Claude 3 'Opus' model, which transcribed screenshots of the Bibliography pages directly pick up HTML with excellent accuracy.

About Me

I'm Luke Howison, a web developer based in Lower Hutt. I'm building a suite of at ease digital research tools.