Report on comparative search results
following overProof correction of 10 million NLA newspaper articles


Background

In June 2021, the National Library of Australia contracted Project Computing Pty Ltd to automatically correct at least 60% of the OCR errors in the digitised text of 10 million newspaper articles from the Trove repository using overProof [paper]. The "first" (that is, those with the lowest article id) 10 million articles without any existing human text corrections were processed.

The majority of these articles were of category "Article" (7.4m), with 1.4m categorised as "Advertising" and 1.2m as "Detailed Lists, Results, Guides". Articles came from 32 different newspaper titles with the most common being The Argus (3.8m), Sydney Morning Herald (2.8m), The Hobart Mercury (1.2m), The Adelaide Advertiser (0.7m) and The Canberra Times (0.6m). There were 6.37 billion words in these 10 million articles, of which 1.65 billion were corrected.

The main reason to use overProof is to improve search recall on the digitised newspaper text: with an estimated OCR word-error rate of around 25%, by removing 60%+ of errors and reducing this error rate to around 10%, a very significant improvement in search recall can be achieved, particularly with longer words and multi-word searches. A secondary benefit is to improve the readability of text.

The corrected text has now replaced the publicly searchable version in the NLA newspapers database. "Before" and "after" versions of the text were stored on the overProof servers, and these versions are used by the comparative search engine (below) to show the improvement in recall when searching the corrected text.

Summary

This is a comparison of search recall, not word correction rates. For example, if an uncorrected document contains one instance of the correctly OCR'ed word Leichardt but also two error OCRs of the same word, a search for Leichardt will still find the document, and while the correction of the other two instances may improve relevance-ranking and some phrase search recall, it won't change the recall on a single word search for Leichardt.

For the public, it is the improvement in recall rather than word correction rate that is most important, because articles with no correctly OCR'ed versions of the target words cannot be easily found.

For an analysis of word correction rates using overProof, refer to the NLA trial output that compared overProof run on uncorrected text with human-corrected text, which demonstrated overProof's raw word correction rate of 69%.

Recall improvement is a function of number of search words and the character-count of the words. A rough indicator of improvement is:

Consider the search for two randomly selected names, Reginald Evans and Dorothy Giles, typical of a search that may be conducted by someone researching family history:

Before correction, a search on these 10m articles for Reginald Evans returned 8 results. After correction, the same set of articles returned 21 results. Of the additional 13 results, 11 are of value to the searcher but 2 are "false positives" created by overProof. The first false positive was a correction of Regruñid Lvmis to Reginald Evans after assessing it as slightly more likely than Reginald Lyons, which on this occasion was correct. The second corrected ltrgtrald Ivans to Reginald Evans (Evans was correct, but ltrgtrald should have been corrected to Fitzgerald rather than Reginald)

Before correction, a search on these 10m articles for Dorothy Giles returned 10 results. After correction, the same set of articles returned 20 results. Examining the 10 additional results (which you may also do by performing the search below):

Bella Lavender (aka Bella Guerin) was a prominent activist during the late 18th and early 19th century, and the first female graduate of Melbourne University. Newspaper references to her are likely to be of interest to social historians and other researchers. Automatic correction increases the number of articles found when searching for Bella Lavender from 22 to 41, with only two of the newly uncovered 19 articles being false positives (although one other is irrelevant, being adjacent names of horse "Bella" and "Lavender" in a form guide). A search on uncorrected text for Bella Guerin returns just 2 articles, but on the corrected text returns 14 articles of which 10 refer to "this" Bella Guerin and 4 refer to an actress in a play from the 1840s and 1850s.

A search for David Unaipon (aboriginal inventor, author and preacher) returns just 6 articles on uncorrected text. After correction, 18 articles are returned of which 2 are "false positives". Those 10 additional articles more than double the content found in this 10m article subset and if applied to the entire newspaper corpus, represent valuable needles in the Trove haystack.

Name components are some of the most difficult words for overProof to correct because they occur combined with a great many other names: there is little to differentiate statistically and contextually for example between "Reginald Evans" and "Reginald Lyons", both having comparable "language" probability, which is why the correction of Regruñid Lvmis above is difficult: the OCR-error model also finds little to differentiate between Evans and Lyons as more likely representations of the OCR'ed Lvmis.

General words tend to be easier, which is why, for example, the sample searches below for Lygon Street Carlton, kanaka trade and chocolate factory show very few false positives. Words part of quite common contexts, such as those making up Lygon Street Carlton, Sydney Harbour Bridge, and Flinders Street Station are typically corrected extremely reliably by overProof, even when the original OCR is severely corrupted.

The cost of automatic correction of text on a newspaper page is trivial compared to the up-front cost of digitisation and ongoing costs of storage and delivery. The speed of automatic correction is orders of magnitude faster than even a huge crowd-sourced team: at current rates of correction, NLA's existing newspaper corpus will take over 110 years to manually correct. By automatically correcting two-thirds of OCR errors, overProof can reduce the time needed for human corrections and as demonstrated by this comparative search engine, allow current and future generations of searchers to uncover a lot more "gold", and in so doing, greatly enhance the public utility of this magnificent resource.

Sample searches

The following table lists article counts found for sample searches before and after correction of the 10m article corpus by overProof. The meanings of the table columns are:

Notes:

  1. This search comparator uses an index built from the 10 million articles processed, not the entire NLA newspaper corpus. It is demonstrating the effect of overProof correction on just the articles it processed.
  2. This search comparator uses a default configuration of a SOLR v8.9 search engine with the default SOLR word tokenizer and no word stemming. This is different from NLA's custom SOLR configuration, so although search result counts will be comparable, they will not be identical.
  3. Because of the way articles were selected for processing, articles from the 18th and early 19th century predominate: there is comparatively little post-1940 content.

Search term
as a phrase, slop 2
Original Both original
and corrected
Only original Only corrected Corrected Recall improvement
original -> corrected
Reginald Evans 8 8 0 13 21 x
Dorothy Giles 10 10 0 10 20 x
Bella Lavender 22 22 0 19 41 x
Bella Guerin 2 2 0 12 14 x
Miles Franklin 161 158 3 135 293 x
Henry Lawson 707 688 19 393 1081 x
Richard Mahoney 14 14 0 17 31 x
George Birchall 11 11 0 17 28 x
Dame Nellie Melba 1228 1201 27 963 2164 x
Henry Parkes 14370 14314 56 6067 20381 x
David Unaipon 6 6 0 12 18 x
Alfred Deakin 1781 1776 5 917 2693 x
Albert Einstein 38 38 0 31 69 x
Queen Victoria 33569 33371 198 19175 52546 x
Viscount Lascelles 133 133 0 50 183 x
James Cook 3294 3250 44 1473 4723 x
W L Baillieu 194 192 2 1446 1638 x
Baillieu Trust 8 8 0 13 21 x
Baillieu Education Trust 4 4 0 8 12 x
Disraeli 7815 7723 92 1465 9188 x
Leichardt 4542 4425 117 2389 6814 x
Kaiser Wilhelm 1236 1221 15 727 1948 x
WM Hughes 496 489 7 383 872 x
Prime Minister Hughes 4644 4632 12 6066 10698 x
Lloyd George Hughes 33 33 0 21 54 x
Archbishop Mannix 1352 1351 1 2365 3716 x
Lygon Street Carlton 2846 2827 19 7064 9891 x
Flinders Street Station 11464 11443 21 15016 26459 x
Sydney Harbour Bridge 430 430 0 429 859 x
Broken Hill railway 406 404 2 295 699 x
Sydney ferries 3243 3242 1 3988 7230 x
Sydney fish markets 6 6 0 9 15 x
William Street Woolloomooloo 1693 1672 21 2539 4211 x
Italy 88815 88430 385 62568 150998 x
Mount Isa 3883 3844 39 1969 5813 x
Kalgoorlie 49805 49796 9 10437 60233 x
Adelaide University 3424 3405 19 2019 5424 x
Myall Creek Station 28 28 0 26 54 x
New South Wales 507111 506665 446 334201 840866 x
Keira Street Wollongong 13 12 1 30 42 x
Mackenzie Street Bendigo 2 2 0 7 9 x
Venezia Hotel Zeehan 1 1 0 4 5 x
Imperial Hotel Queenstown 115 115 0 120 235 x
Intercolonial conference 3087 3083 4 1821 4904 x
robbery under arms 1515 1515 0 1286 2801 x
kanaka trade 68 68 0 36 104 x
pearling Torres Strait 6 6 0 4 10 x
Torres Strait pearling 2 2 0 2 4 x
British import duties 15 15 0 18 33 x
margarine butter 352 351 1 224 575 x
chocolate factory 85 85 0 99 184 x
MacRobertson's 134 129 5 199 328 x
zinc mine 234 226 8 342 568 x
Royal Australian Air Force 1300 1299 1 1310 2609 x
Chief Protector of Aborigines 135 135 0 156 291 x
Electricity Supply Association of Australia 5 5 0 9 14 x
Olympic Games Paris 53 53 0 87 140 x
Bathurst Goal 17 16 1 23 39 x

Search

Search Results

Project Computing Pty Ltd