overProof run on 10M NLA articles - comparative search results

In June 2021, the National Library of Australia contracted Project Computing Pty Ltd to automatically correct at least 60% of the OCR errors in the digitised text of 10 million newspaper articles from the Trove repository using overProof [paper]. The "first" (that is, those with the lowest article id) 10 million articles without any existing human text corrections were processed.

The majority of these articles were of category "Article" (7.4m), with 1.4m categorised as "Advertising" and 1.2m as "Detailed Lists, Results, Guides". Articles came from 32 different newspaper titles with the most common being The Argus (3.8m), Sydney Morning Herald (2.8m), The Hobart Mercury (1.2m), The Adelaide Advertiser (0.7m) and The Canberra Times (0.6m). There were 6.37 billion words in these 10 million articles, of which 1.65 billion were corrected.

The main reason to use overProof is to improve search recall on the digitised newspaper text: with an estimated OCR word-error rate of around 25%, by removing 60%+ of errors and reducing this error rate to around 10%, a very significant improvement in search recall can be achieved, particularly with longer words and multi-word searches. A secondary benefit is to improve the readability of text.

The corrected text has now replaced the publicly searchable version in the NLA newspapers database. "Before" and "after" versions of the text were stored on the overProof servers, and these versions are used by the comparative search engine (below) to show the improvement in recall when searching the corrected text.

This is a comparison of search recall, not word correction rates. For example, if an uncorrected document contains one instance of the correctly OCR'ed word Leichardt but also two error OCRs of the same word, a search for Leichardt will still find the document, and while the correction of the other two instances may improve relevance-ranking and some phrase search recall, it won't change the recall on a single word search for Leichardt.

For the public, it is the improvement in recall rather than word correction rate that is most important, because articles with no correctly OCR'ed versions of the target words cannot be easily found.

For an analysis of word correction rates using overProof, refer to the NLA trial output that compared overProof run on uncorrected text with human-corrected text, which demonstrated overProof's raw word correction rate of 69%.

Recall improvement is a function of number of search words and the character-count of the words. A rough indicator of improvement is:

Single word searches: recall improves by between 15% and 50%
2 word phrase searches: recall improves by between 35% and 150%
3+ word phrase searches: recall improves by over 60%

Consider the search for two randomly selected names, Reginald Evans and Dorothy Giles, typical of a search that may be conducted by someone researching family history:

Before correction, a search on these 10m articles for Reginald Evans returned 8 results. After correction, the same set of articles returned 21 results. Of the additional 13 results, 11 are of value to the searcher but 2 are "false positives" created by overProof. The first false positive was a correction of Regruñid Lvmis to Reginald Evans after assessing it as slightly more likely than Reginald Lyons, which on this occasion was correct. The second corrected ltrgtrald Ivans to Reginald Evans (Evans was correct, but ltrgtrald should have been corrected to Fitzgerald rather than Reginald)

Before correction, a search on these 10m articles for Dorothy Giles returned 10 results. After correction, the same set of articles returned 20 results. Examining the 10 additional results (which you may also do by performing the search below):

5 articles would be of interest to the family historian: 1671095, 4201293, 11298952, 11357261, 11697413
1 article matches, but is unlikely to be of interest to this searcher: 5181641 (race results)
2 articles have been accurately corrected but the phrase slop search is returning forename/surnames for people other than Dorothy Giles: 15997930, 16097162,
2 articles are "false positives" created by overProof:
- 11017334: the original OCR text of .. Joan C1I1I Doro thv .. was corrected by overProof's algorithms to the statistically and contextually most likely text of .. Joan GILES Dorothy ..., which is a good guess, partially correct, but partially wrong (as the page image shows C1I1I is really Cliff)
- 12810699: OCR of .. Doiothys, Gi'Is' Drosses, .. has been mis-corrected to .. Dorothy, Giles' Dresses, .. (Dorothy and Dresses are both good corrections, Giles' is not)

Bella Lavender (aka Bella Guerin) was a prominent activist during the late 18th and early 19th century, and the first female graduate of Melbourne University. Newspaper references to her are likely to be of interest to social historians and other researchers. Automatic correction increases the number of articles found when searching for Bella Lavender from 22 to 41, with only two of the newly uncovered 19 articles being false positives (although one other is irrelevant, being adjacent names of horse "Bella" and "Lavender" in a form guide). A search on uncorrected text for Bella Guerin returns just 2 articles, but on the corrected text returns 14 articles of which 10 refer to "this" Bella Guerin and 4 refer to an actress in a play from the 1840s and 1850s.

A search for David Unaipon (aboriginal inventor, author and preacher) returns just 6 articles on uncorrected text. After correction, 18 articles are returned of which 2 are "false positives". Those 10 additional articles more than double the content found in this 10m article subset and if applied to the entire newspaper corpus, represent valuable needles in the Trove haystack.

Name components are some of the most difficult words for overProof to correct because they occur combined with a great many other names: there is little to differentiate statistically and contextually for example between "Reginald Evans" and "Reginald Lyons", both having comparable "language" probability, which is why the correction of Regruñid Lvmis above is difficult: the OCR-error model also finds little to differentiate between Evans and Lyons as more likely representations of the OCR'ed Lvmis.

General words tend to be easier, which is why, for example, the sample searches below for Lygon Street Carlton, kanaka trade and chocolate factory show very few false positives. Words part of quite common contexts, such as those making up Lygon Street Carlton, Sydney Harbour Bridge, and Flinders Street Station are typically corrected extremely reliably by overProof, even when the original OCR is severely corrupted.

The cost of automatic correction of text on a newspaper page is trivial compared to the up-front cost of digitisation and ongoing costs of storage and delivery. The speed of automatic correction is orders of magnitude faster than even a huge crowd-sourced team: at current rates of correction, NLA's existing newspaper corpus will take over 110 years to manually correct. By automatically correcting two-thirds of OCR errors, overProof can reduce the time needed for human corrections and as demonstrated by this comparative search engine, allow current and future generations of searchers to uncover a lot more "gold", and in so doing, greatly enhance the public utility of this magnificent resource.

The following table lists article counts found for sample searches before and after correction of the 10m article corpus by overProof. The meanings of the table columns are:

Original: total count of articles found before correction by overProof
Both original and corrected: count of common articles found before and after correction by overProof
Only original: count of articles found only before correction by overProof. Some of these may have been "false positives" before correction (eg, "Franklin Mills" being incorrectly OCRed as "Franklin Miles" and hence falsely matching a search for "Miles Franklin"), some of these may have been incorrectly "corrected" by overProof to text statistically and contextually more likely, but nevertheless, incorrect in this instance. Some are also related to a name being corrected to a possessive form, as the index does not use stemming (eg, the original Mannix'« which matches Mannix, is accurately corrected to Mannix's, which does not).
Only corrected: count of articles found only after correction by overProof, that is, the count of articles where the search term is found after correction but not before correction. Some of these may also be "false positives".
Corrected: total count of articles found after correction by overProof
Recall improvement: percentage increase in number of articles found after correction by overProof

Notes:

This search comparator uses an index built from the 10 million articles processed, not the entire NLA newspaper corpus. It is demonstrating the effect of overProof correction on just the articles it processed.
This search comparator uses a default configuration of a SOLR v8.9 search engine with the default SOLR word tokenizer and no word stemming. This is different from NLA's custom SOLR configuration, so although search result counts will be comparable, they will not be identical.
Because of the way articles were selected for processing, articles from the 18th and early 19th century predominate: there is comparatively little post-1940 content.

Search term as a phrase, slop 2	Original	Both original and corrected	Only original	Only corrected	Corrected	Recall improvement original -> corrected
Reginald Evans	8	8	0	13	21	x
Dorothy Giles	10	10	0	10	20	x
Bella Lavender	22	22	0	19	41	x
Bella Guerin	2	2	0	12	14	x
Miles Franklin	161	158	3	135	293	x
Henry Lawson	707	688	19	393	1081	x
Richard Mahoney	14	14	0	17	31	x
George Birchall	11	11	0	17	28	x
Dame Nellie Melba	1228	1201	27	963	2164	x
Henry Parkes	14370	14314	56	6067	20381	x
David Unaipon	6	6	0	12	18	x
Alfred Deakin	1781	1776	5	917	2693	x
Albert Einstein	38	38	0	31	69	x
Queen Victoria	33569	33371	198	19175	52546	x
Viscount Lascelles	133	133	0	50	183	x
James Cook	3294	3250	44	1473	4723	x
W L Baillieu	194	192	2	1446	1638	x
Baillieu Trust	8	8	0	13	21	x
Baillieu Education Trust	4	4	0	8	12	x
Disraeli	7815	7723	92	1465	9188	x
Leichardt	4542	4425	117	2389	6814	x
Kaiser Wilhelm	1236	1221	15	727	1948	x
WM Hughes	496	489	7	383	872	x
Prime Minister Hughes	4644	4632	12	6066	10698	x
Lloyd George Hughes	33	33	0	21	54	x
Archbishop Mannix	1352	1351	1	2365	3716	x
Lygon Street Carlton	2846	2827	19	7064	9891	x
Flinders Street Station	11464	11443	21	15016	26459	x
Sydney Harbour Bridge	430	430	0	429	859	x
Broken Hill railway	406	404	2	295	699	x
Sydney ferries	3243	3242	1	3988	7230	x
Sydney fish markets	6	6	0	9	15	x
William Street Woolloomooloo	1693	1672	21	2539	4211	x
Italy	88815	88430	385	62568	150998	x
Mount Isa	3883	3844	39	1969	5813	x
Kalgoorlie	49805	49796	9	10437	60233	x
Adelaide University	3424	3405	19	2019	5424	x
Myall Creek Station	28	28	0	26	54	x
New South Wales	507111	506665	446	334201	840866	x
Keira Street Wollongong	13	12	1	30	42	x
Mackenzie Street Bendigo	2	2	0	7	9	x
Venezia Hotel Zeehan	1	1	0	4	5	x
Imperial Hotel Queenstown	115	115	0	120	235	x
Intercolonial conference	3087	3083	4	1821	4904	x
robbery under arms	1515	1515	0	1286	2801	x
kanaka trade	68	68	0	36	104	x
pearling Torres Strait	6	6	0	4	10	x
Torres Strait pearling	2	2	0	2	4	x
British import duties	15	15	0	18	33	x
margarine butter	352	351	1	224	575	x
chocolate factory	85	85	0	99	184	x
MacRobertson's	134	129	5	199	328	x
zinc mine	234	226	8	342	568	x
Royal Australian Air Force	1300	1299	1	1310	2609	x
Chief Protector of Aborigines	135	135	0	156	291	x
Electricity Supply Association of Australia	5	5	0	9	14	x
Olympic Games Paris	53	53	0	87	140	x
Bathurst Goal	17	16	1	23	39	x

Report on comparative search results
following overProof correction of 10 million NLA newspaper articles

Background

Summary

Sample searches

Search

Search Results