In June 2021, the National Library of Australia contracted Project Computing Pty Ltd to automatically correct at least 60% of the OCR errors in the digitised text of 10 million newspaper articles from the Trove repository using overProof [paper]. The "first" (that is, those with the lowest article id) 10 million articles without any existing human text corrections were processed.
The majority of these articles were of category "Article" (7.4m), with 1.4m categorised as "Advertising" and 1.2m as "Detailed Lists, Results, Guides". Articles came from 32 different newspaper titles with the most common being The Argus (3.8m), Sydney Morning Herald (2.8m), The Hobart Mercury (1.2m), The Adelaide Advertiser (0.7m) and The Canberra Times (0.6m). There were 6.37 billion words in these 10 million articles, of which 1.65 billion were corrected.
The main reason to use overProof is to improve search recall on the digitised newspaper text: with an estimated OCR word-error rate of around 25%, by removing 60%+ of errors and reducing this error rate to around 10%, a very significant improvement in search recall can be achieved, particularly with longer words and multi-word searches. A secondary benefit is to improve the readability of text.
The corrected text has now replaced the publicly searchable version in the NLA newspapers database. "Before" and "after" versions of the text were stored on the overProof servers, and these versions are used by the comparative search engine (below) to show the improvement in recall when searching the corrected text.
This is a comparison of search recall, not word correction rates. For example, if an uncorrected document contains one instance of the correctly OCR'ed word Leichardt but also two error OCRs of the same word, a search for Leichardt will still find the document, and while the correction of the other two instances may improve relevance-ranking and some phrase search recall, it won't change the recall on a single word search for Leichardt.
For the public, it is the improvement in recall rather than word correction rate that is most important, because articles with no correctly OCR'ed versions of the target words cannot be easily found.
Recall improvement is a function of number of search words and the character-count of the words. A rough indicator of improvement is:
Consider the search for two randomly selected names, Reginald Evans and Dorothy Giles, typical of a search that may be conducted by someone researching family history:
Before correction,
a search on these 10m articles for Reginald Evans returned 8 results. After correction, the same set of
articles returned 21 results. Of the additional 13 results, 11 are of value to the searcher but 2 are "false positives"
created by overProof. The first false positive was a correction of
Regruñid Lvmis
to Reginald Evans
after assessing it
as slightly more likely than Reginald Lyons
, which on this occasion was correct. The second
corrected ltrgtrald Ivans
to Reginald Evans
(Evans was correct, but ltrgtrald
should have been corrected to Fitzgerald rather than Reginald)
Before correction, a search on these 10m articles for Dorothy Giles returned 10 results. After correction, the same set of articles returned 20 results. Examining the 10 additional results (which you may also do by performing the search below):
.. Joan C1I1I Doro thv ..
was corrected by overProof's algorithms to the statistically
and contextually most likely text of .. Joan GILES Dorothy ...
, which is a good guess, partially
correct, but partially wrong (as the page image shows C1I1I
is really Cliff
)
.. Doiothys, Gi'Is' Drosses, ..
has been mis-corrected to
.. Dorothy, Giles' Dresses, ..
(Dorothy and Dresses are both good corrections,
Giles' is not)
Bella Lavender (aka Bella Guerin) was a prominent activist during the late 18th and early 19th century, and the first female graduate of Melbourne University. Newspaper references to her are likely to be of interest to social historians and other researchers. Automatic correction increases the number of articles found when searching for Bella Lavender from 22 to 41, with only two of the newly uncovered 19 articles being false positives (although one other is irrelevant, being adjacent names of horse "Bella" and "Lavender" in a form guide). A search on uncorrected text for Bella Guerin returns just 2 articles, but on the corrected text returns 14 articles of which 10 refer to "this" Bella Guerin and 4 refer to an actress in a play from the 1840s and 1850s.
A search for David Unaipon (aboriginal inventor, author and preacher) returns just 6 articles on uncorrected text. After correction, 18 articles are returned of which 2 are "false positives". Those 10 additional articles more than double the content found in this 10m article subset and if applied to the entire newspaper corpus, represent valuable needles in the Trove haystack.
Name components are some of the most difficult words for overProof to correct because they occur combined with a great many
other names: there is little to differentiate statistically and contextually for example between
"Reginald Evans" and "Reginald Lyons", both having comparable "language" probability, which is why the correction
of Regruñid Lvmis
above is difficult: the OCR-error model also finds little to differentiate between
Evans and Lyons as more likely representations of the OCR'ed Lvmis.
General words tend to be easier, which is why, for example, the sample searches below for Lygon Street Carlton, kanaka trade and chocolate factory show very few false positives. Words part of quite common contexts, such as those making up Lygon Street Carlton, Sydney Harbour Bridge, and Flinders Street Station are typically corrected extremely reliably by overProof, even when the original OCR is severely corrupted.
The cost of automatic correction of text on a newspaper page is trivial compared to the up-front cost of digitisation and ongoing costs of storage and delivery. The speed of automatic correction is orders of magnitude faster than even a huge crowd-sourced team: at current rates of correction, NLA's existing newspaper corpus will take over 110 years to manually correct. By automatically correcting two-thirds of OCR errors, overProof can reduce the time needed for human corrections and as demonstrated by this comparative search engine, allow current and future generations of searchers to uncover a lot more "gold", and in so doing, greatly enhance the public utility of this magnificent resource.
The following table lists article counts found for sample searches before and after correction of the 10m article corpus by overProof. The meanings of the table columns are:
Mannix'«
which matches Mannix, is
accurately corrected to Mannix's
, which does not).Notes:
Search term as a phrase, slop 2 |
Original | Both original and corrected |
Only original | Only corrected | Corrected | Recall improvement original -> corrected |
---|---|---|---|---|---|---|
Reginald Evans | 8 | 8 | 0 | 13 | 21 | x |
Dorothy Giles | 10 | 10 | 0 | 10 | 20 | x |
Bella Lavender | 22 | 22 | 0 | 19 | 41 | x |
Bella Guerin | 2 | 2 | 0 | 12 | 14 | x |
Miles Franklin | 161 | 158 | 3 | 135 | 293 | x |
Henry Lawson | 707 | 688 | 19 | 393 | 1081 | x |
Richard Mahoney | 14 | 14 | 0 | 17 | 31 | x |
George Birchall | 11 | 11 | 0 | 17 | 28 | x |
Dame Nellie Melba | 1228 | 1201 | 27 | 963 | 2164 | x |
Henry Parkes | 14370 | 14314 | 56 | 6067 | 20381 | x |
David Unaipon | 6 | 6 | 0 | 12 | 18 | x |
Alfred Deakin | 1781 | 1776 | 5 | 917 | 2693 | x |
Albert Einstein | 38 | 38 | 0 | 31 | 69 | x |
Queen Victoria | 33569 | 33371 | 198 | 19175 | 52546 | x |
Viscount Lascelles | 133 | 133 | 0 | 50 | 183 | x |
James Cook | 3294 | 3250 | 44 | 1473 | 4723 | x |
W L Baillieu | 194 | 192 | 2 | 1446 | 1638 | x |
Baillieu Trust | 8 | 8 | 0 | 13 | 21 | x |
Baillieu Education Trust | 4 | 4 | 0 | 8 | 12 | x |
Disraeli | 7815 | 7723 | 92 | 1465 | 9188 | x |
Leichardt | 4542 | 4425 | 117 | 2389 | 6814 | x |
Kaiser Wilhelm | 1236 | 1221 | 15 | 727 | 1948 | x |
WM Hughes | 496 | 489 | 7 | 383 | 872 | x |
Prime Minister Hughes | 4644 | 4632 | 12 | 6066 | 10698 | x |
Lloyd George Hughes | 33 | 33 | 0 | 21 | 54 | x |
Archbishop Mannix | 1352 | 1351 | 1 | 2365 | 3716 | x |
Lygon Street Carlton | 2846 | 2827 | 19 | 7064 | 9891 | x |
Flinders Street Station | 11464 | 11443 | 21 | 15016 | 26459 | x |
Sydney Harbour Bridge | 430 | 430 | 0 | 429 | 859 | x |
Broken Hill railway | 406 | 404 | 2 | 295 | 699 | x |
Sydney ferries | 3243 | 3242 | 1 | 3988 | 7230 | x |
Sydney fish markets | 6 | 6 | 0 | 9 | 15 | x |
William Street Woolloomooloo | 1693 | 1672 | 21 | 2539 | 4211 | x |
Italy | 88815 | 88430 | 385 | 62568 | 150998 | x |
Mount Isa | 3883 | 3844 | 39 | 1969 | 5813 | x |
Kalgoorlie | 49805 | 49796 | 9 | 10437 | 60233 | x |
Adelaide University | 3424 | 3405 | 19 | 2019 | 5424 | x |
Myall Creek Station | 28 | 28 | 0 | 26 | 54 | x |
New South Wales | 507111 | 506665 | 446 | 334201 | 840866 | x |
Keira Street Wollongong | 13 | 12 | 1 | 30 | 42 | x |
Mackenzie Street Bendigo | 2 | 2 | 0 | 7 | 9 | x |
Venezia Hotel Zeehan | 1 | 1 | 0 | 4 | 5 | x |
Imperial Hotel Queenstown | 115 | 115 | 0 | 120 | 235 | x |
Intercolonial conference | 3087 | 3083 | 4 | 1821 | 4904 | x |
robbery under arms | 1515 | 1515 | 0 | 1286 | 2801 | x |
kanaka trade | 68 | 68 | 0 | 36 | 104 | x |
pearling Torres Strait | 6 | 6 | 0 | 4 | 10 | x |
Torres Strait pearling | 2 | 2 | 0 | 2 | 4 | x |
British import duties | 15 | 15 | 0 | 18 | 33 | x |
margarine butter | 352 | 351 | 1 | 224 | 575 | x |
chocolate factory | 85 | 85 | 0 | 99 | 184 | x |
MacRobertson's | 134 | 129 | 5 | 199 | 328 | x |
zinc mine | 234 | 226 | 8 | 342 | 568 | x |
Royal Australian Air Force | 1300 | 1299 | 1 | 1310 | 2609 | x |
Chief Protector of Aborigines | 135 | 135 | 0 | 156 | 291 | x |
Electricity Supply Association of Australia | 5 | 5 | 0 | 9 | 14 | x |
Olympic Games Paris | 53 | 53 | 0 | 87 | 140 | x |
Bathurst Goal | 17 | 16 | 1 | 23 | 39 | x |