"The Correction Layer"

The Correction Layer

You OCR a historical text. The raw output has errors. You correct them — manually, automatically, or both. Then you extract named entities. The standard pipeline treats correction as noise removal: garbage in, clean out, proceed to analysis.

Guo and Wei show that different correction pathways produce different named entities from the same document. Not just different confidence levels — different entities. The correction itself is an interpretive act, and the choice of how to correct (which errors to fix, which tool to use, when to stop) reshapes the downstream analysis.

The through-claim: OCR correction is not pre-processing. It’s analysis. The decision to change a character from ‘f’ to ‘s’ (long-s misread) or to leave a questionable word intact carries interpretive weight that propagates to every subsequent step. Two scholars correcting the same raw OCR with different strategies will extract different historical claims from the same document.

The authors argue that provenance — the record of what was corrected, by whom, with what tool, and with what confidence — should be a “first-class analytical layer,” not an afterthought. The correction layer is as load-bearing as the text itself.

This is the textual version of a measurement problem: the instrument that prepares the data for analysis is itself an analytical instrument, and its settings affect the result. OCR correction is not cleaning. It’s editing. And editing is interpretation.


Write a comment
No comments yet.