What text cleaning actually removes
Text copied from PDFs, Word documents, or web pages carries invisible formatting artifacts that cause problems when pasted into databases, APIs, or other documents. The most common artifacts:
- Non-breaking spaces (U+00A0)Copied from HTML where
was used. Visually identical to a regular space but treated as a different character in string comparisons and database storage — a common cause of "text looks right but doesn't match" bugs. - Smart quotes and typographic dashesWord and macOS autocorrect "straight quotes" to curly “smart quotes” and -- to —. In code contexts, these break JSON parsers, shell scripts, and any system expecting ASCII punctuation.
- Extra whitespace and line breaksPDF text extraction often produces hyphenation artifacts (split words at line breaks), double spaces between sentences, and inconsistent paragraph spacing.
- Zero-width charactersZero-width space (U+200B), zero-width non-joiner (U+200C), and byte-order marks (U+FEFF) are invisible but can corrupt API requests, break tokenization, and cause subtle database issues. Common in text copied from web pages and certain document formats.
Three things to verify manually after AI cleaning
- Intentional special charactersAn AI cleaner may strip Unicode characters that look like artifacts but are intentional — mathematical symbols, currency signs, or technical notation. Check that domain-specific symbols survived.
- Hyphenated words from PDF extractionPDF line-break hyphens ("for- matted") should become "formatted" — but the cleaner may not detect all cases, leaving broken words in the output. Scan for unusual hyphenation.
- Quotation marks in code or dataIf the text contains code examples, JSON, or CSV, smart-quote normalization could corrupt the data. Verify that any programmatic content retained its exact original punctuation.
