
In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. SuperUser contributor Frabjous offers a solution combined with a heavy dose of caution:įirstly, you have to understand what a PDF is.

Is there a quick and easy way for Colen (and the rest of us) to get grab text without sacrificing the formatting? The Answer Ideally, I’d like to be able to copy text from a PDF and have formatting converted to HTML codes, “smart quotes” converted to ” and ‘, and line breaks done properly. Formatting like bold and italics are lost soft line breaks within a paragraph of text are converted to hard line breaks dashes to break a word over two lines are preserved even when they shouldn’t be and single and double quotes are replaced with ? signs.

When I copy text out of a PDF file and into a text editor, it ends up mangled in a variety of ways. SuperUser reader Colen is searching for a way to extract text from PDFs while preserving the formatting:
