This file is indexed.

/usr/share/doc/ocrmypdf/html/introduction.html is in ocrmypdf-doc 4.3.5-3.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
  <meta charset="utf-8">
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
  <title>Introduction &mdash; ocrmypdf 4.3.5 documentation</title>
  

  
  

  

  
  
    

  

  
  
    <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
  

  

  
        <link rel="index" title="Index"
              href="genindex.html"/>
        <link rel="search" title="Search" href="search.html"/>
    <link rel="top" title="ocrmypdf 4.3.5 documentation" href="index.html"/>
        <link rel="next" title="Installing additional language packs" href="languages.html"/>
        <link rel="prev" title="OCRmyPDF documentation" href="index.html"/> 

  
  <script src="_static/js/modernizr.min.js"></script>

</head>

<body class="wy-body-for-nav" role="document">

  <div class="wy-grid-for-nav">

    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search">
          

          
            <a href="index.html" class="icon icon-home"> ocrmypdf
          

          
          </a>

          

          
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>

          
        </div>

        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
          
            
            
                <ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">Introduction</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#about-ocr">About OCR</a></li>
<li class="toctree-l2"><a class="reference internal" href="#about-pdfs">About PDFs</a></li>
<li class="toctree-l2"><a class="reference internal" href="#about-pdf-a">About PDF/A</a></li>
<li class="toctree-l2"><a class="reference internal" href="#what-ocrmypdf-does">What OCRmyPDF does</a></li>
<li class="toctree-l2"><a class="reference internal" href="#limitations">Limitations</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="languages.html">Installing additional language packs</a></li>
<li class="toctree-l1"><a class="reference internal" href="cookbook.html">Cookbook</a></li>
<li class="toctree-l1"><a class="reference internal" href="security.html">PDF Security Issues</a></li>
<li class="toctree-l1"><a class="reference internal" href="errors.html">Common error messages</a></li>
</ul>

            
          
        </div>
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">

      
      <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
        <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
        <a href="index.html">ocrmypdf</a>
      </nav>


      
      <div class="wy-nav-content">
        <div class="rst-content">
          

 



<div role="navigation" aria-label="breadcrumbs navigation">
  <ul class="wy-breadcrumbs">
    <li><a href="index.html">Docs</a> &raquo;</li>
      
    <li>Introduction</li>
      <li class="wy-breadcrumbs-aside">
        
          
            <a href="_sources/introduction.txt" rel="nofollow"> View page source</a>
          
        
      </li>
  </ul>
  <hr/>
</div>
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
            
  <div class="section" id="introduction">
<h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline"></a></h1>
<p>OCRmyPDF is a Python 3 package that adds OCR layers to PDFs.</p>
<div class="section" id="about-ocr">
<h2>About OCR<a class="headerlink" href="#about-ocr" title="Permalink to this headline"></a></h2>
<p><a class="reference external" href="https://en.wikipedia.org/wiki/Optical_character_recognition">Optical character recognition</a> is technology that converts images of typed or handwritten text, such as in a scanned document, to computer text that can be searched and copied.</p>
<p>OCRmyPDF uses <a class="reference external" href="https://github.com/tesseract-ocr/tesseract">Tesseract</a>, the best available open source OCR engine, to perform OCR.</p>
</div>
<div class="section" id="about-pdfs">
<span id="raster-vector"></span><h2>About PDFs<a class="headerlink" href="#about-pdfs" title="Permalink to this headline"></a></h2>
<p>PDFs are page description files that attempts to preserve a layout exactly. They can contain <a class="reference external" href="http://vector-conversions.com/vectorizing/raster_vs_vector.html">vector graphic files</a> that can contain raster objects such as scanned images. Because PDFs can contain multiple pages (unlike many image formats) and can contain fonts and text, it is a good formats for exchanging scanned documents.</p>
<img alt="_images/bitmap_vs_svg.svg" src="_images/bitmap_vs_svg.svg" /><p>A PDF page might contain multiple images, even if it only appears to have one image.  Some scanners or scanning software will segment pages into monochromatic text and color regions for example, to improve the compression ratio and appearance of the page.</p>
<p>Rasterizing a PDF is the process of generating an image suitable for display or analyzing with an OCR engine.  OCR engines like Tesseract work with images, not vector objects.</p>
</div>
<div class="section" id="about-pdf-a">
<h2>About PDF/A<a class="headerlink" href="#about-pdf-a" title="Permalink to this headline"></a></h2>
<p><a class="reference external" href="https://en.wikipedia.org/wiki/PDF/A">PDF/A</a> is an ISO-standardized subset of the full PDF specification that is designed for archiving (the &#8216;A&#8217; stands for Archive).  PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript, video, audio and references to external fonts.  All fonts and resources needed to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types of embedded content, it is probably more secure.</p>
<p>There are various conformance levels and versions, such as &#8220;PDF/A-2b&#8221;.</p>
<p>Generally speaking, the best format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, <a class="reference external" href="https://pdfblog.com/2012/02/13/what-is-pdfa/">mandate the use of PDF/A</a> for scanned documents.</p>
<p>Since most people who scan documents are interested in reading them indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.</p>
<p>PDF/A has a few drawbacks.  Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users.  It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. PDF/A files can be digitally signed, but may not be encrypted, to ensure they can be read in the future.  Fortunately, converting from PDF/A to a regular PDF is trivial, and any PDF viewer can view PDF/A.</p>
</div>
<div class="section" id="what-ocrmypdf-does">
<h2>What OCRmyPDF does<a class="headerlink" href="#what-ocrmypdf-does" title="Permalink to this headline"></a></h2>
<p>OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content.  It uses <a class="reference external" href="http://ghostscript.com/">Ghostscript</a> to rasterize the page, and then performs on OCR on the rasterized image.  It is not enough to simply extract the images from each page and run OCR on them individually.  Of course one could use Ghostscript or another PDF rasterizer and then pass the image to Tesseract.  OCRmyPDF automates this process and produces a minimally changed output file that contains the same information, colorspace and resolution.</p>
<p>The Tesseract OCR engine can output &#8216;hOCR&#8217; files, which are XML files that contain a description of the text it found on the page.  OCRmyPDF will render a new PDF that contains only the hidden text layer, and merge this with the original page.</p>
<p>Alternately, OCRmyPDF can use the Tesseract OCR engine to directly output PDFs for each page, then merge them.</p>
<p>By default, OCRmyPDF will convert the file to a PDF/A.  This behavior can be disabled with the <code class="docutils literal"><span class="pre">--output-type</span> <span class="pre">pdf</span></code> argument.</p>
<p>Depending on the settings selected, OCRmyPDF may &#8220;graft&#8221; the OCR layer into the existing PDF, or reconstruct a visually equivalent new PDF.</p>
</div>
<div class="section" id="limitations">
<h2>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2>
<p>OCRmyPDF is limited by the Tesseract OCR engine.  As such it experiences these limitations, as do any other programs that rely on Tesseract:</p>
<ul class="simple">
<li>The OCR is not as accurate as commercial solutions such as Abbyy.</li>
<li>It is not capable of recognizing handwriting.</li>
<li>It may find gibberish and report this as OCR output.</li>
<li>If a document contains languages outside of those given in the <code class="docutils literal"><span class="pre">-l</span> <span class="pre">LANG</span></code> arguments, results may be poor.</li>
<li>It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns and join text across the columns.</li>
<li>Poor quality scans may produce poor quality OCR. Garbage in, garbage out.</li>
</ul>
<p>OCRmyPDF is also limited by the PDF specification:</p>
<ul class="simple">
<li>PDF encodes the position of text glyphs but does not encode document structure.  There is no markup that divides a document in sections, paragraphs, sentences, or even words (since blank spaces are not represented). As such all elements of document structure including the spaces between words must be derived heuristically.  Some PDF viewers do a better job of this than others.</li>
</ul>
<p>Ghostscript also imposes some limitations:</p>
<ul class="simple">
<li>PDFs containing JBIG2-encoded content will be converted to CCITT Group4 encoding, which has lower compression ratios, if Ghostscript PDF/A is enabled.</li>
</ul>
<p>OCRmyPDF is currently not designed to be used as a Python API; it is designed to be run as a command line tool. <code class="docutils literal"><span class="pre">import</span> <span class="pre">ocrmypf</span></code> currently attempts to process the command line on <code class="docutils literal"><span class="pre">sys.argv</span></code> at import time so it has side effects that will interfere with its use as a package. The API it presents should not be considered stable.</p>
</div>
</div>


           </div>
          </div>
          <footer>
  
    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
      
        <a href="languages.html" class="btn btn-neutral float-right" title="Installing additional language packs" accesskey="n">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
        <a href="index.html" class="btn btn-neutral" title="OCRmyPDF documentation" accesskey="p"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  

  <hr/>

  <div role="contentinfo">
    <p>
        &copy; Copyright 2017, James R. Barlow.

    </p>
  </div>
  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 

</footer>

        </div>
      </div>

    </section>

  </div>
  


  

    <script type="text/javascript">
        var DOCUMENTATION_OPTIONS = {
            URL_ROOT:'./',
            VERSION:'4.3.5',
            COLLAPSE_INDEX:false,
            FILE_SUFFIX:'.html',
            HAS_SOURCE:  true
        };
    </script>
      <script type="text/javascript" src="_static/jquery.js"></script>
      <script type="text/javascript" src="_static/underscore.js"></script>
      <script type="text/javascript" src="_static/doctools.js"></script>

  

  
  
    <script type="text/javascript" src="_static/js/theme.js"></script>
  

  
  
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.StickyNav.enable();
      });
  </script>
   

</body>
</html>