1

I have an application where users can upload pdf which are converted to text for further processing. The thing is that some of the uploaded files are image pdf, so conversion it does not work. Instead of sending all pdf to be split into images and then ocr them, I would prefer to send only those that are proved or detected to be images, is there a way to do this, I'm working in linux (debian) environment with php

UPDATE

While searching for the final solution I have followed @Andrew's suggestion, counting the amount of words at the generated txt file, if it less than 10 words proceed to the next step: pdf to images for later ocr recognition, which is what I'm working on now...

// convert any file with pdf extension to text
$cmd = "pdftotext -eol unix '$uploadedfile'";
shell_exec($cmd);
// save original file at the orig directory
rename("$uploadedfile", "orig/$uploadedfile");
// pdftotext renames files to txt so I need the file name with txt extension
$textfile = preg_replace('"\.(pdf|PDF)$"', '.txt', $uploadedfile);
// count words on the generated txt file
$cmd = "wc -w '$textfile' | cut -f1 -d' '";
$wc = shell_exec($cmd);
// proceed if words are less than 10
    if ($wc < 10)
    {
//take out the pdf extension for directory creation
    $imgdir = preg_replace('"\.(pdf|PDF)$"', '', $uploadedfile);
    $cmd = "mkdir '$imgdir'";
    shell_exec($cmd);
//change pdf extension to jpg for images creation
    $imgfile = preg_replace('"\.(pdf|PDF)$"', '.jpg', $uploadedfile);
//convert pdf to images
    $cmd = "convert 'orig/$uploadedfile' '$imgdir/$imgfile'";

then it will come the ocr...

UPDATE2 Thanks to the suggestion of @Mark-Setchell I've changed a little bit the code, now the last part is this way:

//take out the pdf extension for directory creation
$imgdir = preg_replace('"\.(pdf|PDF)$"', '', $uploadedfile);
$cmd = "mkdir '$imgdir'";
shell_exec($cmd);
//convert pdf to images
$cmd = "pdfimages 'orig/$uploadedfile' '$imgdir/$imgdir'";
4
  • Well, try to get text. If your attempt fails, then send to OCR
    – Andrew
    Commented Oct 6, 2015 at 12:34
  • Ok, so I should have a way to check if there is text or enough text in the output file?... any suggestion... thanks Commented Oct 6, 2015 at 12:35
  • Don;t you have an application that "converts PDF to text" as you stated in the question? Commented Oct 6, 2015 at 16:24
  • Yes, pdftotext, but this one only converts when the pdf content is text, not when the content is images of text. Commented Oct 6, 2015 at 16:56

2 Answers 2

0

You could use pdfimages from the Poppler package to list and extract all the images in their original formats and sizes and qualities:

pdfimages -list SomeFile.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   9     0 image      37    39  icc     3   8  jpeg   no       978  0   102   102  915B  21%
   9     1 smask      37    39  gray    1   8  image  no       978  0   102   102  334B  23%
   9     2 image     110   120  icc     3   8  jpeg   no       977  0   101   101 2246B 5.7%
   9     3 image     113   103  icc     3   8  jpeg   no       976  0   101   101 2951B 8.5%
  20     4 image     212   156  icc     3   8  jpeg   no       996  0   101   101 3664B 3.7%
  20     5 image      19    23  icc     3   8  jpeg   no      1003  0   103   103 1619B 123%
  20     6 smask      19    23  gray    1   8  image  no      1003  0   103   103  291B  67%
  22     7 image     212   156  icc     3   8  jpeg   no      1188  0   101   101 3579B 3.6%
  24     8 image     212   156  icc     3   8  jpeg   no      1195  0   101   101 2824B 2.8%
  25     9 image     212   156  icc     3   8  jpeg   no      1202  0   101   101 3247B 3.3%
  25    10 image     348    92  icc     3   8  jpeg   no      1209  0   101   101 5022B 5.2%
  25    11 smask     348    92  gray    1   8  jpeg   no      1209  0   101   101  754B 2.4%
  32    12 image     600   400  icc     3   8  jpeg   no      1217  0   150   150 26.9K 3.8%
  32    13 smask     600   400  gray    1   8  jpeg   no      1217  0   150   150 3090B 1.3%
  43    14 image     151   101  icc     3   8  jpeg   no      1228  0   101   102 2656B 5.8%
  43    15 image      71    84  icc     3   8  jpeg   no      1227  0   101   101 1540B 8.6%
  43    16 image     119    84  icc     3   8  jpeg   no      1226  0   101   101 1768B 5.9%
  43    17 image      83    84  icc     3   8  jpeg   no      1230  0   101   101 2082B  10%
  43    18 image     118    84  icc     3   8  jpeg   no      1229  0   101   101 2205B 7.4%
  46    19 image     170   114  icc     3   8  jpeg   no      1243  0   101   101 2594B 4.5%
  46    20 image     125    84  icc     3   8  jpeg   no      1242  0   101   101 3029B 9.6%
  46    21 image     125    84  icc     3   8  jpeg   no      1242  0   101   101 3029B 9.6%
  46    22 image     126    84  index   1   8  image  no      1244  0   101   101 5849B  55%
  48    23 image     226   234  icc     3   8  jpeg   no      1260  0   151   151 5310B 3.3%
  48    24 smask     226   234  gray    1   8  image  no      1260  0   151   151   81B 0.2%
  48    25 image     226   234  icc     3   8  jpeg   no      1259  0   151   151 10.3K 6.7%
  48    26 smask     226   234  gray    1   8  image  no      1259  0   151   151   81B 0.2%
  48    27 image     226   234  icc     3   8  jpeg   no      1259  0   151   151 10.3K 6.7%
  48    28 smask     226   234  gray    1   8  image  no      1259  0   151   151   81B 0.2%
  48    29 image     226   234  icc     3   8  jpeg   no      1264  0   151   151 4052B 2.6%
  48    30 smask     226   234  gray    1   8  image  no      1264  0   151   151  284B 0.5%
  48    31 image     109   113  index   1   8  image  no      1263  0   151   150 5066B  41%
  48    32 smask     109   113  gray    1   8  image  no      1263  0   151   150   76B 0.6%
  48    33 image     109   113  index   1   8  image  no      1262  0   151   150 5362B  44%
  48    34 smask     109   113  gray    1   8  image  no      1262  0   151   150   76B 0.6%
  48    35 image     226   234  index   1   8  image  no      1261  0   151   151 15.5K  30%
  48    36 smask     226   234  gray    1   8  image  no      1261  0   151   151  284B 0.5%
  50    37 image     156   103  icc     3   8  jpeg   no      1291  0   101   101 3625B 7.5%
  50    38 smask     156   103  gray    1   8  jpeg   no      1291  0   101   101  490B 3.0%
  50    39 image     156   103  icc     3   8  jpeg   no      1290  0   101   101 3615B 7.5%
  50    40 smask     156   103  gray    1   8  jpeg   no      1290  0   101   101  472B 2.9%
  50    41 image     157   103  icc     3   8  jpeg   no      1289  0   101   101 3254B 6.7%
  50    42 image     157   104  index   1   8  image  no      1292  0   101   101 3020B  18%
  52    43 image     181   139  icc     3   8  jpeg   no      1309  0   101   101 4407B 5.8%
  52    44 image     181   139  icc     3   8  jpeg   no      1308  0   101   101 4744B 6.3%
  52    45 image     181   139  icc     3   8  jpeg   no      1307  0   101   101 2356B 3.1%
  53    46 image     261   146  icc     3   8  jpeg   no      1320  0   151   150 6577B 5.8%
  53    47 smask     261   146  gray    1   8  image  no      1320  0   151   150  264B 0.7%
  53    48 image     261   146  icc     3   8  jpeg   no      1319  0   151   150 7406B 6.5%
  53    49 smask     261   146  gray    1   8  image  no      1319  0   151   150  264B 0.7%
  53    50 image     261   146  icc     3   8  jpeg   no      1318  0   151   150 9274B 8.1%
  53    51 smask     261   146  gray    1   8  image  no      1318  0   151   150  264B 0.7%
  53    52 image     261   146  icc     3   8  jpeg   no      1318  0   151   150 9274B 8.1%
  53    53 smask     261   146  gray    1   8  image  no      1318  0   151   150  264B 0.7%
  53    54 image     261   146  index   1   8  image  no      1323  0   151   150 6681B  18%
  53    55 smask     261   146  gray    1   8  image  no      1323  0   151   150  264B 0.7%
  53    56 image     261   146  icc     3   8  jpeg   no      1322  0   151   151 7089B 6.2%
  53    57 smask     261   146  gray    1   8  image  no      1322  0   151   151  264B 0.7%
  53    58 image     261   146  index   1   8  image  no      1321  0   151   150 6981B  18%
  53    59 smask     261   146  gray    1   8  image  no      1321  0   151   150  264B 0.7%
  58    60 image     600   556  icc     3   8  image  no      1344  0   145   145  289K  30%
  58    61 smask     600   556  gray    1   8  jpeg   no      1344  0   145   145 8055B 2.4%
  71    62 image     150   175  icc     3   8  jpeg   no      1383  0   101   101 4008B 5.1%
  71    63 image     150   174  icc     3   8  jpeg   no      1382  0   101   101 2523B 3.2%
  74    64 image     510   456  rgb     3   8  image  no      1392  0   144   144 22.9K 3.4%
  74    65 smask     510   456  gray    1   8  image  no      1392  0   144   144 1438B 0.6%
  74    66 image     443   177  rgb     3   8  image  no      1398  0   144   144 25.0K  11%
  74    67 smask     443   177  gray    1   8  image  no      1398  0   144   144  102B 0.1%

Then extract them using extracted as the root of the filenames:

pdfimages SomeDoc.pdf extracted

-rw-r--r--@ 1 mark  staff      915  7 Oct 10:21 extracted-000.jpg
-rw-r--r--  1 mark  staff     4342  7 Oct 10:21 extracted-000.ppm
-rw-r--r--  1 mark  staff     4342  7 Oct 10:21 extracted-001.ppm
-rw-r--r--@ 1 mark  staff     2246  7 Oct 10:21 extracted-002.jpg
-rw-r--r--  1 mark  staff    39615  7 Oct 10:21 extracted-002.ppm
-rw-r--r--@ 1 mark  staff     2951  7 Oct 10:21 extracted-003.jpg
-rw-r--r--  1 mark  staff    34932  7 Oct 10:21 extracted-003.ppm
-rw-r--r--@ 1 mark  staff     3664  7 Oct 10:21 extracted-004.jpg
-rw-r--r--  1 mark  staff    99231  7 Oct 10:21 extracted-004.ppm
-rw-r--r--@ 1 mark  staff     1619  7 Oct 10:21 extracted-005.jpg
-rw-r--r--  1 mark  staff     1324  7 Oct 10:21 extracted-005.ppm
-rw-r--r--  1 mark  staff     1324  7 Oct 10:21 extracted-006.ppm
-rw-r--r--@ 1 mark  staff     3579  7 Oct 10:21 extracted-007.jpg
-rw-r--r--  1 mark  staff    99231  7 Oct 10:21 extracted-007.ppm
-rw-r--r--@ 1 mark  staff     2824  7 Oct 10:21 extracted-008.jpg
-rw-r--r--  1 mark  staff    99231  7 Oct 10:21 extracted-008.ppm
-rw-r--r--@ 1 mark  staff     3247  7 Oct 10:21 extracted-009.jpg
-rw-r--r--  1 mark  staff    99231  7 Oct 10:21 extracted-009.ppm
-rw-r--r--@ 1 mark  staff     5022  7 Oct 10:21 extracted-010.jpg
-rw-r--r--  1 mark  staff    96062  7 Oct 10:21 extracted-010.ppm
-rw-r--r--@ 1 mark  staff      754  7 Oct 10:21 extracted-011.jpg
-rw-r--r--  1 mark  staff    96062  7 Oct 10:21 extracted-011.ppm
-rw-r--r--@ 1 mark  staff    27539  7 Oct 10:21 extracted-012.jpg
-rw-r--r--  1 mark  staff   720015  7 Oct 10:21 extracted-012.ppm
-rw-r--r--@ 1 mark  staff     3090  7 Oct 10:21 extracted-013.jpg
-rw-r--r--  1 mark  staff   720015  7 Oct 10:21 extracted-013.ppm
-rw-r--r--@ 1 mark  staff     2656  7 Oct 10:21 extracted-014.jpg
-rw-r--r--  1 mark  staff    45768  7 Oct 10:21 extracted-014.ppm
-rw-r--r--@ 1 mark  staff     1540  7 Oct 10:21 extracted-015.jpg
-rw-r--r--  1 mark  staff    17905  7 Oct 10:21 extracted-015.ppm
-rw-r--r--@ 1 mark  staff     1768  7 Oct 10:21 extracted-016.jpg
-rw-r--r--  1 mark  staff    30002  7 Oct 10:21 extracted-016.ppm
-rw-r--r--@ 1 mark  staff     2082  7 Oct 10:21 extracted-017.jpg
-rw-r--r--  1 mark  staff    20929  7 Oct 10:21 extracted-017.ppm
-rw-r--r--@ 1 mark  staff     2205  7 Oct 10:21 extracted-018.jpg
-rw-r--r--  1 mark  staff    29750  7 Oct 10:21 extracted-018.ppm
-rw-r--r--@ 1 mark  staff     2594  7 Oct 10:21 extracted-019.jpg
-rw-r--r--  1 mark  staff    58155  7 Oct 10:21 extracted-019.ppm
-rw-r--r--@ 1 mark  staff     3029  7 Oct 10:21 extracted-020.jpg
-rw-r--r--  1 mark  staff    31514  7 Oct 10:21 extracted-020.ppm
-rw-r--r--@ 1 mark  staff     3029  7 Oct 10:21 extracted-021.jpg
-rw-r--r--  1 mark  staff    31514  7 Oct 10:21 extracted-021.ppm
-rw-r--r--  1 mark  staff    31766  7 Oct 10:21 extracted-022.ppm
-rw-r--r--@ 1 mark  staff     5310  7 Oct 10:21 extracted-023.jpg
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-023.ppm
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-024.ppm
-rw-r--r--@ 1 mark  staff    10564  7 Oct 10:21 extracted-025.jpg
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-025.ppm
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-026.ppm
-rw-r--r--@ 1 mark  staff    10564  7 Oct 10:21 extracted-027.jpg
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-027.ppm
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-028.ppm
-rw-r--r--@ 1 mark  staff     4052  7 Oct 10:21 extracted-029.jpg
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-029.ppm
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-030.ppm
-rw-r--r--  1 mark  staff    36966  7 Oct 10:21 extracted-031.ppm
-rw-r--r--  1 mark  staff    36966  7 Oct 10:21 extracted-032.ppm
-rw-r--r--  1 mark  staff    36966  7 Oct 10:21 extracted-033.ppm
-rw-r--r--  1 mark  staff    36966  7 Oct 10:21 extracted-034.ppm
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-035.ppm
-rw-r--r--  1 mark  staff   158667  7 Oct 10:21 extracted-036.ppm
-rw-r--r--@ 1 mark  staff     3625  7 Oct 10:21 extracted-037.jpg
-rw-r--r--  1 mark  staff    48219  7 Oct 10:21 extracted-037.ppm
-rw-r--r--@ 1 mark  staff      490  7 Oct 10:21 extracted-038.jpg
-rw-r--r--  1 mark  staff    48219  7 Oct 10:21 extracted-038.ppm
-rw-r--r--@ 1 mark  staff     3615  7 Oct 10:21 extracted-039.jpg
-rw-r--r--  1 mark  staff    48219  7 Oct 10:21 extracted-039.ppm
-rw-r--r--@ 1 mark  staff      472  7 Oct 10:21 extracted-040.jpg
-rw-r--r--  1 mark  staff    48219  7 Oct 10:21 extracted-040.ppm
-rw-r--r--@ 1 mark  staff     3254  7 Oct 10:21 extracted-041.jpg
-rw-r--r--  1 mark  staff    48528  7 Oct 10:21 extracted-041.ppm
-rw-r--r--  1 mark  staff    48999  7 Oct 10:21 extracted-042.ppm
-rw-r--r--@ 1 mark  staff     4407  7 Oct 10:21 extracted-043.jpg
-rw-r--r--  1 mark  staff    75492  7 Oct 10:21 extracted-043.ppm
-rw-r--r--@ 1 mark  staff     4744  7 Oct 10:21 extracted-044.jpg
-rw-r--r--  1 mark  staff    75492  7 Oct 10:21 extracted-044.ppm
-rw-r--r--@ 1 mark  staff     2356  7 Oct 10:21 extracted-045.jpg
-rw-r--r--  1 mark  staff    75492  7 Oct 10:21 extracted-045.ppm
-rw-r--r--@ 1 mark  staff     6577  7 Oct 10:21 extracted-046.jpg
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-046.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-047.ppm
-rw-r--r--@ 1 mark  staff     7406  7 Oct 10:21 extracted-048.jpg
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-048.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-049.ppm
-rw-r--r--@ 1 mark  staff     9274  7 Oct 10:21 extracted-050.jpg
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-050.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-051.ppm
-rw-r--r--@ 1 mark  staff     9274  7 Oct 10:21 extracted-052.jpg
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-052.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-053.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-054.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-055.ppm
-rw-r--r--@ 1 mark  staff     7089  7 Oct 10:21 extracted-056.jpg
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-056.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-057.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-058.ppm
-rw-r--r--  1 mark  staff   114333  7 Oct 10:21 extracted-059.ppm
-rw-r--r--  1 mark  staff  1000815  7 Oct 10:21 extracted-060.ppm
-rw-r--r--@ 1 mark  staff     8055  7 Oct 10:21 extracted-061.jpg
-rw-r--r--  1 mark  staff  1000815  7 Oct 10:21 extracted-061.ppm
-rw-r--r--@ 1 mark  staff     4008  7 Oct 10:21 extracted-062.jpg
-rw-r--r--  1 mark  staff    78765  7 Oct 10:21 extracted-062.ppm
-rw-r--r--@ 1 mark  staff     2523  7 Oct 10:21 extracted-063.jpg
-rw-r--r--  1 mark  staff    78315  7 Oct 10:21 extracted-063.ppm
-rw-r--r--  1 mark  staff   697695  7 Oct 10:21 extracted-064.ppm
-rw-r--r--  1 mark  staff   697695  7 Oct 10:21 extracted-065.ppm
-rw-r--r--  1 mark  staff   235248  7 Oct 10:21 extracted-066.ppm
-rw-r--r--  1 mark  staff   235248  7 Oct 10:21 extracted-067.ppm
5
  • Thanks @Mark-Setchell, I have changed the last part of my code, I don't find the suggestion good enough for recognition because you can have a pdf file that contains both text and images, any way in the case you find only images it is hard to determine that at a script level from the output file of pdfimages -list SomeFile.pdf Commented Oct 7, 2015 at 10:16
  • I was thinking that if you generate a list of images and find none, then your PDF is going to be text - or empty. Commented Oct 7, 2015 at 10:19
  • Yes, any way, what I do when I have Images and Text? Commented Oct 7, 2015 at 10:20
  • To my mind, the title of your question implied you knew that your PDFs were either text or images and you wished to know which. Commented Oct 7, 2015 at 10:23
  • Sorry if I take you to a confusion, but when the user uploads the pdf file, he can upload either a pdf containing only images, only texts or both... Commented Oct 7, 2015 at 10:31
0

I have not tested this against every pdf file in the world so there may be some false positives or false negatives out there but this code works for me to do exactly what the OP wants to do. I use a php text extraction library when the PDF file that was uploaded has some text in it, I send it off to several OCR as a service endpoints if it is an image only pdf. (Although not part of the OP question, why do I send to multiple services? A. Because I have not found a single one that is accurate. By using six, I can usually find the text that I am searching for is recognized by at least one of them, but it is still hit and miss in the OCR world. Of course if the PDF is text based, by search results after extraction are 100% accurate.)

$path='path to your pdf file';
$buf=file_getcontents($path);
if(strpos($buf,'/Font')===false){
 print $path.' is an image only pdf';
}else{
 if(strpos($buf,'/Image')===false){
  print $path.' is a text only pdf';
 }else{
  print $path.' is a pdf consisting of both text and images';
 }
}

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.