I have an application where users can upload pdf
which are converted to text
for further processing.
The thing is that some of the uploaded files are image pdf, so conversion it does not work. Instead of sending all pdf to be split into images and then ocr them, I would prefer to send only those that are proved or detected to be images, is there a way to do this, I'm working in linux (debian)
environment with php
UPDATE
While searching for the final solution I have followed @Andrew's suggestion, counting the amount of words at the generated txt file, if it less than 10 words proceed to the next step: pdf to images for later ocr recognition, which is what I'm working on now...
// convert any file with pdf extension to text
$cmd = "pdftotext -eol unix '$uploadedfile'";
shell_exec($cmd);
// save original file at the orig directory
rename("$uploadedfile", "orig/$uploadedfile");
// pdftotext renames files to txt so I need the file name with txt extension
$textfile = preg_replace('"\.(pdf|PDF)$"', '.txt', $uploadedfile);
// count words on the generated txt file
$cmd = "wc -w '$textfile' | cut -f1 -d' '";
$wc = shell_exec($cmd);
// proceed if words are less than 10
if ($wc < 10)
{
//take out the pdf extension for directory creation
$imgdir = preg_replace('"\.(pdf|PDF)$"', '', $uploadedfile);
$cmd = "mkdir '$imgdir'";
shell_exec($cmd);
//change pdf extension to jpg for images creation
$imgfile = preg_replace('"\.(pdf|PDF)$"', '.jpg', $uploadedfile);
//convert pdf to images
$cmd = "convert 'orig/$uploadedfile' '$imgdir/$imgfile'";
then it will come the ocr...
UPDATE2 Thanks to the suggestion of @Mark-Setchell I've changed a little bit the code, now the last part is this way:
//take out the pdf extension for directory creation
$imgdir = preg_replace('"\.(pdf|PDF)$"', '', $uploadedfile);
$cmd = "mkdir '$imgdir'";
shell_exec($cmd);
//convert pdf to images
$cmd = "pdfimages 'orig/$uploadedfile' '$imgdir/$imgdir'";