How to detect PDF pages with scanned content?

Ask Question

Asked 9 months ago

Modified 7 months ago

Viewed 33 times

Part of AWS Collective

I'm trying to build an algorithm that it is able to detect pages that require to apply AWS Textract because they are scanned content. The use case is that some documents have text plain content but other parts are scanned and they are all spread around the document. So I need to find a way in Python to know when to apply Textract to extract the text from the scanned parts and the rest just use some normal library to extract the plain text. In Python.

edited Sep 23, 2024 at 3:36

John Rotenstein

271k28 gold badges448 silver badges532 bronze badges

asked Jul 25, 2024 at 15:36

Sebastian Chavarry Gutierrez

256 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How to detect PDF pages with scanned content?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest