All Questions
300 questions
1
vote
1
answer
67
views
How to extract bold text from a PDF file [closed]
I'm working on a project where I need to extract only the bold text from PDF files using Python.
At first, I tried using libraries like PyMuPDF (fitz) and pdfminer, extracting the PDF as HTML and ...
0
votes
0
answers
49
views
PyMuPDF - Extract table contents
I try to extract the table text of a PDF:
With the following code code i get:
page 0 of page-1-ocr.pdf
Tables rowsasf 49
texysdft [['', '', 'Staatlic', 'he Fische', 'rprüfung', 'in Bayern - Prü', '...
0
votes
0
answers
45
views
Improving OCR Accuracy for Digits Inside Circles from Vector PDF Images (AutoCAD Export)
I'm working on an OCR task where I need to detect and read numbers that are inside circles. The original data comes from .dwg (AutoCAD) files, and the PDFs I use are not scanned — they’re exported ...
1
vote
0
answers
46
views
Preserve Empty Columns When Extracting Tables from PDF
I have 25–30 different types of PDF documents, each containing tables with varying structures. My ultimate goal is to extract table data from specific headings (i.e., between certain titles) and ...
4
votes
2
answers
827
views
Mistral AI OCR not returning anything useful
I am trying to extract a table from a PDF.
I was able to use the Le Chat feature of Mistral and get a super great result, but when I try to use the API to programmatically get the same result, I am ...
1
vote
2
answers
113
views
Read numbers under barcode (not barcode stripes itself) in .NET 8
I need help about reading numbers under barcode (not stripes) from PDF file. My idea was to convert page to image and then read it.
This is example of barcode that is located in the top right part of ...
0
votes
0
answers
33
views
How to detect PDF pages with scanned content?
I'm trying to build an algorithm that it is able to detect pages that require to apply AWS Textract because they are scanned content. The use case is that some documents have text plain content but ...
0
votes
0
answers
73
views
How to convert non-readable PDF into readable PDF with OcrMyPdf: troubles with tesseract and configparser
I'm trying to convert a scanned PDF into a readable one.
The original PDF contains text, tables, images/logos. The desired output file should be exactly the same of the original file.
I found ...
1
vote
1
answer
172
views
Embed/Insert/Add JSON OCR data generated by 'Google Cloud Vision (OCR)' inside a PDF file and make the PDF searchable
I am using Google Cloud Vision API (OCR) to detect text in PDF files using the PHP API Library. The OCR is done perfectly and I have saved the complete set of JSON output files (ex. output-1-to-2.json)...
0
votes
1
answer
197
views
Does Datacap have a way to read searchable pdf instead of using OCR?
I'm working with searchable pdf, there is no need to OCR the document. Is there a rule/action in Datacap that can read the data instead of using OCR?
1
vote
2
answers
1k
views
OCR - Azure Document Intelligence to recreate document digitally
Where i work we have lots of scanned documents, we want to digitalize them without losing the general format of the document, a document can have many key-value pairs like forms, titles, plaragraphs, ...
1
vote
1
answer
42
views
Definite OCR Quality [closed]
I have a low-quality English PDF file with no image, no table, single-column and completely black and white - not even gray. I used ABBYY FineReader and it detected the text just fine and I can search ...
0
votes
1
answer
692
views
How can I extract the PDF section/chapter titles with Python?
I want to add the page titles in the pdf to an array with a loop.I have tried many ways so far but I have not succeeded. How can it be done?
I tried to do it by selecting the first lines on the page, ...
0
votes
0
answers
168
views
Extract PDF data in C# using OCR which include datatables
I have a PDF which contains tables and some headers and value field
I want to extract this data in c# objects using OCR, so that I can use that to insert in database
The PDF data is in the form as ...
0
votes
0
answers
49
views
Issues with Extracting Tables from bank transaction PDFs
I am working on a python code for extracting tables from bank transactional pdfs(not image based pdfs).
currently have worked with extracting tabular data using Tabula and Camelot but these are not ...