Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
1 vote
1 answer
67 views

How to extract bold text from a PDF file [closed]

I'm working on a project where I need to extract only the bold text from PDF files using Python. At first, I tried using libraries like PyMuPDF (fitz) and pdfminer, extracting the PDF as HTML and ...
Marco Floriano's user avatar
0 votes
0 answers
49 views

PyMuPDF - Extract table contents

I try to extract the table text of a PDF: With the following code code i get: page 0 of page-1-ocr.pdf Tables rowsasf 49 texysdft [['', '', 'Staatlic', 'he Fische', 'rprüfung', 'in Bayern - Prü', '...
Marc's user avatar
  • 3,934
0 votes
0 answers
45 views

Improving OCR Accuracy for Digits Inside Circles from Vector PDF Images (AutoCAD Export)

I'm working on an OCR task where I need to detect and read numbers that are inside circles. The original data comes from .dwg (AutoCAD) files, and the PDFs I use are not scanned — they’re exported ...
Jules Angebault's user avatar
1 vote
0 answers
46 views

Preserve Empty Columns When Extracting Tables from PDF

I have 25–30 different types of PDF documents, each containing tables with varying structures. My ultimate goal is to extract table data from specific headings (i.e., between certain titles) and ...
Requiet's user avatar
  • 85
4 votes
2 answers
827 views

Mistral AI OCR not returning anything useful

I am trying to extract a table from a PDF. I was able to use the Le Chat feature of Mistral and get a super great result, but when I try to use the API to programmatically get the same result, I am ...
Shelly Liu's user avatar
1 vote
2 answers
113 views

Read numbers under barcode (not barcode stripes itself) in .NET 8

I need help about reading numbers under barcode (not stripes) from PDF file. My idea was to convert page to image and then read it. This is example of barcode that is located in the top right part of ...
Pelle Woah's user avatar
0 votes
0 answers
33 views

How to detect PDF pages with scanned content?

I'm trying to build an algorithm that it is able to detect pages that require to apply AWS Textract because they are scanned content. The use case is that some documents have text plain content but ...
Sebastian Chavarry Gutierrez's user avatar
0 votes
0 answers
73 views

How to convert non-readable PDF into readable PDF with OcrMyPdf: troubles with tesseract and configparser

I'm trying to convert a scanned PDF into a readable one. The original PDF contains text, tables, images/logos. The desired output file should be exactly the same of the original file. I found ...
eljamba's user avatar
  • 407
1 vote
1 answer
172 views

Embed/Insert/Add JSON OCR data generated by 'Google Cloud Vision (OCR)' inside a PDF file and make the PDF searchable

I am using Google Cloud Vision API (OCR) to detect text in PDF files using the PHP API Library. The OCR is done perfectly and I have saved the complete set of JSON output files (ex. output-1-to-2.json)...
sariDon's user avatar
  • 7,991
0 votes
1 answer
197 views

Does Datacap have a way to read searchable pdf instead of using OCR?

I'm working with searchable pdf, there is no need to OCR the document. Is there a rule/action in Datacap that can read the data instead of using OCR?
user3067752's user avatar
1 vote
2 answers
1k views

OCR - Azure Document Intelligence to recreate document digitally

Where i work we have lots of scanned documents, we want to digitalize them without losing the general format of the document, a document can have many key-value pairs like forms, titles, plaragraphs, ...
I NN_'s user avatar
  • 185
1 vote
1 answer
42 views

Definite OCR Quality [closed]

I have a low-quality English PDF file with no image, no table, single-column and completely black and white - not even gray. I used ABBYY FineReader and it detected the text just fine and I can search ...
Ebrahim Mehri's user avatar
0 votes
1 answer
692 views

How can I extract the PDF section/chapter titles with Python?

I want to add the page titles in the pdf to an array with a loop.I have tried many ways so far but I have not succeeded. How can it be done? I tried to do it by selecting the first lines on the page, ...
gofQ's user avatar
  • 1
0 votes
0 answers
168 views

Extract PDF data in C# using OCR which include datatables

I have a PDF which contains tables and some headers and value field I want to extract this data in c# objects using OCR, so that I can use that to insert in database The PDF data is in the form as ...
Radha's user avatar
  • 81
0 votes
0 answers
49 views

Issues with Extracting Tables from bank transaction PDFs

I am working on a python code for extracting tables from bank transactional pdfs(not image based pdfs). currently have worked with extracting tabular data using Tabula and Camelot but these are not ...
Gaurav Nambiar's user avatar

15 30 50 per page
1
2 3 4 5
20