Drop an image (PNG, JPG, WEBP, BMP, TIFF) or a scanned PDF into the box above, then wait for the progress bar to complete. The tool runs Tesseract.js entirely in your browser - your file is never uploaded to any server. Copy the result or download it as a .txt file, then open it in the Diff Checker to compare two versions.
How to use this tool
Drop a file into the upload zone or click it to browse. The tool detects whether the file is an image or a PDF automatically. For PDFs, each page is rendered to a canvas and OCR'd sequentially - a progress bar shows which page is being processed. For images, recognition starts immediately. When complete, the extracted text appears in the text box below. Use the Copy or Download buttons to save the result.
Tesseract.js (~25 MB including the English language model) loads from CDN the first time you use the tool. Subsequent uses in the same browser session skip this download.
Supported file formats
| Format | Extension | Notes |
|---|---|---|
| JPEG | .jpg, .jpeg | Most common for photos and scanned documents |
| PNG | .png | Best for screenshots and documents with text on white backgrounds |
| WebP | .webp | Modern image format used by many web apps and screenshots |
| BMP | .bmp | Uncompressed Windows bitmap; large files but lossless |
| TIFF | .tif, .tiff | Standard format for high-resolution document scans |
| PDF (scanned) | Each page rendered to image then OCR'd. Text-based PDFs: use Compare PDF instead |
If you have a text-based (non-scanned) PDF, the Compare PDF tool extracts the embedded text layer directly - much faster and more accurate than OCR.
How browser-based OCR works
This tool uses Tesseract.js, a WebAssembly port of Google's Tesseract OCR engine. Tesseract is one of the most widely used open-source OCR engines, originally developed at HP and maintained by Google since 2006.
The recognition pipeline has three stages:
- Image preprocessing. The image is loaded into the browser's memory. For PDFs, PDF.js renders each page to a canvas element at 2× scale (approximately 144 DPI).
- Layout analysis. Tesseract segments the image into text blocks, lines, words, and characters using its LSTM neural network model.
- Character recognition. Each character is classified against the trained English model and assembled into words and lines.
All of this happens inside your browser tab, using WebAssembly for near-native performance. No data leaves your device.
Getting accurate results
OCR accuracy depends almost entirely on input quality. The same engine can achieve 99% accuracy on a clean 300 DPI scan and under 70% on a blurry photo of a document. The most impactful things you can do:
- Scan at 300 DPI or higher. Below 200 DPI, character edges become too blurry for reliable recognition. Most scanner apps default to 200–300 DPI; set it to at least 300.
- Maximise contrast. Black text on white paper is ideal. Coloured backgrounds, watermarks, and faint text all reduce accuracy. Convert to greyscale before scanning if possible.
- Keep the page flat and straight. Curved pages (from book scans), skewed pages, or folded corners introduce geometric distortion. Most scanner software includes a deskew option.
- Use a consistent font. Standard serif and sans-serif fonts (Times New Roman, Arial, Helvetica) are recognised with much higher accuracy than decorative, handwritten, or stylised typefaces.
- Avoid JPEG compression artefacts. JPEG compression at low quality settings introduces blocky artefacts around text edges. PNG or TIFF at 300 DPI produces cleaner input.
Use cases
Digitising scanned contracts and legal documents
Older contracts, agreements, and legal filings stored as scanned PDFs contain no embedded text. Extract the text layer with this tool, then compare two versions using the Diff Checker or Compare Documents tool to identify changes between drafts.
Extracting text from screenshots
When text is locked inside a screenshot - a software error message, a social media post, a slide from a presentation - OCR converts it to editable text in seconds without manual retyping.
Processing scanned academic papers
Many older research papers, theses, and textbooks are only available as scanned images. OCR extracts the text so it can be searched, quoted, or summarised.
Business card and invoice data extraction
Photographs of business cards, receipts, or invoices contain structured text that OCR can extract. The result can then be cleaned up manually or fed into a spreadsheet.
FAQs
Is my file kept private?
Yes. The entire OCR process runs inside your browser using WebAssembly. Your file never leaves your device and is never sent to any server.
What languages are supported?
This tool recognises English. Tesseract supports 100+ languages, but each language model is 10–20 MB. Loading multiple language packs in the browser would result in large downloads. For non-English documents, dedicated desktop OCR software (such as Adobe Acrobat, ABBYY FineReader, or Tesseract CLI with the appropriate language pack) will give better results.
How accurate is it?
For clean, high-resolution scans of printed English text (300 DPI, good contrast, standard fonts), accuracy is typically 95–99%. Blurry photos, low contrast, skewed pages, handwriting, or decorative fonts reduce accuracy. See Getting accurate results above.
Can it read handwriting?
Not reliably. Tesseract is trained primarily on printed typefaces. It may recognise some clear, printed-style handwriting but performs poorly on cursive script. For handwriting recognition, dedicated tools trained on handwritten text (such as Google Lens or Microsoft Azure AI Vision) are significantly more accurate.
Why is the text in the wrong order?
OCR reads text in the order Tesseract detects text blocks in the image, which may not match top-to-bottom reading order for multi-column layouts, tables, sidebars, or footnotes. Single-column documents usually come out in correct order. For complex layouts, manual reordering of sections in the extracted text is normal.
My PDF has text but OCR gives garbage output
If your PDF already has an embedded text layer (i.e. it is not a scanned image), do not use this tool. Use the Compare PDF tool instead - it extracts the embedded text directly, which is faster and perfectly accurate. This OCR tool is specifically for scanned PDFs where no text layer exists.