Show HN: Free API to extract PDF data

3 points by leftnode 6 hours ago

Hi HN,

Like everyone, I'm working on an product that uses LLMs to extract data from photos and documents. Part of the processing pipeline is extracting data from PDFs as raw text or a raster image.

As part of our leadgen strategy, we've opened our REST API that lets you process pages of a PDF. The API is completely free to use anonymously, but is rate limited to 1 page per 30 seconds. Creating a free account removes this restriction.

The two endpoints are:

- https://extract.dev/api/pages/extract/raster - Rasterize a page of a PDF

- https://extract.dev/api/pages/extract/text - Extract text from a page of a PDF

Both have the same request format:

    {
        "file": "https://assets.extract-cdn.com/data/hd-receipt.pdf",
        "page": 1
    }

I've outlined more of the documentation here: https://extract.dev/docs

Under the hood, the API is using Poppler to extract text and rasterize pages. Note that the text extraction functionality extracts actual text encoded in the PDF, and does not employ an OCR model. Give it a spin, I'm interested in your feedback if this is useful or not.