Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior of getDocumentProxy's buffer when extracting text AND rendering page as image (only for some pdf) #17

Open
ndrbrt opened this issue Aug 29, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@ndrbrt
Copy link

ndrbrt commented Aug 29, 2024

Environment

node v20.11.1
unpdf v0.11.0

Reproduction

I got the original error in a server route of a Nuxt 3 project. Also, in the original app I performed other operations besides text/metadata extraction and image rendering.

Anyway, I prepared a new Nitro project for this issue and isolated only the error involved. You can find the repo here: https://github.com/ndrbrt/unpdf-issue

Describe the bug

First of all, I noticed the issue only for some pdfs (actually pdfs with images, but I don't know if it's something comparable to #4, nor if it only affects pdfs with images).

Error A

The original code was similar to that in server/api/error-a.ts.

If you run the dev server and open, e.g.:

You get the following error:

[nitro] [request error] [unhandled] Cannot read properties of undefined (reading 'createCanvas')
  at i.constructor._createCanvas (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1552904)
  at i.constructor.create (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1399305)
  at CachedCanvases.getCanvas (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1474861)
  at CanvasGraphics.beginGroup (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1502437)
  at CanvasGraphics.executeOperatorList (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1482511)
  at InternalRenderTask._next (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1591245)
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

However, as I said, if you pass some other pdfs, everything's fine, e.g.:

Working version

Now, the only way I was able to solve the problem is as in server/api/working.ts: I copied the original buffer before it was passed to getDocumentProxy and then passed the copied buffer to renderPageAsImage. You can see that both requests succeed:

Error B

I also tried another approach in server/api/error-b.ts, passing a new Uint8Array(buffer) directly to renderPageAsImage. This way, if you open:

You get this error:

[nitro] [request error] [unhandled] Unable to deserialize cloned data.
  at LoopbackPort.postMessage (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1573782)
  at MessageHandler.sendWithPromise (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1514035)
  at ./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1561726
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Interestingly, in this case, if you repeat the request disabling text extraction (note the query param), it works:

Additional context

I did not use the official PDF.js build, because I couldn't get it to work. I still tried using the default build from unpdf and everything worked fine until I noticed the mentioned problem.

Logs

No response

@ndrbrt ndrbrt added the bug Something isn't working label Aug 29, 2024
@johannschopplich
Copy link
Collaborator

Hi there!
Thanks for the thourough issue description. One question: How did you deploy the app? Canvas support is only possible in Node deploy targets.

@ndrbrt
Copy link
Author

ndrbrt commented Oct 2, 2024

Hi @johannschopplich, I deployed the app on Vercel using the default config as in https://nuxt.com/deploy/vercel.
(It works the same way both on Vercel and locally)

@johannschopplich
Copy link
Collaborator

johannschopplich commented Oct 2, 2024

I see. It's probably not gonna work on Vercel, since the canvas module requires Node.js bindings.

For your other examples: Please use the official PDF.js build, because the serverless build (used by unpdf by default) has stripped the canvas support.
Can you please follow the renderPageAsImage guide to set up the pdfjs-dist build used together with canvas?

import { configureUnPDF, renderPageAsImage } from "unpdf";

await configureUnPDF({
  // Use the official PDF.js build
  pdfjs: () => import("pdfjs-dist"),
});

const result = await renderPageAsImage(pdf, 1, {
  canvas: () => import("canvas"),
});

@ndrbrt
Copy link
Author

ndrbrt commented Oct 10, 2024

Actually I did try to use pdfjs-dist, but it resulted in an error.

yarn add pdfjs-dist
await configureUnPDF({
  // Use the official PDF.js build
  pdfjs: async () => await import('pdfjs-dist'),
})
 ERROR  [nuxt] [request error] [unhandled] [500] Resolving failed. Please check the provided configuration.
  at resolvePDFJSImports (./node_modules/unpdf/dist/index.mjs:33:13)
  at async configureUnPDF (./node_modules/unpdf/dist/index.mjs:179:5)
  at Object.handler (./server/api/test.ts:5:1)
  at async ./node_modules/h3/dist/index.mjs:1975:19
  at async Object.callAsync (./node_modules/unctx/dist/index.mjs:72:16)
  at async Server.toNodeHandle (./node_modules/h3/dist/index.mjs:2266:7)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants