FormX async call tutorial in NodeJS

What is an async (aynchronous) call?

An async (short for asynchronous) call is used when the extractor's result isn't returned immediately as part of the HTTP response. With a slight modification to a normal FormX HTTP call, instead of receiving the extractor's result, you'll get a job_id (a unique string identifier). This job_id acts as a key for retrieving the extractor job's result later. (In this context, job is synonymous with task, referring to the specific action you've requested FormX to perform in that HTTP request.)

When should I use an async call?

Use an async call when processing large files, such as PDFs with 4 or more pages. Normal synchronous calls may timeout due to HTTP limitations. This occurs because the HTTP protocol has no way to distinguish between a server legitimately taking a long time to process a large file and a server encountering an error (such as infinite loops). To prevent indefinite waiting, the protocol enforces a timeout, returning an HTTP timeout response after a set period. Async calls circumvent this limitation by allowing the server to process large files without being constrained by the HTTP timeout.

Technical details

Like a regular FormX extractor HTTP call, headers such as X-WORKER-EXTRACTOR-ID and other optional headers are required. To enable an async call, simply set the X-WORKER-ASYNC header to true (it defaults to false, which is why we didn't need to specify it before). If you are using formdata instead of headers, just append async: true to the formdata in the extract HTTP request.
Reference: headers formdata

Here is a sample async extract response:

202 Accepted
{
  "job_id": "<string>",
  "request_id": "<string>",
  "status": "ok"
}

Afterwards, repeatedly check on the job until it is done. The API for getting the async extraction result only requires the X-WORKER-TOKEN for authorization and the job_id in the URL.

Here is a sample Curl request:

curl --request GET \
     --url https://worker.formextractorai.com/v2/extract/jobs/{replace with job_id} \
     --header 'X-WORKER-TOKEN: {replace with token}' \
     --header 'accept: application/json:'

For intermediate and final HTTP responses, refer to the end of this tutorial.

A complete NodeJS sample

Create a folder named async-example (or your preferred name) for your project.
Open a terminal in the newly created folder.
Run npm init in the terminal. (Note: npm comes with NodeJS installation.)
Keep pressing enter for all prompts until you reach test command:.
For test command:, type node index.js.
Continue pressing enter to finish the initialization.
Verify that package.json has been created in the folder.
Create a new file named index.js in the same folder.
In the terminal, run npm install --save node-fetch@2 form-data to install dependencies.
Open index.js in a text editor.
Paste the code below into index.js:

const WORKER_TOKEN = 'replace with your worker access token';
const EXTRACTOR_ID = 'replace with your extractor ID';
const PDF_FILENAME = 'replace with PDF filename in the same folder as this index.js';
const ENDPOINT = 'worker.formextractorai.com'; // or 'sg-gcp.worker.formextractorai.com'
const EXTRACT_URL = `https://${ENDPOINT}/v2/extract`;
const WAIT_TIME = 1000; // time interval between get requests in milliseconds

const fs = require('fs');
const FormData = require('form-data');
const fetch = require('node-fetch');

async function performExtraction() {
  const formData = new FormData();
  formData.append('extractor_id', EXTRACTOR_ID);
  formData.append('async', 'true');
  formData.append('image', fs.createReadStream(PDF_FILENAME));

  const extractOptions = {
    method: 'POST',
    headers: {
      'accept': 'application/json',
      'X-WORKER-TOKEN': WORKER_TOKEN
    },
    body: formData
  };

  const response = await fetch(EXTRACT_URL, extractOptions);
  const json = await response.json();
  console.log('Response from async extract HTTP call:', json);
  if (json.status === "ok") {
    return json.job_id;
  } else {
    throw new Error('Extraction failed');
  }
}

async function getResult(jobID) {
  const getOptions = {
    method: 'GET',
    headers: {
      'accept': 'application/json',
      'X-WORKER-TOKEN': WORKER_TOKEN
    }
  };

  while (true) {
    const response = await fetch(`${EXTRACT_URL}/jobs/${jobID}`, getOptions);
    const json = await response.json();
    console.log('Response from async get result HTTP call:', json);
    if (json.status === "ok") {
      return json;
    } else if (json.status !== "pending") {
      throw new Error('Unexpected job status');
    }
    await new Promise(resolve => setTimeout(resolve, WAIT_TIME));
  }
}

async function main() {
  const jobID = await performExtraction();
  const result = await getResult(jobID);
  console.log('Final result:', result);
  // Process 'result' here
}

main();

Remember to replace all the const values at the beginning. The PDF filename should include .pdf at the end, and the file should be placed in the project folder. Here is the project structure:

async-example/
├─node_modules/
├─index.js
├─your_pdf_file.pdf
├─package.json
├─package-lock.json

Run the code with npm run test in the terminal. Here is the expected result::

After the first extract HTTP call:

response from async extract HTTP call: {
  status: 'ok',
  job_id: '<string>',
  request_id: '<string>'
}

While the PDF is being processed, this response should be returned repeatedly:

response from async get result HTTP call: { status: 'pending', job_id: '<string>' }

Finally, the code will exit once the actual result is obtained:

response from async get result HTTP call: {
  status: 'ok',
  metadata: {
    extractor_id: ...,
    request_id: ...,
    usage: <number of pages>,
    job_id: ...
  },
  documents: [
    {
      extractor_id: ...,
      metadata: [Object],
      data: [Object],
      detailed_data: [Object]
    },
    ...
  ]
}