This tutorial and the sample source code explain data extraction from a PDF document or just the extraction of borderless tables from a PDF to CSV using the Document Parser functionality of PDF.co Web API. The source code will be used to convert a PDF table to CSV in Java. Users can also parse JPG and PNG documents to extract tables, barcodes, values, and fields from orders, invoices, statements, and other PDF documents.

Features of Document Parser API Endpoint

The PDF.co Web API provides tools and functionalities to convert any PDF document or scanned image to CSV, XML, and JSON. The API uses the technique of automatic classification of incoming documents. The users can utilize the document classifier endpoint to automatically sort and detect the document’s class based on their custom keywords-based rules or the AI.

The built-in document parser templates help parse the invoices in English to their invoice id, dates, tax, total, and other items. The users can utilize the templateId parameter to select their required template.

The PDF.co Web API provides a secure platform for users to upload sensitive information. The platform transmits user data via encrypted connections to ensure security. The users can go through the security protocols in detail here.

Endpoint Parameters to Extract PDF Table to CSV in Java

Following are the Document Parser API endpoint parameters for converting the PDF to CSV:

  1. url: It is a required parameter that provides the URL to the source file. The PDF.co platform supports URLs from Dropbox, Google Drive, and built-in file storage of PDF.co.
  2. httpusername: It is an optional parameter and provides an HTTP auth user name to access the source URL if required.
  3. httppassword: It is an optional parameter and provides an HTTP auth password to access the source URL if needed.
  4. templateId: It is a required parameter that sets the Id of a document parser temple the user uses.
  5. template: It is an optional parameter. The users can provide the document parser template code using this parameter directly.
  6. inline: It is an optional parameter. The users can set it to true to return data as inline or false to return the link to an output file.
  7. outputFormat: It is an optional parameter. The user can set this parameter to generate the output in any required format, including CSV, XML, or JSON.
  8. password: It is an optional parameter that must be a string. It provides the password for the PDF file if required.
  9. async: It is an optional parameter that helps run the processes asynchronously. It returns the JobId to check the state of the background job.
  10. name: It is an optional parameter and must be a string. It provides the name of the generated output file after successful code execution.
  11. expiration: It is an optional parameter that offers the expiration time for the output link.
  12. profiles: It is an optional parameter and must be a string. This parameter helps in setting additional configurations and extra options.

How to Use Java to Convert PDF to CSV

The following source codes explain how to extract any PDF of a borderless table from a PDF document and save it as CSV using the Document Parser API endpoint. The sample codes in Java demonstrate converting a PDF document and borderless table to CSV. The below source codes take the sample PDF file for classification and use the custom template for parsing. The code then uses the specified template and extracts the data into CSV. The resulting CSV file is returned to the user after data extraction.

Template YAML Code

Following is the template YAML code for PDF parsing:

templateName: Multipage Table Test
templateVersion: 4
templatePriority: 0
detectionRules:
  keywords:
  - Sample document with multi-page table
objects:
- name: total
  objectType: field
  fieldProperties:
    fieldType: macros
    expression: TOTAL{{Spaces}}({{Number}})
    regex: true
    dataType: decimal
- name: table1
  objectType: table
  tableProperties:
    start:
      expression: Item{{Spaces}}Description{{Spaces}}Price
      regex: true
    end:
      expression: TOTAL{{Spaces}}{{Number}}
      regex: true
    row:
      expression: '{{LineStart}}{{Spaces}}(?<itemNo>{{Digits}}){{Spaces}}(?<description>{{SentenceWithSingleSpaces}}){{Spaces}}(?<price>{{Number}}){{Spaces}}(?<qty>{{Digits}}){{Spaces}}(?<extPrice>{{Number}})'
      regex: true
    columns:
    - name: itemNo
      dataType: integer
    - name: description
      dataType: string
    - name: price
      dataType: decimal
    - name: qty
      dataType: integer
    - name: extPrice
      dataType: decimal
    multipage: true

Sample Code in Java

Following is the sample code to parse the PDF file:

import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import com.google.gson.JsonPrimitive;
import okhttp3.*;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
 
public class Main {
    // Get your own API Key by registering at https://app.pdf.co
    final static String API_KEY = "**********************";
    public static void main(String[] args) throws IOException {
        // Source PDF file
        // You can also upload your own file into PDF.co and use it as url. Check "Upload File" samples for code snippets: https://github.com/bytescout/pdf-co-api-samples/tree/master/File%20Upload/    
        final String SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/document-parser/MultiPageTable.pdf";
        // PDF document password. Leave empty for unprotected documents.
        final String Password = "";
        // Destination JSON file name
        final Path DestinationFile = Paths.get(".\\result.csv");
        final String outputFormat = "CSV";
        // Template text. Use Document Parser (https://pdf.co/document-parser, https://app.pdf.co/document-parser)
        // to create templates.
        // Read template from file:
        String templateText = new String(Files.readAllBytes(Paths.get(".\\MultiPageTable-template1.yml")), StandardCharsets.UTF_8);
        // Create HTTP client instance
        OkHttpClient webClient = new OkHttpClient();
        // PARSE UPLOADED PDF DOCUMENT
        ParseDocument(webClient, API_KEY, DestinationFile, Password, SourceFileUrl, templateText, outputFormat);
    }

    public static void ParseDocument(OkHttpClient webClient, String apiKey, Path destinationFile,
                                     String password, String uploadedFileUrl, String templateText, String outputFormat) throws IOException {
        // Prepare POST request body in JSON format
        JsonObject jsonBody = new JsonObject();
        jsonBody.add("url", new JsonPrimitive(uploadedFileUrl));
        jsonBody.add("template", new JsonPrimitive(templateText));
        jsonBody.add("outputFormat", new JsonPrimitive(outputFormat));
        RequestBody body = RequestBody.create(MediaType.parse("application/json"), jsonBody.toString());

        // Prepare request to `Document Parser` API
        Request request = new Request.Builder()
                .url("https://api.pdf.co/v1/pdf/documentparser")
                .addHeader("x-api-key", API_KEY) // (!) Set API Key
                .addHeader("Content-Type", "application/json")
                .post(body)
                .build();

        // Execute request
        Response response = webClient.newCall(request).execute();
        if (response.code() == 200) {
            // Parse JSON response
            JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject();
            boolean error = json.get("error").getAsBoolean();
            if (!error) {
                // Get URL of generated JSON file
                String resultFileUrl = json.get("url").getAsString();
                // Download JSON file
                downloadFile(webClient, resultFileUrl, destinationFile.toFile());
                System.out.printf("Generated JSON file saved as \"%s\" file.", destinationFile.toString());
            } else {
                // Display service reported error
                System.out.println(json.get("message").getAsString());
            }
        } else {
            // Display request error
            System.out.println(response.code() + " " + response.message());
        }
    }

    public static void downloadFile(OkHttpClient webClient, String url, File destinationFile) throws IOException {
        // Prepare request
        Request request = new Request.Builder()
                .url(url)
                .build();
        // Execute request
        Response response = webClient.newCall(request).execute();
        byte[] fileBytes = response.body().bytes();
        // Save downloaded bytes to file
        OutputStream output = new FileOutputStream(destinationFile);
        output.write(fileBytes);
        output.flush();
        output.close();
        response.close();
    }
}

Sample PDF with Multi-Page Table for PDF to CSV Parsing

Below is the source PDF file for parsing:

Source File for PDF to CSV Parsing

Output CSV File

Below is the CSV output of the above code and the source file:

total,tableNames,tables
450.00,table1,"itemNo,description,price,qty,extPrice
1,Item 1,10.00,1,10.00
2,Item 2,10.00,1,10.00
3,Item 3,10.00,1,10.00
4,Item 4,10.00,1,10.00
5,Item 5,10.00,1,10.00
6,Item 6,10.00,1,10.00
7,Item 7,10.00,1,10.00
8,Item 8,10.00,1,10.00
9,Item 9,10.00,1,10.00
10,Item 10,10.00,1,10.00
11,Item 11,10.00,1,10.00
12,Item 12,10.00,1,10.00
13,Item 13,10.00,1,10.00
14,Item 14,10.00,1,10.00
15,Item 15,10.00,1,10.00
16,Item 16,10.00,1,10.00
17,Item 17,10.00,1,10.00
18,Item 18,10.00,1,10.00
19,Item 19,10.00,1,10.00
20,Item 20,10.00,1,10.00
21,Item 21,10.00,1,10.00
22,Item 22,10.00,1,10.00
23,Item 23,10.00,1,10.00
24,Item 24,10.00,1,10.00
25,Item 25,10.00,1,10.00
26,Item 26,10.00,1,10.00
27,Item 27,10.00,1,10.00
28,Item 28,10.00,1,10.00
29,Item 29,10.00,1,10.00
30,Item 30,10.00,1,10.00
31,Item 31,10.00,1,10.00
32,Item 32,10.00,1,10.00
33,Item 33,10.00,1,10.00
34,Item 34,10.00,1,10.00
35,Item 35,10.00,1,10.00
36,Item 36,10.00,1,10.00
37,Item 37,10.00,1,10.00
38,Item 38,10.00,1,10.00
39,Item 39,10.00,1,10.00
40,Item 40,10.00,1,10.00
41,Item 41,10.00,1,10.00
42,Item 42,10.00,1,10.00
43,Item 43,10.00,1,10.00
44,Item 44,10.00,1,10.00
45,Item 45,10.00,1,10.00

Demo – Convert PDF Table to CSV

Below is a demonstration of how the above code works.

Demo - Convert PDF Table to CSV

Step-by-Step Guide To Extract PDF Form to CSV Using Document Parser

Following is the step-by-step guide to explain converting complete PDF form to CSV:

  1. The code imports the required packages and libraries to make the API request and reads the file from the URL.
  2. It then declares and initializes the API_Key, which the users can get by signing up or logging into the PDF.co account. The users require this API key to make requests to API endpoints.
  3. After this, the user has to provide API’s body payload, which in this sample code is the destination and source file URL, file password, API Key,  template text, and the outputFormat parameter. The users can provide their own required information here and customize the code. The code utilizes the PDF.co sample code containing the source file.
  4. The template provided here is a YML file that contains all the information needed to extract data from the document, like the keywords, objects, expressions, and other information. The users can use built-in templates provided by the API or customize them to get the required output.
  5. The outputFormat parameter contains the information regarding the resulting format. In this scenario, the format is set to CSV.
  6. The sample code then assembles variables for JSON payload and sends the API POST request. The successful request returns the CSV formatted data that the file stream reads and stores on the local storage as the result.csv file, i.e., the destination file.

Sample Code to Convert PDF Form to CSV

Below is the sample code to detect and extract borderless table

import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import com.google.gson.JsonPrimitive;
import okhttp3.*;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Main {
    // Get your own API Key by registering at https://app.pdf.co
    final static String API_KEY = "*************";
    public static void main(String[] args) throws IOException {
        // Source PDF file
        // You can also upload your own file into PDF.co and use it as url. Check "Upload File" samples for code snippets: https://github.com/bytescout/pdf-co-api-samples/tree/master/File%20Upload/    
        final String SourceFileUrl = "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf";
        // PDF document password. Leave empty for unprotected documents.
        final String Password = "";
        // Destination JSON file name
        final Path DestinationFile = Paths.get(".\\result.csv");
        final String outputFormat = "CSV";
        // Template text. Use Document Parser (https://pdf.co/document-parser, https://app.pdf.co/document-parser)
        // to create templates.
     final String templateId = "6181";
        // Create HTTP client instance
        OkHttpClient webClient = new OkHttpClient();
        // PARSE UPLOADED PDF DOCUMENT
        ParseDocument(webClient, API_KEY, DestinationFile, Password, SourceFileUrl, outputFormat, templateId);
    }
    public static void ParseDocument(OkHttpClient webClient, String apiKey, Path destinationFile,
                                     String password, String uploadedFileUrl, String outputFormat, String templateId) throws IOException {

        // Prepare POST request body in JSON format
        JsonObject jsonBody = new JsonObject();
        jsonBody.add("url", new JsonPrimitive(uploadedFileUrl));
        jsonBody.add("outputFormat", new JsonPrimitive(outputFormat));
        jsonBody.add("templateId", new JsonPrimitive(templateId));
        RequestBody body = RequestBody.create(MediaType.parse("application/json"), jsonBody.toString());

        // Prepare request to `Document Parser` API
        Request request = new Request.Builder()
                .url("https://api.pdf.co/v1/pdf/documentparser")
                .addHeader("x-api-key", API_KEY) // (!) Set API Key
                .addHeader("Content-Type", "application/json")
                .post(body)
                .build();

        // Execute request
        Response response = webClient.newCall(request).execute();
        if (response.code() == 200) {
            // Parse JSON response
            JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject();
            boolean error = json.get("error").getAsBoolean();
            if (!error) {
                // Get URL of generated JSON file
                String resultFileUrl = json.get("url").getAsString();
                // Download JSON file
                downloadFile(webClient, resultFileUrl, destinationFile.toFile());
                System.out.printf("Generated JSON file saved as \"%s\" file.", destinationFile.toString());
            } else {

                // Display service reported error
                System.out.println(json.get("message").getAsString());
            }
        } else {

            // Display request error
            System.out.println(response.code() + " " + response.message());
        }
    }
    public static void downloadFile(OkHttpClient webClient, String url, File destinationFile) throws IOException {
        // Prepare request
        Request request = new Request.Builder()
                .url(url)
                .build();

        // Execute request
        Response response = webClient.newCall(request).execute();
        byte[] fileBytes = response.body().bytes();
        // Save downloaded bytes to file
        OutputStream output = new FileOutputStream(destinationFile);
        output.write(fileBytes);
        output.flush();
        output.close();
        response.close();
    }
}

Source PDF File for CSV Extraction

Below is the screenshot of the source PDF file.

Source PDF File for CSV Extraction

Output File in CSV Format

Below is the CSV output of the above code:

Company Name,Invoice No,Date,Total Due,tableNames,tables
ACME Inc.,,,,Table Items,"Column1,Column2
ACME Inc.,
""1540 Long Street, Jacksonville, 32099"",ACME Phone 352-200-0371 Fax 904-787-9468,"

Demo

Below is the gif to demonstrate the working of the code: