PDF to Text API - Extract PDF Data to Text Format with PDF.co API Platform

Why use PDF to Text API?

Retains Original Format and Layout of Original Text Object

PDF.co API platform retains the original layout and format of the source text objects. PDF.co can provide better-structured results for text extraction from PDF, compared to regular PDF to Text tools.

Damaged and Scanned Text Support

The PDF.co engine supports damaged and scan text with the help of our built-in OCR (Optical Character Recognition).

Web API Supports Multiple Languages

PDF.co platform can extract text from PDF using programming languages such as PHP, Javascript, .NET and ASP.NET, C#, Java, Visual Basic, and many others. Find source code samples in our API documentation.

Business Automation Platforms Integrations

If you are not a developer, you can also easily automate your PDF operations via popular business automation platforms: Zapier, Make, Airtable, Salesforce, Google Apps Script, and 300+ more.

PDF to Text API Sample & Demo

I’ll be using this sample file for this demonstration of text extraction from PDF.

We will use the code snippets below. These snippets are for popular programming languages. We’ll convert the sample PDF file (shown above) into a plain text file.

The final plain text output looks like this:

Before we extract text from PDF using the code, let us first check the /v1/pdf/convert/to/text parameters and their values.

Endpoint

URL:

https://api.pdf.co/v1/pdf/convert/to/text

Method:

POST

Parameter

Description

url	required. Link to the source file.

lang

optional. english by default. Sets OCR (image to text extraction) language to be used for scanned PDF when a scanned document is detected or input is PNG, JPG images. Other supported values: eng, spa, deu, fra, jpn, chi_sim, chi_tra, kor. You can also specify two languages to be used on the same page, for example: eng+deu, jpn+kor or other combinations.

inline optional. Must be one of: true to return data as inline or false to return link to an output file (default).

unwrap optional. Unwrap lines to a single line within table cells when lineGrouping is enabled. Must be one of true or false.

pages optional. Comma-separated list of page indices (or ranges) to process. IMPORTANT: the very first page starts with 0 (zero). To set a range use the dash –, for example: 0, 2-5, 7-.

rect	optional. Defines coordinates for extraction, e.g. 51.8, 114.8, 235.5, 204.0. Must be a `string`.

encrypt optional. Enable encryption for the output file: true or false

async optional. Runs processing asynchronously. Returns jobId to use with job/check: true or false

name	optional. Output file name.

profiles

optional. Must be a String. Set custom configuration. See profiles examples here

lineGrouping optional. Line grouping with table cells. Set to 1 to enable the grouping. Must be a string.

Now we are ready to write some codes.

cURL Code Snippet for Text Extraction from PDF

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: YOUR_API_KEY' \
--data-raw '{
"url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf",
"inline": false
}'

This sample code and other cURL source code samples are available here.

Now let’s see how to extract text from PDF in action.

The Sample code for PDF to Text in JavaScript is available here.

The Sample code for PDF to Text in PHP is available here.

The Sample code for PDF to Text in Python is here.

The Sample code for PDF to Text in Java is available here.

The Sample code for PDF to Text in C# is available here.

NOTE: Use PDF.co Document Classifier to know the source of the document. You can easily create and maintain classification rules with the desktop-based Classifier Testing Tool (see the details here)