ppu-pdf

Easily extract text from digital PDF and Scanned PDF files with coordinate and font size included, and optionally group text by lines.

⚠️ Warning: Version 3 uses mupdf for better support in non-browser environment. While version 2 uses pdfjs and focuses only on extraction in digital pdf, you can still uses it here: https://www.npmjs.com/package/ppu-pdf-headless

Features

Text Extraction: Retrieve all text content from a PDF.
Coordinate Data: Get precise bounding box and dimension information for each text element.
Line Grouping: Merge individual text tokens into coherent lines.
Scanned PDF Detection: Determine if a PDF appears to be scanned or digitally generated.
Support for PDF Scan via Tesseract OCR is underway

Installation

Using Bun is recommended

Install the package via npm:

npm install ppu-pdf

Or using Yarn:

yarn add ppu-pdf

Bun:

bun add ppu-pdf

Usage

Below is an example of how to use the library with Bun:

import { PdfReader } from "ppu-pdf";

const pdfReader = new PdfReader({ verbose: false });
const file = Bun.file("./src/assets/opposite-expectation.pdf");

const buffer = await file.arrayBuffer();
const pdf = pdfReader.open(buffer);

// remember it's a map
const texts = await pdfReader.getTexts(pdf);
const page0texts = texts.get(0);
console.log("texts: ", page0texts);

pdfReader.destroy(pdf);

const isScanned = pdfReader.isScanned(texts);
console.log("is pdf scanned: ", isScanned);

`PdfReaderOptions`

Configuration options for PdfReader, allowing customization of PDF text extraction behavior.

Option	Type	Default Value	Description
`verbose`	`boolean`	`false`	Enables logging for debugging purposes.
`excludeFooter`	`boolean`	`true`	Excludes detected footer text from the extracted content.
`excludeHeader`	`boolean`	`true`	Excludes detected header text from the extracted content.
`raw`	`boolean`	`false`	If `true`, returns raw text without additional processing.
`headerFromHeightPercentage`	`number`	`0.02`	Defines the height percentage from the top used to identify header text.
`footerFromHeightPercentage`	`number`	`0.95`	Defines the height percentage from the bottom used to identify footer text.
`mergeCloseTextNeighbor`	`boolean`	`true`	Merges text elements that are close to each other into a single entity.
`simpleSortAlgorithm`	`boolean`	`false`	Uses a simplified sorting algorithm for text positioning.
`scale`	`number`	`1`	The pdf document scale

Usage Example:

const reader = new PdfReader({ verbose: true, excludeFooter: false });

These options allow fine-tuned control over how text is extracted and processed from PDFs.

Method Documentation

`PdfReader` Class

Constructor: `constructor(options?: Partial<PdfReaderOptions>)`

Creates an instance of PdfReader.

Parameters:
- options (optional): Partial options to override the defaults. Refer to the PdfReaderOptions interface for available options.

Method: `open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>`

Opens a PDF document.

Parameters:
- filename: The path to the PDF file or an ArrayBuffer containing the PDF data.
Returns: A promise that resolves with the PDFDocumentProxy.

Method: `getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>`

Extracts the text content from the PDF document.

Parameters:
- pdf: The PDFDocumentProxy instance.
Returns: A promise that resolves with a Map of page numbers to their corresponding PdfTexts.

Sample return:

// Map (1) starting index from 0
{
  "0": {
    "words": [
      {
        "text": "Opposite Expectation: How to See the World as Two-Sided",
        "bbox": {
          "x0": 72,
          "y0": 84,
          "x1": 464,
          "y1": 99
        },
        "dimension": {
          "width": 392,
          "height": 15
        },
        "metadata": {
          "writing": "horizontal",
          "font": {
            "name": "AAAAAA+Arial-BoldItalicMT",
            "family": "sans-serif",
            "weight": "bold",
            "style": "italic",
            "size": 14
          },
          "pageNum": 0
        },
        "id": 0
      }
    ]
  }
}

Method: `getLinesFromTexts(pageTexts: PageTexts): PageLines`

Retrieves line information from the page texts.

Parameters:
- pageTexts: A Map of page numbers to their corresponding PdfTexts.
Returns: A Map of page numbers to an array of PdfLine objects.

Sample return:

// Map (1) starting index from 0
{
  "0": [
    {
      "bbox": {
        "x0": 72,
        "y0": 84,
        "x1": 464,
        "y1": 99
      },
      "averageFontSize": 14,
      "dimension": {
        "width": 392,
        "height": 15
      },
      "words": [
        {
          "text": "Opposite Expectation: How to See the World as Two-Sided",
          "bbox": {
            "x0": 72,
            "y0": 84,
            "x1": 464,
            "y1": 99
          },
          "dimension": {
            "width": 392,
            "height": 15
          },
          "metadata": {
            "writing": "horizontal",
            "font": {
              "name": "AAAAAA+Arial-BoldItalicMT",
              "family": "sans-serif",
              "weight": "bold",
              "style": "italic",
              "size": 14
            },
            "pageNum": 0
          },
          "id": 0
        }
      ],
      "text": "Opposite Expectation: How to See the World as Two-Sided"
    }
  ]
}

Method: `getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines`

Retrieves a compact representation of line information from the page texts using the specified algorithm.

Parameters:
- pageTexts: A Map of page numbers to their corresponding PdfTexts.
- algorithm: An optional PdfCompactLineAlgorithm specifying the method for grouping lines. Defaults to middleY.
Returns: A Map of page numbers to an array of CompactPdfLine objects, where the line extraction method depends on the chosen algorithm.

Sample return:

// Map (1) starting index from 0
{
  "0": [
    {
      "bbox": {
        "x0": 72,
        "y0": 83.13183584999996,
        "x1": 461.4900053795799,
        "y1": 97.13183534999996
      },
      "words": [
        {
          "text": "Opposite Expectation: How to See the World as Two-Sided",
          "bbox": {
            "x0": 72,
            "y0": 83.13183584999996,
            "x1": 461.4900053795799,
            "y1": 97.13183534999996
          }
        }
      ],
      "text": "Opposite Expectation: How to See the World as Two-Sided"
    }
  ]
}

Method: `isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean`

Determines whether the PDF appears to be a scanned document.

Parameters:
- pageTexts: A Map of page numbers to their corresponding PdfTexts.
- options (optional): Thresholds for scanned detection. Defaults to CONSTANT.WORDS_PER_PAGE_THRESHOLD and CONSTANT.TEXT_LENGTH_THRESHOLD.
Returns: true if the PDF is considered scanned; otherwise, false.

Contributing

Contributions are welcome! If you would like to contribute, please follow these steps:

Fork the Repository: Create your own fork of the project.
Create a Feature Branch: Use a descriptive branch name for your changes.
Implement Changes: Make your modifications, add tests, and ensure everything passes.
Submit a Pull Request: Open a pull request to discuss your changes and get feedback.

Running Tests

This project uses Bun for testing. To run the tests locally, execute:

bun test

Ensure that all tests pass before submitting your pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Support

If you encounter any issues or have suggestions, please open an issue in the repository.

Happy coding!

Scripts

Recommended development environment is in linux-based environment. Library template: https://github.com/aquapi/lib-template

All script sources and usage.

Build

Emit .js and .d.ts files to lib.

Publish

Move package.json, README.md to lib and publish the package.

Bench

Run files that ends with .bench.ts extension.

To run a specific file.

bun bench index # Run bench/index.bench.ts

To run the benchmark in node, add a --node parameter

bun bench --node

bun bench --node index # Run bench/index.bench.ts with node

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
bench		bench
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ppu-pdf

Features

Installation

Usage

`PdfReaderOptions`

Usage Example:

Method Documentation

`PdfReader` Class

Constructor: `constructor(options?: Partial<PdfReaderOptions>)`

Method: `open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>`

Method: `getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>`

Method: `getLinesFromTexts(pageTexts: PageTexts): PageLines`

Method: `getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines`

Method: `isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean`

Contributing

Running Tests

License

Support

Scripts

Build

Publish

Bench

About

Releases

Packages

Languages

License

PT-Perkasa-Pilar-Utama/ppu-pdf

Folders and files

Latest commit

History

Repository files navigation

ppu-pdf

Features

Installation

Usage

PdfReaderOptions

Usage Example:

Method Documentation

PdfReader Class

Constructor: constructor(options?: Partial<PdfReaderOptions>)

Method: open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>

Method: getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>

Method: getLinesFromTexts(pageTexts: PageTexts): PageLines

Method: getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines

Method: isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean

Contributing

Running Tests

License

Support

Scripts

Build

Publish

Bench

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`PdfReaderOptions`

`PdfReader` Class

Constructor: `constructor(options?: Partial<PdfReaderOptions>)`

Method: `open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>`

Method: `getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>`

Method: `getLinesFromTexts(pageTexts: PageTexts): PageLines`

Method: `getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines`

Method: `isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean`

Packages