Docpler™

From classic search and previews to big-data analytics and AI data prep — extract text, tables, and metadata from every document, with structure preserved. Cloud-native and lightweight, exposed cleanly via REST API and JSON.

REST APIJSON OutputCloud NativeDublin Core

Document Text Extraction

Wildly different documents, one interface, clean data out.

Search, RAG, data pipelines — they all assume clean text. Real-world documents don't cooperate. HWP, secured PDFs, every flavor of office format, archive files — formats keep coming.

Docpler takes that variety on the input side and gives you clean data on the output side. Large files extracted fast and accurately, tables preserved as tables, metadata in the Dublin Core standard.

Built cloud-native, so it runs lightly inside containers — with health checks and uptime monitoring built in.

A developer-friendly extraction engine

Fast extraction, standard interfaces, and a lightweight architecture that fits container environments.

Fast & accurate

Large documents come through without errors and without slowing down. Speed and accuracy aren't traded off — both come together.

REST API

Call it from virtually any language or automation tool. Standard HTTP is enough — no special SDK required.

Table extraction

Tables come out structured, not flattened. Output lands in shapes that map directly to CSV, Excel, or your database.

JSON output

Every extraction result is JSON. Integration with downstream systems is natural, and scripting against the output stays simple.

Cloud native

Optimized for container environments with health checks and uptime monitoring built in. Drops cleanly into Kubernetes.

Extensive format support

Office docs, iWork, PDF, EPUB, and major archive formats — all handled by one engine. Metadata follows the Dublin Core standard.

Every document format we support

Office docs, iWork, general documents, and archive formats — one engine handles them all.

Need a format we don't list? It can be added for your domain.

Word

.doc · .docx · .dot · .dotx

office

PowerPoint

.ppt · .pptx · .pps · .ppsx

office

Excel

.xls · .xlsx · .xlt · .xltx

office

HWP

Hangul Word Processor

office

HWP Office

office

Pages

iwork

Numbers

iwork

Keynote

iWork '13 excluded

iwork

PDF

RTF

EPUB

Compression

ZIP · GZ · XZ · BZIP2 · 7Z · TAR · CPIO · AR

Standard metadata, plus structured text extraction

File metadata is extracted using the standardized DC (Dublin Core) vocabulary. On top of that, content is extracted in a structured form that matches the document's characteristics.

{
  "name": "한글문서파일형식3.0_HWPML_revision1.2.hwp",
  "basename": "한글문서파일형식3.0_HWPML_revision1.2",
  "ext": "hwp",
  "mimeType": "application/x-hwp-v5",
  "metadata": {
      "dc:title": "개요",
      "dc:creator": "heyzard",
      "dcterms:created": "2014-10-04T05:49:27Z",
      "dcterms:modified": "2014-11-05T08:22:30Z"
  },
  "content": {
      "text": "개요 저작권 (주)한글과컴퓨터(이하 ‘한컴’)는 문서 형식의 개방성과 표준화에 대하여 적극 찬성합니다. 한컴은 ᄒᆞᆫ글 97의 문서 형식을 무상으로 지원한 바 있으며, ᄒᆞᆫ글 2002~2010 문서의 XML 형식은 HwpML에 대해서도 문서 형식을 공개한 바 있습니다. 개방형 문서 표준화 및 코드 관련 위원회에도 적극적으로 참여하여 파일 형식의 표준화와 개방성을 위해 노력해 왔습니다. 또한, 한컴오피스에서 기록물 장기보존 표준 포맷인 PDF/A-1의 지원과 ISO 국제 문서 형식인 ODF와 OOXML 파일 형식의 불러오기와 저장하기를 적극적으로 지원하였습니다. 본 문서를 열람하고자 하는 자라면 누구에게나 제공되는 것이며, 본 문서를 열람하는 것 외에 복사, 배포, 게재 및 본 문서에 기재되어 있는 내용을 사용하고자 하는 자는 한글과컴퓨터의 본 저작권을 충분히 인식하고 동의하여야 합니다. 본 문서를 누구나 열람, 복사, 배포, 게재 및 ..."
  },
  "rendering-options": {
      "output": "json"
  }
}

{
  "name": "sample.ppt",
  "basename": "sample",
  "ext": "ppt",
  "mimeType": "application/vnd.ms-powerpoint",
  "metadata": {
      "application": "Microsoft Macintosh PowerPoint",
      "pageCount": 3,
      "dc:title": "Apache Tika Overview",
      "dc:creator": "John Doe",
      "dcterms:created": "2023-09-28T03:40:39Z",
      "dcterms:modified": "2023-10-02T01:42:38Z",
      "cp:revision": "13"
  },
  "content": [
      {
          "master-content": "Apache Tika Sample Document",
          "content": "Apache Tika Overview Tika",
          "notes": "Apache Tika 에 대해서 들어보신 적 있나요? * Footer (Tika) Header (Tika)"
      },
      {
          "master-content": "",
          "content": "What is Apache Tika Java based Text Extraction Tool Apache POI based TikaServer",
          "notes": "Apache Tika 에 대한 설명 * Footer (Tika) Header (Tika)"
      },
      {
          "master-content": "",
          "content": "Tika Features Feature Summary 다양한 파일 지원 Powerpoint Excel Word Media 파일 지원 Video Audio",
          "notes": "Apache Tika 에 대한 특징. 아래와 같은 것이 잇음. 여러  파일  지원 음성 등도 지원 * Footer (Tika) Header (Tika)"
      }
  ],
  "rendering-options": {
      "structured": false,
      "output": "json",
      "ignore-headers-and-footers": false
  }
}

{
  "name": "sample.doc",
  "basename": "sample",
  "ext": "doc",
  "mimeType": "application/msword",
  "metadata": {
      "application": "Microsoft Office Word",
      "pageCount": 2,
      "dc:creator": "영일 박",
      "dcterms:created": "2023-10-02T05:00:00Z",
      "dcterms:modified": "2023-10-15T11:43:00Z",
      "cp:revision": "3"
  },
  "content": "Apache Tika Author: ..   Overview  What is Apache Tika  Apache Tika is text extraction tool.   Feature  아래와 같은 기능들이 있습니다.     기능 명  설명  비고   Text Extraction  · Microsoft Office Files  · OpenOffice Files  한컴은 아래한글만 지원합니다.   Rendering Format  · Text  · HTML  · XML  출력 방식  Supported Formats  다음과 같은 포맷을 지원합니다. · Docx  · Doc  · Ppt  · pptx",
  "rendering-options": {
      "output": "json",
      "ignore-headers-and-footers": true
  }
}

{
  "name": "sample.xls",
  "basename": "sample",
  "ext": "xls",
  "mimeType": "application/vnd.ms-excel",
  "metadata": {
      "application": "Microsoft Macintosh Excel",
      "dc:creator": "John Doe",
      "dcterms:created": "2023-09-26T08:17:49Z",
      "dcterms:modified": "2023-09-27T23:21:37Z",
      "spreadsheet:sheetCount": 2,
      "spreadsheet:totalRowCount": 4
  },
  "content": [
      {
          "sheet-name": "train",
          "tables": [
              {
                  "data": [
                      [
                          "PassengerId",
                          "Survived",
                          "Pclass",
                          "Name",
                          "Gender",
                          "Age",
                          "SibSp",
                          "Parch",
                          "Ticket",
                          "Fare",
                          "Cabin",
                          "Embarked"
                      ],
                      [
                          "1.0",
                          "0.0",
                          "3.0",
                          "Braund, Mr. Owen Harris",
                          "male",
                          "22.0",
                          "1.0",
                          "0.0",
                          "A/5 21171",
                          "7.25",
                          null,
                          "S"
                      ]
                  ]
              }
          ]
      },
      {
          "sheet-name": "ground-truth",
          "tables": [
              {
                  "data": [
                      [
                          "PassengerId",
                          "Survived"
                      ],
                      [
                          "893.0",
                          "1.0"
                      ]
                  ]
              }
          ]
      }
  ],
  "rendering-options": {
      "structured": true,
      "output": "json"
  }
}

{
  "name": "genai-transparency-pdf.pdf",
  "basename": "genai-transparency-pdf",
  "ext": "pdf",
  "mimeType": "application/pdf",
  "createdAt": "2023-09-27T17:13:25Z",
  "lastModifiedAt": "2023-09-27T17:13:55Z",
  "metadata": {
      "application": "Adobe InDesign 18.5 (Macintosh)",
      "dc:language": "en-US",
      "pdf:version": "1.7",
      "pdf:encrypted": false,
      "pdf:pageCount": 20,
      "pdf:hasXfa": false,
      "pdf:hasXmp": true,
      "pdf:hasCollection": false,
      "pdf:trapped": false,
      "pdf:hasMarkedContent": false,
      "pdf:containsDamagedFont": false,
      "pdf:containsNonEmbeddedFont": false,
      "pdf:3dAnnotationsCount": 0,
      "pdf:allowsPrinting": true,
      "pdf:allowsModifyingContents": true,
      "pdf:allowsFormFieldEntry": true,
      "pdf:allowsContentAccessibility": true,
      "pdf:allowsExtractingContent": true,
      "pdf:allowsDocumentAssembly": true,
      "pdf:allowsModifyingAnnotations": true,
      "pdf:allowsDegradedPrinting": true
  },
  "content": {
      "text": "Building Generative AI Responsibly Contents Introduction 1 An overview of generative AI 2 What are generative AI features? 2 What are best practices for developing generative AI experiences? 2 How Meta is building generative AI features responsibly 5 Step 1: Develop the generative AI foundation model 5 Details on our generative AI systems 5 Data used to train our generative AI models 5 Step 2: Determine use case 6 Features that ..."
  },
  "rendering-options": {
      "structured": true,
      "output": "json"
  }
}

{
  "name": "Metamorphosis-jackson.epub",
  "basename": "Metamorphosis-jackson",
  "ext": "epub",
  "mimeType": "application/epub+zip",
  "metadata": {
      "dc:title": "Metamorphosis",
      "dc:creator": "Franz Kafka",
      "dc:description": "The Metamorphosis is a novella by Franz Kafka, first published in 1915. It has been cited as one of the seminal works of fiction of the 20th century and is studied in colleges and universities across the Western world. The story begins with a traveling salesman, Gregor Samsa, waking to find himself transformed (metamorphosed) into a large, monstrous insect-like creature. The cause of Samsa's transformation is never revealed, and Kafka never did give an explanation. The rest of Kafka's novella deals with Gregor's attempts to adjust to his new condition as he deals with being burdensome to his parents and sister, who are repulsed by the horrible, verminous creature Gregor has become. (Wikipedia)<div style=\"text-align: center;\"><p>Download: <br /><strong><a href=\"\">PDF</a> | <a href=\"\">EPUB</a> | <a href=\"\">MOBI</a> | <a href=\"\">3-file Zip</a></strong></p></div>",
      "dc:language": "en",
      "dc:identifier": "http://metamorphosiskafka.pressbooks.com",
      "dc:publisher": "PressBooks.com",
      "epub:version": "2.0",
      "epub:rendition:layout": "reflowable",        
  },
  "content": {
      "text": "Metamorphosis Metamorphosis Franz Kafka PressBooks.com The PressBooks version of The Metamorphosis, by Franz Kafka. This book was produced using PressBooks.com, a simple book production tool that creates PDF, EPUB and MOBI. For more information, visit: pressbooks.com. This book is adapted from the Project Gutenberg version. It is in the public domain, and is free for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re-use it as you like. This book was produced using PressBooks.com. Contents CHAPTER I CHAPTER II CHAPTER III CHAPTER I One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. ... dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body."
  },
  "rendering-options": {
      "structured": true,
      "output": "json"
  }
}

Case Studies

Hundreds of thousands of HWP tables, converted to Excel for the database

Through RightStack's HWP SDK, hundreds of thousands of HWP tables were classified by type and converted into Excel data. The customer used this to complete the database migration of years of unstructured data, building the foundation for analytics and forecasting.

Big DataData Migration

Search-engine integration to unify internal document search

A clean ElasticSearch integration delivered document-search capabilities without the cost of expensive proprietary licenses. Front-desk staff could search posts, internal Knowledge Base, attachments, and various documents in one place — and respond to customers faster.

Search EngineElasticSearch

Real-time PII detection on user-uploaded documents

Delivered as an SDK so detection runs at the moment of upload, applied to the existing board features without breaking the user experience while still meeting privacy-protection policy. The service operator gets a dashboard to track detection trends, with detected items exposed for false-positive review.

Privacy FilteringRealtime

PII detection across documents in the public-disclosure system

For systems that need to analyze documents in real time or in batch to detect PII, the text extraction tool — distributed as an SDK — proved a clean fit.

Privacy Filtering