RightStack
Menu

Docpler™

From classic search and previews to big-data analytics and AI data prep — extract text, tables, and metadata from every document, with structure preserved. Cloud-native and lightweight, exposed cleanly via REST API and JSON.

REST APIJSON OutputCloud NativeDublin Core
Document Text Extraction

Wildly different documents, one interface, clean data out.

Search, RAG, data pipelines — they all assume clean text. Real-world documents don't cooperate. HWP, secured PDFs, every flavor of office format, archive files — formats keep coming.

Docpler takes that variety on the input side and gives you clean data on the output side. Large files extracted fast and accurately, tables preserved as tables, metadata in the Dublin Core standard.

Built cloud-native, so it runs lightly inside containers — with health checks and uptime monitoring built in.

A developer-friendly extraction engine

Fast extraction, standard interfaces, and a lightweight architecture that fits container environments.

Fast & accurate

Large documents come through without errors and without slowing down. Speed and accuracy aren't traded off — both come together.

REST API

Call it from virtually any language or automation tool. Standard HTTP is enough — no special SDK required.

Table extraction

Tables come out structured, not flattened. Output lands in shapes that map directly to CSV, Excel, or your database.

JSON output

Every extraction result is JSON. Integration with downstream systems is natural, and scripting against the output stays simple.

Cloud native

Optimized for container environments with health checks and uptime monitoring built in. Drops cleanly into Kubernetes.

Extensive format support

Office docs, iWork, PDF, EPUB, and major archive formats — all handled by one engine. Metadata follows the Dublin Core standard.

Every document format we support

Office docs, iWork, general documents, and archive formats — one engine handles them all.

Need a format we don't list? It can be added for your domain.

Word

.doc · .docx · .dot · .dotx

office

PowerPoint

.ppt · .pptx · .pps · .ppsx

office

Excel

.xls · .xlsx · .xlt · .xltx

office

HWP

Hangul Word Processor

office

HWP Office

office

Pages

iwork

Numbers

iwork

Keynote

iWork '13 excluded

iwork

PDF

RTF

EPUB

Compression

ZIP · GZ · XZ · BZIP2 · 7Z · TAR · CPIO · AR

archive

Standard metadata, plus structured text extraction

File metadata is extracted using the standardized DC (Dublin Core) vocabulary. On top of that, content is extracted in a structured form that matches the document's characteristics.

{
  "name": "한글문서파일형식3.0_HWPML_revision1.2.hwp",
  "basename": "한글문서파일형식3.0_HWPML_revision1.2",
  "ext": "hwp",
  "mimeType": "application/x-hwp-v5",
  "metadata": {
      "dc:title": "개요",
      "dc:creator": "heyzard",
      "dcterms:created": "2014-10-04T05:49:27Z",
      "dcterms:modified": "2014-11-05T08:22:30Z"
  },
  "content": {
      "text": "개요 저작권 (주)한글과컴퓨터(이하 ‘한컴’)는 문서 형식의 개방성과 표준화에 대하여 적극 찬성합니다. 한컴은 ᄒᆞᆫ글 97의 문서 형식을 무상으로 지원한 바 있으며, ᄒᆞᆫ글 2002~2010 문서의 XML 형식은 HwpML에 대해서도 문서 형식을 공개한 바 있습니다. 개방형 문서 표준화 및 코드 관련 위원회에도 적극적으로 참여하여 파일 형식의 표준화와 개방성을 위해 노력해 왔습니다. 또한, 한컴오피스에서 기록물 장기보존 표준 포맷인 PDF/A-1의 지원과 ISO 국제 문서 형식인 ODF와 OOXML 파일 형식의 불러오기와 저장하기를 적극적으로 지원하였습니다. 본 문서를 열람하고자 하는 자라면 누구에게나 제공되는 것이며, 본 문서를 열람하는 것 외에 복사, 배포, 게재 및 본 문서에 기재되어 있는 내용을 사용하고자 하는 자는 한글과컴퓨터의 본 저작권을 충분히 인식하고 동의하여야 합니다. 본 문서를 누구나 열람, 복사, 배포, 게재 및 ..."
  },
  "rendering-options": {
      "output": "json"
  }
}

Case Studies

KESCO (Korea Electrical Safety Corporation)

Hundreds of thousands of HWP tables, converted to Excel for the database

Through RightStack's HWP SDK, hundreds of thousands of HWP tables were classified by type and converted into Excel data. The customer used this to complete the database migration of years of unstructured data, building the foundation for analytics and forecasting.

Big DataData Migration
Korea Tourism Organization

Search-engine integration to unify internal document search

A clean ElasticSearch integration delivered document-search capabilities without the cost of expensive proprietary licenses. Front-desk staff could search posts, internal Knowledge Base, attachments, and various documents in one place — and respond to customers faster.

Search EngineElasticSearch
Korea Tourism Organization

Real-time PII detection on user-uploaded documents

Delivered as an SDK so detection runs at the moment of upload, applied to the existing board features without breaking the user experience while still meeting privacy-protection policy. The service operator gets a dashboard to track detection trends, with detected items exposed for false-positive review.

Privacy FilteringRealtime
Korea Copyright Commission

PII detection across documents in the public-disclosure system

For systems that need to analyze documents in real time or in batch to detect PII, the text extraction tool — distributed as an SDK — proved a clean fit.

Privacy Filtering