HWP, PDF, Office docs, images, archive files — taken in through one interface and delivered as clean data. Search, previews, big-data analytics, AI data prep, PII detection — all run on the same pipeline.
HWP / PDF / OfficeREST API · JSONCloud NativeBig Data Ready
Document Text Extraction
Wildly different documents, one stream of clean data.
Search, RAG, and data pipelines all assume clean text. Real documents don't cooperate — HWP, secured PDFs, every flavor of office format, archive files keep arriving.
This solution takes that variety in through one interface and delivers clean data on the other side. Large files extracted fast and accurately, tables preserved as tables, metadata in the Dublin Core standard.
Our extraction engine (Docpler) sits on top of data engineering and operations know-how, so the pipeline carries through to search indexing, RAG, PII detection, and big-data ingestion — owned end to end.
From extraction to ingestion, on one pipeline
Diverse formats, standard interfaces, and an operations-aware integrated solution.
Extensive format support
HWP (Hangul), PDF, Word, Excel, PowerPoint, iWork, EPUB, RTF, plus archive formats like ZIP, GZ, 7Z, and TAR — all handled by one engine.
Tables & structure preserved
Tables come out structured — directly loadable into CSV, Excel, or your database without a flattening step.
REST API · JSON output
Standard HTTP and JSON, callable from virtually any language or automation tool. Integration stays clean with no special SDK required.
Dublin Core metadata
File metadata is extracted in the standardized DC (Dublin Core) format, plugging cleanly into asset-management and classification systems.
Cloud-native operations
Optimized for container environments with health checks and uptime monitoring built in. Drops cleanly into Kubernetes.
Search · analytics · privacy integration
Output flows naturally to ElasticSearch indexing, RAG preprocessing, or PII detection and masking — pipelines connect end to end.
What's inside this solution
Our own extraction engine combined with data engineering know-how.
Hundreds of thousands of HWP tables, converted to Excel for the database
Through RightStack's HWP SDK, hundreds of thousands of HWP tables were classified by type and converted into Excel data. The customer used this to complete the database migration of years of unstructured data, building the foundation for analytics and forecasting.
Big DataData Migration
Search-engine integration to unify internal document search
A clean ElasticSearch integration delivered document-search capabilities without the cost of expensive proprietary licenses. Front-desk staff could search posts, internal Knowledge Base, attachments, and various documents in one place — and respond to customers faster.
Search EngineElasticSearch
Real-time PII detection on user-uploaded documents
Delivered as an SDK so detection runs at the moment of upload, applied to the existing board features without breaking the user experience while still meeting privacy-protection policy. The service operator gets a dashboard to track detection trends, with detected items exposed for false-positive review.
Privacy FilteringRealtime
PII detection across documents in the public-disclosure system
For systems that need to analyze documents in real time or in batch to detect PII, the text extraction tool — distributed as an SDK — proved a clean fit.