Korean encoding mojibake explained

Why Korean text breaks, and how to infer the encoding mismatch from the mojibake alone.

1. Core concept: encoding mismatch

Korean mojibake almost always comes from one cause. Program A writes bytes using encoding X, and program B reads the same bytes believing they are encoding Y. Hangul characters take 2–3 bytes each, so when they are misread as a 1-byte encoding (Latin-1, Windows-1252) a single syllable appears as 2–3 stray Western glyphs.

2. The relevant encodings

  • EUC-KR — Standardised 1987. 2,350 precomposed Hangul. Default on legacy Korean systems and early Korean web pages.
  • CP949 — Microsoft's extension of EUC-KR adding ~8,822 extra Hangul. Default on Korean Windows and inside old HWP / Excel files. Practically a superset of EUC-KR; characters in the extension area fail under a strict EUC-KR decoder.
  • UTF-8 — Variable-length Unicode encoding. Default on modern web, macOS/Linux, and Windows 10+. Hangul takes 3 bytes per syllable.
  • ISO-8859-1 / Windows-1252 — 1-byte Western encodings. Cannot store Korean but act as a pass-through that "carries the raw bytes" and is a frequent link in the mojibake chain.

3. Common mojibake patterns

  • ¾È³çÇϼ¼¿ä — "안녕하세요" written as EUC-KR then decoded as Latin-1. Most common in ZIP filenames.
  • 안녕하세요 — same text written as UTF-8 then decoded as Latin-1. Frequent in DB dumps and email headers.
  • %EC%95%88%EB%85%95%ED%95%98%EC%84%B8%EC%9A%94 — URL percent-encoding; appears in server logs and Referer headers.
  • ㅇㅏㄴㄴㅕㅇ — decomposed Jamo (NFD). Seen when iOS Safari uploads filenames to servers.
  • ����� — U+FFFD replacement characters. The original bytes are already lost and cannot be recovered.

4. Why we show 5 candidates instead of 1

Many real mojibake chains are 2 steps deep (A → B → C), and distinct original bytes can produce the same broken output. Pinning a single "most likely" answer is often wrong. This tool runs 7–8 encoding pair combinations plus double pairs, scores each result by "Hangul syllable ratio + ASCII health − U+FFFD penalty − Latin extended penalty", and shows the top 5. It is genuinely common for candidate #2 to be the right one.

5. ZIP filename recovery

ZIPs produced by Korean Windows store inner filenames as CP949. unzip on macOS and Linux defaults to UTF-8, producing ¾È³ç-style names. Paste a filename into this tool and pick the "EUC-KR → Latin-1" candidate. For batch recovery use unar, 7z, or the planned ZIP filename bulk fixer.

6. DB dumps & CSV

Korean rows stored when MySQL still defaulted to latin1 show up as 안녕 when a UTF-8 client reads them — the top candidate here ("UTF-8 → Latin-1") is exactly the unwind. Korean Excel often saves CSVs in CP949; opening them in a UTF-8 editor breaks the same way.

7. Email subjects (MIME encoded-word)

Subjects wrapped as =?UTF-8?B?...?= or =?EUC-KR?B?...?= follow RFC 2047 encoded-word. If the mail client mishandles it the subject appears broken; decode the Base64 part first, then paste the result here.

8. When recovery is impossible

  • U+FFFD is already present — bytes that were replaced are gone for good.
  • String is too short — with 2–3 characters several candidates tie and no single winner emerges. Paste a longer fragment including surrounding context when possible.
  • Scanned images or PDFs — this tool only fixes byte-level encoding. OCR errors need a separate language-model-based corrector.
작성 김지광 (운영자)마지막 업데이트 bal.pe.kr 마이크로 SaaS

Back to the repair tool