1. Core concept: encoding mismatch
Korean mojibake almost always comes from one cause. Program A writes bytes using encoding X, and program B reads the same bytes believing they are encoding Y. Hangul characters take 2–3 bytes each, so when they are misread as a 1-byte encoding (Latin-1, Windows-1252) a single syllable appears as 2–3 stray Western glyphs.
2. The relevant encodings
- EUC-KR — Standardised 1987. 2,350 precomposed Hangul. Default on legacy Korean systems and early Korean web pages.
- CP949 — Microsoft's extension of EUC-KR adding ~8,822 extra Hangul. Default on Korean Windows and inside old HWP / Excel files. Practically a superset of EUC-KR; characters in the extension area fail under a strict EUC-KR decoder.
- UTF-8 — Variable-length Unicode encoding. Default on modern web, macOS/Linux, and Windows 10+. Hangul takes 3 bytes per syllable.
- ISO-8859-1 / Windows-1252 — 1-byte Western encodings. Cannot store Korean but act as a pass-through that "carries the raw bytes" and is a frequent link in the mojibake chain.
3. Common mojibake patterns
¾È³çÇϼ¼¿ä— "안녕하세요" written as EUC-KR then decoded as Latin-1. Most common in ZIP filenames.안녕하세요— same text written as UTF-8 then decoded as Latin-1. Frequent in DB dumps and email headers.%EC%95%88%EB%85%95%ED%95%98%EC%84%B8%EC%9A%94— URL percent-encoding; appears in server logs and Referer headers.ㅇㅏㄴㄴㅕㅇ— decomposed Jamo (NFD). Seen when iOS Safari uploads filenames to servers.�����— U+FFFD replacement characters. The original bytes are already lost and cannot be recovered.
4. Why we show 5 candidates instead of 1
Many real mojibake chains are 2 steps deep (A → B → C), and distinct original bytes can produce the same broken output. Pinning a single "most likely" answer is often wrong. This tool runs 7–8 encoding pair combinations plus double pairs, scores each result by "Hangul syllable ratio + ASCII health − U+FFFD penalty − Latin extended penalty", and shows the top 5. It is genuinely common for candidate #2 to be the right one.
5. ZIP filename recovery
ZIPs produced by Korean Windows store inner filenames as CP949. unzip on macOS and Linux defaults to UTF-8, producing ¾È³ç-style names. Paste a filename into this tool and pick the "EUC-KR → Latin-1" candidate. For batch recovery use unar, 7z, or the planned ZIP filename bulk fixer.
6. DB dumps & CSV
Korean rows stored when MySQL still defaulted to latin1 show up as 안녕 when a UTF-8 client reads them — the top candidate here ("UTF-8 → Latin-1") is exactly the unwind. Korean Excel often saves CSVs in CP949; opening them in a UTF-8 editor breaks the same way.
7. Email subjects (MIME encoded-word)
Subjects wrapped as =?UTF-8?B?...?= or =?EUC-KR?B?...?= follow RFC 2047 encoded-word. If the mail client mishandles it the subject appears broken; decode the Base64 part first, then paste the result here.
8. When recovery is impossible
- U+FFFD is already present — bytes that were replaced are gone for good.
- String is too short — with 2–3 characters several candidates tie and no single winner emerges. Paste a longer fragment including surrounding context when possible.
- Scanned images or PDFs — this tool only fixes byte-level encoding. OCR errors need a separate language-model-based corrector.