Technical walkthrough
How YomiNinja Reads Japanese Text From Your Screen
From hotkey press to Yomitan popup — every step of the pipeline, explained.
The process
Four steps from text to definition
YomiNinja's pipeline runs in under 500 milliseconds from trigger to overlay display.
A capture is triggered
Two modes: manual hotkey (press a key, get one capture) or Auto OCR (continuous monitoring that fires automatically when screen content changes). Auto OCR uses frame comparison to detect when new text has appeared — typically when a dialogue box updates — and immediately initiates a new capture without requiring any input.
The capture region is either the entire selected application window, or a user-defined OCR Template — a saved rectangle that covers just the dialogue box. Templates are faster and more accurate because they eliminate noise from other screen elements.
Screenshot is sent to the OCR engine
The captured region is passed to the selected OCR engine via YomiNinja's gRPC backend. The engine processes the image and returns two things: the recognized text (the Japanese characters), and character-level bounding box coordinates — the pixel positions of each individual character in the image.
This character-level positioning is what makes hover-to-lookup possible. YomiNinja knows not just what was recognized, but exactly where each character appears on screen.
Local engines (PaddleOCR, MangaOCR, Apple Vision) process the image entirely on your machine. Cloud engines (Google Cloud Vision, Google Lens) send the capture to an external API and receive results back — slightly slower, but often more accurate on difficult fonts.
A transparent overlay is rendered above the game
YomiNinja creates a transparent Electron window positioned above the game window in the display stack. This window has no visible chrome — it's invisible except for the text it renders.
Using the bounding box coordinates from the OCR result, YomiNinja places each character at the exact pixel position where it appears in the game. The overlay text sits precisely on top of the original game text.
Because the overlay is a real HTML document with actual text nodes (not an image), your cursor can interact with it normally. The underlying game receives mouse and keyboard input through the transparent areas. The overlay only intercepts input on the text characters themselves.
Borderless Windowed mode is required because exclusive fullscreen bypasses the operating system's display compositor entirely. Without the compositor, no transparent window can layer above a fullscreen application.
Hover a word — Yomitan fires instantly
Yomitan and 10ten Reader are Chrome extensions embedded inside YomiNinja's Electron Chromium context. They're pre-installed — no browser or Web Store required.
When you move your cursor over any character in the overlay, Yomitan's hover detection triggers on the underlying text node — exactly as it does on a Japanese webpage. It reads the full word (performing Japanese segmentation to find word boundaries), queries your imported dictionary files, and displays a popup with:
- Kanji and kana reading
- Part of speech and conjugation information
- Definitions from JMdict and any other imported dictionaries
- Pitch accent pattern (if you've imported a pitch accent dictionary)
- Frequency ranking (if you've imported a frequency list)
The popup appears within one frame. No network request is made at lookup time — all dictionary data is stored locally from the files you imported during setup.
Comparison
OCR vs text hooking: how they differ technically
Text-hooking tools like Textractor work by injecting a DLL into the game's process and intercepting the function call where the game sends text to the display system — capturing the raw string before it's rendered to pixels. This is highly accurate but requires a compatible hook for each specific game engine.
YomiNinja reads pixels — it never touches the game process at all. This has two significant consequences:
| OCR (YomiNinja) | Text Hooking (Textractor) | |
|---|---|---|
| Game compatibility | Any game with visible text | Only games with a compatible hook |
| Accuracy | High, with occasional recognition errors | Exact — reads the source string |
| Process access | None — only screen capture | DLL injection into game process |
| Emulator support | Yes — reads from emulator window | Generally no |
| Anti-cheat risk | Minimal (screen capture only) | Higher (process injection) |
| Setup complexity | Low — select window, run | Varies — finding the correct hook can be difficult |
The practical recommendation: try Textractor or LunaTranslator first for visual novels on supported engines. Use YomiNinja for everything else — JRPGs, action games, emulated games, games where text hooking fails.
Integration
WebSocket output: connecting to external tools
Every time YomiNinja's OCR produces a result, it broadcasts the recognized text over a local WebSocket server (default port: 7331). Any application or script that connects to this WebSocket receives the OCR output in real time.
This is how Anki mining pipelines work: a texthooker page (running in your browser) connects to the WebSocket, receives each captured sentence, displays it as hoverable text, and lets Yomitan export it to Anki. The same interface can be used for:
- Logging game dialogue to a file
- Piping text to external TTS engines
- Feeding a second-screen display showing the current sentence
- Custom applications or scripts built on top of YomiNinja's output
Try it
See the pipeline in action
Download YomiNinja free and follow the setup guide to go from zero to your first Yomitan popup in under ten minutes.