Scanning Documents and Importing Native Files

Use your PC to Master Japanese and Chinese

Scanning Documents and Importing Native Files

Using Optical Character Recognition

At the present time, most text exists or is available only in printed form stored on paper. Traditional methods of processing this text involve annotating it (or a copy) with a pencil, or typing it into a word processor. However, typing it is exceedingly slow and error prone, hence then need for an automatic method of converting the printed text to electronic text. This process is called optical character recognition(11- 2) (OCR). The process involves scanning a page to produce an electronic image of the page, then processing the image to recognize the electronic text. The text is then checked against the original image, errors are corrected, and the text is saved in a standard native file format. Although full page black and white scanned images are quite large (2-3 megabytes) in size, the resulting text is only about 1/1000 of that.

Although Roman language OCR programs are fairly sophisticated, Asian language OCR is still in its infancy. The technical challenges include distinguishing a 200 times larger character set, and the need to achieve essentially perfect accuracy. Consequently, Asian language OCR currently involves considerable human interaction to produce acceptable results.

Optimal image contrast and brightness are critical to error-free recognition, so scanning parameters must be adjusted to yield an optimal image for recognition. Sometimes a compromise is required. After the recognition process is run, the correction process verifies the recognition results with the original image, and corrects whatever errors in zoning and character recognition are noticed. The verification and correction process can be time consuming. If there are a large number of errors caused by poor image quality, it may be easier to re-scan and reprocess a page than to correct the errors.

To use an Asian language OCR(11- 2) program to import scanned text:

Launch the OCR program. If the OCR program supports acquiring a page from a scanner, use the File | Acquire or Scan selections to scan the page. Otherwise, launch the scanner program separately.
Position the image in the scanner to avoid noticeable image skew (horizontal lines should be horizontal). A little skew is ok. Zoom the image and set the scan contrast and brightness for the sharpest and clearest image possible, and set the resolution to 400 dpi or so (there is a tradeoff between resolution and processing time, and system capacity). Create the final image or save a bitmap file in a format that the OCR program can open.
Use the Recognize command to see what the scanning software makes of the acquired image. Correct incorrect zoning, then correct incorrect characters. It is essential to fix errors at the scanning stage if you plan to run the resulting text through a translation or annotation program: small errors in characters cause big errors in translation as the machine becomes hopelessly confused.
Save Japanese files as Shift-JIS(D- - 7), and Chinese files as BigFive(D- - 1) or GuoBiao(D- - 4). See Importing Native Files(7- 4).

Importing Native Files

You can import scanned files or plain text files from native word processors directly into Smart Characters, or run them through a translation or annotation step before importing them. If you add native fonts to Smart Characters, you can open the files in their native code spaces. See the File Format(3- 2) dialog and Use Other Fonts(8- 5).

Select File | Open | Interpret File As to open a file in a native code space and review it for accuracy. See Interpreting Native Formats(7- 1). If you have installed the ScAnnotate Automatic Annotator(11- 2) , select the Translate Add Annotations(3- 35) command to create an annotated file in the Combined(4- 9). If you have installed a full translation program, select Translate Translate Window(3- 35) command to launch it and create a translation of the original document.
Select File | Open | Convert File From to convert a text file in a native code space to the Combined symbol set for hand annotation.

Need more info? Go to the Customer Service Page. Questions or comments? E-mail to Apropos Customer Service

Apropos Customer Service home page 617-648-2041
Last Modified: March 23, 1996