JPH06119391A - Chinese character string extraction system - Google Patents

Chinese character string extraction system

Info

Publication number
JPH06119391A
JPH06119391A JP4271533A JP27153392A JPH06119391A JP H06119391 A JPH06119391 A JP H06119391A JP 4271533 A JP4271533 A JP 4271533A JP 27153392 A JP27153392 A JP 27153392A JP H06119391 A JPH06119391 A JP H06119391A
Authority
JP
Japan
Prior art keywords
character
kanji
character string
data
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
JP4271533A
Other languages
Japanese (ja)
Inventor
Kazuhiro Noguchi
和宏 野口
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Solution Innovators Ltd
Original Assignee
NEC Solution Innovators Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Solution Innovators Ltd filed Critical NEC Solution Innovators Ltd
Priority to JP4271533A priority Critical patent/JPH06119391A/en
Publication of JPH06119391A publication Critical patent/JPH06119391A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PURPOSE:To increase the operation speed of a conversion from a character number into a bit number in a Chinese character data management system. CONSTITUTION:An extraction processing managing part 1-1 receives a pointer indicating an objective character string, extraction start character position and extraction character number in the character string, from a host device, and extracts a Chinese character string from Chinese character string data by using a start bite position and an extraction bite number. A Chinese character data managing part 1-2 fetches the Chinese character string data indicated by the pointer. A character attribute table preparing part 1-5 scans the Chinese character string data from a left direction, segments the data by each one bite, and prepares a character attribute table constituted of the attribute of a Chinese character code such as one bite data, two bite data full size Chinese character first half, two bite data full size Chinese character second half, and half size Chinese character, and the constituting bite number of the segmented character, on a memory. A character number/bite number conversion processing part 1-6 outputs the extraction start bite number and the extraction bite number by referring to the character attribute table based on the extraction start character position and the extraction character number.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は漢字文字列抽出方式に関
し、特に漢字データ管理システムにおける文字数からバ
イト数への変換方式に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a kanji character string extraction method, and more particularly to a conversion method from the number of characters to the number of bytes in a kanji data management system.

【0002】[0002]

【従来の技術】従来、漢字データ管理システムにおい
て、漢字文字列における特定の文字列を抽出する場合、
文字数をバイト数に変換し、文字列の先頭より文字コー
ドを比較し数え上げるのが一般的な方法である。
2. Description of the Related Art Conventionally, in a kanji data management system, when extracting a specific character string in a kanji character string,
It is a general method to convert the number of characters into the number of bytes, compare the character codes from the beginning of the character string, and count up.

【0003】[0003]

【発明が解決しようとする課題】上述した従来の漢字文
字列抽出方式は、文字列内から複数回データを抽出する
場合、毎回データの数え上を行わなければならないた
め、バイト位置を求めるための処理時間がかかるという
欠点があった。
In the above-mentioned conventional kanji character string extraction method, when extracting data from a character string a plurality of times, it is necessary to count the data each time. It had the drawback of requiring a long processing time.

【0004】本発明の目的は、1度目の数え上げ処理以
外の以降の処理は、文字列の属性管理テーブルを使用す
ることで、処理時間の短縮を計ることにある。
An object of the present invention is to shorten the processing time by using the character string attribute management table for the subsequent processes other than the first counting process.

【0005】[0005]

【課題を解決するための手段】第1の発明は、漢字文字
列における特定の文字列を抽出する漢字データ管理シス
テムにおける漢字文字列抽出方式において、対象となる
文字列を示すポインタと前記文字列中の抽出開始文字位
置および抽出文字数を上位装置より受けとり第1の指
示、第2の指示、第3の指示を出力し開始バイト位置,
抽出バイト数を用いて漢字文字列データから漢字文字列
の抽出を行う抽出処理管理部と、前記第1の指示により
前記ポインタの示す前記漢字文字列データを取り込む漢
字データ管理部と、前記第2の指示を受けて前記漢字文
字列データを左方向より走査し1バイトづつデータを切
り出し1バイトデータ,2バイトデータ全角漢字前半,
2バイトデータ全角漢字後半,半角漢字,漢字インコー
ドあるいは漢字アウトコードなどの漢字コードの属性と
前記切り出した文字の構成バイト数から成る文字属性テ
ーブルをメモリ上に作成する文字属性テーブル作成部
と、前記第3の指示を受けて前記抽出開始文字位置と前
記抽出文字数をもとに前記文字属性テーブルを参照して
前記抽出開始バイト数および前記抽出バイト数を出力す
る文字数・バイト数変換処理部とから成ることを特徴と
する。
A first invention is a kanji character string extraction method in a kanji data management system for extracting a specific character string in a kanji character string, and a pointer indicating a target character string and the character string. Receiving the extraction start character position and the number of extracted characters from the upper device, outputting the first instruction, the second instruction, and the third instruction,
An extraction processing management unit that extracts a Kanji character string from the Kanji character string data using the number of extracted bytes; a Kanji data management unit that fetches the Kanji character string data indicated by the pointer according to the first instruction; In response to the instruction, the Kanji character string data is scanned from the left and the data is cut out byte by byte, 1-byte data, 2-byte data full-width Kanji first half,
2-byte data A character attribute table creation unit that creates a character attribute table on the memory, which is composed of half-width Kanji half-width characters, half-width Kanji characters, Kanji in-codes, Kanji out-codes, and other Kanji code attributes and the number of constituent bytes of the cut-out characters. A character number / byte number conversion processing unit which outputs the extraction start byte number and the extracted byte number by referring to the character attribute table based on the extraction start character position and the extracted character number in response to the third instruction, It is characterized by consisting of.

【0006】[0006]

【実施例】次に、本発明の実施例について図面を参照し
て説明する。
Embodiments of the present invention will now be described with reference to the drawings.

【0007】図1は本発明の一実施例を示すブロック図
である。
FIG. 1 is a block diagram showing an embodiment of the present invention.

【0008】1.文字列の抽出処理が開始されると、抽
出処理管理部1−1には、対象となる文字列を示すポイ
ンタと文字列中の抽出開始文字位置、抽出文字数が上位
装置より(図示せず)渡される。抽出処理管理部1−1
は、漢字データ管理部1−2を介して指定されたポイン
タの示す漢字文字列データを取り込み、文字属性テーブ
ルの作成を文字属性管理部1−3を介して文字属性テー
ブル作成部1−5に指示する。
1. When the extraction process of the character string is started, the extraction process management unit 1-1 provides the pointer indicating the target character string, the extraction start character position in the character string, and the number of extracted characters from the upper device (not shown). Passed. Extraction processing management section 1-1
Imports the Kanji character string data indicated by the designated pointer via the Kanji data management unit 1-2, and creates the character attribute table in the character attribute table creation unit 1-5 via the character attribute management unit 1-3. Give instructions.

【0009】2.文字属性テーブル作成部1−5は、図
2に示すように漢字文字列2−1を左方向より走査し、
1バイトづつデータを切り出し漢字コード判定部1−4
を介して文字属性テーブル2−2をメモリ上に作成す
る。
2. The character attribute table creation unit 1-5 scans the Kanji character string 2-1 from the left as shown in FIG.
Kanji code determination unit 1-4, which cuts out data byte by byte
The character attribute table 2-2 is created on the memory via.

【0010】3.漢字コード判定部1−4は、与えられ
た漢字コード(1バイト系のANKコードも含む)の属
性を返却する。返却されるデータとしては、漢字の属
性(1バイトデータ、2バイトデータ全角漢字前半、2
バイトデータ全角漢字後半、半角漢字、漢字インコー
ド、漢字アウトコード)、文字構成バイト数(1バイ
ト系、2バイトデータ全角漢字前半と後半については各
バイト数1、半角漢字、漢字シフトコードについてはバ
イト数2)を返却する。
3. The Kanji code determination unit 1-4 returns the attributes of the given Kanji code (including the 1-byte ANK code). The data to be returned includes the attributes of kanji (1 byte data, 2 byte data full-width kanji first half, 2 bytes
Byte data Half-width Kanji half-width, half-width Kanji, Kanji in-code, Kanji out-code), number of bytes of character configuration (1 byte system, 2-byte data Full-width Kanji 1st half and half-width Kanji, Kanji shift code) Returns the number of bytes 2).

【0011】4.文字属性テーブル2−2の構成は図3
の様になっており、文字属性エリア3−1と文字長エリ
ア3−2から構成されている。文字属性テーブルの1バ
イトは文字1文字に相当し、テーブル長は漢字文字列長
と同じ長さとなる。文字属性エリア3−1には漢字コー
ド判定部1−3より返却された漢字属性、また文字長エ
リア3−2には漢字コード判定部1−3より返却された
文字構成バイト数が格納される。この場合、漢字インコ
ードが通知された場合は、次につづく文字の漢字属性に
漢字インデータの開始を示すビットを立て、文字構成バ
イト数に漢字シフトコードのバイト数を加算する。漢字
アウトコードが通知された場合は、直前の文字の漢字属
性に漢字アウトデータの開始を示すビットを立て、文字
構成バイト数に漢字シフトコードのバイト数を加算す
る。
4. The structure of the character attribute table 2-2 is shown in FIG.
And is composed of a character attribute area 3-1 and a character length area 3-2. One byte of the character attribute table corresponds to one character, and the table length is the same as the Kanji character string length. The character attribute area 3-1 stores the Chinese character attributes returned by the Chinese character code determination unit 1-3, and the character length area 3-2 stores the number of character constituent bytes returned by the Chinese character code determination unit 1-3. . In this case, when the Kanji in code is notified, a bit indicating the start of the Kanji in data is set in the Kanji attribute of the subsequent character, and the number of bytes of the Kanji shift code is added to the number of character constituent bytes. When the kanji out code is notified, a bit indicating the start of kanji out data is set in the kanji attribute of the immediately preceding character, and the number of kanji shift code bytes is added to the number of character constituent bytes.

【0012】5.抽出処理管理部1−1は、文字属性テ
ーブル2−2が作成されたのち、文字数・バイト数変換
処理部1−6を呼び出し、実際の抽出対象文字列の開始
バイト位置、抽出バイト数を獲得する。
5. After the character attribute table 2-2 is created, the extraction processing management unit 1-1 calls the character number / byte number conversion processing unit 1-6 to acquire the starting byte position and the extracted byte number of the actual extraction target character string. To do.

【0013】6.文字数・バイト数変換処理部1−6
は、抽出処理管理部1−1より渡された抽出対象文字開
始位置(開始も文字数)、抽出文字数をもとに文字属性
テーブル2−2を利用して、文字属性テーブル2−2に
文字長エリア3−2を抽出開始文字数分(文字数は文字
属性テーブル2−2のテーブルバイト数に相当する)を
足し込み抽出開始バイト数として返却する。同様に抽出
文字数分、抽出開始位置以降の文字長エリア3−2を足
し込むことにより抽出バイト数を返却する。
6. Character / byte conversion processing unit 1-6
Uses the character attribute table 2-2 based on the extraction target character start position (the start is also the number of characters) and the number of extracted characters passed from the extraction processing management unit 1-1, and the character length is stored in the character attribute table 2-2. The area 3-2 is added as the extraction start character number (the character number corresponds to the table byte number of the character attribute table 2-2) and returned as the extraction start byte number. Similarly, the number of extracted bytes is returned by adding the character length area 3-2 after the extraction start position by the number of extracted characters.

【0014】7.文字数・バイト数変換処理部1−6
は、6.の抽出開始バイト数を求める際に、抽出開始文
字位置の文字が全角の後半部になっていないかチェック
する。チェックは文字属性テーブル2−2の文字属性エ
リア3−1の内容が全角漢字後半になっていないかで判
断する。抽出開始文字位置が全角漢字後半であった場
合、抽出開始文字位置を1文字前(全角漢字前半)にな
るように補正を行う。
7. Character / byte conversion processing unit 1-6
Is 6. When obtaining the extraction start byte number of, check whether the character at the extraction start character position is in the latter half part of the full-width character. The check determines whether the content of the character attribute area 3-1 of the character attribute table 2-2 is the latter half of the full-width Chinese character. When the extraction start character position is in the latter half of the full-width Chinese character, the extraction start character position is corrected to be one character before (the first half of the full-width Chinese character).

【0015】8.文字数・バイト数変換処理部1−6
は、抽出開始文字位置の文字が2バイト系文字の場合、
開始文字の属性に漢字インシフトコードのビットが設定
されていない場合、漢字インシフトコードの設定が必要
であるむねのステータスを返却する。また、6.の処理
の中で、2バイト系文字が検出されてから、最終抽出文
字の間に漢字アウトシフトコードを検出しなかった場
合、漢字アウトシフトコードの設定が必要であるむねの
ステータスを返却する。抽出文字の最終文字の属性に漢
字アウトシフトコードのビットが立っていた場合は、抽
出文字のバイト数を2バイト増加させる。
8. Character / byte conversion processing unit 1-6
When the character at the extraction start character position is a double-byte character,
If the Kanji inshift code bit is not set in the attribute of the start character, return the status of Mune for which the Kanji inshift code needs to be set. In addition, 6. In the process of (2), if the Kanji outshift code is not detected during the last extracted character after the double-byte character is detected, the status of Mune requiring the setting of the Kanji outshift code is returned. If the bit of the Kanji outshift code is set in the attribute of the last character of the extracted character, the number of bytes of the extracted character is increased by 2 bytes.

【0016】9.抽出処理管理部1−1は、実際の抽出
対象文字列の開始バイト位置、抽出バイト数を使って、
漢字データ管理部1−2より漢字文字列の抽出を行う。
この時、8.によってステータスが設定されている場合
は、ステータスによって漢字シフトコードの追加を行
う。
9. The extraction process management unit 1-1 uses the start byte position and the number of extracted bytes of the actual extraction target character string to
A kanji character string is extracted from the kanji data management section 1-2.
At this time, 8. If the status is set by, add the Kanji shift code depending on the status.

【0017】以上の処理によって、文字数、バイト数を
変換を行う。複数回の抽出を行う場合は、5.の処理以
降を行うのみで変換が可能となり、処理の高速化が可能
となる。
By the above processing, the number of characters and the number of bytes are converted. When performing extraction multiple times, 5. The conversion can be performed only by performing the processing after step 1, and the processing can be speeded up.

【0018】[0018]

【発明の効果】以上説明したように、本発明は、漢字コ
ードの比較数え上げ処理を、文字属性テーブルを利用す
るようにしたことにより、複数回の文字数、バイト数の
変換処理を高速化できるという効果がある。
As described above, according to the present invention, the conversion process of the number of characters and the number of bytes can be speeded up by using the character attribute table for the comparison and counting process of the kanji code. effective.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の一実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図2】本実施例における漢字文字列と文字属性テーブ
ルとの関係を示す図である。
FIG. 2 is a diagram showing a relationship between a kanji character string and a character attribute table in the present embodiment.

【図3】本実施例における文字属性テーブルの構成例を
示す図である。
FIG. 3 is a diagram showing a configuration example of a character attribute table in the present embodiment.

【符号の説明】[Explanation of symbols]

1−1 抽出処理管理部 1−2 漢字データ管理部 1−3 文字属性管理部 1−4 漢字コード判定部 1−5 文字属性テーブル作成部 1−6 文字数・バイト数変換処理部 2−1 漢字文字列 2−2 文字属性テーブル 3−1 文字属性エリア 3−2 文字長エリア 1-1 Extraction process management unit 1-2 Kanji data management unit 1-3 Character attribute management unit 1-4 Kanji code determination unit 1-5 Character attribute table creation unit 1-6 Character number / byte number conversion processing unit 2-1 Kanji Character string 2-2 Character attribute table 3-1 Character attribute area 3-2 Character length area

Claims (1)

【特許請求の範囲】[Claims] 【請求項1】漢字文字列における特定の文字列を抽出す
る漢字データ管理システムにおける漢字文字列抽出方式
において、対象となる文字列を示すポインタと前記文字
列中の抽出開始文字位置および抽出文字数を上位装置よ
り受けとり第1の指示、第2の指示、第3の指示を出力
し開始バイト位置,抽出バイト数を用いて漢字文字列デ
ータから漢字文字列の抽出を行う抽出処理管理部と、前
記第1の指示により前記ポインタの示す前記漢字文字列
データを取り込む漢字データ管理部と、前記第2の指示
を受けて前記漢字文字列データを左方向より走査し1バ
イトづつデータを切り出し1バイトデータ,2バイトデ
ータ全角漢字前半,2バイトデータ全角漢字後半,半角
漢字,漢字インコードあるいは漢字アウトコードなどの
漢字コードの属性と前記切り出した文字の構成バイト数
から成る文字属性テーブルをメモリ上に作成する文字属
性テーブル作成部と、前記第3の指示を受けて前記抽出
開始文字位置と前記抽出文字数をもとに前記文字属性テ
ーブルを参照して前記抽出開始バイト数および前記抽出
バイト数を出力する文字数・バイト数変換処理部とから
成ることを特徴とする漢字文字列抽出方式。
1. In a kanji character string extraction method in a kanji data management system for extracting a specific character string in a kanji character string, a pointer indicating a target character string, an extraction start character position in the character string, and the number of extracted characters are set. An extraction processing management unit that receives the first instruction, the second instruction, and the third instruction from the host device, and extracts the Kanji character string from the Kanji character string data using the start byte position and the number of extracted bytes; A Kanji data management unit that takes in the Kanji character string data indicated by the pointer according to the first instruction, and scans the Kanji character string data from the left direction in response to the second instruction and cuts out the data one byte at a time. , 2-byte data full-width Kanji first half, 2-byte data full-width Kanji second half, half-width Kanji, Kanji in code or Kanji out code, etc. A character attribute table creating unit that creates a character attribute table composed of the number of constituent bytes of the cut out character on a memory, and the character attribute based on the extraction start character position and the number of extracted characters in response to the third instruction. A kanji character string extraction method comprising: a character number / byte number conversion processing unit that outputs the extraction start byte number and the extraction byte number by referring to a table.
JP4271533A 1992-10-09 1992-10-09 Chinese character string extraction system Withdrawn JPH06119391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP4271533A JPH06119391A (en) 1992-10-09 1992-10-09 Chinese character string extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP4271533A JPH06119391A (en) 1992-10-09 1992-10-09 Chinese character string extraction system

Publications (1)

Publication Number Publication Date
JPH06119391A true JPH06119391A (en) 1994-04-28

Family

ID=17501394

Family Applications (1)

Application Number Title Priority Date Filing Date
JP4271533A Withdrawn JPH06119391A (en) 1992-10-09 1992-10-09 Chinese character string extraction system

Country Status (1)

Country Link
JP (1) JPH06119391A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100351584B1 (en) * 2000-07-05 2002-09-05 주식회사 팔만시스템 System of proofreading a Chinese character by contrasting one by one

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100351584B1 (en) * 2000-07-05 2002-09-05 주식회사 팔만시스템 System of proofreading a Chinese character by contrasting one by one

Similar Documents

Publication Publication Date Title
US7984076B2 (en) Document processing apparatus, document processing method, document processing program and recording medium
JP3333549B2 (en) Document search method
US9734140B2 (en) Method, apparatus and computer program for model-driven message parsing
JPH06119391A (en) Chinese character string extraction system
JP4584359B2 (en) Unicode converter
JPH0969785A (en) Method and device for data compression
JP2967275B2 (en) Kana-Kanji conversion device
JPS6383833A (en) Retrieving method for character string
JP2833871B2 (en) Alien name data judgment method
JPH09114854A (en) Document retrieving system
JPS61251984A (en) Device for recognizing multi-font type character
JPH043243A (en) Kana to kanji converter
JP2503259B2 (en) How to determine full-width and half-width characters
JPH0440554A (en) Character data processor
JPH04205551A (en) Description conversion system
JPH07141347A (en) Method for segmenting japanese character string
JPH05210629A (en) Display control system
JPH11149476A (en) System and method for extracting similar data
JPH07225763A (en) Document processor
JPH0752451B2 (en) Information retrieval device
JPH07152858A (en) Method and system for management of character recognition ofplurality of document format images with common data type
JPH0796639A (en) Printer
JPS62176354A (en) Facsimile transmission system
JPH035818A (en) Code conversion method
JPH01277961A (en) Character conversion system

Legal Events

Date Code Title Description
A300 Withdrawal of application because of no request for examination

Free format text: JAPANESE INTERMEDIATE CODE: A300

Effective date: 20000104