JPH06119391A

JPH06119391A - Chinese character string extraction system

Info

Publication number: JPH06119391A
Application number: JP4271533A
Authority: JP
Inventors: Kazuhiro Noguchi; 和宏野口
Original assignee: NEC Solution Innovators Ltd
Current assignee: NEC Solution Innovators Ltd
Priority date: 1992-10-09
Filing date: 1992-10-09
Publication date: 1994-04-28

Abstract

PURPOSE:To increase the operation speed of a conversion from a character number into a bit number in a Chinese character data management system. CONSTITUTION:An extraction processing managing part 1-1 receives a pointer indicating an objective character string, extraction start character position and extraction character number in the character string, from a host device, and extracts a Chinese character string from Chinese character string data by using a start bite position and an extraction bite number. A Chinese character data managing part 1-2 fetches the Chinese character string data indicated by the pointer. A character attribute table preparing part 1-5 scans the Chinese character string data from a left direction, segments the data by each one bite, and prepares a character attribute table constituted of the attribute of a Chinese character code such as one bite data, two bite data full size Chinese character first half, two bite data full size Chinese character second half, and half size Chinese character, and the constituting bite number of the segmented character, on a memory. A character number/bite number conversion processing part 1-6 outputs the extraction start bite number and the extraction bite number by referring to the character attribute table based on the extraction start character position and the extraction character number.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は漢字文字列抽出方式に関
し、特に漢字データ管理システムにおける文字数からバ
イト数への変換方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a kanji character string extraction method, and more particularly to a conversion method from the number of characters to the number of bytes in a kanji data management system.

【０００２】[0002]

【従来の技術】従来、漢字データ管理システムにおい
て、漢字文字列における特定の文字列を抽出する場合、
文字数をバイト数に変換し、文字列の先頭より文字コー
ドを比較し数え上げるのが一般的な方法である。2. Description of the Related Art Conventionally, in a kanji data management system, when extracting a specific character string in a kanji character string,
It is a general method to convert the number of characters into the number of bytes, compare the character codes from the beginning of the character string, and count up.

【０００３】[0003]

【発明が解決しようとする課題】上述した従来の漢字文
字列抽出方式は、文字列内から複数回データを抽出する
場合、毎回データの数え上を行わなければならないた
め、バイト位置を求めるための処理時間がかかるという
欠点があった。In the above-mentioned conventional kanji character string extraction method, when extracting data from a character string a plurality of times, it is necessary to count the data each time. It had the drawback of requiring a long processing time.

【０００４】本発明の目的は、１度目の数え上げ処理以
外の以降の処理は、文字列の属性管理テーブルを使用す
ることで、処理時間の短縮を計ることにある。An object of the present invention is to shorten the processing time by using the character string attribute management table for the subsequent processes other than the first counting process.

【０００５】[0005]

【課題を解決するための手段】第１の発明は、漢字文字
列における特定の文字列を抽出する漢字データ管理シス
テムにおける漢字文字列抽出方式において、対象となる
文字列を示すポインタと前記文字列中の抽出開始文字位
置および抽出文字数を上位装置より受けとり第１の指
示、第２の指示、第３の指示を出力し開始バイト位置，
抽出バイト数を用いて漢字文字列データから漢字文字列
の抽出を行う抽出処理管理部と、前記第１の指示により
前記ポインタの示す前記漢字文字列データを取り込む漢
字データ管理部と、前記第２の指示を受けて前記漢字文
字列データを左方向より走査し１バイトづつデータを切
り出し１バイトデータ，２バイトデータ全角漢字前半，
２バイトデータ全角漢字後半，半角漢字，漢字インコー
ドあるいは漢字アウトコードなどの漢字コードの属性と
前記切り出した文字の構成バイト数から成る文字属性テ
ーブルをメモリ上に作成する文字属性テーブル作成部
と、前記第３の指示を受けて前記抽出開始文字位置と前
記抽出文字数をもとに前記文字属性テーブルを参照して
前記抽出開始バイト数および前記抽出バイト数を出力す
る文字数・バイト数変換処理部とから成ることを特徴と
する。A first invention is a kanji character string extraction method in a kanji data management system for extracting a specific character string in a kanji character string, and a pointer indicating a target character string and the character string. Receiving the extraction start character position and the number of extracted characters from the upper device, outputting the first instruction, the second instruction, and the third instruction,
An extraction processing management unit that extracts a Kanji character string from the Kanji character string data using the number of extracted bytes; a Kanji data management unit that fetches the Kanji character string data indicated by the pointer according to the first instruction; In response to the instruction, the Kanji character string data is scanned from the left and the data is cut out byte by byte, 1-byte data, 2-byte data full-width Kanji first half,
2-byte data A character attribute table creation unit that creates a character attribute table on the memory, which is composed of half-width Kanji half-width characters, half-width Kanji characters, Kanji in-codes, Kanji out-codes, and other Kanji code attributes and the number of constituent bytes of the cut-out characters. A character number / byte number conversion processing unit which outputs the extraction start byte number and the extracted byte number by referring to the character attribute table based on the extraction start character position and the extracted character number in response to the third instruction, It is characterized by consisting of.

【０００６】[0006]

【実施例】次に、本発明の実施例について図面を参照し
て説明する。Embodiments of the present invention will now be described with reference to the drawings.

【０００７】図１は本発明の一実施例を示すブロック図
である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【０００８】１．文字列の抽出処理が開始されると、抽
出処理管理部１−１には、対象となる文字列を示すポイ
ンタと文字列中の抽出開始文字位置、抽出文字数が上位
装置より（図示せず）渡される。抽出処理管理部１−１
は、漢字データ管理部１−２を介して指定されたポイン
タの示す漢字文字列データを取り込み、文字属性テーブ
ルの作成を文字属性管理部１−３を介して文字属性テー
ブル作成部１−５に指示する。1. When the extraction process of the character string is started, the extraction process management unit 1-1 provides the pointer indicating the target character string, the extraction start character position in the character string, and the number of extracted characters from the upper device (not shown). Passed. Extraction processing management section 1-1
Imports the Kanji character string data indicated by the designated pointer via the Kanji data management unit 1-2, and creates the character attribute table in the character attribute table creation unit 1-5 via the character attribute management unit 1-3. Give instructions.

【０００９】２．文字属性テーブル作成部１−５は、図
２に示すように漢字文字列２−１を左方向より走査し、
１バイトづつデータを切り出し漢字コード判定部１−４
を介して文字属性テーブル２−２をメモリ上に作成す
る。2. The character attribute table creation unit 1-5 scans the Kanji character string 2-1 from the left as shown in FIG.
Kanji code determination unit 1-4, which cuts out data byte by byte
The character attribute table 2-2 is created on the memory via.

【００１０】３．漢字コード判定部１−４は、与えられ
た漢字コード（１バイト系のＡＮＫコードも含む）の属
性を返却する。返却されるデータとしては、漢字の属
性（１バイトデータ、２バイトデータ全角漢字前半、２
バイトデータ全角漢字後半、半角漢字、漢字インコー
ド、漢字アウトコード）、文字構成バイト数（１バイ
ト系、２バイトデータ全角漢字前半と後半については各
バイト数１、半角漢字、漢字シフトコードについてはバ
イト数２）を返却する。3. The Kanji code determination unit 1-4 returns the attributes of the given Kanji code (including the 1-byte ANK code). The data to be returned includes the attributes of kanji (1 byte data, 2 byte data full-width kanji first half, 2 bytes
Byte data Half-width Kanji half-width, half-width Kanji, Kanji in-code, Kanji out-code), number of bytes of character configuration (1 byte system, 2-byte data Full-width Kanji 1st half and half-width Kanji, Kanji shift code) Returns the number of bytes 2).

【００１１】４．文字属性テーブル２−２の構成は図３
の様になっており、文字属性エリア３−１と文字長エリ
ア３−２から構成されている。文字属性テーブルの１バ
イトは文字１文字に相当し、テーブル長は漢字文字列長
と同じ長さとなる。文字属性エリア３−１には漢字コー
ド判定部１−３より返却された漢字属性、また文字長エ
リア３−２には漢字コード判定部１−３より返却された
文字構成バイト数が格納される。この場合、漢字インコ
ードが通知された場合は、次につづく文字の漢字属性に
漢字インデータの開始を示すビットを立て、文字構成バ
イト数に漢字シフトコードのバイト数を加算する。漢字
アウトコードが通知された場合は、直前の文字の漢字属
性に漢字アウトデータの開始を示すビットを立て、文字
構成バイト数に漢字シフトコードのバイト数を加算す
る。4. The structure of the character attribute table 2-2 is shown in FIG.
And is composed of a character attribute area 3-1 and a character length area 3-2. One byte of the character attribute table corresponds to one character, and the table length is the same as the Kanji character string length. The character attribute area 3-1 stores the Chinese character attributes returned by the Chinese character code determination unit 1-3, and the character length area 3-2 stores the number of character constituent bytes returned by the Chinese character code determination unit 1-3. . In this case, when the Kanji in code is notified, a bit indicating the start of the Kanji in data is set in the Kanji attribute of the subsequent character, and the number of bytes of the Kanji shift code is added to the number of character constituent bytes. When the kanji out code is notified, a bit indicating the start of kanji out data is set in the kanji attribute of the immediately preceding character, and the number of kanji shift code bytes is added to the number of character constituent bytes.

【００１２】５．抽出処理管理部１−１は、文字属性テ
ーブル２−２が作成されたのち、文字数・バイト数変換
処理部１−６を呼び出し、実際の抽出対象文字列の開始
バイト位置、抽出バイト数を獲得する。5. After the character attribute table 2-2 is created, the extraction processing management unit 1-1 calls the character number / byte number conversion processing unit 1-6 to acquire the starting byte position and the extracted byte number of the actual extraction target character string. To do.

【００１３】６．文字数・バイト数変換処理部１−６
は、抽出処理管理部１−１より渡された抽出対象文字開
始位置（開始も文字数）、抽出文字数をもとに文字属性
テーブル２−２を利用して、文字属性テーブル２−２に
文字長エリア３−２を抽出開始文字数分（文字数は文字
属性テーブル２−２のテーブルバイト数に相当する）を
足し込み抽出開始バイト数として返却する。同様に抽出
文字数分、抽出開始位置以降の文字長エリア３−２を足
し込むことにより抽出バイト数を返却する。6. Character / byte conversion processing unit 1-6
Uses the character attribute table 2-2 based on the extraction target character start position (the start is also the number of characters) and the number of extracted characters passed from the extraction processing management unit 1-1, and the character length is stored in the character attribute table 2-2. The area 3-2 is added as the extraction start character number (the character number corresponds to the table byte number of the character attribute table 2-2) and returned as the extraction start byte number. Similarly, the number of extracted bytes is returned by adding the character length area 3-2 after the extraction start position by the number of extracted characters.

【００１４】７．文字数・バイト数変換処理部１−６
は、６．の抽出開始バイト数を求める際に、抽出開始文
字位置の文字が全角の後半部になっていないかチェック
する。チェックは文字属性テーブル２−２の文字属性エ
リア３−１の内容が全角漢字後半になっていないかで判
断する。抽出開始文字位置が全角漢字後半であった場
合、抽出開始文字位置を１文字前（全角漢字前半）にな
るように補正を行う。7. Character / byte conversion processing unit 1-6
Is 6. When obtaining the extraction start byte number of, check whether the character at the extraction start character position is in the latter half part of the full-width character. The check determines whether the content of the character attribute area 3-1 of the character attribute table 2-2 is the latter half of the full-width Chinese character. When the extraction start character position is in the latter half of the full-width Chinese character, the extraction start character position is corrected to be one character before (the first half of the full-width Chinese character).

【００１５】８．文字数・バイト数変換処理部１−６
は、抽出開始文字位置の文字が２バイト系文字の場合、
開始文字の属性に漢字インシフトコードのビットが設定
されていない場合、漢字インシフトコードの設定が必要
であるむねのステータスを返却する。また、６．の処理
の中で、２バイト系文字が検出されてから、最終抽出文
字の間に漢字アウトシフトコードを検出しなかった場
合、漢字アウトシフトコードの設定が必要であるむねの
ステータスを返却する。抽出文字の最終文字の属性に漢
字アウトシフトコードのビットが立っていた場合は、抽
出文字のバイト数を２バイト増加させる。8. Character / byte conversion processing unit 1-6
When the character at the extraction start character position is a double-byte character,
If the Kanji inshift code bit is not set in the attribute of the start character, return the status of Mune for which the Kanji inshift code needs to be set. In addition, 6. In the process of (2), if the Kanji outshift code is not detected during the last extracted character after the double-byte character is detected, the status of Mune requiring the setting of the Kanji outshift code is returned. If the bit of the Kanji outshift code is set in the attribute of the last character of the extracted character, the number of bytes of the extracted character is increased by 2 bytes.

【００１６】９．抽出処理管理部１−１は、実際の抽出
対象文字列の開始バイト位置、抽出バイト数を使って、
漢字データ管理部１−２より漢字文字列の抽出を行う。
この時、８．によってステータスが設定されている場合
は、ステータスによって漢字シフトコードの追加を行
う。9. The extraction process management unit 1-1 uses the start byte position and the number of extracted bytes of the actual extraction target character string to
A kanji character string is extracted from the kanji data management section 1-2.
At this time, 8. If the status is set by, add the Kanji shift code depending on the status.

【００１７】以上の処理によって、文字数、バイト数を
変換を行う。複数回の抽出を行う場合は、５．の処理以
降を行うのみで変換が可能となり、処理の高速化が可能
となる。By the above processing, the number of characters and the number of bytes are converted. When performing extraction multiple times, 5. The conversion can be performed only by performing the processing after step 1, and the processing can be speeded up.

【００１８】[0018]

【発明の効果】以上説明したように、本発明は、漢字コ
ードの比較数え上げ処理を、文字属性テーブルを利用す
るようにしたことにより、複数回の文字数、バイト数の
変換処理を高速化できるという効果がある。As described above, according to the present invention, the conversion process of the number of characters and the number of bytes can be speeded up by using the character attribute table for the comparison and counting process of the kanji code. effective.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】本実施例における漢字文字列と文字属性テーブ
ルとの関係を示す図である。FIG. 2 is a diagram showing a relationship between a kanji character string and a character attribute table in the present embodiment.

【図３】本実施例における文字属性テーブルの構成例を
示す図である。FIG. 3 is a diagram showing a configuration example of a character attribute table in the present embodiment.

【符号の説明】[Explanation of symbols]

１−１抽出処理管理部１−２漢字データ管理部１−３文字属性管理部１−４漢字コード判定部１−５文字属性テーブル作成部１−６文字数・バイト数変換処理部２−１漢字文字列２−２文字属性テーブル３−１文字属性エリア３−２文字長エリア 1-1 Extraction process management unit 1-2 Kanji data management unit 1-3 Character attribute management unit 1-4 Kanji code determination unit 1-5 Character attribute table creation unit 1-6 Character number / byte number conversion processing unit 2-1 Kanji Character string 2-2 Character attribute table 3-1 Character attribute area 3-2 Character length area

Claims

【特許請求の範囲】[Claims]

【請求項１】漢字文字列における特定の文字列を抽出す
る漢字データ管理システムにおける漢字文字列抽出方式
において、対象となる文字列を示すポインタと前記文字
列中の抽出開始文字位置および抽出文字数を上位装置よ
り受けとり第１の指示、第２の指示、第３の指示を出力
し開始バイト位置，抽出バイト数を用いて漢字文字列デ
ータから漢字文字列の抽出を行う抽出処理管理部と、前
記第１の指示により前記ポインタの示す前記漢字文字列
データを取り込む漢字データ管理部と、前記第２の指示
を受けて前記漢字文字列データを左方向より走査し１バ
イトづつデータを切り出し１バイトデータ，２バイトデ
ータ全角漢字前半，２バイトデータ全角漢字後半，半角
漢字，漢字インコードあるいは漢字アウトコードなどの
漢字コードの属性と前記切り出した文字の構成バイト数
から成る文字属性テーブルをメモリ上に作成する文字属
性テーブル作成部と、前記第３の指示を受けて前記抽出
開始文字位置と前記抽出文字数をもとに前記文字属性テ
ーブルを参照して前記抽出開始バイト数および前記抽出
バイト数を出力する文字数・バイト数変換処理部とから
成ることを特徴とする漢字文字列抽出方式。1. In a kanji character string extraction method in a kanji data management system for extracting a specific character string in a kanji character string, a pointer indicating a target character string, an extraction start character position in the character string, and the number of extracted characters are set. An extraction processing management unit that receives the first instruction, the second instruction, and the third instruction from the host device, and extracts the Kanji character string from the Kanji character string data using the start byte position and the number of extracted bytes; A Kanji data management unit that takes in the Kanji character string data indicated by the pointer according to the first instruction, and scans the Kanji character string data from the left direction in response to the second instruction and cuts out the data one byte at a time. , 2-byte data full-width Kanji first half, 2-byte data full-width Kanji second half, half-width Kanji, Kanji in code or Kanji out code, etc. A character attribute table creating unit that creates a character attribute table composed of the number of constituent bytes of the cut out character on a memory, and the character attribute based on the extraction start character position and the number of extracted characters in response to the third instruction. A kanji character string extraction method comprising: a character number / byte number conversion processing unit that outputs the extraction start byte number and the extraction byte number by referring to a table.