JP6528927B2

JP6528927B2 - Document processing apparatus and program

Info

Publication number: JP6528927B2
Application number: JP2014167569A
Authority: JP
Inventors: 鶴慶銭
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2014-08-20
Filing date: 2014-08-20
Publication date: 2019-06-12
Anticipated expiration: 2034-08-20
Also published as: WO2016027476A1; JP2016045566A

Description

本発明は、文書処理装置及びプログラムに関する。 The present invention relates to a document processing apparatus and program.

特許文献１には、互いに隣接する文字領域間の間隔の第１の分布及び文字領域の重心間距離の第２の分布をそれぞれ二分割したときに第１及び第２の分離度を判別分析により求め、第１の分離度と第２の分離度とを比較することで文字列の表記に用いられているフォントを判定して、判定されたフォントに従って単語間の空白を検出するための閾値を設定し、文字領域間の間隔または重心間の距離が閾値以上である場合、対応する二つの文字の間に単語間の空白を検出する技術が開示されている。 In Patent Document 1, when the first distribution of the space between adjacent character areas and the second distribution of the distance between the centers of gravity of the character areas are each divided into two, the first and second separation degrees are determined by discriminant analysis The threshold used to detect a space between words according to the determined font is determined by determining the font used for character string representation by comparing the first resolution and the second resolution. A technique is disclosed for detecting an inter-word space between corresponding two characters when setting and the distance between character regions or the distance between gravity centers is equal to or greater than a threshold.

また、特許文献２には、均等割付けされた文字であるか、見出し文字列であるか、和文・欧文などの文字種判断などの結果に基づいて、空白文字を挿入するか否かの判断を行う技術が開示されている。 Further, in Patent Document 2, it is judged whether to insert a blank character or not based on a result such as whether the character is equally allocated, a headline character string, or a character type judgment such as Japanese or Western text. Technology is disclosed.

また、特許文献３には、英語表記ルールによる判断、元文書データに含まれる空白文字の有無の情報による判断、隣接する文字同士が含まれる文字列オブジェクトの同一性による判断、隣接する文字同士が含まれる文字列オブジェクトの間隔による判断を用いて隣接する文字が同じ単語に含まれるかどうかについて判断する技術が開示されている。 In addition, in Patent Document 3, a judgment according to an English notation rule, a judgment according to information on the presence or absence of a blank character included in original document data, a judgment according to the identity of character string objects including adjacent characters, adjacent characters A technique is disclosed for determining whether adjacent characters are included in the same word using determination based on the spacing of included string objects.

特開２０１３−０９７５６１号公報JP, 2013-097561, A 特開２００８−１７１４００号公報JP, 2008-171400, A 特開２０１２−００８９６５号公報JP 2012-008965 A

単語間に空白の文字コードが挿入されていない分かち書き言語の文書データに対して、判別分析法等により単語間の位置を検出して空白の文字コードを挿入した場合、本来挿入すべきでない位置に空白の文字コードを挿入してしまう過挿入が発生するという問題点があった。 When the word character code is inserted by detecting the position between the words by the discriminant analysis method or the like for the document data of the split-off language in which the character code of the blank character code is not inserted between the words, There has been a problem that overinsertion occurs, which inserts a blank character code.

本発明の目的は、単語間に空白の文字コードが挿入されていない分かち書きの言語の文書データに対して、判別分析法等により単語間の位置を検出して空白の文字コードを挿入した場合と比べ、本来挿入すべきでない位置に空白の文字コードを挿入してしまう過挿入を抑制することが可能な文書処理装置及びプログラムを提供することである。 The object of the present invention is to detect a position between words by a discriminant analysis method or the like and insert a blank character code into document data of a split-off language in which no blank character code is inserted between words. It is another object of the present invention to provide a document processing apparatus and program capable of suppressing overinsertion in which a blank character code is inserted at a position which should not be inserted.

請求項１に係る本発明は、文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値未満または前記文字間隔リストの中央に位置する文字間隔以下の場合に、当該文字間隔を第１の閾値の候補から除外し、除外された文字間隔の前記変化量リストにおける値の次に大きい値に対応する文字間隔リストの文字間隔、又は前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値以上または前記文字間隔リストの中央に位置する文字間隔より大きい場合に、当該文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
を有する文書処理装置である。 The present invention according to claim 1 is a receiving unit for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
If the character spacing of the character spacing list corresponding to the maximum value in the variation amount list is less than the average value of the character spacing in the character spacing list or less than the character spacing located at the center of the character spacing list A character spacing of the character spacing list corresponding to the second largest value of the excluded character spacing excluded from the first threshold candidate and the value in the variation list of the excluded character spacing, or a character spacing corresponding to the maximum value of the variation amount list Determining means for determining the character spacing as a first threshold when the character spacing of the list is greater than or equal to the average value of the character spacing in the character spacing list or greater than the character spacing located at the center of the character spacing list ;
Inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
Document processing apparatus.

請求項２に係る本発明は、文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
前記取得手段により取得された文字列の文字間隔の標準偏差が、第１の予め定められた値以下である場合に、当該文字列には空白の文字コードを挿入する必要がないと判定する判定手段と、を有し、
前記挿入手段は、前記判定手段により空白の文字コードを挿入する必要がないと判定された文字列に対しては、空白の文字コードの挿入を行わない文書処理装置である。 The present invention according to claim 2 is a receiving unit for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
A determination unit configured to determine, as a first threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation amount list;
Inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
When the standard deviation of the character spacing of the character string acquired by the acquisition means is less than or equal to a first predetermined value, it is determined that it is not necessary to insert a blank character code in the character string Means, and
The insertion means, wherein for the determined character string as there is no need to insert the character code space by determining means, an Ibn document processing apparatus to perform the insertion of blank character codes.

請求項３に係る本発明は、前記判定手段は、前記取得手段により取得された文字列において、偏差が第２の予め定められた値以下の文字間隔が含まれる場合、当該文字間隔の偏差を０として当該文字列の標準偏差を再計算し、再計算後の標準偏差が、第１の予め定められた値以下であるとき、当該文字列には空白の文字コードを挿入する必要がないと判定する請求項２記載の文書処理装置である。 The present invention according to claim 3 is that, when the character string acquired by the acquisition unit includes the character spacing having a deviation equal to or less than a second predetermined value, the determination means determines the deviation of the character spacing. Recalculate the standard deviation of the character string as 0, and when the standard deviation after recalculation is equal to or less than the first predetermined value, it is not necessary to insert a blank character code into the character string The document processing apparatus according to claim 2, wherein the determination is made.

請求項４に係る本発明は、文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
前記文書データの文字列は分かち書きの言語により構成され、
前記言語の単語を記憶する記憶手段と、
前記取得手段により取得された文字列を、前記挿入手段により挿入された空白の文字コードを境に分割する分割手段と、を有し、
前記第１の作成手段は、前記分割手段により分割された文字列が前記記憶手段に記憶された単語と一致しない場合に、当該分割された文字列の文字間隔を、大きさの順に並べた文字間隔リストを作成し、
前記第２の作成手段は、前記分割された文字列の文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成し、
前記決定手段は、前記分割された文字列の変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第２の閾値として決定し、
前記挿入手段は、前記分割された文字列に対して、前記決定手段により決定された第２の閾値以上の文字間隔の文字間に空白の文字コードを挿入する文書処理装置である。 The present invention according to claim 4 is a receiving means for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
A determination unit configured to determine, as a first threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation amount list;
Inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
The character string of the document data is composed of a split language,
Storage means for storing words of the language;
A character string acquired by the acquisition unit, has a dividing means for dividing the boundary of character codes inserted blank by said inserting means,
When the character string divided by the dividing unit does not match the word stored in the storage unit, the first creating unit is a character in which the character intervals of the divided character string are arranged in order of size Create an interval list,
The second creation means creates a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list of the divided character strings;
The determining means determines, as a second threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation list of the divided character strings.
The insertion means, with respect to the divided character string is inserted to Rubun document processing device character code spaces between characters of the second threshold value or more character spacing determined by the determining means.

請求項５に係る本発明は、文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、を有し、
前記第１の作成手段は、取得された文字列に既に空白の文字コードが含まれている場合に、当該空白の文字コードの数が、取得された文字列の文字間隔数の予め定められた割合以上であるとき、当該文字列に含まれる空白の文字コードを全て削除した後、前記文字間隔リストを作成する文書処理装置である。 The present invention according to claim 5 is a receiving unit for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
A determination unit configured to determine, as a first threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation amount list;
And inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
When the acquired character string already includes a blank character code, the first creating means determines that the number of blank character codes is a predetermined number of character intervals of the acquired character string. when it is proportion or more, after deleting all blank character codes included in the character string, a Rubun form processing unit to generate the character spacing list.

請求項６に係る本発明は、受け付けた文字列の中の隣接する２つの文字間の距離である文字間隔を大きさの順に並べ替えて文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値未満または前記文字間隔リストの中央に位置する文字間隔以下の場合に、当該文字間隔を第１の閾値の候補から除外し、除外された文字間隔の前記変化量リストにおける値の次に大きい値に対応する文字間隔リストの文字間隔、又は前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値以上または前記文字間隔リストの中央に位置する文字間隔より大きい場合に、当該文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
を有する文書処理装置である。 According to a sixth aspect of the present invention, there is provided a first creation means for rearranging character spacing, which is a distance between two adjacent characters in a received character string, in order of magnitude to create a character spacing list;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
If the character spacing of the character spacing list corresponding to the maximum value in the variation amount list is less than the average value of the character spacing in the character spacing list or less than the character spacing located at the center of the character spacing list A character spacing of the character spacing list corresponding to the second largest value of the excluded character spacing excluded from the first threshold candidate and the value in the variation list of the excluded character spacing, or a character spacing corresponding to the maximum value of the variation amount list Inserting means for inserting a blank character code between characters of the character spacing list if the character spacing of the list is greater than or equal to the average value of the character spacing in the character spacing list or the character spacing located at the center of the character spacing list When,
Document processing apparatus.

請求項７に係る本発明は、文書データを受け付けるステップと、
受け付けられた文書データに含まれる文字コードに基づいて文字列を取得するステップと、
取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成するステップと、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成するステップと、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値未満または前記文字間隔リストの中央に位置する文字間隔以下の場合に、当該文字間隔を閾値の候補から除外し、除外された文字間隔の前記変化量リストにおける値の次に大きい値に対応する文字間隔リストの文字間隔、又は前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値以上または前記文字間隔リストの中央に位置する文字間隔より大きい場合に、当該文字間隔を閾値として決定するステップと、
前記文字列に対して、前記閾値以上の文字間隔の文字間に空白の文字コードを挿入するステップと、
をコンピュータに実行させるためのプログラムである。 The present invention according to claim 7 receives the document data.
Acquiring a character string based on a character code included in the accepted document data;
Creating a character interval list in which character intervals, which are distances between two adjacent characters in the acquired character string, are arranged in order of magnitude;
Creating a variation amount list indicating variation amounts for character spacing before and after each character spacing in the character spacing list;
If the character spacing of the character spacing list corresponding to the maximum value in the variation amount list is less than the average value of the character spacing in the character spacing list or less than the character spacing located at the center of the character spacing list Character spacing of the character spacing list corresponding to the second largest value in the variation amount list of excluded character spacings excluded from threshold candidates, or character spacing list character corresponding to the maximum value of the variation amount list Determining the character spacing as a threshold if the spacing is greater than or equal to the average value of the character spacing in the character spacing list or greater than the character spacing located at the center of the character spacing list ;
Inserting a character code of a space between characters of the character space equal to or more than the threshold value for the character string;
Is a program for making a computer execute.

請求項１又は請求項６に係る本発明によれば、単語間に空白の文字コードが挿入されていない分かち書きの言語の文書データに対して、判別分析法等により単語間の位置を検出して空白の文字コードを挿入した場合と比べ、本来挿入すべきでない位置に空白の文字コードを挿入してしまう過挿入を抑制することが可能な文書処理装置を提供することができる。また、請求項１又は請求項６に係る本発明によれば、単語間に空白の文字コードが挿入されていない文書データの単語間に空白の文字コードを挿入する際に、判別分析法等を用いて空白の文字コードの挿入位置を検出する場合と比較して、空白の文字コードの挿入位置を高い精度で検出することが可能な文書処理装置を提供することができる。 According to the present invention according to claim 1 or 6 , for the document data of the split-off language document data in which no space character code is inserted between the words, the position between the words is detected by the discriminant analysis method or the like. It is possible to provide a document processing apparatus capable of suppressing overinsertion in which a blank character code is inserted at a position that should not be originally inserted as compared with the case where a blank character code is inserted. Further, according to the present invention according to claim 1 or claim 6, when inserting a blank character code between words of document data in which a blank character code is not inserted between words, the discriminant analysis method etc. It is possible to provide a document processing apparatus capable of detecting the insertion position of the blank character code with high accuracy as compared with the case of detecting the insertion position of the blank character code.

請求項２に係る本発明によれば、文字列の文字間隔の標準偏差に基づいて、空白の文字コードを挿入する必要がある文字列であるか否かを判定することが可能な文書処理装置を提供することができる。また、請求項２に係る本発明によれば、単語間に空白の文字コードが挿入されていない分かち書きの言語の文書データに対して、判別分析法等により単語間の位置を検出して空白の文字コードを挿入した場合と比べ、本来挿入すべきでない位置に空白の文字コードを挿入してしまう過挿入を抑制することが可能な文書処理装置を提供することができる。 According to the second aspect of the present invention, it is possible to determine, based on the standard deviation of the character spacing of the character string, whether or not the character string needs to have a blank character code inserted. Can be provided. Further, according to the present invention as set forth in claim 2, with respect to the document data of the split-off language document data in which no space character code is inserted between words, the position between the words is detected by the discriminant analysis method etc. It is possible to provide a document processing apparatus capable of suppressing overinsertion in which a blank character code is inserted at a position not to be originally inserted as compared with the case where a character code is inserted.

請求項３に係る本発明によれば、文字列の標準偏差に基づいて、空白の文字コードを挿入する必要がある文字列であるか否かを判定することが可能な文書処理装置を提供することができる。 According to the third aspect of the present invention, there is provided a document processing apparatus capable of determining, based on the standard deviation of character strings, whether the character string requires insertion of a blank character code. be able to.

請求項４に係る発明によれば、分かち書きの言語で構成された文書データにおいて、文字列が単語ごとに区切られる位置に空白の文字コードを挿入することが可能な文書処理装置を提供することができる。また、請求項４に係る本発明によれば、単語間に空白の文字コードが挿入されていない分かち書きの言語の文書データに対して、判別分析法等により単語間の位置を検出して空白の文字コードを挿入した場合と比べ、本来挿入すべきでない位置に空白の文字コードを挿入してしまう過挿入を抑制することが可能な文書処理装置を提供することができる。 According to the fourth aspect of the present invention, there is provided a document processing apparatus capable of inserting a blank character code at a position where a character string is divided for each word in document data configured in a split language. it can. Further, according to the present invention as set forth in claim 4, with respect to the document data of the split-off language document data in which no space character code is inserted between the words, the position between the words is detected by the discriminant analysis method etc. It is possible to provide a document processing apparatus capable of suppressing overinsertion in which a blank character code is inserted at a position not to be originally inserted as compared with the case where a character code is inserted.

請求項５に係る本発明によれば、文字列の文字間隔数の予め定められた割合以上の空白の文字コードが既に含まれている文字列において、判別分析法等を用いて空白の挿入位置を検出する場合と比較して、空白の挿入位置を高い精度で検出することが可能な文書処理装置を提供することができる。また、請求項５に係る本発明によれば、単語間に空白の文字コードが挿入されていない分かち書きの言語の文書データに対して、判別分析法等により単語間の位置を検出して空白の文字コードを挿入した場合と比べ、本来挿入すべきでない位置に空白の文字コードを挿入してしまう過挿入を抑制することが可能な文書処理装置を提供することができる。 According to the present invention as set forth in claim 5 , in a character string in which a blank character code equal to or more than a predetermined ratio of the number of character intervals of the character string is already contained, the insertion position of the blank using discriminant analysis method etc. The present invention can provide a document processing apparatus capable of detecting a blank insertion position with high accuracy as compared to the case of detecting. Further, according to the present invention as set forth in claim 5, with respect to the document data of the split-off language document data in which no space character code is inserted between the words, the position between the words is detected by the discriminant analysis method etc. It is possible to provide a document processing apparatus capable of suppressing overinsertion in which a blank character code is inserted at a position not to be originally inserted as compared with the case where a character code is inserted.

請求項７に係る本発明によれば、単語間に空白の文字コードが挿入されていない分かち書きの言語の文書データに対して、判別分析法等により単語間の位置を検出して空白の文字コードを挿入した場合と比べ、本来挿入すべきでない位置に空白の文字コードを挿入してしまう過挿入を抑制することが可能なプログラムを提供することができる。また、請求項７に係る本発明によれば、単語間に空白の文字コードが挿入されていない文書データの単語間に空白の文字コードを挿入する際に、判別分析法等を用いて空白の文字コードの挿入位置を検出する場合と比較して、空白の文字コードの挿入位置を高い精度で検出することが可能なプログラムを提供することができる。

According to the present invention as set forth in claim 7 , for the document data of the split-off language document data in which no space character code is inserted between words, the position between the words is detected by the discriminant analysis method etc. It is possible to provide a program capable of suppressing overinsertion in which a blank character code is inserted at a position which should not be inserted as compared with the case where is inserted. Further, according to the present invention as set forth in claim 7, when inserting the character code of the space between the words of the document data in which the character code of the space is not inserted between the words, using the discriminant analysis method etc. It is possible to provide a program capable of detecting the insertion position of the blank character code with high accuracy as compared to the case of detecting the insertion position of the character code.

本発明の第１の実施形態における文書処理システムの構成を示す図である。It is a figure showing composition of a document processing system in a 1st embodiment of the present invention. 本発明の第１の実施形態における文書処理サーバ３０のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the document processing server 30 in the 1st Embodiment of this invention. 本発明の第１の実施形態における文書処理サーバ３０の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the document processing server 30 in the 1st Embodiment of this invention. 本発明の第１の実施形態における文書データ及び文字列の文字間隔の一例を示す図である。It is a figure which shows an example of the character data space of the document data in the 1st Embodiment of this invention, and a character string. 本発明の第１の実施形態における標準偏差の算出方法の一例を示す図である。It is a figure which shows an example of the calculation method of the standard deviation in the 1st Embodiment of this invention. 本発明の第１の実施形態における文字間隔リスト、一次微分リスト及び空白の文字コードの挿入位置の一例を示す図である。It is a figure which shows an example of the insertion position of the character space | interval list | wrist in the 1st Embodiment of this invention, a primary differential list | wrist, and a blank character code. 本発明の第１の実施形態における文字間隔リスト、一次微分リスト及び空白の文字コードの挿入位置の一例を示す図である。It is a figure which shows an example of the insertion position of the character space | interval list | wrist in the 1st Embodiment of this invention, a primary differential list | wrist, and a blank character code. 本発明の第１の実施形態における文書処理サーバ３０の処理を示すフローチャートである。It is a flowchart which shows the process of the document processing server 30 in the 1st Embodiment of this invention. 本発明の第２の実施形態における文書処理サーバ３０ａの機能構成を示す図である。It is a figure which shows the function structure of the document processing server 30a in the 2nd Embodiment of this invention. 本発明の第２の実施形態における文書処理の一例を示す図である。It is a figure which shows an example of the document processing in the 2nd Embodiment of this invention. 本発明の第２の実施形態における文書処理サーバ３０ａの処理を示すフローチャートである。It is a flowchart which shows the process of the document processing server 30a in the 2nd Embodiment of this invention. 本発明の第３の実施形態における文書処理サーバ３０ｂの機能構成を示す図である。It is a figure which shows the function structure of the document processing server 30b in the 3rd Embodiment of this invention. 本発明の第３の実施形態における文書データ、及び判別分析法の一例を示す図である。It is a figure which shows an example of the document data in the 3rd Embodiment of this invention, and a discriminant analysis method. 本発明の第３の実施形態における文書処理の一例を示す図である。It is a figure which shows an example of the document process in the 3rd Embodiment of this invention. 本発明の第３の実施形態における文書処理サーバ３０ｂの処理を示すフローチャートである。It is a flowchart which shows the process of the document processing server 30b in the 3rd Embodiment of this invention.

次に、本発明の実施の形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

［第１の実施形態］
図１は、本発明の一実施形態の文書処理システムのシステム構成を示す図である。 First Embodiment
FIG. 1 is a diagram showing a system configuration of a document processing system according to an embodiment of the present invention.

本発明の第１の実施形態の文書処理システムは、図１に示されるように、端末装置１０と、画像形成装置２０と、文書処理サーバ（文書処理装置）３０がネットワーク４０を介して相互に接続されている。端末装置１０は、文書データを生成して、ネットワーク４０経由にて生成した文書データを文書処理サーバ３０に対して送信する。文書処理サーバ３０は、端末装置１０から送信された文書データを受け付けて、文書データに対して後述する処理を行う。なお、画像形成装置２０は、印刷（プリント）機能、スキャン機能、複写（コピー）機能、ファクシミリ機能等の複数の機能を有するいわゆる複合機と呼ばれる装置である。 In the document processing system according to the first embodiment of the present invention, as shown in FIG. 1, a terminal device 10, an image forming apparatus 20, and a document processing server (document processing apparatus) 30 mutually communicate via a network 40. It is connected. The terminal device 10 generates document data, and transmits the document data generated via the network 40 to the document processing server 30. The document processing server 30 receives the document data transmitted from the terminal device 10 and performs processing to be described later on the document data. The image forming apparatus 20 is a so-called multifunction machine having a plurality of functions such as a printing (printing) function, a scanning function, a copying (copying) function, and a facsimile function.

また、本実施形態における文書データは、文字コードを含むＰＤＦ（Portable Document Format）等の形式により作成され、分かち書きの言語である英語で構成されたものを対象として説明する。なお、分かち書きとは、文章を書くとき、一定の方針で語句の単位を切り、その切れ目ごとに間隔をおく書き方のこと、換言すれば、文書において語の区切りに空白を挟んで記述することである。 Further, the document data in the present embodiment is created in a format such as PDF (Portable Document Format) including a character code, and will be described for an object configured in English which is a language of separation. In addition, when we write a sentence, we divide the unit of words and phrases according to a fixed policy and write a sentence, and in other words, it is a writing method that puts a space between words in a document. is there.

また、本実施形態における文書データは、文字コードと座標情報に基づいて文字を配置することにより、空白の文字コードを含まずに文字間の空白を表現している。 Further, the document data in the present embodiment expresses the space between characters without including the space character code by arranging the characters based on the character code and the coordinate information.

次に、本実施形態の文書処理システムにおける文書処理サーバ３０のハードウェア構成を図２に示す。 Next, the hardware configuration of the document processing server 30 in the document processing system of this embodiment is shown in FIG.

文書処理サーバ３０は、図２に示されるように、ＣＰＵ１１、メモリ１２、ハードディスクドライブ（ＨＤＤ）等の記憶装置１３、ネットワーク４０を介してデータの送信及び受信を行う通信インタフェース（ＩＦ）１４、タッチパネル又は液晶ディスプレイ並びにキーボードを含むユーザインタフェース（ＵＩ）装置１５を有する。これらの構成要素は、制御バス１６を介して互いに接続されている。 As illustrated in FIG. 2, the document processing server 30 includes a CPU 11, a memory 12, a storage device 13 such as a hard disk drive (HDD), and a communication interface (IF) 14 that transmits and receives data via the network 40. Or a user interface (UI) device 15 including a liquid crystal display and a keyboard. These components are connected to one another via a control bus 16.

ＣＰＵ１１は、メモリ１２または記憶装置１３に格納された制御プログラムに基づいて所定の処理を実行して、文書処理サーバ３０の動作を制御する。なお、本実施形態では、ＣＰＵ１１は、メモリ１２または記憶装置１３内に格納された制御プログラムを読み出して実行するものとして説明したが、当該プログラムをＣＤ−ＲＯＭ等の記憶媒体に格納してＣＰＵ１１に提供することも可能である。 The CPU 11 executes predetermined processing based on a control program stored in the memory 12 or the storage device 13 to control the operation of the document processing server 30. In the present embodiment, the CPU 11 is described as reading and executing the control program stored in the memory 12 or the storage device 13. However, the CPU 11 stores the program in a storage medium such as a CD-ROM and the like. It is also possible to provide.

図３は、ＣＰＵ１１によりメモリ１２または記憶装置１３に格納された制御プログラムが実行されることにより実現される文書処理サーバ３０の機能構成を示す図である。 FIG. 3 is a diagram showing a functional configuration of the document processing server 30 realized by execution of a control program stored in the memory 12 or the storage device 13 by the CPU 11.

図３に示されるように、本実施形態における文書処理サーバ３０は、文書データ受付部３０１と、文字列取得部３０２と、標準偏差算出部３０３と、空白挿入判定部３０４と、文字間隔リスト作成部３０５と、一次微分リスト作成部３０６と、閾値決定部３０７と、空白挿入部３０８とによって構成される。 As shown in FIG. 3, the document processing server 30 according to the present embodiment includes a document data receiving unit 301, a character string acquiring unit 302, a standard deviation calculating unit 303, a blank insertion determining unit 304, and a character interval list creation. A unit 305, a first derivative list generation unit 306, a threshold value determination unit 307, and a blank insertion unit 308.

文書データ受付部３０１は、端末装置１０または画像形成装置２０からネットワーク４０を介して送信された文書データを受け付ける。 The document data accepting unit 301 accepts document data transmitted from the terminal device 10 or the image forming apparatus 20 via the network 40.

例えば、図４（Ａ）に示されるように、文字列１０１〜１０３を含む文書データを受け付ける。 For example, as shown in FIG. 4A, document data including character strings 101 to 103 is accepted.

文字列取得部３０２は、文書データ受付部３０１により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する。このとき、文字列取得部３０２は、文書データにおける各行ごとに文字列を取得する。 The character string acquisition unit 302 acquires a character string based on the character code included in the document data accepted by the document data acceptance unit 301. At this time, the character string acquisition unit 302 acquires a character string for each line in the document data.

例えば、図４（Ａ）に示されるように、文字列取得部３０２は、文書データに含まれる文字列１０２の「ｔｈｉｓｉｓａｐｅｎ」という文字列において、それぞれの文字の文字コード及び座標情報を抽出することによって、文字列１０２を取得する。 For example, as shown in FIG. 4A, in the character string “this is a pen” of the character string 102 included in the document data, the character string acquisition unit 302 sets the character code and coordinate information of each character. By extracting, the character string 102 is acquired.

また、文字列取得部３０２は、前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する。また、文字列取得部３０２は、取得した文字列に含まれる隣接する２つの文字の距離である文字間隔を取得する。具体的には、図４（Ｂ）に示されるように、文字列取得部３０２において抽出された座標情報から各文字の外接矩形を求め、隣接する文字間において、左側の文字の外接矩形の右端のｘ座標と、右側の文字の外接矩形の左端のｘ座標との距離を、当該隣接する文字間の文字間隔とする。例えば、文字列１０２においては、文字列取得部３０２は、図４（Ｂ）に示されるように、「ｔ」の外接矩形の右端と、「ｈ」の外接矩形の左端との距離を「ｔ」と「ｈ」の文字間隔として検出する。また、文字列取得部３０２は、文字列１０２における他の文字間に対しても同様にして、図４（Ｃ）に示されるように、文字列１０２のそれぞれの文字において隣接する文字間の文字間隔を検出する。 Further, the character string acquisition unit 302 acquires a character string based on the character code included in the document data accepted by the acceptance unit. Further, the character string acquisition unit 302 acquires a character interval which is a distance between two adjacent characters included in the acquired character string. Specifically, as shown in FIG. 4B, the circumscribed rectangle of each character is obtained from the coordinate information extracted by the character string acquisition unit 302, and the right end of the circumscribed rectangle of the left character is located between adjacent characters. The distance between the x coordinate of x and the x coordinate of the left end of the circumscribed rectangle of the right character is taken as the character spacing between the adjacent characters. For example, in the character string 102, as illustrated in FIG. 4B, the character string acquisition unit 302 sets the distance between the right end of the circumscribed rectangle of “t” and the left end of the circumscribed rectangle of “h” to “t It detects as character space of "" and "h". In addition, the character string acquisition unit 302 similarly applies to the characters between adjacent characters in each character of the character string 102 as shown in FIG. 4C for other characters in the character string 102 as well. Detect the interval.

標準偏差算出部３０３は、文字列取得部３０２により取得された文字列の文字間隔の標準偏差を算出する。 The standard deviation calculation unit 303 calculates the standard deviation of the character spacing of the character string acquired by the character string acquisition unit 302.

また、標準偏差算出部３０３は、文字列取得部３０２により取得された文字列の文字間隔において、偏差が設定値Ｂ（第２の予め定められた値）以下の文字間隔が含まれる場合、この文字間隔の偏差を０として当該文字列の標準偏差を再計算する。 In addition, when the character spacing of the character string acquired by the character string acquisition unit 302 includes the character spacing whose deviation is equal to or less than the set value B (second predetermined value), the standard deviation calculation unit 303 Recalculate the standard deviation of the character string with the deviation of character spacing as 0.

例えば、図５（Ａ）に示されるように、文字列１０１の「Ｅｘａｍｐｌｅ」という文字それぞれにおいて、隣接する文字の文字間隔がそれぞれ「０、０、６、６、６、６（ピクセル）」である場合、文字列１０１の文字間隔の平均値は「４（ピクセル）」である。 For example, as shown in FIG. 5A, in each of the characters "Example" of the character string 101, the character spacing of adjacent characters is "0, 0, 6, 6, 6, 6 (pixels)", respectively. In some cases, the average value of the character spacing of the character string 101 is "4 (pixel)".

ここで、「Ｅ」と「ｘ」の文字間隔「０（ピクセル）」及び「ｘ」と「ａ」の文字間隔は「０（ピクセル）」であり、文字列１０１の平均値は「４（ピクセル）」であるため、「Ｅ」と「ｘ」の文字間隔及び「ｘ」と「ａ」の文字間隔の偏差は「−４」である。そして、「Ｅ」と「ｘ」の文字間隔及び「ｘ」と「ａ」の文字間隔の偏差は設定値Ｂである「−２」以下であるため、標準偏差算出部３０３は、「Ｅ」と「ｘ」の文字間隔及び「ｘ」と「ａ」の文字間隔の偏差が０であるものとして扱う。具体的には、「Ｅ」と「ｘ」との文字間隔及び「ｘ」と「ａ」との文字間隔を「４（ピクセル）」に変更して、文字列１０１の文字間隔を「４、４、６、６、６、６」に変換する。 Here, the character spacing "0 (pixel)" of "E" and "x" and the character spacing of "x" and "a" are "0 (pixel)", and the average value of the character string 101 is "4 ( Since it is "pixel)", the deviation of the character spacing of "E" and "x" and the character spacing of "x" and "a" is "-4". Then, since the deviation of the character spacing between “E” and “x” and the character spacing between “x” and “a” is equal to or less than “−2” which is the set value B, the standard deviation calculating unit 303 The deviation between the character spacing of and “x” and the character spacing of “x” and “a” is treated as zero. Specifically, the character spacing between “E” and “x” and the character spacing between “x” and “a” are changed to “4 (pixels)”, and the character spacing between character strings 101 is “4,” Convert to 4, 6, 6, 6, 6.

そして、標準偏差算出部３０３は、変換された文字列１０１の文字間隔の偏差に基づいて標準偏差を算出する。具体的には、図５（Ｂ）に示されるように、標準偏差算出部３０３は、変換された後の文字列１０１の各文字の文字間隔を、変換される前の文字列１０１の文字間隔の平均値により減算することで偏差を求め、各文字間隔の偏差の二乗の和を、文字列１０１の文字間隔数で除算して平方根を求めることによって標準偏差を算出する。これにより、分かち書き言語において単語の区切りである可能性が低い「Ｅ」と「ｘ」間及び「ｘ」と「ａ」間の偏差の影響が排除された標準偏差が算出される。 Then, the standard deviation calculation unit 303 calculates a standard deviation based on the deviation of the character spacing of the converted character string 101. Specifically, as shown in FIG. 5B, the standard deviation calculation unit 303 converts the character spacing of each character of the character string 101 after conversion into the character spacing of the character string 101 before conversion. The deviation is obtained by subtraction using the average value of the standard deviation, and the standard deviation is calculated by dividing the sum of squares of the deviation of each character interval by the number of character intervals of the character string 101 to obtain a square root. This calculates the standard deviation from which the influence of the deviation between "E" and "x" and between "x" and "a" which is unlikely to be a word break in the split language is eliminated.

空白挿入判定部３０４は、標準偏差算出部３０３により算出された標準偏差に基づいて、取得された文字列が、空白の文字コードを挿入する必要があるか否かを判定する。 The blank insertion determination unit 304 determines, based on the standard deviation calculated by the standard deviation calculation unit 303, whether or not the acquired character string needs to insert a blank character code.

具体的には、空白挿入判定部３０４は、標準偏差算出部３０３により算出された標準偏差が、設定値Ａ（第１の予め定められた値）よりも大きい場合には、文字列に空白の文字コードを挿入する必要があると判定し、標準偏差が設定値Ａ以下である場合には、文字列に空白の文字コードを挿入する必要がないと判定する。例えば、図５（Ｂ）に示されるように、文字列１０１の標準偏差が１．６３であり、設定値Ａの２以下であるため、空白挿入判定部３０４は、文字列１０１には空白の文字コードを挿入する必要がないと判定する。 Specifically, when the standard deviation calculated by the standard deviation calculation unit 303 is larger than the set value A (the first predetermined value), the blank insertion determination unit 304 determines that the character string is blank. It is determined that it is necessary to insert a character code, and if the standard deviation is equal to or less than the set value A, it is determined that it is not necessary to insert a blank character code into the character string. For example, as shown in FIG. 5B, since the standard deviation of the character string 101 is 1.63 and is 2 or less of the setting value A, the blank insertion determination unit 304 determines that the character string 101 is blank. It determines that it is not necessary to insert a character code.

文字間隔リスト作成部３０５は、文字列取得部３０２により取得された文字列に空白の文字コードを挿入する必要があると空白挿入判定部３０４によって判定された場合に、当該文字列において隣接する２つの文字の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する。本実施形態においては、文字間隔リスト作成部３０５は、文字列において隣接する２つの文字の文字間隔を小さい順に並べた文字間隔リストを作成する。 When it is determined by the blank space insertion determination unit 304 that the character space list creation unit 305 needs to insert a blank character code into the character string acquired by the character string acquisition unit 302, the character space list creation unit 305 Create a character spacing list that arranges character spacing, which is the distance between two characters, in order of size. In the present embodiment, the character interval list creation unit 305 creates a character interval list in which the character intervals of two adjacent characters in a character string are arranged in ascending order.

一次微分リスト作成部３０６は、文字間隔リスト作成部３０５により作成された文字間隔リストにおける各文字間隔を一次微分することにより、一次微分リスト（変化量リスト）を作成する。ここで、一次微分リストとは、文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示すリストである。 The first derivative list creation unit 306 creates a first derivative list (variation amount list) by first differentiating each character interval in the character interval list generated by the character interval list generation unit 305. Here, the primary differential list is a list showing the amount of change with respect to the character spacing before and after each character spacing in the character spacing list.

閾値決定部３０７は、一次微分リストにおける最大値に対応する文字間隔リストの文字間隔が、文字間隔リストにおける文字間隔の平均値以上である場合に、当該文字間隔を閾値として決定する。 When the character spacing of the character spacing list corresponding to the maximum value in the first derivative list is equal to or more than the average value of the character spacing in the character spacing list, the threshold setting unit 307 determines the character spacing as the threshold.

空白挿入部３０８は、取得された文字列に対して、閾値決定部３０７により決定された閾値以上の文字間隔の文字間に空白の文字コードを挿入する。また、空白挿入部３０８は、空白の文字コードを挿入する必要がないと空白挿入判定部３０４により判定された文字列に対しては、空白の文字コードを挿入する処理を行わない。さらに、空白挿入部３０８は、文字列に空白の文字コードを挿入した場合には、文字列に空白の文字コードが挿入された文書データをネットワーク４０を介して端末装置１０または画像形成装置２０に送信する。 The blank insertion unit 308 inserts a blank character code between characters of the character spacing equal to or larger than the threshold determined by the threshold determination unit 307 in the acquired character string. In addition, the space insertion unit 308 does not perform the process of inserting the space character code for the character string determined by the space insertion determination unit 304 if it is not necessary to insert the space character code. Furthermore, when the blank character insertion unit 308 inserts a blank character code into the character string, the document data in which the blank character code is inserted into the character string is transmitted to the terminal device 10 or the image forming apparatus 20 via the network 40. Send.

以下、文字間隔リスト作成部３０５、一次微分リスト作成部３０６、閾値決定部３０７及び空白挿入部３０８における処理の具体例について、図６及び図７を参照して詳細に説明する。 Hereinafter, specific examples of the processing in the character interval list creation unit 305, the first derivative list creation unit 306, the threshold value determination unit 307, and the blank space insertion unit 308 will be described in detail with reference to FIGS.

まず、文字列１０２に対する処理について、図６を参照して詳細に説明する。 First, the process for the character string 102 will be described in detail with reference to FIG.

まず、図６（Ａ）に示されるように、文字間隔リスト作成部３０５は、文字列１０２における各文字の文字間隔「２、３、３、７、３、６、７、４、３（ピクセル）」を、左から小さい順に並べ替えて、文字間隔リスト「２、３、３、３、３、４、６、７、７（ピクセル）」を作成する。 First, as shown in FIG. 6A, the character spacing list creation unit 305 sets the character spacing “2, 3, 3, 7, 3, 6, 7, 4, 3 (pixels Are rearranged from the left in ascending order to create a character interval list “2, 3, 3, 3, 3, 4, 6, 7, 7 (pixels)”.

そして、図６（Ｂ）に示されるように、一次微分リスト作成部３０６は、文字列１０２の文字間隔リストにおける各文字間隔を一次微分することにより一次微分リストを作成する。具体的には、一次微分リスト作成部３０６は、文字列１０２の文字間隔リスト「２、３、３、３、３、４、６、７、７（ピクセル）」において、それぞれの文字間隔ごとに、前後の文字間隔との差を算出して、これらの差を各文字間隔の前後の変化量とすることによって、一次微分リスト「１、０、０、０、１、２、１、０」を作成する。ここで、図６（Ｂ）に示されるように、一次微分リスト作成部３０６は、一次微分リストの値を、それぞれの算出元となった文字間隔リストにおける２つの文字間隔うち、右側（大きい側）の文字間隔と対応するよう作成する。 Then, as shown in FIG. 6B, the first derivative list creating unit 306 creates a first derivative list by first differentiating each character interval in the character interval list of the character string 102. Specifically, in the character interval list “2, 3, 3, 3, 3, 4, 6, 7, 7 (pixels)” of the character string 102, the first derivative list creation unit 306 The first derivative list “1, 0, 0, 0, 1, 2, 1, 0” is calculated by calculating the difference between the character spacings before and after, and setting the difference as the amount of change before and after each character spacing. Create Here, as shown in FIG. 6B, the first derivative list creation unit 306 determines the value of the first derivative list as the right side (larger side) of the two character intervals in the character interval list that is the calculation source of each value. Create to correspond to the character spacing of).

次に、図６（Ｂ）に示されるように、閾値決定部３０７は、文字列１０２の一次微分リストを参照して、一次微分リストにおける最大値として「２」を検出する。そして、閾値決定部３０７は、文字列１０２の文字間隔リストを参照して、一次微分リストの「２」に対応する文字間隔として「６（ピクセル）」を検出する。そして、文字列１０２の文字間隔リストの平均値は約「４．２２（ピクセル）」であるため、閾値決定部３０７は、検出された文字間隔「６（ピクセル）」を閾値として決定する。 Next, as shown in FIG. 6B, the threshold value determination unit 307 detects “2” as the maximum value in the first derivative list with reference to the first derivative list of the character string 102. Then, the threshold value determination unit 307 refers to the character interval list of the character string 102 and detects “6 (pixels)” as the character interval corresponding to “2” of the first derivative list. Then, since the average value of the character interval list of the character string 102 is about “4.22 (pixels)”, the threshold determination unit 307 determines the detected character interval “6 (pixels)” as the threshold.

そして、図６（Ｃ）に示されるように、空白挿入部３０８は、文字列１０２に対して、閾値決定部３０７により決定された閾値である「６（ピクセル）」以上の文字間隔である文字間に空白の文字コードを挿入する。具体的には、文字列１０２における「ｔｈｉｓ」の「ｓ」と「ｉｓ」の「ｉ」との間、「ｉｓ」の「ｓ」と「ａ」との間、「ａ」と「ｐｅｎ」の「ｐ」との間にそれぞれ空白の文字コードを挿入する。 Then, as shown in FIG. 6C, the blank space insertion unit 308 causes the character string 102 to have a character spacing of “6 (pixels)” or more, which is the threshold value determined by the threshold value determination unit 307. Insert a blank character code in between. Specifically, between “s” of “this” and “i” of “is” in the character string 102, “a” and “pen” between “s” of “is” and “a” Insert a blank character code between each and the "p" of.

次に、文字列１０３に対する処理について、図７を参照して詳細に説明する。 Next, the process for the character string 103 will be described in detail with reference to FIG.

まず、図７（Ａ）に示されるように、文字間隔リスト作成部３０５は、文字列１０３における各文字の文字間隔「０、０、０、７、３、６、７、４、３（ピクセル）」を小さい順に並べ替えて、文字間隔リスト「０、０、０、３、３、４、６、７、７」を作成する。 First, as shown in FIG. 7A, the character spacing list creation unit 305 sets the character spacing “0, 0, 0, 7, 3, 6, 7, 4 and 3 (pixels ") Are rearranged in ascending order, and a character interval list" 0, 0, 0, 3, 3, 4, 6, 7, 7 "is created.

そして、図７（Ａ）に示されるように、一次微分リスト作成部３０６は、文字列１０３の文字間隔リスト「０、０、０、３、３、４、６、７、７（ピクセル）」の一次微分リスト「０、０、３、０、１、２、１、０」を作成する。 Then, as shown in FIG. 7A, the first derivative list creation unit 306 sets the character interval list “0, 0, 0, 3, 3, 4, 6, 7, 7 (pixel)” of the character string 103. The first derivative list "0, 0, 3, 0, 1, 2, 1, 0" is created.

次に、閾値決定部３０７は、文字列１０３の一次微分リストを参照して、一次微分リストにおける最大値として「３」を検出する。そして、閾値決定部３０７は、文字列１０３の文字間隔リストを参照して、一次微分リストの「３」と対応する文字間隔として「３（ピクセル）」を検出する。ここで、図７（Ａ）に示されるように、文字列１０３の文字間隔リストの平均値は「約３．３３（ピクセル）」であり、検出された「３（ピクセル）」は文字間隔リストの平均値未満であるため、閾値決定部３０７は、一次微分リストの「３」と、検出された文字間隔「３（ピクセル）」を閾値の候補から除外する。 Next, the threshold value determination unit 307 detects “3” as the maximum value in the first derivative list with reference to the first derivative list of the character string 103. Then, the threshold value determination unit 307 refers to the character interval list of the character string 103, and detects "3 (pixel)" as a character interval corresponding to "3" of the first derivative list. Here, as shown in FIG. 7A, the average value of the character interval list of the character string 103 is “about 3.33 (pixels)”, and “3 (pixel)” detected is the character interval list. The threshold determination unit 307 excludes “3” in the first derivative list and the detected character spacing “3 (pixel)” from the candidates for the threshold.

次に、閾値決定部３０７は、図７（Ｂ）に示されるように、一次微分リストにおいて「３」の次に大きい値である「２」を検出する。次に、閾値決定部３０７は、一次微分リストの「２」に対応する文字間隔リストの文字間隔として「６（ピクセル）」を検出する。そして、検出された文字間隔「６（ピクセル）」は文字列１０３の文字間隔リストの平均値「３．３３（ピクセル）」以上であるため、閾値決定部３０７は、閾値として「６（ピクセル）」を決定する。 Next, as shown in FIG. 7B, the threshold value determination unit 307 detects “2” which is the next largest value of “3” in the first derivative list. Next, the threshold value determination unit 307 detects “6 (pixels)” as the character spacing of the character spacing list corresponding to “2” of the first derivative list. Then, since the detected character interval “6 (pixels)” is equal to or more than the average value “3.33 (pixels)” of the character interval list of the character string 103, the threshold determination unit 307 sets “6 (pixels)” as the threshold. To determine

そして、図７（Ｃ）に示されるように、空白挿入部３０８は、文字列１０３に対して、閾値決定部３０７により決定された閾値である「６（ピクセル）」以上の文字間隔である文字間に空白の文字コードを挿入する。具体的には、文字列１０３における「ｔｈｉｓ」の「ｓ」と「ｉｓ」の「ｉ」との間、「ｉｓ」の「ｓ」と「ａ」との間、「ａ」と「ｐｅｎ」の「ｐ」との間にそれぞれ空白の文字コードを挿入する。 Then, as shown in FIG. 7C, the blank space insertion unit 308 causes the character string 103 to have a character spacing of “6 (pixels)” or more, which is the threshold value determined by the threshold value determination unit 307. Insert a blank character code in between. Specifically, between "s" of "this" and "i" of "is" in the character string 103, between "s" of "is" and "a", "a" and "pen" Insert a blank character code between each and the "p" of.

なお、本実施形態においては、閾値決定部３０７は、一次微分リストにおける最大値に対応する文字間隔リストの文字間隔が、文字間隔リストにおける文字間隔の平均値以上である場合に当該文字間隔を閾値として決定するものとして説明しているが、一次微分リストにおける最大値に対応する文字間隔リストの文字間隔が、文字間隔リストの中央に位置する文字間隔より大きい場合に当該文字間隔を閾値として決定するようにしてもよい。 In the present embodiment, when the character spacing of the character spacing list corresponding to the maximum value in the first derivative list is equal to or larger than the average value of the character spacing in the character spacing list, the threshold determining unit 307 sets the character spacing as the threshold. In the case where the character spacing of the character spacing list corresponding to the maximum value in the first derivative list is larger than the character spacing located at the center of the character spacing list, the character spacing is determined as the threshold. You may do so.

次に、第１の実施形態における文書処理サーバ３０の処理を、図８のフローチャートを参照して説明する。 Next, processing of the document processing server 30 in the first embodiment will be described with reference to the flowchart of FIG.

まず、文書データ受付部３０１は、端末装置１０または画像形成装置２０からネットワーク４０を介して文書データを受け付ける（ステップＳ１０１）。 First, the document data accepting unit 301 accepts document data from the terminal device 10 or the image forming apparatus 20 via the network 40 (step S101).

次に、文字列取得部３０２は、受け付けた文書データの文字コード及び文字の座標情報に基づいて文字列を取得する（ステップＳ１０２）。この時、文字列取得部３０２は、取得された文字列の文字間隔を取得する。 Next, the character string acquisition unit 302 acquires a character string based on the character code of the received document data and the coordinate information of the character (step S102). At this time, the character string acquisition unit 302 acquires the character spacing of the acquired character string.

また、標準偏差算出部３０３は、取得された文字列における文字間隔の標準偏差を算出する（ステップＳ１０３）。 Also, the standard deviation calculation unit 303 calculates the standard deviation of the character spacing in the acquired character string (step S103).

そして、空白挿入判定部３０４は、算出された標準偏差が設定値Ａ以下であるか否かを判定する（ステップＳ１０４）。算出された標準偏差が設定値Ａ以下である場合には（ステップＳ１０４においてｙｅｓ）、処理を終了する。 Then, the blank insertion determination unit 304 determines whether the calculated standard deviation is less than or equal to the set value A (step S104). If the calculated standard deviation is less than or equal to the set value A (yes in step S104), the process ends.

また、算出された標準偏差が設定値Ａより大きい場合には（ステップＳ１０４においてｎｏ）、文字間隔リスト作成部３０５は、取得された文字列の文字間隔を小さい順に並べた文字間隔リストを作成する（ステップＳ１０５）。 If the calculated standard deviation is larger than the set value A (No in step S104), the character interval list creation unit 305 creates a character interval list in which the character intervals of the acquired character string are arranged in ascending order. (Step S105).

次に、一次微分リスト作成部３０６は、作成された文字間隔リストの各文字間隔を一次微分することにより一次微分リストを作成する（ステップＳ１０６）。 Next, the first derivative list creating unit 306 creates a first derivative list by first differentiating each character interval of the created character interval list (step S106).

そして、閾値決定部３０７は、一次微分リストにおける最大値に対応する文字間隔リストの文字間隔が、文字間隔リストにおける文字間隔の平均値以上である場合に当該文字間隔を閾値として決定する（ステップＳ１０７）。 Then, when the character spacing of the character spacing list corresponding to the maximum value in the first derivative list is equal to or more than the average value of the character spacing in the character spacing list, the threshold setting unit 307 determines the character spacing as the threshold (step S107). ).

そして、空白挿入部３０８は、決定された閾値以上の文字間隔である２つの文字間に空白の文字コードを挿入する（ステップＳ１０８）。 Then, the space insertion unit 308 inserts a space character code between two characters, which is a character spacing equal to or greater than the determined threshold (step S108).

［第２の実施形態］
次に、本発明の第２の実施形態について図面を参照して詳細に説明する。 Second Embodiment
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

第２の実施形態においては、第１の実施形態における処理が行われた後、空白挿入部３０８により空白の文字コードが挿入された文字列に対して、文書データを構成する言語の単語であるか否かを判定し、当該文字列が単語ではない場合に、再度、空白の文字コードを挿入する処理を行う。なお、第１の実施形態と同じ構成については、同一の符号を付して説明を省略する。 In the second embodiment, after the processing in the first embodiment is performed, the character string in which a blank character code is inserted by the blank insertion unit 308 is a word of a language that configures document data. If it is determined that the character string is not a word, processing is performed to insert a blank character code again. In addition, about the same structure as 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施形態における文書処理システムは、図１に示される第１の実施形態の文書処理システムにおいて、文書処理サーバ３０が文書処理サーバ３０ａに置き換えられている。なお、文書処理サーバ３０ａのハードウェア構成は、第１の実施形態と同じ構成であるため説明を省略する。 The document processing system according to the second embodiment is the document processing system according to the first embodiment shown in FIG. 1, in which the document processing server 30 is replaced with a document processing server 30a. The hardware configuration of the document processing server 30a is the same as that of the first embodiment, and thus the description thereof is omitted.

次に、文書処理サーバ３０ａの機能構成について、図９を参照して詳細に説明する。 Next, the functional configuration of the document processing server 30a will be described in detail with reference to FIG.

図９に示されるように、第２の実施形態における文書処理サーバ３０ａは、第１の実施形態における文書処理サーバ３０に対して、文字列分割部３０９と、記憶部３１０と、単語判定部３１１とが追加されている。 As shown in FIG. 9, the document processing server 30 a according to the second embodiment is different from the document processing server 30 according to the first embodiment in the character string division unit 309, the storage unit 310, and the word determination unit 311. And has been added.

文字列分割部３０９は、第１の実施形態と同様に処理によって、文字列取得部３０２により取得された文字列に対して空白の文字コードが挿入された後、この文字列を、空白挿入部３０８により挿入された空白の文字コードを境に分割する。 The character string dividing unit 309 inserts the character string of the character string acquired by the character string acquiring unit 302 into the character string acquired by the character string acquiring unit 302 by the same processing as in the first embodiment. Divide the space character code inserted by step 308 at the boundary.

記憶部３１０は、文書データを構成する分かち書きの言語の単語を記憶する。本実施形態においては、記憶部３１０は、英単語辞書のデータを登録することにより英単語のデータを予め記憶しておく。 The storage unit 310 stores words of the split language that constitute document data. In the present embodiment, the storage unit 310 stores data of English words in advance by registering data of the English word dictionary.

単語判定部３１１は、文字列分割部３０９により分割された文字列が記憶部３１０に記憶された単語と一致するか否かを判定する。本実施形態においては、単語判定部３１１は、文字列取得部３０２により取得された文字コードと座標情報とに基づいて、当該文字コードの並び順と、記憶部３１０に記憶された単語における文字の並び順とを比較することにより一致するか否かを判定する。また、単語判定部３１１は、文字列分割部３０９により分割された文字列が記憶部３１０に記憶された単語と一致すると判定した場合には、文字列に空白の文字コードが挿入された文書データを、ネットワーク４０を介して端末装置１０または画像形成装置２０に送信する。 The word determination unit 311 determines whether the character string divided by the character string division unit 309 matches the word stored in the storage unit 310. In the present embodiment, based on the character code and the coordinate information acquired by the character string acquisition unit 302, the word determination unit 311 arranges the arrangement order of the character code and the characters in the word stored in the storage unit 310. Whether or not they match is determined by comparing the order of arrangement. In addition, when the word determination unit 311 determines that the character string divided by the character string division unit 309 matches the word stored in the storage unit 310, the document data in which a blank character code is inserted in the character string Are transmitted to the terminal device 10 or the image forming apparatus 20 via the network 40.

また、文字間隔リスト作成部３０５は、文字列分割部３０９によって分割された文字列が、記憶部３１０に記憶された単語と一致しないと単語判定部３１１により判定された場合に、当該分割された文字列の文字間隔を、大きさの順に並べた文字間隔リストを作成する。 In addition, the character interval list creation unit 305 is divided when the word determination unit 311 determines that the character string divided by the character string division unit 309 does not match the word stored in the storage unit 310. Create a character spacing list that arranges the character spacing of strings in order of size.

一次微分リスト作成部３０６は、文字列分割部３０９により分割された文字列の文字間隔リストにおける各文字間隔を一次微分することにより一次微分リストを作成する。 The first derivative list creating unit 306 creates a first derivative list by first differentiating each character interval in the character interval list of the character strings divided by the character string dividing unit 309.

閾値決定部３０７は、文字列分割部３０９により分割された文字列の一次微分リストにおける最大値に対応する文字間隔リストの文字間隔を第２の閾値として決定する。 The threshold determination unit 307 determines the character spacing of the character spacing list corresponding to the maximum value in the first derivative list of the character strings divided by the character string division unit 309 as the second threshold.

空白挿入部３０８は、文字列分割部３０９により分割された文字列に対して、閾値決定部３０７により決定された第２の閾値以上の文字間隔の文字間に空白の文字コードを挿入する The blank insertion unit 308 inserts a blank character code between characters of a character interval equal to or larger than the second threshold determined by the threshold determination unit 307 into the character string divided by the character string division unit 309.

例えば、図１０（Ａ）に示されるような文書データにおいて、文字列取得部３０２により、「ｔｈｉｓｉｓａｐｅｎ」という文字列１０４が取得された場合、文書処理サーバ３０は、まず、第１の実施形態と同様の処理を行う。 For example, in the document data as illustrated in FIG. 10A, when the character string acquisition unit 302 acquires the character string 104 “this is a pen”, the document processing server 30 first performs the first process. The same processing as in the embodiment is performed.

具体的には、図１０（Ｂ）に示されるように、文字間隔リスト作成部３０５は、文字列１０４の文字間隔を小さい順に並べることにより、文字間隔リスト「２、２、３、３、３、４、８、８、１８（ピクセル）」を作成する。 Specifically, as shown in FIG. 10B, the character interval list creation unit 305 arranges the character intervals of the character string 104 in ascending order, so that the character interval list “2, 2, 3, 3, 3 , 4, 8, 8, 18 (pixel) to create.

そして、一次微分リスト作成部３０６は、図１０（Ｂ）に示されるように、文字間隔リスト「２、２、３、３、３、４、８、８、１８（ピクセル）」の一次微分リスト「０、１、０、０、１、４、０、１０」を作成する。 Then, as shown in FIG. 10B, the first derivative list creating unit 306 generates a first derivative list of character interval lists “2, 2, 3, 3, 3, 4, 8, 8, 18 (pixels)”. Create "0, 1, 0, 0, 1, 4, 0, 10".

そして、閾値決定部３０７は、一次微分リストにおける最大値「１０」を検出し、これに対応する文字間隔「１８（ピクセル）」を検出する。そして、文字間隔リストの平均値は「約５．６７（ピクセル）」であるため、文字間隔「１８（ピクセル）」を第１の閾値として決定する。 Then, the threshold value determination unit 307 detects the maximum value “10” in the first derivative list, and detects the character spacing “18 (pixels)” corresponding to this. Then, since the average value of the character spacing list is “about 5.67 (pixels)”, the character spacing “18 (pixels)” is determined as the first threshold.

次に、空白挿入部３０８は、図１０（Ｂ）に示されるように、文字列１０４に対して、第１の閾値として決定された文字間隔「１８（ピクセル）」以上の文字間隔である「ｉｓ」の「ｓ」と「ａ」との間に空白の文字コードを挿入する。 Next, as shown in FIG. 10B, the blank space insertion unit 308 sets the character spacing “18 (pixels)” or more, which is determined as the first threshold, to the character string 104 “ Insert a blank character code between "s" and "a" of is.

次に、文字列分割部３０９は、文字列に空白の文字コードが挿入された文字列１０４を、図１０（Ｃ）に示されるように、文字列１０４ａとして「ｔｈｉｓｉｓ」、及び文字列１０４ｂとして「ａｐｅｎ」に分割する。そして、単語判定部３１１は、記憶部３１０に記憶された英単語辞書のデータを参照して、「ｔｈｉｓｉｓ」及び「ａｐｅｎ」という単語が存在するか否かを判定する。 Next, the character string division unit 309 generates the character string 104 in which the character code of blank is inserted in the character string, as shown in FIG. 10C, “this is” as the character string 104 a, and the character string 104 b. Divide as "a pen". Then, the word determination unit 311 refers to the data of the English word dictionary stored in the storage unit 310 and determines whether the words “this is” and “a pen” exist.

そして、英単語辞書には「ｔｈｉｓｉｓ」という単語及び「ａｐｅｎ」という単語は存在しないため、文字間隔リスト作成部３０５は、図１０（Ｃ）に示されるように、文字列１０４ａ及び文字列１０４ｂそれぞれの文字間隔リストを作成する。 Then, since the word "this is" and the word "a pen" do not exist in the English word dictionary, the character interval list creation unit 305 determines the character string 104a and the character string as shown in FIG. 10C. Create a character space list for each of 104 b.

次に、一次微分リスト作成部３０６は、図１０（Ｃ）に示されるように、文字列１０４ａ及び文字列１０４ｂそれぞれの文字間隔リストにおける各文字間隔を一次微分することにより一次微分リストを作成する。 Next, as shown in FIG. 10C, the first derivative list creating unit 306 creates a first derivative list by first differentiating each character interval in the character interval list of each of the character string 104a and the character string 104b. .

そして、閾値決定部３０７は、文字列１０４ａ及び文字列１０４ｂそれぞれの一次微分リストにおける最大値に対応する文字間隔リストの文字間隔をそれぞれの文字列の第２の閾値として決定する。 Then, the threshold determination unit 307 determines the character spacing of the character spacing list corresponding to the maximum value in each of the first derivative lists of the character string 104 a and the character string 104 b as the second threshold of each character string.

例えば、図１０（Ｃ）に示されるように、閾値決定部３０７は、文字列１０４ａの一次微分リストにおける最大値「５」を検出し、これに対応する文字間隔リストの文字間隔「８」を文字列１０４ａの第２の閾値として決定する。また、閾値決定部３０７は、図１０（Ｃ）に示されるように、文字列１０４ｂの一次微分リストにおける最大値「４」を検出し、これに対応する文字間隔リストの文字間隔「８」を文字列１０４ｂの第２の閾値として決定する。 For example, as shown in FIG. 10C, the threshold value determination unit 307 detects the maximum value "5" in the first derivative list of the character string 104a, and sets the character interval "8" of the corresponding character interval list. It is determined as the second threshold of the character string 104a. Also, as shown in FIG. 10C, the threshold value determination unit 307 detects the maximum value "4" in the first derivative list of the character string 104b, and sets the character interval "8" of the corresponding character interval list. It is determined as the second threshold of the character string 104b.

そして、空白挿入部３０８は、文字列１０４ａ及び文字列１０４ｂに対して、閾値決定部３０７により決定された第２の閾値以上の文字間隔の文字間に空白の文字コードを挿入する。例えば、図１０（Ｃ）に示されるように、空白挿入部３０８は、文字列１０４ａにおいては、「ｔｈｉｓ」の「ｓ」と「ｉｓ」の「ｉ」との間に空白の文字コードを挿入する。また、空白挿入部３０８は、図１０（Ｃ）に示されるように、文字列１０４ｂにおいては、「ａ」と「ｐ」との間に空白の文字コードを挿入する。 Then, the blank insertion unit 308 inserts a blank character code between the characters of the character spacing equal to or larger than the second threshold determined by the threshold determination unit 307 in the character string 104 a and the character string 104 b. For example, as shown in FIG. 10C, in the character string 104a, the space insertion unit 308 inserts a space character code between “s” of “this” and “i” of “is”. Do. Further, as shown in FIG. 10C, the blank insertion unit 308 inserts a blank character code between “a” and “p” in the character string 104 b.

次に、第２の実施形態における文書処理サーバ３０ａの処理を図１１のフローチャートを参照して説明する。なお、第１の実施形態と同じ処理については説明を省略する。 Next, the processing of the document processing server 30a in the second embodiment will be described with reference to the flowchart of FIG. The description of the same processing as that of the first embodiment is omitted.

まず、文字列分割部３０９は、空白挿入部３０８により第１の閾値以上の文字間隔である文字間に空白の文字コードが挿入された文字列を、挿入された空白コードごとに分割する（ステップＳ２０１）。 First, the character string division unit 309 divides the character string in which the character code of the space is inserted between the characters having the character spacing equal to or more than the first threshold by the space insertion unit 308 for each inserted space code (step S201).

次に、単語判定部３１１は、文字列分割部３０９により分割された文字列が、記憶部３１０に記憶された単語と一致するか否かを判定する（ステップＳ２０２）。文字列分割部３０９により分割された文字列が、記憶部３１０に記憶された単語と一致する場合（ステップＳ２０２においてｎｏ）、処理を終了する。 Next, the word determination unit 311 determines whether the character string divided by the character string division unit 309 matches the word stored in the storage unit 310 (step S202). If the character string divided by the character string division unit 309 matches the word stored in the storage unit 310 (No in step S202), the process ends.

また、文字列分割部３０９により分割された文字列が、記憶部３１０に記憶された単語と一致しないと判定された場合（ステップＳ２０２においてｙｅｓ）、文字間隔リスト作成部３０５は、分割された文字列の文字間隔を小さい順に並べた文字間隔リストを作成する（ステップＳ２０３）。 When it is determined that the character string divided by character string division unit 309 does not match the word stored in storage unit 310 (yes in step S 202), character interval list generation unit 305 generates divided characters. A character interval list is created in which the character intervals of the columns are arranged in ascending order (step S203).

そして、一次微分リスト作成部３０６は、ステップＳ２０３において作成された文字間隔リストの文字間隔の各文字間隔を一次微分することにより一次微分リストを作成する（ステップＳ２０４）。 Then, the first derivative list creation unit 306 creates a first derivative list by first differentiating each character interval of the character interval of the character interval list created in step S203 (step S204).

次に、閾値決定部３０７は、ステップＳ２０４において作成された一次微分リストにおける最大値に対応する文字間隔リストの文字間隔を第２の閾値として決定する（ステップＳ２０５）。 Next, the threshold determination unit 307 determines, as the second threshold, the character spacing of the character spacing list corresponding to the maximum value in the first derivative list created in step S204 (step S205).

そして、空白挿入部３０８は、分割された文字列に対して、ステップＳ２０５において決定された閾値以上の文字間隔の文字間に空白の文字コードを挿入する（ステップＳ２０６）。そして、ステップＳ２０１に戻り、ステップＳ２０１〜ステップＳ２０６の処理を繰り返す。 Then, the blank space insertion unit 308 inserts a blank character code between the characters of the character spacing equal to or greater than the threshold value determined in step S205, for the divided character string (step S206). Then, the process returns to step S201, and the processes of steps S201 to S206 are repeated.

なお、第２の実施形態においては、空白の文字コードを挿入する処理を２回行うものとして説明しているが、空白挿入部３０８により文字列に対して空白の文字コードが挿入されるごとに、当該文字列が文書データを構成する言語の単語であるか否かを判定し、当該文字列が単語ではない場合には、第２の実施形態と同様の方法により空白の文字コードを挿入する処理を再帰的に繰り返すようにしてもよい。 In the second embodiment, although the process of inserting the blank character code is performed twice, the blank insertion unit 308 inserts the blank character code into the character string every time it is inserted. Then, it is determined whether or not the character string is a word of a language that constitutes document data, and when the character string is not a word, a blank character code is inserted in the same manner as in the second embodiment. The process may be repeated recursively.

また、第２の実施形態においては、記憶部３１０に記憶される言語の単語が英語であるものとして説明したが、文書データを構成する言語が他の分かち書きの言語である場合には、当該言語の単語を記憶するようにしてもよい。また、文書データが複数の言語により構成されている場合には、記憶部３１０は、複数の言語の単語を記憶するようにしてもよい。 In the second embodiment, the language of the language stored in the storage unit 310 is described as English. However, when the language of the document data is another split language, the language is May be stored. In addition, when the document data is configured in a plurality of languages, the storage unit 310 may store words in a plurality of languages.

［第３の実施形態］
次に、本発明の第３の実施形態について図面を参照して詳細に説明する。 Third Embodiment
Next, a third embodiment of the present invention will be described in detail with reference to the drawings.

第３の実施形態においては、第１の実施形態における空白の文字コードを挿入する処理が行われる前に、既に空白の文字コードが挿入されている場合に、この空白の文字コードが過挿入であるか否かを判定する場合について説明する。なお、第３の実施形態においても、第１の実施形態と同じ構成については説明を省略する。 In the third embodiment, if the blank character code is already inserted before the blank character code insertion processing in the first embodiment is performed, the blank character code is overinserted. The case where it is determined whether there is any will be described. Also in the third embodiment, the description of the same configuration as that of the first embodiment will be omitted.

第３の実施形態における文書処理システムは、図１に示される第１の実施形態における文書処理システムにおいて、文書処理サーバ３０が文書処理サーバ３０ｂに置き換えられている。なお、文書処理サーバ３０ｂのハードウェア構成は、第１の実施形態と同じ構成であるため説明を省略する。 The document processing system in the third embodiment is the document processing system in the first embodiment shown in FIG. 1, in which the document processing server 30 is replaced with a document processing server 30b. The hardware configuration of the document processing server 30b is the same as that of the first embodiment, and thus the description thereof is omitted.

次に、図１２を参照して、第３の実施形態における文書処理サーバ３０ｂの機能構成を詳細に説明する。なお、第１の実施形態と同じ構成については、同じ符号を付して説明を省略する。 Next, the functional configuration of the document processing server 30b according to the third embodiment will be described in detail with reference to FIG. In addition, about the same structure as 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

図１２に示されるように、第３の実施形態における文書処理サーバ３０ｂは、第１の実施形態における文書処理サーバ３０に対して、判別分析部３１２と、過挿入判定部３１３が追加されている。 As shown in FIG. 12, in the document processing server 30b according to the third embodiment, a discrimination analysis unit 312 and an overinsertion judgment unit 313 are added to the document processing server 30 according to the first embodiment. .

判別分析部３１２は、文字列取得部３０２により取得された文字列に空白の文字コードを挿入する必要があると空白挿入判定部３０４により判定された場合に、判別分析法を用いて判別分析の閾値を決定する。 When it is determined by the blank insertion determination unit 304 that the blank character code needs to be inserted into the character string acquired by the character string acquisition unit 302, the discrimination analysis unit 312 uses discriminant analysis to determine Determine the threshold.

過挿入判定部３１３は、取得された文字列に対して、空白挿入部３０８により判別分析の閾値に基づいて空白の文字コードが挿入された場合に、当該空白の文字コードの数が、取得された文字列の文字間隔数の予め定められた割合以上であるか否かを判定する。本実施形態においては、予め定められた割合を４０％として予め設定されている。 When a blank character code is inserted into the acquired character string based on the threshold of discriminant analysis by the blank insertion unit 308, the excessive insertion determination unit 313 acquires the number of such blank character codes. It is determined whether it is equal to or more than a predetermined ratio of the number of character intervals of the character string. In the present embodiment, a predetermined ratio is set in advance as 40%.

また、過挿入判定部３１３は、空白の文字コードの数が、取得された文字列の文字間隔数の４０％以上である場合には、文字間隔リスト作成部３０５に対して、当該文字列に既に含まれている空白の文字コードを全て削除した後、文字間隔リストを作成するよう指示する。 If the number of blank character codes is 40% or more of the number of character intervals of the acquired character string, the excessive insertion determination unit 313 causes the character interval list creation unit 305 to set the character string as the relevant character string. Instructs to create a character interval list after deleting all blank character codes already included.

以下、第３の実施形態における処理の具体例を図１３、図１４を参照して詳細に説明する。 Hereinafter, a specific example of the process in the third embodiment will be described in detail with reference to FIGS. 13 and 14.

例えば、文書データ受付部３０１によって、図１３（Ａ）に示されるような文書データを受け付けて、図１３（Ｂ）に示されるように、文字列取得部３０２によって文字列１０５を取得される。次に、判別分析部３１２は、図１３（Ｃ）に示されるように、文字列１０５の文字間隔のヒストグラムを作成する。 For example, the document data accepting unit 301 accepts document data as shown in FIG. 13A, and the character string acquisition unit 302 acquires the character string 105 as shown in FIG. 13B. Next, as shown in FIG. 13C, the discrimination analysis unit 312 creates a histogram of the character spacing of the character string 105.

そして、判別分析部３１２は、作成したヒストグラムに対して判別分析法を用いることにより、判別分析の閾値を境界として、ヒストグラムの値を２つのグループに分類する。例えば、図１３（Ｃ）に示されるように、「３（ピクセル）」を判別分析の閾値として算出し、ヒストグラムの値を「３（ピクセル）」より大きい文字間隔と、「３（ピクセル）」以下の文字間隔との２つのグループに分類する。 Then, the discriminant analysis unit 312 classifies the values of the histogram into two groups by using the discriminant analysis method on the created histogram, with the threshold of the discriminant analysis as a boundary. For example, as shown in FIG. 13C, “3 (pixels)” is calculated as a threshold of discriminant analysis, and the value of the histogram is “3 (pixels)” and the character spacing is larger than “3 (pixels)”. It is divided into two groups with the following character spacing.

そして、空白挿入部３０８は、判別分析部３１２により決定された判別分析の閾値より大きい文字間隔の文字間に空白の文字コードを挿入する。例えば、図１４（Ａ）に示されるように、文字列１０５において、「Ｆａｉｌｅｄ」の「ａ」と「ｉ」との間、「Ｆａｉｌｅｄ」の「ｌ」と「ｅ」との間、「Ｆａｉｌｅｄ」の「ｄ」と「ｅｘａｍｐｌｅ」の「ｅ」との間、「ｅｘａｍｐｌｅ」の「ｘ」と「ａ」との間、「ｅｘａｍｐｌｅ」の「ａ」と「ｍ」との間にそれぞれ空白の文字コードを挿入する。 Then, the blank insertion unit 308 inserts a blank character code between the characters of the character interval larger than the threshold of the discriminant analysis determined by the discriminant analysis unit 312. For example, as shown in FIG. 14A, in the character string 105, between "a" of "Failed" and "i", between "l" of "Failed" and "e", "Failed Blank between "d" of "" and "e" of "example", between "x" and "a" of "example", and between "a" and "m" of "example" Insert character code

次に、図１４（Ｂ）に示されるように、文字列１０５の文字間隔数が１２個、判別分析の閾値より大きい文字間隔に挿入された空白の文字コード数が５個である場合、空白コードの数は、文字列１０５の文字間隔数の約４１．６７％であるため、過挿入判定部３１３は、文字列１０５に対して挿入された空白の文字コードの数が、文字列１０５の文字間隔数の予め定められた割合以上であると判定する。 Next, as shown in FIG. 14B, when the number of character intervals of the character string 105 is 12 and the number of blank character codes inserted into the character spacing larger than the discrimination analysis threshold is 5, white space Since the number of codes is about 41.67% of the number of character intervals of the character string 105, the overinsertion determination unit 313 determines that the number of blank character codes inserted for the character string 105 is that of the character string 105. It is determined that the ratio is equal to or more than a predetermined ratio of the number of character intervals.

また、過挿入判定部３１３は、文字列に対して挿入された空白コードが、文字列１０５の文字間隔数の予め定められた割合以上であると判定した場合には、文字間隔リスト作成部３０５に対して、文字列１０５に含まれる空白の文字コードを全て削除した後、文字列１０５の文字間隔リストを作成するよう指示する。 In addition, when it is determined that the space code inserted into the character string is equal to or more than a predetermined ratio of the character interval number of the character string 105, the overinsertion determination unit 313 generates the character interval list generation unit 305. After deleting all blank character codes included in the character string 105, the user instructs to create a character interval list of the character string 105.

そして、文字間隔リスト作成部３０５は、文字列１０５に含まれる空白の文字コードを全て削除した後、第１の実施形態と同様に、文字列１０５の文字間隔リストを作成する。例えば、図１４（Ｃ）に示されるように、文字列１０５の文字間隔リストとして「１、１、２、２、２、２、３、４、４、４、４、６」を作成する。 Then, after deleting all blank character codes included in the character string 105, the character interval list generation unit 305 generates a character interval list of the character string 105 as in the first embodiment. For example, as shown in FIG. 14C, “1, 1, 2, 2, 2, 2, 3, 2, 3, 4, 4, 4, 6” is created as the character interval list of the character string 105.

また、一次微分リスト作成部３０６は、第１の実施形態と同様に、文字列１０５の文字間隔リストにおける各文字間隔を一次微分することにより一次微分リストを作成する。例えば、図１４（Ｃ）に示されるように、文字列１０５の一次微分リストとして「０、１、０、０、０、１、１、０、０、０、２」を作成する。 Further, as in the first embodiment, the first derivative list creating unit 306 creates a first derivative list by first differentiating each character interval in the character interval list of the character string 105. For example, as shown in FIG. 14C, “0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 2” is created as the first derivative list of the character string 105.

次に、閾値決定部３０７は、文字列１０５の一次微分リストにおける最大値として「２」を検出し、これに対応する文字間隔リストの「６（ピクセル）」を閾値として決定する。 Next, the threshold value determination unit 307 detects “2” as the maximum value in the first derivative list of the character string 105, and determines “6 (pixels)” of the character interval list corresponding to this as a threshold value.

そして、空白挿入部３０８は、図１４（Ｄ）に示されるように、「６（ピクセル」以上の文字間隔の文字間に空白の文字コードを挿入する。具体的には、「Ｆａｉｌｅｄ」の「ｄ」と「ｅｘａｍｐｌｅ」のうち先頭の「ｅ」との間に空白の文字コードを挿入する。 Then, the blank insertion unit 308 inserts a blank character code between characters of character spacing of “6 (pixels) or more”, as shown in FIG.14 (D). Specifically, “Failed” Insert a blank character code between the leading "e" of "d" and "example".

次に、第３の実施形態における文書処理サーバ３０ｂの処理について、図１５のフローチャートを参照して説明する。なお、第１の実施形態と同様の処理については同じ符号を付して説明を省略する。 Next, processing of the document processing server 30b according to the third embodiment will be described with reference to the flowchart of FIG. The same processing as that of the first embodiment is given the same reference numeral and the description is omitted.

まず、文書データ受付部３０１により文書データが受け付けられ、文字列取得部３０２により取得された文字列の標準偏差が設定値Ａより大きいと空白挿入判定部３０４により判定された場合に、判別分析部３１２は、判別分析法を用いて判別分析の閾値を決定する（ステップＳ３０１）。 First, when the document data receiving unit 301 receives document data and the blank insertion determination unit 304 determines that the standard deviation of the character string acquired by the character string acquisition unit 302 is larger than the set value A, the discrimination analysis unit At 312, a discriminant analysis threshold value is determined using discriminant analysis (step S301).

次に、空白挿入部３０８は、ステップＳ３０１において決定された閾値より大きい文字間隔である文字間に空白の文字コードを挿入する（ステップＳ３０２）。 Next, the blank insertion unit 308 inserts a blank character code between characters having a character interval larger than the threshold value determined in step S301 (step S302).

そして、過挿入判定部３１３は、ステップＳ３０２において挿入された空白の文字コードの数が、取得された文字列の文字間隔数の４０％以上であるか否かを判定する（ステップＳ３０３）。挿入された空白の文字コードの数が、取得された文字列の４０％未満である場合（ステップＳ３０３においてｎｏ）、処理を終了する。 Then, the excessive insertion determination unit 313 determines whether the number of blank character codes inserted in step S302 is 40% or more of the number of character intervals of the acquired character string (step S303). If the number of inserted blank character codes is less than 40% of the acquired character string (no in step S303), the process ends.

また、ステップＳ３０２において挿入された空白の文字コードの数が、取得された文字列の文字間隔数の４０％以上である場合（ステップＳ３０３においてｙｅｓ）、過挿入判定部３１３は、文字間隔リスト作成部３０５に対して、挿入された空白の文字コードを全て削除して、文字間隔リストを作成するよう指示する（ステップＳ３０４）。 If the number of blank character codes inserted in step S302 is 40% or more of the number of character intervals of the acquired character string (yes in step S303), the overinsertion determination unit 313 creates a character interval list. The unit 305 is instructed to delete all inserted blank character codes and create a character interval list (step S304).

そして、過挿入判定部３１３が文字間隔リストを作成するよう指示した場合には、文書処理サーバ３０ｂは、空白の文字コードが全て削除された文字列に対して、第１の実施形態と同様の処理により空白の文字コードを挿入する処理を行う（ステップＳ１０５〜ステップＳ１０８）。 Then, when the over-insertion determination unit 313 instructs to create the character interval list, the document processing server 30b is the same as the first embodiment with respect to the character string from which all the blank character codes are deleted. A process of inserting a blank character code is performed by the process (steps S105 to S108).

［変形例］
なお、第３の実施形態においては、判別分析の閾値に基づいて挿入された空白の文字コードの数が、文字列の文字間隔数の予め定められた割合以上であるか否かを判定するものとして説明した。しかし、文書データ受付部３０１において受け付けられた文書データの文字列に既に空白の文字コードが含まれている場合には、判別分析の閾値に基づく空白の文字コードの挿入処理を行わずに、取得された文字列における空白の文字コードの数が、文字列の文字間隔数の予め定められた割合以上であるか否かを判定するようにしてもよい。 [Modification]
In the third embodiment, it is determined whether the number of blank character codes inserted based on the threshold of discriminant analysis is equal to or more than a predetermined ratio of the number of character intervals of the character string. As described. However, when the character string of the document data accepted by the document data accepting unit 301 already includes a blank character code, the insertion processing of the blank character code based on the threshold of discriminant analysis is not performed, and acquisition is performed. It may be determined whether the number of blank character codes in the character string is equal to or more than a predetermined ratio of the character interval number of the character string.

なお、上記の説明では、第１、第２及び第３の実施形態においてそれぞれ異なる構成を有する文書サーバ３０について説明したが、文書サーバ３０は、第１、第２及び第３の実施形態の構成を一部または全て含むようにしてもよい。 In the above description, the document server 30 having different configurations in the first, second, and third embodiments has been described. However, the document server 30 has the configurations of the first, second, and third embodiments. May be included in part or all.

また、上記第１から第３の実施形態では、文書データが英語により構成された場合を用いて説明しているが、ドイツ語、フランス語、韓国語、ベトナム語等のように、単語間の区切りに空白を挿入して記述する分かち書きの言語により文書データが構成されている場合であれば同様に本発明を適用可能である。また、一般的に日本語は分かち書きの言語ではないが、ひらがな文を分かち書きにより記述したような文書データであれば本発明を同様に適用可能である。 Also, in the first to third embodiments described above, the document data is described in the case of being composed of English, but as in German, French, Korean, Vietnamese, etc., separation of words The present invention is similarly applicable to the case where document data is configured in a split language, which is described by inserting a space in. Also, in general, Japanese is not a split language, but the present invention is similarly applicable to any document data in which a hiragana sentence is described by a split.

さらに、上記第１から第３の実施形態では、文字コードと座標情報とが含まれている文書データに対して空白を挿入する場合を用いて説明しているが、書類をスキャンして読み取ったデータをＯＣＲ（optical character recognition）処理して得られた文書データに対して空白を挿入するような場合にも本発明は適用可能である。 Furthermore, in the first to third embodiments described above, the case of inserting a space into the document data including the character code and the coordinate information is described, but the document is scanned and read. The present invention is also applicable to the case where a space is inserted into document data obtained by subjecting data to OCR (optical character recognition) processing.

本発明の構成を採用することで文書データの適切な位置に空白の文字コードを挿入することが可能となり、その後の翻訳処理などの精度の向上に寄与する。なお、過挿入を抑制することに加え、場合によっては、本来挿入すべき位置に空白の文字コードを挿入できない未挿入を抑制することも可能である。 By adopting the configuration of the present invention, it becomes possible to insert a blank character code at an appropriate position of document data, which contributes to the improvement of the accuracy of subsequent translation processing and the like. In addition to suppressing over-insertion, in some cases, it is also possible to suppress uninsertion in which a blank character code can not be inserted at a position to be originally inserted.

１０端末装置
１１ＣＰＵ
１２メモリ
１３記憶装置
１４通信ＩＦ
１５ＵＩ装置
１６制御バス
２０画像形成装置
３０、３０ａ、３０ｂ文書処理サーバ
４０ネットワーク
１０１〜１０５文字列
３０１文書データ受付部
３０２文字列取得部
３０３標準偏差算出部
３０４空白挿入判定部
３０５文字間隔リスト作成部
３０６一次微分リスト作成部
３０７閾値決定部
３０８空白挿入部
３０９文字列分割部
３１０記憶部
３１１単語判定部
３１２判別分析部
３１３過挿入判定部 10 terminal equipment 11 CPU
12 Memory 13 Storage Device 14 Communication IF
15 UI device 16 control bus 20 image forming device 30, 30a, 30b document processing server 40 network 101 to 105 character string 301 document data reception unit 302 character string acquisition unit 303 standard deviation calculation unit 304 blank insertion determination unit 305 character interval list creation Part 306 Primary derivative list creation part 307 Threshold determination part 308 Blank insertion part 309 Character string division part 310 Storage part 311 Word judgment part 312 Discriminant analysis part 313 Over insertion judgment part

Claims

文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値未満または前記文字間隔リストの中央に位置する文字間隔以下の場合に、当該文字間隔を第１の閾値の候補から除外し、除外された文字間隔の前記変化量リストにおける値の次に大きい値に対応する文字間隔リストの文字間隔、又は前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値以上または前記文字間隔リストの中央に位置する文字間隔より大きい場合に、当該文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
を有する文書処理装置。 Reception means for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
If the character spacing of the character spacing list corresponding to the maximum value in the variation amount list is less than the average value of the character spacing in the character spacing list or less than the character spacing located at the center of the character spacing list A character spacing of the character spacing list corresponding to the second largest value of the excluded character spacing excluded from the first threshold candidate and the value in the variation list of the excluded character spacing, or a character spacing corresponding to the maximum value of the variation amount list Determining means for determining the character spacing as a first threshold when the character spacing of the list is greater than or equal to the average value of the character spacing in the character spacing list or greater than the character spacing located at the center of the character spacing list ;
Inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
A document processing device having:

文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
前記取得手段により取得された文字列の文字間隔の標準偏差が、第１の予め定められた値以下である場合に、当該文字列には空白の文字コードを挿入する必要がないと判定する判定手段と、を有し、
前記挿入手段は、前記判定手段により空白の文字コードを挿入する必要がないと判定された文字列に対しては、空白の文字コードの挿入を行わない文書処理装置。 Reception means for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
A determination unit configured to determine, as a first threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation amount list;
Inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
When the standard deviation of the character spacing of the character string acquired by the acquisition means is less than or equal to a first predetermined value, it is determined that it is not necessary to insert a blank character code in the character string Means, and
The insertion means, wherein for the determined character string as there is no need to insert the character code space by determining means, Ibn document processing apparatus to perform the insertion of blank character codes.

前記判定手段は、前記取得手段により取得された文字列において、偏差が第２の予め定められた値以下の文字間隔が含まれる場合、当該文字間隔の偏差を０として当該文字列の標準偏差を再計算し、再計算後の標準偏差が、第１の予め定められた値以下であるとき、当該文字列には空白の文字コードを挿入する必要がないと判定する請求項２記載の文書処理装置。 When the character string acquired by the acquisition unit includes a character interval having a deviation equal to or less than a second predetermined value, the determination unit sets the deviation of the character interval to 0 and sets the standard deviation of the character string to 0. 3. The document processing according to claim 2, wherein it is determined that it is not necessary to insert a blank character code into the character string when the standard deviation after recalculation and recalculation is equal to or less than a first predetermined value. apparatus.

文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
前記文書データの文字列は分かち書きの言語により構成され、
前記言語の単語を記憶する記憶手段と、
前記取得手段により取得された文字列を、前記挿入手段により挿入された空白の文字コードを境に分割する分割手段と、を有し、
前記第１の作成手段は、前記分割手段により分割された文字列が前記記憶手段に記憶された単語と一致しない場合に、当該分割された文字列の文字間隔を、大きさの順に並べた文字間隔リストを作成し、
前記第２の作成手段は、前記分割された文字列の文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成し、
前記決定手段は、前記分割された文字列の変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第２の閾値として決定し、
前記挿入手段は、前記分割された文字列に対して、前記決定手段により決定された第２の閾値以上の文字間隔の文字間に空白の文字コードを挿入する文書処理装置。 Reception means for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
A determination unit configured to determine, as a first threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation amount list;
Inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
The character string of the document data is composed of a split language,
Storage means for storing words of the language;
A character string acquired by the acquisition unit, has a dividing means for dividing the boundary of character codes inserted blank by said inserting means,
When the character string divided by the dividing unit does not match the word stored in the storage unit, the first creating unit is a character in which the character intervals of the divided character string are arranged in order of size Create an interval list,
The second creation means creates a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list of the divided character strings;
The determining means determines, as a second threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation list of the divided character strings.
It said insertion means, the divided on strings, inserted to Rubun document processing device character code spaces between characters of the second threshold value or more character spacing determined by the determining means.

文書データを受け付ける受付手段と、
前記受付手段により受け付けられた文書データに含まれる文字コードに基づいて文字列を取得する取得手段と、
前記取得手段により取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔を第１の閾値として決定する決定手段と、
前記文字列に対して、前記第１の閾値以上の文字間隔の文字間に空白の文字コードを挿入する挿入手段と、を有し、
前記第１の作成手段は、取得された文字列に既に空白の文字コードが含まれている場合に、当該空白の文字コードの数が、取得された文字列の文字間隔数の予め定められた割合以上であるとき、当該文字列に含まれる空白の文字コードを全て削除した後、前記文字間隔リストを作成する文書処理装置。 Reception means for receiving document data;
An acquisition unit configured to acquire a character string based on a character code included in the document data accepted by the acceptance unit;
A first creation unit configured to create a character interval list in which character intervals, which are distances between two adjacent characters in the character string acquired by the acquisition unit, are arranged in order of size;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
A determination unit configured to determine, as a first threshold, the character spacing of the character spacing list corresponding to the maximum value in the variation amount list;
And inserting means for inserting a character code of a space between characters of the character spacing equal to or more than the first threshold with respect to the character string;
When the acquired character string already includes a blank character code, the first creating means determines that the number of blank character codes is a predetermined number of character intervals of the acquired character string. when it is proportion or more, after deleting all blank character codes included in the character string, Rubun document to create the character spacing list processing apparatus.

受け付けた文字列の中の隣接する２つの文字間の距離である文字間隔を大きさの順に並べ替えて文字間隔リストを作成する第１の作成手段と、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成する第２の作成手段と、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値未満または前記文字間隔リストの中央に位置する文字間隔以下の場合に、当該文字間隔を第１の閾値の候補から除外し、除外された文字間隔の前記変化量リストにおける値の次に大きい値に対応する文字間隔リストの文字間隔、又は前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値以上または前記文字間隔リストの中央に位置する文字間隔より大きい場合に、当該文字間隔の文字間に空白の文字コードを挿入する挿入手段と、
を有する文書処理装置。 First creation means for creating a character interval list by rearranging character intervals, which is the distance between two adjacent characters in the accepted character string, in order of magnitude;
Second creation means for creating a variation amount list indicating variation amounts with respect to character spacing before and after each character spacing in the character spacing list;
If the character spacing of the character spacing list corresponding to the maximum value in the variation amount list is less than the average value of the character spacing in the character spacing list or less than the character spacing located at the center of the character spacing list A character spacing of the character spacing list corresponding to the second largest value of the excluded character spacing excluded from the first threshold candidate and the value in the variation list of the excluded character spacing, or a character spacing corresponding to the maximum value of the variation amount list Inserting means for inserting a blank character code between characters of the character spacing list if the character spacing of the list is greater than or equal to the average value of the character spacing in the character spacing list or the character spacing located at the center of the character spacing list When,
A document processing device having:

文書データを受け付けるステップと、
受け付けられた文書データに含まれる文字コードに基づいて文字列を取得するステップと、
取得された文字列において、隣接する２つの文字間の距離である文字間隔を、大きさの順に並べた文字間隔リストを作成するステップと、
前記文字間隔リストにおける各文字間隔の前後の文字間隔に対する変化量を示す変化量リストを作成するステップと、
前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値未満または前記文字間隔リストの中央に位置する文字間隔以下の場合に、当該文字間隔を閾値の候補から除外し、除外された文字間隔の前記変化量リストにおける値の次に大きい値に対応する文字間隔リストの文字間隔、又は前記変化量リストにおける最大値に対応する文字間隔リストの文字間隔が、前記文字間隔リストにおける文字間隔の平均値以上または前記文字間隔リストの中央に位置する文字間隔より大きい場合に、当該文字間隔を閾値として決定するステップと、
前記文字列に対して、前記閾値以上の文字間隔の文字間に空白の文字コードを挿入するステップと、
をコンピュータに実行させるためのプログラム。 Accepting document data;
Acquiring a character string based on a character code included in the accepted document data;
Creating a character interval list in which character intervals, which are distances between two adjacent characters in the acquired character string, are arranged in order of magnitude;
Creating a variation amount list indicating variation amounts for character spacing before and after each character spacing in the character spacing list;
If the character spacing of the character spacing list corresponding to the maximum value in the variation amount list is less than the average value of the character spacing in the character spacing list or less than the character spacing located at the center of the character spacing list Character spacing of the character spacing list corresponding to the second largest value in the variation amount list of excluded character spacings excluded from threshold candidates, or character spacing list character corresponding to the maximum value of the variation amount list Determining the character spacing as a threshold if the spacing is greater than or equal to the average value of the character spacing in the character spacing list or greater than the character spacing located at the center of the character spacing list ;
Inserting a character code of a space between characters of the character space equal to or more than the threshold value for the character string;
A program to make a computer run.