CN104750666A

CN104750666A - Text character encoding mode identification method and system

Info

Publication number: CN104750666A
Application number: CN201510107921.2A
Authority: CN
Inventors: 段垚
Original assignee: MAINBO EDUCATION TECHNOLOGY Co Ltd
Current assignee: MAINBO EDUCATION TECHNOLOGY Co Ltd
Priority date: 2015-03-12
Filing date: 2015-03-12
Publication date: 2015-07-01
Anticipated expiration: 2035-03-12
Also published as: CN104750666B

Abstract

The invention discloses a text character encoding mode identification method and system and belongs to the technical field of character encoding. The identification method comprises the steps that a text to be identified is decoded separately according to N types of character encoding modes, so that character rings corresponding to the N types of character encoding modes are obtained after the decoding; N is not smaller than two; the occurrence probability the corresponding character ring corresponding to each type of character encoding mode is calculated according to the occurrence probability of each character in the character rings obtained after the decoding, and the character encoding mode corresponding to the character ring with the highest occurrence probability is determined as the character encoding mode of the text to be identified. By means of the text character encoding mode identification method and system, the accuracy of character encoding mode identification is improved effectively, and the method and system are particularly applicable to identification of character encoding modes of short texts.

Description

A kind of recognition methods of text character codes mode and system

Technical field

The present invention relates to character encoding techniques field, be specifically related to a kind of recognition methods and system of text character codes mode.

Background technology

In computer information processing, text data can represent with multiple different character code (encoding).Some of them character code can represent the character of all conventional words in the world, and (UTF represents UCS Transformation Format, i.e. universal character set transformat for such as UTF-8, UTF-16, UTF-32 etc.UCS refers to Universal Character Set, i.e. universal character set).More character code is then that word is correlated with, lay particular stress on the character that (or can only) represents one or more words, such as GB2312 and GB18030 is mainly used in simplified Chinese character, Big5 is used for traditional Chinese character, Shift-JIS is mainly used in Japanese character, ISO-8859-1 is mainly used in representing Latin character, and ISO-8859-5 is mainly used in representing Cyrillic character (Russian character) etc.The character code that word is correlated with almost only is used to the text of coding language-specific word, although such as GB18030 also contemplated the character of conventional words all in the world, is almost only used to coding simplified form of Chinese Character text.On the contrary, the coding of UTF series is used to encode the text of various word.The coding that UTF coding replaces word relevant is a kind of trend, but the latter is also a large amount of at present to be existed, and will there is long period of time.

In computer information processing process, have many text datas not indicated or correctly do not indicated the character code that adopts, such as part webpage, the filename in zip archive file, ID3 metadata in mp3 file, the text message etc. that Quick Response Code carries.Two kinds of modes are often taked: (1) adopts default character coding (2) to identify text character codes when processing such text.Front a kind of mode is easy to make mistakes, and because after this kind of mode is more paid attention to, and is widely used.But it is not existing text character codes recognition methods also exists some problems, mainly high to the recognition correct rate of short text (several character is to tens characters).

The Shanjian Li etc. of Netscape company proposes a kind of recognition methods of text character codes in paper " A composite approach tolanguage/encoding detection ".Its main thought utilizes a lot of character-coded lettering system correlativity, and in East Asia word (such as Chinese character), conventional characters accounts for this fact of less ratio in all characters.Such as, according to GB2312, (this also just equals the supposition text be simplified form of Chinese Character be main) is decoded to the text of one section of unknown character coding, then the proportionate relationship of wherein Chinese characters in common use and non-common Chinese character is added up, if meet the proportionate relationship in nature Chinese, then the text is just probably encoded with GB2312; Similar process is also done to other character code.In order to quantitative contrast text adopts the possibility of various characters coding, the method is the computing formula that often kind of character code defines " degree of confidence ", and the character code that degree of confidence is the highest is exactly the character code that the text most possibly adopts.To the formula that various East Asia character is encoded be:

Everyday character ratio in degree of confidence=everyday character ratio/natural language

The number of times that everyday character ratio=everyday character occurs/(number of times that total number of characters-everyday character occurs)

The definition of everyday character is 512 characters the most frequently used in this lettering system, and this can by adding up existing natural language text to obtain.

This method is relatively more effective to longer text (such as webpage), but often not enough to very short text (being such as short as several character) discrimination.Possible reason is, degree of confidence can't the significant change along with the increase of number of characters, but is tending towards a definite value, this means that this method does not make full use of the implicit information in text.

Summary of the invention

For the defect existed in prior art, the object of the present invention is to provide recognition methods and the system of the text character codes mode that a kind of scope of application is wider, accuracy rate is higher.

For achieving the above object, the technical solution used in the present invention is as follows:

A recognition methods for text character codes mode, comprises the following steps:

(1) text to be identified is decoded respectively according to known N kind character code mode, obtain the character string corresponding to decoded often kind of character code mode; N >=2;

(2) calculate the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.

Further, the recognition methods of a kind of text character codes mode as above, in step (2), calculates the probability of occurrence of the character string corresponding to often kind of character code mode, comprising according to the probability of occurrence of character each in decoded character string:

1) probability of occurrence of each character in decoded character string is determined;

2) probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the probability of occurrence obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character.

1. the probability of occurrence of each character in decoded character string is determined;

2. the probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the preliminary probability obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character;

3. when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; The statistics of the character code mode that the probability of occurrence of often kind of character code mode uses according to the natural text of different regions obtains.

Further, the recognition methods of a kind of text character codes mode as above, in step (2), if certain Character decoder in decoded character string is made mistakes, then the probability of occurrence of corresponding character string is defined as 0, or the probability of occurrence of certain character described is defined as setting value, and described setting value is much smaller than the mean value of the probability of occurrence of all characters in the character code mode corresponding to this character or the probability of occurrence much smaller than arbitrary character in the character code mode corresponding to this character.

Further, the recognition methods of a kind of text character codes mode as above, in step (2), determine that the mode of the probability of occurrence of each character in decoded character string is:

By character each in decoded character string adopting the probability occurred in the natural text of corresponding character code mode, be defined as the probability of occurrence of each character.

If character has been carried out classification according to common degree by the Character decoder mode corresponding to character string, then the probability of occurrence of the character in decoded character string has equaled the probability of occurrence of the character of this character place grade.

Further, the recognition methods of a kind of text character codes mode as above, in step (2), determine that the mode of the probability of occurrence of each character in decoded character string is: the probability of occurrence arranging character according to the correlative factor of each character, described correlative factor comprise the position of character in character string, the adjacent character of character, character in the natural text adopting corresponding character code mode by use probability, user preference is arranged, the language of the source of text to be identified and the equipment of text to be identified of decoding arranges.

For achieving the above object, additionally provide a kind of recognition system of text character codes mode in the embodiment of the present invention, comprising:

Decode text module, for being decoded respectively according to known N kind character code mode by text to be identified, obtains the character string corresponding to decoded often kind of character code mode; N >=2;

Character code mode identification module, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.

Further, the recognition system of a kind of text character codes mode as above, described character code mode identification module comprises:

First character probability of occurrence determining unit, for determining the probability of occurrence of each character in decoded character string;

First character string probability of occurrence computing unit, is multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character is added the probability of occurrence obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character.

Further, the recognition system of a kind of text character codes mode as above, is characterized in that: described character code mode identification module comprises:

Second character probability of occurrence determining unit, determines the probability of occurrence of each character in decoded character string;

The preliminary probability calculation unit of character string, for being multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character being added the preliminary probability obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character;

Second character string probability of occurrence computing unit, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode: when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; Wherein, the statistics of character code mode that the probability of occurrence of often kind of character code mode uses according to the natural text of different regions obtains.

Beneficial effect of the present invention is: method and system of the present invention, take full advantage of the statistical information of each character code in natural text, therefore, effectively improve the accuracy of character code mode identification, be particularly useful for the identification of the character code mode of short text.

Accompanying drawing explanation

A kind of process flow diagram of recognition methods of text character codes mode of Fig. 1 for providing in the specific embodiment of the invention;

A kind of structured flowchart of recognition system of text character codes mode of Fig. 2 for providing in the specific embodiment of the invention.

Embodiment

Below in conjunction with Figure of description and embodiment, the present invention is described in further detail.

Fig. 1 shows the process flow diagram of the recognition methods of a kind of text character codes mode in the specific embodiment of the invention, and as can be seen from Figure, the method mainly comprises following two steps:

Step S100: text to be identified is decoded respectively according to known N kind character code mode, obtains the character string corresponding to decoded often kind of character code mode; N >=2;

Step S200: the probability of occurrence calculating the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, is defined as the character code mode of text to be identified by the character code mode corresponding to character string the highest for probability of occurrence.

In present embodiment, described text to be identified for do not indicated or correctly do not indicated the text of character code mode that adopts, namely need the text of identification character coded system.For text to be identified, first known various characters coded system is adopted to decode to it respectively, obtain the decoded character string that often kind of decoding process is corresponding, often kind of corresponding decoded character string of coded system, that is, described known various characters coded system is the character code mode of candidate, and the character code mode of the text to be identified finally identified belongs to one wherein.Wherein, known character code mode includes but not limited to UNICODE coded system (UTF-8, UTF-16, UTF-32 etc.), GB2312 coded system, GB18030 coded system etc.

In actual applications, text to be identified is decoded according to known character code mode, thus when obtaining multiple character string corresponding to Multi-encoding mode, can be directly that text to be identified is decoded to corresponding coded system code point as this decoded character string according to known character code mode, such as, for GB18030 character code mode, a character can with 1,2 or 4 byte representations, this is 1 years old, 2, or the numerical value of the combination of 4 bytes (general by high byte in front explanation) is exactly the code point of GB18030.Such as " " word is with two byte representations, i.e. 0xB0 (first character joint) 0xA1 (second byte), and therefore its code point is 0x0000B0A1 (using 32 integer representations herein); " a " word byte 0x61 represents, therefore its code point is 0x00000061.Therefore, text to be identified is obtained code point sequence after being decoded by GB18030 coded system and is decoded character string.

In order to the processing requirements of satisfied compatibility, can also after text to be identified be obtained the code point sequence of corresponding coded system according to decoding process decoding, code point sequence is converted to universal character set UCS code point sequence, using UCS code point sequence as decoded character string.

In order to better meet across language, cross-platform processing requirements, above-mentioned UCS code point sequence can also be encoded according to certain UTF coding (UTF-16 or UTF-32 etc.), obtain the sequence after a coding, using the sequence after this coding as above-mentioned decoded character string.

In actual applications, the result of which kind of mode above-mentioned can being adopted according to actual needs as decoded character string, when the code point of encoding from GB etc. is converted to UCS code point, generally needing by inquiring about corresponding mapping table.This mapping table is often larger, inquires about also more time-consuming.If directly can determine the probability of occurrence of respective symbols according to the value of the code point of GB coding, so just without the need to being converted into UCS code point again, thus can improve treatment effeciency, but needing to encode for various characters realizes determining the code of character probability of occurrence, therefore more complicated respectively.

In actual applications, usually do not need to attempt character codes all in the world, because a computing machine or computer user character-coded kind that may touch normally minority is several.

After completing the decoding of text to be identified, calculate the probability of occurrence of often kind of character string according to the probability of occurrence of character each in decoded character string, and the character code mode corresponding to character string the highest for most probability of occurrence is defined as the character code mode of text to be identified.

The mode that following two kinds of probabilities of occurrence according to character each in decoded character string calculate the probability of occurrence of the character string corresponding to often kind of character code mode is provided in present embodiment:

The step of first kind of way is as follows:

2) probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the probability of occurrence obtaining character string corresponding to often kind of character code mode.

The step of the second way is as follows:

2. the probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the preliminary probability obtaining character string corresponding to often kind of character code mode;

3. when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string;

Different countries, area often use to be tended to use different character code modes, therefore when determining the probability of occurrence of often kind of character code mode, can use character-coded statistics with reference to the natural text in different countries, area.Such as when the area/language of computer installation is set to " China's Mainland/simplified form of Chinese Character ", or the source of text to be identified is China's Mainland, or when the preference of user is set to " simplified form of Chinese Character ", then can estimate that the probability of occurrence that GB encodes is 0.5 according to statistics, and the probability of occurrence of UTF-8 is 0.4, all the other are other coding.If when above factor is " Europe " or " Germany ", then estimate that the probability of occurrence of ISO-8859-1 be the probability of occurrence of 0.7, UTF-8 is 0.2, all the other are other coding.Above numeral, without actual verification, is only described as an example.In actual applications, do not need to add up all natural texts.For each concrete application, better way is the samples of text of statistics this area and this area.Such as in order to identify the character code of filename in zip file, the filename in the zip file of some this areas can be added up; In order to identify the character code of the text in Quick Response Code, the text of the Quick Response Code of some this areas can be added up.In described N the summation of the probability of occurrence of character code mode strictly speaking this probability summation be usually less than 1, because this N kind character code can not contain all possible character code usually, only contain more common, but in computation process, also the summation of the probability of occurrence of character code mode in N kind can be determined 1, represent that the character code mode of text to be identified must be the one in N kind.

In present embodiment, described probability of occurrence refers to the possibility for a certain event occurs, as there is the probability of this character after the probability of occurrence of a certain character refers to decoding in character string after decoding; What described logarithm referred to is all with the probability of occurrence of character for antilog and the truth of a matter logarithm that is greater than 1.

For some character codes, not arbitrary binary sequence is all legal, and illegal binary sequence may cause the character (such as UCS code point u0000) decoded unsuccessfully or produce and should not occur in normal text.Occur that this situation just means that this character code can not be extremely the correct coding of the text.For this problem, in present embodiment when calculating the probability of occurrence of often kind of character string, if certain Character decoder in decoded character string is made mistakes, then the probability of occurrence of corresponding character string is defined as 0, or the probability of occurrence of certain character described is defined as setting value, and described setting value is much smaller than the mean value of the probability of occurrence of all characters in the character code mode corresponding to this character or the probability of occurrence much smaller than arbitrary character in the character code mode corresponding to this character.That is, when there is the character of decoding error in the character string obtained after a decoding, the probability of occurrence of this character string directly can be defined as 0, or the probability of occurrence of the character of mistake is defined as a numeral much smaller than the probability of occurrence of general character, in fact just eliminates the character-coded qualifications for being elected corresponding to character of decoding error.

In present embodiment, above-mentioned steps 1) and 1. in determine that the mode of the probability of occurrence of each character in decoded character string is various, provide following several mode in present embodiment:

Mode one: reference character, adopting the probability occurred in corresponding character-coded natural text, by character each in decoded character string adopting the probability occurred in the natural text of corresponding character code mode, is defined as the probability of occurrence of each character.Such as GB coded system, the webpage of the simplified form of Chinese Character of representational some can be added up, obtain the probability of occurrence of wherein each character, then make the mapping table from character to probability, for determining the probability of occurrence by each character after the decoding of GB character code.Certainly, for each concrete application, better way is the samples of text of statistics this area.Such as in order to identify the character code of filename in zip file, the filename in zip file in a large amount of real world can be added up; In order to identify the character code of the text in Quick Response Code, the text of a large amount of Quick Response Codes can be added up.

Mode two: if character has been carried out classification according to common degree by the Character decoder mode corresponding to character string, then the probability of occurrence of the character in decoded character string equals the probability of occurrence of the character of this character place grade.。Such as GB18030-2000 is exactly so a kind of character code: it has included 27533 Chinese characters altogether, and which includes 6768 characters that GB2312 includes, this includes the most frequently used Chinese character, other then belong to the Chinese character be comparatively of little use.After according to GB18030-2000 decoding, can determine whether it belongs to the scope of GB2312 according to the numerical value of the code point of character.If assuming that the probability of occurrence sum of GB2312 Chinese character is 0.90, and be uniformly distributed (this does not meet reality certainly), then their average probability of occurrence is 0.90/6768=0.000133.Assuming that the probability of occurrence sum of non-common Chinese character beyond GB2312 is 0.1, then the average probability of occurrence of non-common word is 0.01/ (27533-6768)=4.8 × 10 ^-7.This Measures compare is coarse, but owing to not needing program to carry mapping table from character to probability, therefore needs the volume ratio of the program of design smaller and more exquisite when applying.

Mode three: the probability of occurrence that character is set according to the correlative factor of each character, described correlative factor comprise the position of character in character string, the adjacent character of character, character in the natural text adopting corresponding character code mode by use probability, user preference is arranged, the language of the source of text to be identified and the equipment of text to be identified of decoding arranges.Such as, the character-coded text of UTF series is used to use a special character " byte order mark " (abbreviation BOM sometimes, UCS code point be uFEFF) indicate its character code used, if therefore by a text with after certain UTF coding and decoding, the character of beginning is " byte order mark ", then this UTF of the strong hint text encodes.The probability of occurrence of " byte order mark " also can be obtained by statistics natural character originally.If use the UTF-8 text of " byte order mark " to account for 20% of all UTF-8 texts in such as reality, the probability that " byte order mark " appears at the beginning of UTF-8 text is exactly 20%.Again such as, ISO-8859-1 coding is a kind of byte coding (code point is between 0-256), mainly comprise two parts: the code point that its intermediate value is less than or equal to 127 is encoded identical with ASCII, mainly include not containing the Latin alphabet (i.e. English alphabet), numeral, punctuate etc. of tone; The Latin alphabet containing key signature is then included, such as between 127 to 256 , é, deng, be mainly used in West Europe word, such as French, German, Italian.In these words, there is a feature: the Latin alphabet containing key signature can not occur substantially continuously, be always clipped in not containing in the middle of the letter of tone.That is, when a upper character is the Latin alphabet containing key signature, then this character is also extremely low containing the probability of the Latin alphabet of key signature.Concrete probability still can be obtained by statistics natural character originally.A guestimate is, in the text (except English) using ISO-8859-1, the probability of occurrence in space is 0.14, the average probability of occurrence of each English alphabet is 0.0149, the average probability of occurrence of each Latin alphabet containing key signature is 0.00156 (when a upper character is English alphabet), or 0.0000156 (when a upper character is not English alphabet).In ISO-8859-1 coding, consider that previous character is necessary, because its coding utilization factor higher (namely having the code bit of larger proportion to be used to conventional characters of encoding), when identification character is encoded, likely other character-coded text of employing is thought by mistake be ISO-8859-1 coding.If other character-coded text ISO-8859-1 will be adopted to decode, larger probability is then had to produce the Latin alphabet of continuous print containing key signature, if a very low probability of occurrence given in the Latin alphabet therefore continuous print being contained key signature, then can imitate with having and avoid this misidentification.

And the language setting of the equipment of decoding text to be identified, text source to be identified and user preference are arranged to the consideration of these factors, be because: for some character code, the particularly character code of UTF series, the writing system of not unique correspondence.Countries and regions different in the world, they are used to diverse word of encoding, and therefore also just do not have consistent " probability of occurrence of character ".Therefore a rational way utilizes localization information: such as one Tai Area/language is set to the computing machine of simplified form of Chinese Character, has the very large text that may receive and process simplified form of Chinese Character; The text that the website being positioned at China's Mainland from the network address obtains also is likely simplified form of Chinese Character; It is simplified form of Chinese Character that user also can arrange first-selected spoken and written languages in some softwares (as browser), and the text of at this moment this software receipt and process is also likely simplified form of Chinese Character.In this case, when identification character is encoded, can suppose that the probability of occurrence of character in UTF coding meets the probability of occurrence of character in simplified form of Chinese Character nature text.For the situation of other word, way is also similar.Although also the text using UTF to encode in global range can be done sampling statistics, draw the probability of occurrence of the character that global UTF coding is lower, do like this and do not make full use of localization information, therefore just lower slightly for accuracy during character code identification.

Additionally provide a kind of recognition system of text character codes mode in the embodiment of the present invention, as shown in Figure 2, this system comprises decode text module 100 and character code mode identification module 200.

Decode text module 100, for being decoded respectively according to known N kind character code mode by text to be identified, obtains the character string corresponding to decoded often kind of character code mode; N >=2;

Character code mode identification module 200, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.

In one embodiment of the invention, described character code mode identification module 200 can comprise the first character probability of occurrence confirmation unit 201 and the first character string probability of occurrence computing unit 202.

First character probability of occurrence determining unit 101, for determining the probability of occurrence of each character in decoded character string;

First character string probability of occurrence computing unit 102, is multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character is added the probability of occurrence obtaining character string corresponding to often kind of character code mode.

In another embodiment of the present invention, described character code mode identification module 200 can comprise the second character probability of occurrence determining unit 203, preliminary probability calculation unit 204 and the second character string probability of occurrence computing unit 205.

Second character probability of occurrence determining unit 203, determines the probability of occurrence of each character in decoded character string;

The preliminary probability calculation unit 204 of character string, for being multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character being added the preliminary probability obtaining character string corresponding to often kind of character code mode;

Second character string probability of occurrence computing unit 205, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode: when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; Wherein, the probability of occurrence of often kind of character code mode is arranged by user.

For a better understanding of the present invention, below in conjunction with specific embodiment, method of the present invention is further described.

Embodiment one

In the present embodiment, text to be identified is designated as T, the character code mode (the known text character codes mode in step S100) of candidate is designated as E1 and E2.The concrete steps adopting text character codes mode provided by the invention recognition methods to identify this text to be identified are as follows:

Step S10: decoded respectively by character code E1 and E2 by the binary sequence of text T to be identified, obtain the character string of decoded correspondence respectively, is designated as S_E1 by the character string obtained after E1 decoding, is designated as S_E2 by the character string obtained after E2 decoding.

Step S11: the probability of occurrence arranging character code mode E1 and E2, namely determines that the character code mode of text to be identified is the probability of E1 and E2, determines that the probability of occurrence of E1 be the probability of occurrence of P_E1=0.9, E2 is P_E2=0.1 in the present embodiment.

Step S12: for each character string obtained after decoding in step S10, determine the probability of occurrence of each character in each character string, the probability of occurrence of each character is multiplied as the preliminary probability of this character string again, again preliminary probability is multiplied by the probability of occurrence of the character code mode corresponding to character string, obtains the probability of occurrence of each character string; If certain Character decoder is made mistakes, then the probability of occurrence of this character is defined as 0, or is defined as a very little numeral.

In the present embodiment, assuming that comprise two characters in character string S_E1, its probability of occurrence is 0.001 and 0.0001 respectively, then the probability of occurrence P_S_E1 of S_E1 is 0.9 × 0.001 × 0.0001=9 × 10 ^-8.Assuming that S_E2 comprises 4 characters, its probability of occurrence is 0.01,0.001,0.001,0.001 respectively, then the probability of occurrence P_S_E2 of S_E2 is 0.1 × 0.01 × 0.001 × 0.001 × 0.001=1 × 10 ^-12.

In actual applications, because the numeral of the result of calculation of the probability of occurrence of character string is all very little, when number of characters is more, computing machine is easy to overflow, so the logarithm representation of equivalence can be adopted, logarithm by the probability of occurrence of each character is added and obtains the preliminary probability of character string, more preliminary probability is added the logarithm of probability of occurrence of the character code mode corresponding to character string, obtains the probability of occurrence of character string.

For the present embodiment, then P_S_E1)=lg (0.001)+lg (0.0001)+lg (0.9)=-7.05, P_S_E2=lg (0.01)+lg (0.001)+lg (0.001)+lg (0.001)+lg (0.1)=-12.

Step S13: character code corresponding for character string the highest for probability of occurrence is defined as the character code that the text most possibly adopts.Because P_S_E1 is greater than P_S_E2, therefore the most probable character code of text T is E1.

If identified more than two kinds of candidate characters codings, only need expand a little said method, therefore no longer illustrate.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technology thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a recognition methods for text character codes mode, comprises the following steps:

2. the recognition methods of a kind of text character codes mode according to claim 1, it is characterized in that: in step (2), calculate the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, comprising:

3. the recognition methods of a kind of text character codes mode according to claim 1, it is characterized in that: in step (2), calculate the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, comprising:

4. according to the recognition methods of a kind of text character codes mode one of claims 1 to 3 Suo Shu, it is characterized in that: in step (2), if certain Character decoder in decoded character string is made mistakes, then the probability of occurrence of corresponding character string is defined as 0, or the probability of occurrence of certain character described is defined as setting value, and described setting value is much smaller than the mean value of the probability of occurrence of all characters in the character code mode corresponding to this character or the probability of occurrence much smaller than arbitrary character in the character code mode corresponding to this character.

5. the recognition methods of a kind of text character codes mode according to claim 4, is characterized in that: in step (2), determines that the mode of the probability of occurrence of each character in decoded character string is:

6. the recognition methods of a kind of text character codes mode according to claim 4, is characterized in that: in step (2), determines that the mode of the probability of occurrence of each character in decoded character string is:

7. the recognition methods of a kind of text character codes mode according to claim 4, it is characterized in that: in step (2), determine that the mode of the probability of occurrence of each character in decoded character string is: the probability of occurrence arranging character according to the correlative factor of each character, described correlative factor comprise the position of character in character string, the adjacent character of character, character in the natural text adopting corresponding character code mode by use probability, user preference is arranged, the language of the source of text to be identified and the equipment of text to be identified of decoding arranges.

8. a recognition system for text character codes mode, comprising:

9. the recognition system of a kind of text character codes mode according to claim 8, is characterized in that: described character code mode identification module comprises:

10. the recognition system of a kind of text character codes mode according to claim 8, is characterized in that: described character code mode identification module comprises: