CN104750666A - Text character encoding mode identification method and system - Google Patents

Text character encoding mode identification method and system Download PDF

Info

Publication number
CN104750666A
CN104750666A CN201510107921.2A CN201510107921A CN104750666A CN 104750666 A CN104750666 A CN 104750666A CN 201510107921 A CN201510107921 A CN 201510107921A CN 104750666 A CN104750666 A CN 104750666A
Authority
CN
China
Prior art keywords
character
probability
occurrence
character string
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510107921.2A
Other languages
Chinese (zh)
Other versions
CN104750666B (en
Inventor
段垚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MAINBO EDUCATION TECHNOLOGY Co Ltd
Original Assignee
MAINBO EDUCATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MAINBO EDUCATION TECHNOLOGY Co Ltd filed Critical MAINBO EDUCATION TECHNOLOGY Co Ltd
Priority to CN201510107921.2A priority Critical patent/CN104750666B/en
Publication of CN104750666A publication Critical patent/CN104750666A/en
Application granted granted Critical
Publication of CN104750666B publication Critical patent/CN104750666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Character Discrimination (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a text character encoding mode identification method and system and belongs to the technical field of character encoding. The identification method comprises the steps that a text to be identified is decoded separately according to N types of character encoding modes, so that character rings corresponding to the N types of character encoding modes are obtained after the decoding; N is not smaller than two; the occurrence probability the corresponding character ring corresponding to each type of character encoding mode is calculated according to the occurrence probability of each character in the character rings obtained after the decoding, and the character encoding mode corresponding to the character ring with the highest occurrence probability is determined as the character encoding mode of the text to be identified. By means of the text character encoding mode identification method and system, the accuracy of character encoding mode identification is improved effectively, and the method and system are particularly applicable to identification of character encoding modes of short texts.

Description

A kind of recognition methods of text character codes mode and system
Technical field
The present invention relates to character encoding techniques field, be specifically related to a kind of recognition methods and system of text character codes mode.
Background technology
In computer information processing, text data can represent with multiple different character code (encoding).Some of them character code can represent the character of all conventional words in the world, and (UTF represents UCS Transformation Format, i.e. universal character set transformat for such as UTF-8, UTF-16, UTF-32 etc.UCS refers to Universal Character Set, i.e. universal character set).More character code is then that word is correlated with, lay particular stress on the character that (or can only) represents one or more words, such as GB2312 and GB18030 is mainly used in simplified Chinese character, Big5 is used for traditional Chinese character, Shift-JIS is mainly used in Japanese character, ISO-8859-1 is mainly used in representing Latin character, and ISO-8859-5 is mainly used in representing Cyrillic character (Russian character) etc.The character code that word is correlated with almost only is used to the text of coding language-specific word, although such as GB18030 also contemplated the character of conventional words all in the world, is almost only used to coding simplified form of Chinese Character text.On the contrary, the coding of UTF series is used to encode the text of various word.The coding that UTF coding replaces word relevant is a kind of trend, but the latter is also a large amount of at present to be existed, and will there is long period of time.
In computer information processing process, have many text datas not indicated or correctly do not indicated the character code that adopts, such as part webpage, the filename in zip archive file, ID3 metadata in mp3 file, the text message etc. that Quick Response Code carries.Two kinds of modes are often taked: (1) adopts default character coding (2) to identify text character codes when processing such text.Front a kind of mode is easy to make mistakes, and because after this kind of mode is more paid attention to, and is widely used.But it is not existing text character codes recognition methods also exists some problems, mainly high to the recognition correct rate of short text (several character is to tens characters).
The Shanjian Li etc. of Netscape company proposes a kind of recognition methods of text character codes in paper " A composite approach tolanguage/encoding detection ".Its main thought utilizes a lot of character-coded lettering system correlativity, and in East Asia word (such as Chinese character), conventional characters accounts for this fact of less ratio in all characters.Such as, according to GB2312, (this also just equals the supposition text be simplified form of Chinese Character be main) is decoded to the text of one section of unknown character coding, then the proportionate relationship of wherein Chinese characters in common use and non-common Chinese character is added up, if meet the proportionate relationship in nature Chinese, then the text is just probably encoded with GB2312; Similar process is also done to other character code.In order to quantitative contrast text adopts the possibility of various characters coding, the method is the computing formula that often kind of character code defines " degree of confidence ", and the character code that degree of confidence is the highest is exactly the character code that the text most possibly adopts.To the formula that various East Asia character is encoded be:
Everyday character ratio in degree of confidence=everyday character ratio/natural language
The number of times that everyday character ratio=everyday character occurs/(number of times that total number of characters-everyday character occurs)
The definition of everyday character is 512 characters the most frequently used in this lettering system, and this can by adding up existing natural language text to obtain.
This method is relatively more effective to longer text (such as webpage), but often not enough to very short text (being such as short as several character) discrimination.Possible reason is, degree of confidence can't the significant change along with the increase of number of characters, but is tending towards a definite value, this means that this method does not make full use of the implicit information in text.
Summary of the invention
For the defect existed in prior art, the object of the present invention is to provide recognition methods and the system of the text character codes mode that a kind of scope of application is wider, accuracy rate is higher.
For achieving the above object, the technical solution used in the present invention is as follows:
A recognition methods for text character codes mode, comprises the following steps:
(1) text to be identified is decoded respectively according to known N kind character code mode, obtain the character string corresponding to decoded often kind of character code mode; N >=2;
(2) calculate the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.
Further, the recognition methods of a kind of text character codes mode as above, in step (2), calculates the probability of occurrence of the character string corresponding to often kind of character code mode, comprising according to the probability of occurrence of character each in decoded character string:
1) probability of occurrence of each character in decoded character string is determined;
2) probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the probability of occurrence obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character.
Further, the recognition methods of a kind of text character codes mode as above, in step (2), calculates the probability of occurrence of the character string corresponding to often kind of character code mode, comprising according to the probability of occurrence of character each in decoded character string:
1. the probability of occurrence of each character in decoded character string is determined;
2. the probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the preliminary probability obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character;
3. when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; The statistics of the character code mode that the probability of occurrence of often kind of character code mode uses according to the natural text of different regions obtains.
Further, the recognition methods of a kind of text character codes mode as above, in step (2), if certain Character decoder in decoded character string is made mistakes, then the probability of occurrence of corresponding character string is defined as 0, or the probability of occurrence of certain character described is defined as setting value, and described setting value is much smaller than the mean value of the probability of occurrence of all characters in the character code mode corresponding to this character or the probability of occurrence much smaller than arbitrary character in the character code mode corresponding to this character.
Further, the recognition methods of a kind of text character codes mode as above, in step (2), determine that the mode of the probability of occurrence of each character in decoded character string is:
By character each in decoded character string adopting the probability occurred in the natural text of corresponding character code mode, be defined as the probability of occurrence of each character.
Further, the recognition methods of a kind of text character codes mode as above, in step (2), determine that the mode of the probability of occurrence of each character in decoded character string is:
If character has been carried out classification according to common degree by the Character decoder mode corresponding to character string, then the probability of occurrence of the character in decoded character string has equaled the probability of occurrence of the character of this character place grade.
Further, the recognition methods of a kind of text character codes mode as above, in step (2), determine that the mode of the probability of occurrence of each character in decoded character string is: the probability of occurrence arranging character according to the correlative factor of each character, described correlative factor comprise the position of character in character string, the adjacent character of character, character in the natural text adopting corresponding character code mode by use probability, user preference is arranged, the language of the source of text to be identified and the equipment of text to be identified of decoding arranges.
For achieving the above object, additionally provide a kind of recognition system of text character codes mode in the embodiment of the present invention, comprising:
Decode text module, for being decoded respectively according to known N kind character code mode by text to be identified, obtains the character string corresponding to decoded often kind of character code mode; N >=2;
Character code mode identification module, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.
Further, the recognition system of a kind of text character codes mode as above, described character code mode identification module comprises:
First character probability of occurrence determining unit, for determining the probability of occurrence of each character in decoded character string;
First character string probability of occurrence computing unit, is multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character is added the probability of occurrence obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character.
Further, the recognition system of a kind of text character codes mode as above, is characterized in that: described character code mode identification module comprises:
Second character probability of occurrence determining unit, determines the probability of occurrence of each character in decoded character string;
The preliminary probability calculation unit of character string, for being multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character being added the preliminary probability obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character;
Second character string probability of occurrence computing unit, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode: when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; Wherein, the statistics of character code mode that the probability of occurrence of often kind of character code mode uses according to the natural text of different regions obtains.
Beneficial effect of the present invention is: method and system of the present invention, take full advantage of the statistical information of each character code in natural text, therefore, effectively improve the accuracy of character code mode identification, be particularly useful for the identification of the character code mode of short text.
Accompanying drawing explanation
A kind of process flow diagram of recognition methods of text character codes mode of Fig. 1 for providing in the specific embodiment of the invention;
A kind of structured flowchart of recognition system of text character codes mode of Fig. 2 for providing in the specific embodiment of the invention.
Embodiment
Below in conjunction with Figure of description and embodiment, the present invention is described in further detail.
Fig. 1 shows the process flow diagram of the recognition methods of a kind of text character codes mode in the specific embodiment of the invention, and as can be seen from Figure, the method mainly comprises following two steps:
Step S100: text to be identified is decoded respectively according to known N kind character code mode, obtains the character string corresponding to decoded often kind of character code mode; N >=2;
Step S200: the probability of occurrence calculating the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, is defined as the character code mode of text to be identified by the character code mode corresponding to character string the highest for probability of occurrence.
In present embodiment, described text to be identified for do not indicated or correctly do not indicated the text of character code mode that adopts, namely need the text of identification character coded system.For text to be identified, first known various characters coded system is adopted to decode to it respectively, obtain the decoded character string that often kind of decoding process is corresponding, often kind of corresponding decoded character string of coded system, that is, described known various characters coded system is the character code mode of candidate, and the character code mode of the text to be identified finally identified belongs to one wherein.Wherein, known character code mode includes but not limited to UNICODE coded system (UTF-8, UTF-16, UTF-32 etc.), GB2312 coded system, GB18030 coded system etc.
In actual applications, text to be identified is decoded according to known character code mode, thus when obtaining multiple character string corresponding to Multi-encoding mode, can be directly that text to be identified is decoded to corresponding coded system code point as this decoded character string according to known character code mode, such as, for GB18030 character code mode, a character can with 1,2 or 4 byte representations, this is 1 years old, 2, or the numerical value of the combination of 4 bytes (general by high byte in front explanation) is exactly the code point of GB18030.Such as " " word is with two byte representations, i.e. 0xB0 (first character joint) 0xA1 (second byte), and therefore its code point is 0x0000B0A1 (using 32 integer representations herein); " a " word byte 0x61 represents, therefore its code point is 0x00000061.Therefore, text to be identified is obtained code point sequence after being decoded by GB18030 coded system and is decoded character string.
In order to the processing requirements of satisfied compatibility, can also after text to be identified be obtained the code point sequence of corresponding coded system according to decoding process decoding, code point sequence is converted to universal character set UCS code point sequence, using UCS code point sequence as decoded character string.
In order to better meet across language, cross-platform processing requirements, above-mentioned UCS code point sequence can also be encoded according to certain UTF coding (UTF-16 or UTF-32 etc.), obtain the sequence after a coding, using the sequence after this coding as above-mentioned decoded character string.
In actual applications, the result of which kind of mode above-mentioned can being adopted according to actual needs as decoded character string, when the code point of encoding from GB etc. is converted to UCS code point, generally needing by inquiring about corresponding mapping table.This mapping table is often larger, inquires about also more time-consuming.If directly can determine the probability of occurrence of respective symbols according to the value of the code point of GB coding, so just without the need to being converted into UCS code point again, thus can improve treatment effeciency, but needing to encode for various characters realizes determining the code of character probability of occurrence, therefore more complicated respectively.
In actual applications, usually do not need to attempt character codes all in the world, because a computing machine or computer user character-coded kind that may touch normally minority is several.
After completing the decoding of text to be identified, calculate the probability of occurrence of often kind of character string according to the probability of occurrence of character each in decoded character string, and the character code mode corresponding to character string the highest for most probability of occurrence is defined as the character code mode of text to be identified.
The mode that following two kinds of probabilities of occurrence according to character each in decoded character string calculate the probability of occurrence of the character string corresponding to often kind of character code mode is provided in present embodiment:
The step of first kind of way is as follows:
1) probability of occurrence of each character in decoded character string is determined;
2) probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the probability of occurrence obtaining character string corresponding to often kind of character code mode.
The step of the second way is as follows:
1. the probability of occurrence of each character in decoded character string is determined;
2. the probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the preliminary probability obtaining character string corresponding to often kind of character code mode;
3. when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string;
Different countries, area often use to be tended to use different character code modes, therefore when determining the probability of occurrence of often kind of character code mode, can use character-coded statistics with reference to the natural text in different countries, area.Such as when the area/language of computer installation is set to " China's Mainland/simplified form of Chinese Character ", or the source of text to be identified is China's Mainland, or when the preference of user is set to " simplified form of Chinese Character ", then can estimate that the probability of occurrence that GB encodes is 0.5 according to statistics, and the probability of occurrence of UTF-8 is 0.4, all the other are other coding.If when above factor is " Europe " or " Germany ", then estimate that the probability of occurrence of ISO-8859-1 be the probability of occurrence of 0.7, UTF-8 is 0.2, all the other are other coding.Above numeral, without actual verification, is only described as an example.In actual applications, do not need to add up all natural texts.For each concrete application, better way is the samples of text of statistics this area and this area.Such as in order to identify the character code of filename in zip file, the filename in the zip file of some this areas can be added up; In order to identify the character code of the text in Quick Response Code, the text of the Quick Response Code of some this areas can be added up.In described N the summation of the probability of occurrence of character code mode strictly speaking this probability summation be usually less than 1, because this N kind character code can not contain all possible character code usually, only contain more common, but in computation process, also the summation of the probability of occurrence of character code mode in N kind can be determined 1, represent that the character code mode of text to be identified must be the one in N kind.
In present embodiment, described probability of occurrence refers to the possibility for a certain event occurs, as there is the probability of this character after the probability of occurrence of a certain character refers to decoding in character string after decoding; What described logarithm referred to is all with the probability of occurrence of character for antilog and the truth of a matter logarithm that is greater than 1.
For some character codes, not arbitrary binary sequence is all legal, and illegal binary sequence may cause the character (such as UCS code point u0000) decoded unsuccessfully or produce and should not occur in normal text.Occur that this situation just means that this character code can not be extremely the correct coding of the text.For this problem, in present embodiment when calculating the probability of occurrence of often kind of character string, if certain Character decoder in decoded character string is made mistakes, then the probability of occurrence of corresponding character string is defined as 0, or the probability of occurrence of certain character described is defined as setting value, and described setting value is much smaller than the mean value of the probability of occurrence of all characters in the character code mode corresponding to this character or the probability of occurrence much smaller than arbitrary character in the character code mode corresponding to this character.That is, when there is the character of decoding error in the character string obtained after a decoding, the probability of occurrence of this character string directly can be defined as 0, or the probability of occurrence of the character of mistake is defined as a numeral much smaller than the probability of occurrence of general character, in fact just eliminates the character-coded qualifications for being elected corresponding to character of decoding error.
In present embodiment, above-mentioned steps 1) and 1. in determine that the mode of the probability of occurrence of each character in decoded character string is various, provide following several mode in present embodiment:
Mode one: reference character, adopting the probability occurred in corresponding character-coded natural text, by character each in decoded character string adopting the probability occurred in the natural text of corresponding character code mode, is defined as the probability of occurrence of each character.Such as GB coded system, the webpage of the simplified form of Chinese Character of representational some can be added up, obtain the probability of occurrence of wherein each character, then make the mapping table from character to probability, for determining the probability of occurrence by each character after the decoding of GB character code.Certainly, for each concrete application, better way is the samples of text of statistics this area.Such as in order to identify the character code of filename in zip file, the filename in zip file in a large amount of real world can be added up; In order to identify the character code of the text in Quick Response Code, the text of a large amount of Quick Response Codes can be added up.
Mode two: if character has been carried out classification according to common degree by the Character decoder mode corresponding to character string, then the probability of occurrence of the character in decoded character string equals the probability of occurrence of the character of this character place grade.。Such as GB18030-2000 is exactly so a kind of character code: it has included 27533 Chinese characters altogether, and which includes 6768 characters that GB2312 includes, this includes the most frequently used Chinese character, other then belong to the Chinese character be comparatively of little use.After according to GB18030-2000 decoding, can determine whether it belongs to the scope of GB2312 according to the numerical value of the code point of character.If assuming that the probability of occurrence sum of GB2312 Chinese character is 0.90, and be uniformly distributed (this does not meet reality certainly), then their average probability of occurrence is 0.90/6768=0.000133.Assuming that the probability of occurrence sum of non-common Chinese character beyond GB2312 is 0.1, then the average probability of occurrence of non-common word is 0.01/ (27533-6768)=4.8 × 10 -7.This Measures compare is coarse, but owing to not needing program to carry mapping table from character to probability, therefore needs the volume ratio of the program of design smaller and more exquisite when applying.
Mode three: the probability of occurrence that character is set according to the correlative factor of each character, described correlative factor comprise the position of character in character string, the adjacent character of character, character in the natural text adopting corresponding character code mode by use probability, user preference is arranged, the language of the source of text to be identified and the equipment of text to be identified of decoding arranges.Such as, the character-coded text of UTF series is used to use a special character " byte order mark " (abbreviation BOM sometimes, UCS code point be uFEFF) indicate its character code used, if therefore by a text with after certain UTF coding and decoding, the character of beginning is " byte order mark ", then this UTF of the strong hint text encodes.The probability of occurrence of " byte order mark " also can be obtained by statistics natural character originally.If use the UTF-8 text of " byte order mark " to account for 20% of all UTF-8 texts in such as reality, the probability that " byte order mark " appears at the beginning of UTF-8 text is exactly 20%.Again such as, ISO-8859-1 coding is a kind of byte coding (code point is between 0-256), mainly comprise two parts: the code point that its intermediate value is less than or equal to 127 is encoded identical with ASCII, mainly include not containing the Latin alphabet (i.e. English alphabet), numeral, punctuate etc. of tone; The Latin alphabet containing key signature is then included, such as between 127 to 256 , é, deng, be mainly used in West Europe word, such as French, German, Italian.In these words, there is a feature: the Latin alphabet containing key signature can not occur substantially continuously, be always clipped in not containing in the middle of the letter of tone.That is, when a upper character is the Latin alphabet containing key signature, then this character is also extremely low containing the probability of the Latin alphabet of key signature.Concrete probability still can be obtained by statistics natural character originally.A guestimate is, in the text (except English) using ISO-8859-1, the probability of occurrence in space is 0.14, the average probability of occurrence of each English alphabet is 0.0149, the average probability of occurrence of each Latin alphabet containing key signature is 0.00156 (when a upper character is English alphabet), or 0.0000156 (when a upper character is not English alphabet).In ISO-8859-1 coding, consider that previous character is necessary, because its coding utilization factor higher (namely having the code bit of larger proportion to be used to conventional characters of encoding), when identification character is encoded, likely other character-coded text of employing is thought by mistake be ISO-8859-1 coding.If other character-coded text ISO-8859-1 will be adopted to decode, larger probability is then had to produce the Latin alphabet of continuous print containing key signature, if a very low probability of occurrence given in the Latin alphabet therefore continuous print being contained key signature, then can imitate with having and avoid this misidentification.
And the language setting of the equipment of decoding text to be identified, text source to be identified and user preference are arranged to the consideration of these factors, be because: for some character code, the particularly character code of UTF series, the writing system of not unique correspondence.Countries and regions different in the world, they are used to diverse word of encoding, and therefore also just do not have consistent " probability of occurrence of character ".Therefore a rational way utilizes localization information: such as one Tai Area/language is set to the computing machine of simplified form of Chinese Character, has the very large text that may receive and process simplified form of Chinese Character; The text that the website being positioned at China's Mainland from the network address obtains also is likely simplified form of Chinese Character; It is simplified form of Chinese Character that user also can arrange first-selected spoken and written languages in some softwares (as browser), and the text of at this moment this software receipt and process is also likely simplified form of Chinese Character.In this case, when identification character is encoded, can suppose that the probability of occurrence of character in UTF coding meets the probability of occurrence of character in simplified form of Chinese Character nature text.For the situation of other word, way is also similar.Although also the text using UTF to encode in global range can be done sampling statistics, draw the probability of occurrence of the character that global UTF coding is lower, do like this and do not make full use of localization information, therefore just lower slightly for accuracy during character code identification.
Additionally provide a kind of recognition system of text character codes mode in the embodiment of the present invention, as shown in Figure 2, this system comprises decode text module 100 and character code mode identification module 200.
Decode text module 100, for being decoded respectively according to known N kind character code mode by text to be identified, obtains the character string corresponding to decoded often kind of character code mode; N >=2;
Character code mode identification module 200, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.
In one embodiment of the invention, described character code mode identification module 200 can comprise the first character probability of occurrence confirmation unit 201 and the first character string probability of occurrence computing unit 202.
First character probability of occurrence determining unit 101, for determining the probability of occurrence of each character in decoded character string;
First character string probability of occurrence computing unit 102, is multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character is added the probability of occurrence obtaining character string corresponding to often kind of character code mode.
In another embodiment of the present invention, described character code mode identification module 200 can comprise the second character probability of occurrence determining unit 203, preliminary probability calculation unit 204 and the second character string probability of occurrence computing unit 205.
Second character probability of occurrence determining unit 203, determines the probability of occurrence of each character in decoded character string;
The preliminary probability calculation unit 204 of character string, for being multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character being added the preliminary probability obtaining character string corresponding to often kind of character code mode;
Second character string probability of occurrence computing unit 205, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode: when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; Wherein, the probability of occurrence of often kind of character code mode is arranged by user.
For a better understanding of the present invention, below in conjunction with specific embodiment, method of the present invention is further described.
Embodiment one
In the present embodiment, text to be identified is designated as T, the character code mode (the known text character codes mode in step S100) of candidate is designated as E1 and E2.The concrete steps adopting text character codes mode provided by the invention recognition methods to identify this text to be identified are as follows:
Step S10: decoded respectively by character code E1 and E2 by the binary sequence of text T to be identified, obtain the character string of decoded correspondence respectively, is designated as S_E1 by the character string obtained after E1 decoding, is designated as S_E2 by the character string obtained after E2 decoding.
Step S11: the probability of occurrence arranging character code mode E1 and E2, namely determines that the character code mode of text to be identified is the probability of E1 and E2, determines that the probability of occurrence of E1 be the probability of occurrence of P_E1=0.9, E2 is P_E2=0.1 in the present embodiment.
Step S12: for each character string obtained after decoding in step S10, determine the probability of occurrence of each character in each character string, the probability of occurrence of each character is multiplied as the preliminary probability of this character string again, again preliminary probability is multiplied by the probability of occurrence of the character code mode corresponding to character string, obtains the probability of occurrence of each character string; If certain Character decoder is made mistakes, then the probability of occurrence of this character is defined as 0, or is defined as a very little numeral.
In the present embodiment, assuming that comprise two characters in character string S_E1, its probability of occurrence is 0.001 and 0.0001 respectively, then the probability of occurrence P_S_E1 of S_E1 is 0.9 × 0.001 × 0.0001=9 × 10 -8.Assuming that S_E2 comprises 4 characters, its probability of occurrence is 0.01,0.001,0.001,0.001 respectively, then the probability of occurrence P_S_E2 of S_E2 is 0.1 × 0.01 × 0.001 × 0.001 × 0.001=1 × 10 -12.
In actual applications, because the numeral of the result of calculation of the probability of occurrence of character string is all very little, when number of characters is more, computing machine is easy to overflow, so the logarithm representation of equivalence can be adopted, logarithm by the probability of occurrence of each character is added and obtains the preliminary probability of character string, more preliminary probability is added the logarithm of probability of occurrence of the character code mode corresponding to character string, obtains the probability of occurrence of character string.
For the present embodiment, then P_S_E1)=lg (0.001)+lg (0.0001)+lg (0.9)=-7.05, P_S_E2=lg (0.01)+lg (0.001)+lg (0.001)+lg (0.001)+lg (0.1)=-12.
Step S13: character code corresponding for character string the highest for probability of occurrence is defined as the character code that the text most possibly adopts.Because P_S_E1 is greater than P_S_E2, therefore the most probable character code of text T is E1.
If identified more than two kinds of candidate characters codings, only need expand a little said method, therefore no longer illustrate.
Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technology thereof, then the present invention is also intended to comprise these change and modification.

Claims (10)

1. a recognition methods for text character codes mode, comprises the following steps:
(1) text to be identified is decoded respectively according to known N kind character code mode, obtain the character string corresponding to decoded often kind of character code mode; N >=2;
(2) calculate the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.
2. the recognition methods of a kind of text character codes mode according to claim 1, it is characterized in that: in step (2), calculate the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, comprising:
1) probability of occurrence of each character in decoded character string is determined;
2) probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the probability of occurrence obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character.
3. the recognition methods of a kind of text character codes mode according to claim 1, it is characterized in that: in step (2), calculate the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, comprising:
1. the probability of occurrence of each character in decoded character string is determined;
2. the probability of occurrence of each character be multiplied or the logarithm of the probability of occurrence of each character be added the preliminary probability obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character;
3. when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; The statistics of the character code mode that the probability of occurrence of often kind of character code mode uses according to the natural text of different regions obtains.
4. according to the recognition methods of a kind of text character codes mode one of claims 1 to 3 Suo Shu, it is characterized in that: in step (2), if certain Character decoder in decoded character string is made mistakes, then the probability of occurrence of corresponding character string is defined as 0, or the probability of occurrence of certain character described is defined as setting value, and described setting value is much smaller than the mean value of the probability of occurrence of all characters in the character code mode corresponding to this character or the probability of occurrence much smaller than arbitrary character in the character code mode corresponding to this character.
5. the recognition methods of a kind of text character codes mode according to claim 4, is characterized in that: in step (2), determines that the mode of the probability of occurrence of each character in decoded character string is:
By character each in decoded character string adopting the probability occurred in the natural text of corresponding character code mode, be defined as the probability of occurrence of each character.
6. the recognition methods of a kind of text character codes mode according to claim 4, is characterized in that: in step (2), determines that the mode of the probability of occurrence of each character in decoded character string is:
If character has been carried out classification according to common degree by the Character decoder mode corresponding to character string, then the probability of occurrence of the character in decoded character string has equaled the probability of occurrence of the character of this character place grade.
7. the recognition methods of a kind of text character codes mode according to claim 4, it is characterized in that: in step (2), determine that the mode of the probability of occurrence of each character in decoded character string is: the probability of occurrence arranging character according to the correlative factor of each character, described correlative factor comprise the position of character in character string, the adjacent character of character, character in the natural text adopting corresponding character code mode by use probability, user preference is arranged, the language of the source of text to be identified and the equipment of text to be identified of decoding arranges.
8. a recognition system for text character codes mode, comprising:
Decode text module, for being decoded respectively according to known N kind character code mode by text to be identified, obtains the character string corresponding to decoded often kind of character code mode; N >=2;
Character code mode identification module, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode according to the probability of occurrence of character each in decoded character string, the character code mode corresponding to character string the highest for probability of occurrence is defined as the character code mode of text to be identified.
9. the recognition system of a kind of text character codes mode according to claim 8, is characterized in that: described character code mode identification module comprises:
First character probability of occurrence determining unit, for determining the probability of occurrence of each character in decoded character string;
First character string probability of occurrence computing unit, is multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character is added the probability of occurrence obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character.
10. the recognition system of a kind of text character codes mode according to claim 8, is characterized in that: described character code mode identification module comprises:
Second character probability of occurrence determining unit, determines the probability of occurrence of each character in decoded character string;
The preliminary probability calculation unit of character string, for being multiplied the probability of occurrence of each character or the logarithm of the probability of occurrence of each character being added the preliminary probability obtaining character string corresponding to often kind of character code mode; Described logarithm is for antilog and the truth of a matter logarithm that is greater than 1 with the probability of occurrence of character;
Second character string probability of occurrence computing unit, for calculating the probability of occurrence of the character string corresponding to often kind of character code mode: when preliminary probability be by the probability of occurrence of each character be multiplied obtain time, the probability of occurrence preliminary probability of often kind of character string being multiplied by character code mode corresponding to this character string obtains the probability of occurrence of often kind of character string; When preliminary probability be by the probability of occurrence of each character logarithm be added obtain time, the preliminary probability of often kind of character string is added the logarithm of the probability of occurrence of the character code mode that this character string is corresponding obtains the probability of occurrence of often kind of character string; Wherein, the statistics of character code mode that the probability of occurrence of often kind of character code mode uses according to the natural text of different regions obtains.
CN201510107921.2A 2015-03-12 2015-03-12 A kind of recognition methods of text character codes mode and system Active CN104750666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510107921.2A CN104750666B (en) 2015-03-12 2015-03-12 A kind of recognition methods of text character codes mode and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510107921.2A CN104750666B (en) 2015-03-12 2015-03-12 A kind of recognition methods of text character codes mode and system

Publications (2)

Publication Number Publication Date
CN104750666A true CN104750666A (en) 2015-07-01
CN104750666B CN104750666B (en) 2018-08-07

Family

ID=53590378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510107921.2A Active CN104750666B (en) 2015-03-12 2015-03-12 A kind of recognition methods of text character codes mode and system

Country Status (1)

Country Link
CN (1) CN104750666B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760364A (en) * 2016-02-22 2016-07-13 深圳市茁壮网络股份有限公司 Character set detection method and device
CN106569939A (en) * 2016-10-28 2017-04-19 上海斐讯数据通信技术有限公司 Multilateral language analysis system and multilateral language analysis method for control script programs
WO2017166430A1 (en) * 2016-03-28 2017-10-05 深圳Tcl新技术有限公司 Caption display method and device
CN108108267A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 The restoration methods and device of data
CN108197087A (en) * 2018-01-18 2018-06-22 北京奇安信科技有限公司 Character code recognition methods and device
CN109359274A (en) * 2018-09-14 2019-02-19 阿里巴巴集团控股有限公司 The method, device and equipment that the character string of a kind of pair of Mass production is identified
CN109582930A (en) * 2017-09-29 2019-04-05 北京金山安全软件有限公司 Sliding input decoding method and device and electronic equipment
CN110704629A (en) * 2018-07-09 2020-01-17 北京京东尚科信息技术有限公司 Method and device for determining character set
CN111679830A (en) * 2020-06-03 2020-09-18 中国银行股份有限公司 File coding format detection method and device
CN113255398A (en) * 2020-02-10 2021-08-13 百度在线网络技术(北京)有限公司 Interest point duplicate determination method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030128296A1 (en) * 2002-01-04 2003-07-10 Chulhee Lee Video display apparatus with separate display means for textual information
CN101034391A (en) * 2007-04-26 2007-09-12 北京立通无限科技有限公司 Method and apparatus for confirming text stream character set
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN102194503A (en) * 2010-03-12 2011-09-21 腾讯科技(深圳)有限公司 Player and character code detection method and device for subtitle file
CN104360988A (en) * 2014-10-17 2015-02-18 北京锐安科技有限公司 Method and device for identifying coding mode of Chinese characters

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030128296A1 (en) * 2002-01-04 2003-07-10 Chulhee Lee Video display apparatus with separate display means for textual information
CN101034391A (en) * 2007-04-26 2007-09-12 北京立通无限科技有限公司 Method and apparatus for confirming text stream character set
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN102194503A (en) * 2010-03-12 2011-09-21 腾讯科技(深圳)有限公司 Player and character code detection method and device for subtitle file
CN104360988A (en) * 2014-10-17 2015-02-18 北京锐安科技有限公司 Method and device for identifying coding mode of Chinese characters

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760364B (en) * 2016-02-22 2018-09-04 深圳市茁壮网络股份有限公司 A kind of character set detection method and device
CN105760364A (en) * 2016-02-22 2016-07-13 深圳市茁壮网络股份有限公司 Character set detection method and device
WO2017166430A1 (en) * 2016-03-28 2017-10-05 深圳Tcl新技术有限公司 Caption display method and device
CN106569939A (en) * 2016-10-28 2017-04-19 上海斐讯数据通信技术有限公司 Multilateral language analysis system and multilateral language analysis method for control script programs
CN106569939B (en) * 2016-10-28 2020-06-12 北京数科网维技术有限责任公司 Control script program multi-country character analysis system and multi-country character analysis method
CN108108267B (en) * 2016-11-25 2021-06-22 北京国双科技有限公司 Data recovery method and device
CN108108267A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 The restoration methods and device of data
CN109582930B (en) * 2017-09-29 2022-12-20 北京金山安全软件有限公司 Sliding input decoding method and device and electronic equipment
CN109582930A (en) * 2017-09-29 2019-04-05 北京金山安全软件有限公司 Sliding input decoding method and device and electronic equipment
CN108197087A (en) * 2018-01-18 2018-06-22 北京奇安信科技有限公司 Character code recognition methods and device
CN108197087B (en) * 2018-01-18 2021-11-16 奇安信科技集团股份有限公司 Character code recognition method and device
CN110704629A (en) * 2018-07-09 2020-01-17 北京京东尚科信息技术有限公司 Method and device for determining character set
CN110704629B (en) * 2018-07-09 2024-06-18 北京京东尚科信息技术有限公司 Method and device for determining character set
CN109359274A (en) * 2018-09-14 2019-02-19 阿里巴巴集团控股有限公司 The method, device and equipment that the character string of a kind of pair of Mass production is identified
CN109359274B (en) * 2018-09-14 2023-05-02 蚂蚁金服(杭州)网络技术有限公司 Method, device and equipment for identifying character strings generated in batch
CN113255398A (en) * 2020-02-10 2021-08-13 百度在线网络技术(北京)有限公司 Interest point duplicate determination method, device, equipment and storage medium
CN113255398B (en) * 2020-02-10 2023-08-18 百度在线网络技术(北京)有限公司 Point of interest weight judging method, device, equipment and storage medium
CN111679830A (en) * 2020-06-03 2020-09-18 中国银行股份有限公司 File coding format detection method and device

Also Published As

Publication number Publication date
CN104750666B (en) 2018-08-07

Similar Documents

Publication Publication Date Title
CN104750666A (en) Text character encoding mode identification method and system
US8271873B2 (en) Automatically detecting layout of bidirectional (BIDI) text
CN104424165A (en) Messy code detection method and system for text documents
CN104391993A (en) Method and system for recognizing webpage codes
CN104516862A (en) Method and system for selecting and reading coded format of target document
CN107526742B (en) Method and apparatus for processing multilingual text
US9798721B2 (en) Innovative method for text encodation in quick response code
US11847159B2 (en) Detecting typography elements from outlines
CN104008123A (en) Native-script and cross-script Chinese name matching
CN110704813A (en) Character anti-piracy system based on character recoding
CN112948776A (en) Digital watermark adding method and device, electronic equipment and storage medium
CN115223188A (en) Bill information processing method, device, electronic equipment and computer storage medium
CN104331399A (en) Dictionary tree translation method
US11704505B2 (en) Language processing method and device
WO2024066271A1 (en) Database watermark embedding method and apparatus, database watermark tracing method and apparatus, and electronic device
CN103136166B (en) Method and device for font determination
CN110096481B (en) Method for identifying file code and computer readable storage medium
CN102063415A (en) Method and system for embedding single-byte fonts in PDF (Portable Document Format) file
CN102063416A (en) Method and system for embedding double-byte fonts into PDF file
CN106406560A (en) Method and system for outputting vector fonts of mechanical engineering characters in desktop operation system
Sequiera et al. Word-level language identification and back transliteration of Romanized text
CN111695327B (en) Method and device for repairing messy codes, electronic equipment and readable storage medium
CN113283233A (en) Text error correction method and device, electronic equipment and storage medium
CN1916888A (en) Method and system of identifying language of double-byte character set character data
CN111158805B (en) Delphi software source language translation system, method, equipment and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant