CN104750663A - Identification method and device for text messy codes in page - Google Patents

Identification method and device for text messy codes in page Download PDF

Info

Publication number
CN104750663A
CN104750663A CN201310737443.4A CN201310737443A CN104750663A CN 104750663 A CN104750663 A CN 104750663A CN 201310737443 A CN201310737443 A CN 201310737443A CN 104750663 A CN104750663 A CN 104750663A
Authority
CN
China
Prior art keywords
text
coded format
characteristic information
page
mess code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310737443.4A
Other languages
Chinese (zh)
Other versions
CN104750663B (en
Inventor
丁世远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Singapore Holdings Pte Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310737443.4A priority Critical patent/CN104750663B/en
Publication of CN104750663A publication Critical patent/CN104750663A/en
Application granted granted Critical
Publication of CN104750663B publication Critical patent/CN104750663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention provides an identification method and an identification device for text messy codes in a page. The identification method for the text messy codes in the page includes: obtaining a first coding format of a first text to be identified in the page, converting the first text to a second text with a second coding format according to a corresponding relationship between characters corresponding to the second coding format and characters corresponding to other coding formats, and then converting the second text to a third text according to the specific corresponding relationship between the characters corresponding to the second coding format and the characters corresponding to the first coding format, and confirming whether the messy codes exist in the first text according to the third text and the first text. The identification method and the identification device for the text messy codes in the page do not need operation personnel to participate in the identification process, are easy to operate and high in accurate rate, and thereby improve identification efficiency and reliability of the text messy codes.

Description

The recognition methods of page Chinese version mess code and device
[technical field]
The application relates to WWW (World Wide Web, Web) page treatment technology, particularly relates to a kind of recognition methods and device of page Chinese version mess code.
[background technology]
WWW (World Wide Web, Web) page can comprise by one or more HTML (Hypertext Markup Language) (HyperText Markup Language, HTML) a display block of label composition, be called page elements, such as, text, label, hyperlink, button, input frame, combobox etc.Due to the reason such as parsing of Web page, the text in Web page there will be mess code phenomenon.In prior art, need to be checked Web page one by one by operating personnel, to find whether the text in this Web page occurs mess code phenomenon.
But the identifying operation time of existing text mess code is long, and easily makes mistakes, thus result in the efficiency of the identification of text mess code and the reduction of reliability.
[summary of the invention]
The many aspects of the application provide a kind of recognition methods and device of page Chinese version mess code, in order to improve efficiency and the reliability of the identification of text mess code.
The one side of the application, provides a kind of recognition methods of page Chinese version mess code, comprising:
Obtain the first coded format of the first text to be identified in the page;
Described first text-converted is the second text by the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, and the coded format of described second text is described second coded format;
Described second text-converted is the 3rd text by the character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format;
According to described 3rd text and described first text, determine whether there is mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, and described second coded format comprises Unicode coded format.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, described according to described 3rd text and described first text, determines whether there is mess code in described first text, comprising:
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, describedly compares described 3rd text and described first text, comprising:
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, described characteristic information comprises MD5 value.
The another aspect of the application, provides a kind of recognition device of page Chinese version mess code, comprising:
Acquiring unit, for obtaining the first coded format of the first text to be identified in the page;
Described first text-converted, for the corresponding relation between the character corresponding to the second coded format and the character corresponding to other coded formats, is the second text by converting unit, and the coded format of described second text is described second coded format;
Described second text-converted, also for the corresponding relation between the character corresponding to described second coded format and the character corresponding to described first coded format, is the 3rd text by described converting unit;
Determining unit, for according to described 3rd text and described first text, determines whether there is mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, and described second coded format comprises Unicode coded format.
Aspect as above and arbitrary possible implementation, provide a kind of implementation, described determining unit further, specifically for
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation, described determining unit further, specifically for
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, described characteristic information comprises MD5 value.
As shown from the above technical solution, the embodiment of the present application is by obtaining the first coded format of the first text to be identified in the page, and then the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, it is second text with described second coded format by described first text-converted, character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format again, be the 3rd text by described second text-converted, make it possible to according to described 3rd text and described first text, determine whether there is mess code in described first text, identifying is participated in without the need to operating personnel, simple to operate, and accuracy is high, thus improve efficiency and the reliability of the identification of text mess code.
[accompanying drawing explanation]
In order to be illustrated more clearly in the technical scheme in the embodiment of the present application, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
The schematic flow sheet of the recognition methods of the page Chinese version mess code that Fig. 1 provides for the application one embodiment;
The structural representation of the recognition device of the page Chinese version mess code that Fig. 2 provides for another embodiment of the application.
[embodiment]
For making the object of the embodiment of the present application, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present application, technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment is some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making other embodiments whole obtained under creative work prerequisite, all belong to the scope of the application's protection.
Be understandable that, the page involved by the application, can be the webpage (Web Page) write based on HTML (Hypertext Markup Language) (HyperText Markup Language, HTML), also can be called Web page.
It should be noted that, terminal involved in the embodiment of the present application can include but not limited to mobile phone, personal digital assistant (Personal Digital Assistant, PDA), wireless handheld device, wireless Internet access basis, PC, portable computer, PC (Personal Computer, PC), MP3 player, MP4 player etc.
In addition, term "and/or" herein, being only a kind of incidence relation describing affiliated partner, can there are three kinds of relations in expression, and such as, A and/or B, can represent: individualism A, exists A and B simultaneously, these three kinds of situations of individualism B.In addition, character "/" herein, general expression forward-backward correlation is to the relation liking a kind of "or".
The schematic flow sheet of the recognition methods of the page Chinese version mess code that Fig. 1 provides for the application one embodiment, as shown in Figure 1.
101, the first coded format of the first text to be identified in the page is obtained.
Wherein, described first coded format can be all optional text code modes in prior art, and such as, GBK coded system, UTF-8 coded system or GB2312 coded system etc., the present embodiment is not particularly limited this.
GBK is one of encode Chinese characters for computer standard, full name " Chinese Internal Code Specification " (the GBK i.e. first letter of " GB ", " expansion " Chinese phonetic alphabet, can also be called Chinese character international proliferation code, English name is Chinese Internal Code Specification).
UTF is the abbreviation of " UCS Transformation Format ", can translate into Unicode character set format transformation.
Alternatively, in one of the present embodiment possible implementation, in 101, specifically according to the relevant information of the page, the first coded format of the first text to be identified in the described page can be obtained.
Such as, can according to the META label of the page i.e. " <meta http-equiv=" Content-Type " content=" text/html; Charset=gb2312 " > ", the first coded format obtaining the first text to be identified in this page is GB2312 coded format.
Or, again such as, can according to the definition in the Cascading Style Sheet of the page (Cascading Style Sheet, CSS) file i.e. "@charset " UTF-8 " ", the first coded format obtaining the first text to be identified in this page is UTF-8 coded format.
Or, more such as, can website belonging to the page, obtain the first coded format of the first text to be identified in this page.As, the coded system that Baidu uses is GB2312 coded system, and the coded system that Google uses is UTF-8 coded system etc.
102, the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, be the second text by described first text-converted, and the coded format of described second text is described second coded format.
Alternatively, in one of the present embodiment possible implementation, described second coded format can include but not limited to Unicode coded format.The Chinese of Unicode can be translated as ten thousand country codes, international code, Unicode or single code, and it is each character but not the unique code (i.e. an integer) of glyph definition, such as, and unique binary coding.
In the process of conversion, if certain character in described first text has the corresponding character corresponding to the second coded format, so then can by the character of this character conversion corresponding to the second corresponding coded format; If certain character in described first text does not have the corresponding character corresponding to the second coded format, so then can perform former pre-configured operation, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
103, the character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format, be the 3rd text by described second text-converted.
In the process of conversion, if certain character in described second text has the corresponding character corresponding to the first coded format, so then can by the character of this character conversion corresponding to the first corresponding coded format; If certain character in described second text does not have the corresponding character corresponding to the first coded format, so then can perform former pre-configured operation, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
104, according to described 3rd text and described first text, determine whether there is mess code in described first text.
Alternatively, in one of the present embodiment possible implementation, in 104, specifically can compare described 3rd text and described first text.If described 3rd text and described first text inconsistent, then can determine to there is mess code in described first text; If or described 3rd text is consistent with described first text, then can determine to there is not mess code in described first text.
Particularly, compare two texts and described 3rd text and described first text, a lot of method can be adopted.
Such as, directly can carry out the coupling of character to two texts, judge that whether the character in two texts is consistent one by one.
Or, more such as, extract the characteristic information of described 3rd text and the characteristic information of described first text, and such as, Message Digest Algorithm 5 (Message Digest Algorithm, MD5) value; And then, the characteristic information of described 3rd text and the characteristic information of described first text are compared; If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, then can illustrate described 3rd text and described first text inconsistent; If or the characteristic information of described 3rd text is identical with the characteristic information of described first text, then can illustrate that described 3rd text is consistent with described first text.
It should be noted that, the executive agent of 101 ~ 104 can be recognition device, such as, Web page editing machine, can be arranged in local client, to carry out identified off-line, or can also be arranged in the server of network side, to carry out ONLINE RECOGNITION, the present embodiment does not limit this.
Be understandable that, described client can be mounted in the application program in terminal, or can also be a webpage of browser, if can realize page process outwardness form can, the present embodiment does not limit this.
Existing recognition methods, needs to be checked Web page one by one by operating personnel, to find whether the text in this Web page occurs mess code phenomenon.But, manually check whether mess code easily brings two problems to the page.
The first, efficiency is very low, particularly slightly large-scale website, and subpage frame just has hundreds of thousands, and operating personnel cannot check one by one;
The second, artificial cognition easily misses the mess code in the page, and such as, the situation that mess code is little in the page, word is a lot, operating personnel are difficult to naked eyes and find.
Adopt the technical scheme that the present embodiment provides, participate in without the need to operating personnel, simple to operate, and also accuracy is high.
In the present embodiment, by obtaining the first coded format of the first text to be identified in the page, and then the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, it is second text with described second coded format by described first text-converted, character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format again, be the 3rd text by described second text-converted, make it possible to according to described 3rd text and described first text, determine whether there is mess code in described first text, identifying is participated in without the need to operating personnel, simple to operate, and accuracy is high, thus improve efficiency and the reliability of the identification of text mess code.
In addition, adopt the technical scheme that the application provides, can automatically identify the mess code that the text in the page occurs, real-time is good.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the application is not by the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the application is necessary.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
The structural representation of the recognition device of the page Chinese version mess code that Fig. 2 provides for another embodiment of the application, as shown in Figure 2.The recognition device of the page Chinese version mess code of the present embodiment can comprise acquiring unit 21, converting unit 22 and determining unit 23.Wherein, acquiring unit 21, for obtaining the first coded format of the first text to be identified in the page; Described first text-converted, for the corresponding relation between the character corresponding to the second coded format and the character corresponding to other coded formats, is the second text by converting unit 22, and the coded format of described second text is described second coded format; Described second text-converted, also for the corresponding relation between the character corresponding to described second coded format and the character corresponding to described first coded format, is the 3rd text by described converting unit 22; Determining unit 23, for according to described 3rd text and described first text, determines whether there is mess code in described first text.
Wherein, described first coded format can be all optional text code modes in prior art, and such as, GBK coded system, UTF-8 coded system or GB2312 coded system etc., the present embodiment is not particularly limited this.
GBK is one of encode Chinese characters for computer standard, full name " Chinese Internal Code Specification " (the GBK i.e. first letter of " GB ", " expansion " Chinese phonetic alphabet, can also be called Chinese character international proliferation code, English name is Chinese Internal Code Specification).
UTF is the abbreviation of " UCS Transformation Format ", can translate into Unicode character set format transformation.
Alternatively, in one of the present embodiment possible implementation, described acquiring unit 21 specifically according to the relevant information of the page, can obtain the first coded format of the first text to be identified in the described page.
Such as, described acquiring unit 21 can according to the META label of the page i.e. " <metahttp-equiv=" Content-Type " content=" text/html; Charset=gb2312 " > ", the first coded format obtaining the first text to be identified in this page is GB2312 coded format.
Or, again such as, described acquiring unit 21 can according to the definition in the Cascading Style Sheet of the page (CascadingStyle Sheet, CSS) file i.e. "@charset " UTF-8 " ", and the first coded format obtaining the first text to be identified in this page is UTF-8 coded format.
Or more such as, described acquiring unit 21 can website belonging to the page, obtains the first coded format of the first text to be identified in this page.As, the coded system that Baidu uses is GB2312 coded system, and the coded system that Google uses is UTF-8 coded system etc.
Alternatively, in one of the present embodiment possible implementation, described second coded format can include but not limited to Unicode coded format.The Chinese of Unicode can be translated as ten thousand country codes, international code, Unicode or single code, and it is each character but not the unique code (i.e. an integer) of glyph definition, such as, and unique binary coding.
Particularly, described converting unit 22 is in the process performing first time conversion, if certain character in described first text has the corresponding character corresponding to the second coded format, so then can by the character of this character conversion corresponding to the second corresponding coded format; If certain character in described first text does not have the corresponding character corresponding to the second coded format, so then can perform former pre-configured operation, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
Particularly, described converting unit 22 is in the process performing second time conversion, if certain character in described second text has the corresponding character corresponding to the first coded format, so then can by the character of this character conversion corresponding to the first corresponding coded format; If certain character in described second text does not have the corresponding character corresponding to the first coded format, so then can the pre-configured operation of executor, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
Alternatively, in one of the present embodiment possible implementation, described determining unit 23 specifically may be used for comparing described 3rd text and described first text; If described 3rd text and described first text inconsistent, then can determine to there is mess code in described first text; If or described 3rd text is consistent with described first text, then can determine to there is not mess code in described first text.
Particularly, described determining unit 23 compares two texts and described 3rd text and described first text, can adopt a lot of method.
Such as, described determining unit 23 directly can carry out the coupling of character to two texts, judge that whether the character in two texts is consistent one by one.
Or, more such as, described determining unit 23 extracts the characteristic information of described 3rd text and the characteristic information of described first text, and such as, Message Digest Algorithm 5 (Message Digest Algorithm, MD5) value; And then, the characteristic information of described 3rd text and the characteristic information of described first text are compared; If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, then can illustrate described 3rd text and described first text inconsistent; If or the characteristic information of described 3rd text is identical with the characteristic information of described first text, then can illustrate that described 3rd text is consistent with described first text.
It should be noted that, the recognition device of the page Chinese version mess code that the present embodiment provides, such as, Web page editing machine, can be arranged in local client, to carry out identified off-line, or can also be arranged in the server of network side, to carry out ONLINE RECOGNITION, the present embodiment does not limit this.
Be understandable that, described client can be mounted in the application program in terminal, or can also be a webpage of browser, if can realize page process outwardness form can, the present embodiment does not limit this.
Existing recognition device, needs to be checked Web page one by one by operating personnel, to find whether the text in this Web page occurs mess code phenomenon.But, manually check whether mess code easily brings two problems to the page.
The first, efficiency is very low, particularly slightly large-scale website, and subpage frame just has hundreds of thousands, and operating personnel cannot check one by one;
The second, artificial cognition easily misses the mess code in the page, and such as, the situation that mess code is little in the page, word is a lot, operating personnel are difficult to naked eyes and find.
Adopt the technical scheme that the present embodiment provides, participate in without the need to operating personnel, simple to operate, and also accuracy is high.
In the present embodiment, the first coded format of the first text to be identified in the page is obtained by acquiring unit, and then by the corresponding relation between the character of converting unit corresponding to the second coded format and the character corresponding to other coded formats, it is second text with described second coded format by described first text-converted, character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format again, be the 3rd text by described second text-converted, make determining unit can according to described 3rd text and described first text, determine whether there is mess code in described first text, identifying is participated in without the need to operating personnel, simple to operate, and accuracy is high, thus improve efficiency and the reliability of the identification of text mess code.
In addition, adopt the technical scheme that the application provides, can automatically identify the mess code that the text in the page occurs, real-time is good.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the system of foregoing description, the specific works process of device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
In several embodiments that the application provides, should be understood that, disclosed system, apparatus and method, can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the application can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form that hardware also can be adopted to add SFU software functional unit realizes.
The above-mentioned integrated unit realized with the form of SFU software functional unit, can be stored in a computer read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform the part steps of method described in each embodiment of the application.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. various can be program code stored medium.
Last it is noted that above embodiment is only in order to illustrate the technical scheme of the application, be not intended to limit; Although with reference to previous embodiment to present application has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of each embodiment technical scheme of the application.

Claims (10)

1. a recognition methods for page Chinese version mess code, is characterized in that, comprising:
Obtain the first coded format of the first text to be identified in the page;
Described first text-converted is the second text by the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, and the coded format of described second text is described second coded format;
Described second text-converted is the 3rd text by the character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format;
According to described 3rd text and described first text, determine whether there is mess code in described first text.
2. method according to claim 1, is characterized in that, described second coded format comprises Unicode coded format.
3. method according to claim 1, is characterized in that, described according to described 3rd text and described first text, determines whether there is mess code in described first text, comprising:
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
4. method according to claim 3, is characterized in that, describedly compares described 3rd text and described first text, comprising:
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
5. the method according to the arbitrary claim of Claims 1 to 4, is characterized in that, described characteristic information comprises MD5 value.
6. a recognition device for page Chinese version mess code, is characterized in that, comprising:
Acquiring unit, for obtaining the first coded format of the first text to be identified in the page;
Described first text-converted, for the corresponding relation between the character corresponding to the second coded format and the character corresponding to other coded formats, is the second text by converting unit, and the coded format of described second text is described second coded format;
Described second text-converted, also for the corresponding relation between the character corresponding to described second coded format and the character corresponding to described first coded format, is the 3rd text by described converting unit;
Determining unit, for according to described 3rd text and described first text, determines whether there is mess code in described first text.
7. device according to claim 6, is characterized in that, described second coded format comprises Unicode coded format.
8. device according to claim 6, is characterized in that, described determining unit, specifically for
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
9. device according to claim 8, is characterized in that, described determining unit, specifically for
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
10. the device according to the arbitrary claim of claim 6 ~ 9, is characterized in that, described characteristic information comprises MD5 value.
CN201310737443.4A 2013-12-27 2013-12-27 The recognition methods of text messy code and device in the page Active CN104750663B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310737443.4A CN104750663B (en) 2013-12-27 2013-12-27 The recognition methods of text messy code and device in the page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310737443.4A CN104750663B (en) 2013-12-27 2013-12-27 The recognition methods of text messy code and device in the page

Publications (2)

Publication Number Publication Date
CN104750663A true CN104750663A (en) 2015-07-01
CN104750663B CN104750663B (en) 2019-05-28

Family

ID=53590375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310737443.4A Active CN104750663B (en) 2013-12-27 2013-12-27 The recognition methods of text messy code and device in the page

Country Status (1)

Country Link
CN (1) CN104750663B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279247A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Expression library generation method and device
CN106598689A (en) * 2016-12-20 2017-04-26 绿金在线电子商务有限公司 Universal Chinese coding method
CN108271041A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 Mess code treating method and apparatus
CN110728115A (en) * 2018-07-17 2020-01-24 珠海金山办公软件有限公司 Disordered code identification method and device for document content and electronic equipment
CN111259628A (en) * 2020-02-18 2020-06-09 北京金堤科技有限公司 Webpage information extraction method and device, electronic equipment and storage medium
CN113595683A (en) * 2021-07-07 2021-11-02 西安震有信通科技有限公司 Conversion processing method, device, terminal and medium based on various encoding files
CN115348232A (en) * 2022-08-10 2022-11-15 中国建设银行股份有限公司 Decoding method, apparatus, electronic device, medium, and product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110072A (en) * 2007-08-21 2008-01-23 无敌科技(西安)有限公司 Device and method for automatic identifying literal code
CN101551792A (en) * 2008-04-03 2009-10-07 鸿富锦精密工业(深圳)有限公司 Messy code recovery system and method
JP2010128672A (en) * 2008-11-26 2010-06-10 Kyocera Corp Electronic apparatus and character conversion method
CN103150293A (en) * 2011-12-06 2013-06-12 富泰华工业(深圳)有限公司 Electronic device with messy code recovery function and messy code recovery method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110072A (en) * 2007-08-21 2008-01-23 无敌科技(西安)有限公司 Device and method for automatic identifying literal code
CN101551792A (en) * 2008-04-03 2009-10-07 鸿富锦精密工业(深圳)有限公司 Messy code recovery system and method
JP2010128672A (en) * 2008-11-26 2010-06-10 Kyocera Corp Electronic apparatus and character conversion method
CN103150293A (en) * 2011-12-06 2013-06-12 富泰华工业(深圳)有限公司 Electronic device with messy code recovery function and messy code recovery method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279247A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Expression library generation method and device
CN106598689A (en) * 2016-12-20 2017-04-26 绿金在线电子商务有限公司 Universal Chinese coding method
CN108271041A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 Mess code treating method and apparatus
CN108271041B (en) * 2016-12-30 2021-01-22 北京国双科技有限公司 Method and device for processing messy codes
CN110728115A (en) * 2018-07-17 2020-01-24 珠海金山办公软件有限公司 Disordered code identification method and device for document content and electronic equipment
CN110728115B (en) * 2018-07-17 2024-01-26 珠海金山办公软件有限公司 Document content messy code identification method and device and electronic equipment
CN111259628A (en) * 2020-02-18 2020-06-09 北京金堤科技有限公司 Webpage information extraction method and device, electronic equipment and storage medium
CN113595683A (en) * 2021-07-07 2021-11-02 西安震有信通科技有限公司 Conversion processing method, device, terminal and medium based on various encoding files
CN115348232A (en) * 2022-08-10 2022-11-15 中国建设银行股份有限公司 Decoding method, apparatus, electronic device, medium, and product
CN115348232B (en) * 2022-08-10 2024-04-19 中国建设银行股份有限公司 Decoding method, decoding device, electronic equipment, medium and product

Also Published As

Publication number Publication date
CN104750663B (en) 2019-05-28

Similar Documents

Publication Publication Date Title
CN104750663A (en) Identification method and device for text messy codes in page
TWI636452B (en) Method and system of voice recognition
US11055373B2 (en) Method and apparatus for generating information
CN112015430A (en) JavaScript code translation method and device, computer equipment and storage medium
CN101526963A (en) Method for identifying web page coding, device and terminal equipment
CN102436454A (en) Input method switching method and system for browser
CN104866509A (en) Page element positioning method and device
CN104063401A (en) Webpage style address merging method and device
CN112269862B (en) Text role labeling method, device, electronic equipment and storage medium
CN110704608A (en) Text theme generation method and device and computer equipment
CN111460835B (en) Auxiliary translation method and device and electronic equipment
CN113657088A (en) Interface document analysis method and device, electronic equipment and storage medium
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN111159394A (en) Text abstract generation method and device
CN113177407A (en) Data dictionary construction method and device, computer equipment and storage medium
CN112989043A (en) Reference resolution method and device, electronic equipment and readable storage medium
CN117195886A (en) Text data processing method, device, equipment and medium based on artificial intelligence
CN109710634B (en) Method and device for generating information
CN102063416A (en) Method and system for embedding double-byte fonts into PDF file
CN113886748A (en) Method, device and equipment for generating editing information and outputting information of webpage content
CN113742501A (en) Information extraction method, device, equipment and medium
CN111401009A (en) Digital expression symbol recognition conversion method, device, server and storage medium
CN112329434A (en) Text information identification method and device, electronic equipment and storage medium
CN115965018B (en) Training method of information generation model, information generation method and device
CN105353948A (en) Information processing method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240402

Address after: Singapore

Patentee after: Alibaba Singapore Holdings Ltd.

Country or region after: Singapore

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: ALIBABA GROUP HOLDING Ltd.

Country or region before: Cayman Islands

TR01 Transfer of patent right