WO2020082562A1 - Symbol identification method, apparatus, device, and storage medium - Google Patents

Symbol identification method, apparatus, device, and storage medium Download PDF

Info

Publication number
WO2020082562A1
WO2020082562A1 PCT/CN2018/122832 CN2018122832W WO2020082562A1 WO 2020082562 A1 WO2020082562 A1 WO 2020082562A1 CN 2018122832 W CN2018122832 W CN 2018122832W WO 2020082562 A1 WO2020082562 A1 WO 2020082562A1
Authority
WO
WIPO (PCT)
Prior art keywords
preset
character
dictionary
target
word segmentation
Prior art date
Application number
PCT/CN2018/122832
Other languages
French (fr)
Chinese (zh)
Inventor
周罡
王彬
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020082562A1 publication Critical patent/WO2020082562A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/248Character recognition characterised by the processing or recognition method involving plural approaches, e.g. verification by template match; Resolving confusion among similar patterns, e.g. "O" versus "Q"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/196Recognition using electronic means using sequential comparisons of the image signals with a plurality of references
    • G06V30/1983Syntactic or structural pattern recognition, e.g. symbolic string recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present application relates to the field of text recognition technology, and in particular, to a character recognition method, device, equipment, and storage medium.
  • Optical Character Recognition is mainly through electronic equipment, such as a scanner or a digital camera, to check the characters printed on the paper, determine the shape by detecting the dark and bright patterns, and then use the character recognition method to translate the shape into computer text.
  • OCR Optical Character Recognition
  • the text in the paper document is converted into a black and white dot matrix image file by optical means, and the text in the image is converted into a text format by the recognition software for further editing and processing by the word processing software.
  • the recognition speed is often low.
  • the main purpose of this application is to propose a character recognition method, device, equipment and storage medium, aiming to improve the efficiency of text recognition.
  • the character recognition method includes the following steps:
  • the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain a target character, and the target character is displayed.
  • the present application also proposes a character recognition device, the character recognition device includes:
  • Acquisition module used to acquire the text to be recognized
  • a calling module for calling a word segmentation tool pre-stored in the first preset area, and dividing the text to be recognized into a plurality of reference characters of a preset length by the word segmentation tool;
  • the searching module is used to obtain the reference character divided by the word segmentation tool, search for the corresponding preset dictionary in the second preset area according to the target length of the reference character, and determine whether the preset dictionary is stored in the preset dictionary Reference character
  • the filtering module is configured to filter the reference characters that are not stored by the fuzzy matching algorithm when the reference characters are not stored in the preset dictionary to obtain target characters and display the target characters.
  • the present application also proposes a device including: a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the computer may
  • the read instruction is configured to implement the steps of the character recognition method as described above.
  • the present application also proposes a storage medium having computer readable instructions stored on it, which when executed by the processor implements the steps of the character recognition method described above .
  • the word segmentation tool is called by acquiring the text to be recognized, so that the word segmentation tool divides the text to be recognized into a plurality of characters of a preset length, and finds a correspondence according to the characters of the preset length In the preset dictionary to determine whether the character is stored in the preset dictionary. If the character is not stored in the preset dictionary, it indicates that the character has a recognition abnormality. In this case, The unexisted characters are filtered out by fuzzy matching algorithm to target characters, so that the fuzzy matching algorithm realizes text recognition and improves the efficiency of text recognition.
  • FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a first embodiment of a character recognition method of the present application
  • FIG. 3 is a schematic flowchart of a second embodiment of a character recognition method of this application.
  • FIG. 4 is a schematic flowchart of a third embodiment of a character recognition method of this application.
  • FIG. 5 is a schematic diagram of function modules of the first embodiment of the character recognition device of the present application.
  • FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in an embodiment of the present application.
  • the device may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to implement connection communication between these components.
  • the user interface 1003 may include a display (Display), an input unit such as a key, and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as disk storage.
  • the memory 1005 may optionally be a storage device independent of the foregoing processor 1001.
  • FIG. 1 does not constitute a limitation on the device, and may include more or less components than shown, or combine certain components, or arrange different components.
  • the memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and computer-readable instructions.
  • the network interface 1004 is mainly used to connect to an external network and perform data communication with other network devices;
  • the user interface 1003 is mainly used to connect to user devices and perform data communication with the device;
  • the device of this application passes the processor 1001 Invoke the computer readable instructions stored in the memory 1005, and execute the character recognition implementation method provided by the embodiments of the present application.
  • FIG. 2 is a schematic flowchart of a first embodiment of a character recognition method of the present application.
  • the character recognition method includes the following steps:
  • Step S10 Obtain the text to be recognized.
  • the historical recognition text is first obtained through OCR, and the historical recognition text is used as the text to be recognized.
  • the recognition document is mainly input into the computer through an input device.
  • the input device can be a scanner or other devices that can achieve the same function.
  • the inclination angle of the document is measured, the layout analysis of the document is performed, and the selected text field is analyzed.
  • Perform typesetting confirmation divide text lines in horizontal and vertical layout, realize the separation of text images in each line, and distinguish punctuation marks, etc., so as to preprocess the images, and sort out each text image after processing It is handed over to the recognition module for recognition.
  • the layout analysis is the overall analysis of the text image, which is to sort out all the text blocks in the document, distinguish the text paragraphs and the typesetting order, and the areas of the images and tables.
  • the domain boundaries of each text block including the start and end coordinates of the domain in the image, as well as the attributes within the domain, that is, horizontal and vertical layout methods and the connection relationship of each text block, are provided as a data structure to the recognition module for automatic recognition , Recognize the text area directly, perform special table analysis and recognition processing on the table area, and compress or simply store the image area.
  • Line segmentation is the process of cutting a large image into lines and then separating individual characters from the image lines.
  • the text image sorted out from the scanned text is converted into the standard code of the text by the computer.
  • feature points, projection information, and point of the text Analyze the regional distribution, etc., to provide the top10 result of each character recognized in the text, and select top1 as the basic text from the results.
  • top10 result of each character recognized in the text
  • select top1 as the basic text from the results.
  • the recognition result in "I am a person from Zhongyuan” uses the basic text as the text to be recognized for the basic text, so as to realize the initial recognition of the recognized document.
  • step S20 a word segmentation tool pre-stored in the first preset area is invoked, and the word segmentation tool is used to divide the text to be recognized into a plurality of reference characters of a preset length.
  • a word segmentation tool is provided to analyze the text to be recognized through the analysis tool.
  • the word segmentation tool may be jieba, SnowNLP, THULAC, NLPIR, or other word segmentation tools. For example, there is no restriction on this.
  • the word segmentation tool is used to divide the text to be recognized into phrases of a preset word length. For example, the word segmentation tool is used to divide "I am Chinese” into “I”, “Yes” and “Zhongyuanren” , Or “I am”, "Central Park” and "People”.
  • the preset length may be the number of words, for example, "I am” is a character with a length of 2, and "People” is a character with a length of 1, so as to achieve different rules of word segmentation and improve the word segmentation Precision.
  • the phrases with a preset length greater than 2 are listed, that is, "I am” and "China", so as to realize the analysis of the phrases, and also List the phrases that meet other rules.
  • This embodiment does not limit this.
  • the text to be recognized is divided into phrases with a length of 2, thereby improving the efficiency of text recognition.
  • Step S30 Obtain the reference character divided by the word segmentation tool, search the corresponding preset dictionary in the second preset area according to the target length of the reference character, and determine whether the reference character is stored in the preset dictionary .
  • the reference character is a number of phrases after word segmentation by a word segmentation tool, for example, "I am Chinese” is divided into several phrases of length 2, such as “I am”, “Zhongyuan” and " Person ”, wherein the first preset area and the second preset area are used to distinguish the storage address of the word segmentation tool from the storage address of the preset dictionary.
  • the preset dictionary is a dictionary classified according to a preset field, for example, for a dictionary with a word length of 2, a dictionary with a word length of 3, etc.
  • a dictionary with a word length of 2 for example, " “Chinese”, for a dictionary with a word length of 3, such as "Chinese”, etc., so as to classify commonly used phrases according to the length of the words, so as to realize the management of commonly used phrases.
  • the preset dictionary can be used to check whether the target phrase after word segmentation is a common phrase.
  • the phrase with a length of 2 after the word segmentation includes "I am” and "Zhongyuan”.
  • “Yes” and “Zhongyuan” look for the existence in a dictionary with a length of 2, when it does not exist, it indicates that the recognition is abnormal. For example, if the phrase "Zhongyuan” is not found, the phrase “I am” can be found, indicating that "I am” recognition is normal, "Zhongyuan” recognition is abnormal.
  • step S40 when the reference character is not stored in the preset dictionary, the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain a target character, and the target character is displayed.
  • the unexisted characters are screened by a fuzzy matching algorithm, which is a BK-tree (Burkhard-Keller-tree) algorithm, proposed by Burkhard and Keller.
  • the fuzzy matching algorithm refers to Between the two strings, the minimum number of editing operations required to convert from one to the other, using the number of operations as the editing distance, the smaller the editing distance, the more similar the two strings, when the editing distance is 0 At this time, the two character strings are equal, so as to realize the character recognition.
  • the word segmentation tool is invoked by acquiring the text to be recognized, so that the word segmentation tool divides the text to be recognized into a plurality of characters of a preset length, and finds the corresponding characters according to the characters of the preset length A preset dictionary, to determine whether the character is stored in the preset dictionary, when the character is not stored in the preset dictionary, it indicates that the character has an abnormal recognition situation, in this case, the The unexisted characters are screened out by fuzzy matching algorithm to target characters, so as to realize text recognition by fuzzy matching algorithm and improve the efficiency of text recognition.
  • the method further includes:
  • Step S201 receiving a tool writing instruction, extracting the word segmentation tool and word segmentation writing address information in the tool writing instruction, writing the word segmentation tool into the first preset area according to the word segmentation writing address information and Save it.
  • the word segmentation tool is first written in the preset area, and after the text to be recognized is obtained, the word segmentation tool in the preset area is called to change the text to be recognized
  • the word segmentation tool may be a small program or other forms of word segmentation tools, which are not limited in this embodiment.
  • tool writing instruction may be writing operation through the writing platform interface, or writing through the data serial port, which is not limited in this embodiment.
  • step S20 includes:
  • step S202 a word segmentation tool pre-stored in the first preset area is called, and the word segmentation tool is used to compare the text to be recognized with keywords of each preset length, and each of the texts in the text to be recognized is extracted according to the comparison result.
  • the word segmentation tool may be provided with various keywords, and by comparing the text to be recognized with each keyword, to realize the recognition of each keyword in the text to be recognized, for example, the text to be recognized " "Wuhan scenery is good” uses the word segmentation tool to perform word segmentation, and can compare "Wuhan scenery is good” with each keyword to obtain the keywords "Wuhan", “Landscape” and "good”, so as to realize the text to be recognized Treatment.
  • the word segmentation tool is pre-written according to the write instruction, and the word segmentation tool is used to perform word segmentation processing on the text to be recognized, thereby achieving more detailed text recognition.
  • a third embodiment of the character recognition method of the present application is proposed based on the first embodiment or the second embodiment.
  • the description is based on the first embodiment, and before step S30, The method also includes:
  • Step S301 Receive a dictionary writing instruction, extract the preset dictionary and dictionary writing address information in the dictionary writing instruction, and write the preset dictionary into the second preset according to the dictionary writing address information region.
  • the preset dictionary needs to be written first, specifically to receive the write instruction, extract the preset dictionary in the write instruction, and save the preset dictionary in a preset area . Because the word segmentation tool was previously saved, the storage address of the word segmentation tool and the word segmentation address of the preset dictionary can be saved in different areas and labeled with different identification labels, that is, distinguished by the first preset area and the second preset area To achieve effective data management.
  • step S30 includes:
  • Step S302 Obtain the reference character divided by the word segmentation tool, and search for a corresponding storage address in a preset address relationship mapping table according to the target length of the reference character.
  • the storage address is a storage address of a preset dictionary
  • multiple dictionaries are stored in the database, such as a dictionary with a length of 2 and a dictionary with a length of 3, and other types of dictionaries are also stored.
  • the dictionary can be stored using different storage addresses, and the correspondence between the storage address and the length of the dictionary can be used to establish the preset address relationship mapping table, and the preset address relationship mapping table can be obtained by obtaining the length of characters. You can find the address of the corresponding dictionary. For example, when the reference character length is 2, the address information stored in the dictionary of length 2 is searched in the preset address relationship mapping table according to the character length 2, so as to realize the address Effective management.
  • Step S303 searching the corresponding preset dictionary in the preset area according to the storage address, and extracting the characteristic information of the reference character, comparing the characteristic information with the characteristic information of the character in the searched dictionary, and according to the comparison As a result, it is judged whether the reference character is stored in the dictionary.
  • the characteristic information may be the area distribution of the points of the reference character, the geometric distribution state of each point, or other forms of characteristic information. No restrictions.
  • step S40 includes:
  • Step S401 when the reference character is not stored in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm, and The target character is displayed.
  • the BK-tree algorithm is used to find words whose edit distance is not greater than the length of the word. For example, if there is no "Zhongyuan", the word whose edit distance is not greater than the length of the word from the BK-tree may be "China" , Where the edit distance is the edit distance of the character strings A to B.
  • the minimum number of steps required to change A into B For example, it takes two steps from FAME to GATE and two replacements, and three steps from GAME to ACM, including deleting G and E and adding C, and displaying the filtered "China" as the target character, so as to pass the blur
  • the matching algorithm realizes text recognition and improves the accuracy of text recognition.
  • the method further includes: establishing an initial recognition list of each initial recognition character in the text to be recognized, and the step S401 includes:
  • Step S402 when the reference character does not exist in the preset dictionary, a target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm.
  • the text image that is sorted out from the scanned text is converted into the standard code of the text by the computer.
  • the stroke, feature point, projection information of the text is analyzed to provide the top10 result of each character recognized in the text, and the top10 result of each character is established as an initial recognition list corresponding to each character.
  • Step S403 judging the number of the target characters, when the number is multiple, judging whether the target characters exist in the initial recognition list, and displaying the target characters corresponding to the characters existing in the initial recognition list .
  • the solution provided by this embodiment is added to the text recognition through the fuzzy matching algorithm for recognition, finds similar characters according to the editing distance, and uses the selected characters as the target characters, thereby improving the accuracy of text recognition.
  • the present application further provides a character recognition device.
  • FIG. 5 is a schematic diagram of functional modules of a first embodiment of a character recognition device of the present application.
  • the character recognition device includes:
  • the obtaining module 10 obtains the text to be recognized.
  • the historical recognition text is first obtained through OCR, and the historical recognition text is used as the text to be recognized.
  • the recognition document is mainly input into the computer through an input device.
  • the input device can be a scanner or other devices that can achieve the same function.
  • the inclination angle of the document is measured, the layout analysis of the document is performed, and the selected text field is analyzed.
  • Perform typesetting confirmation divide text lines in horizontal and vertical layout, realize the separation of text images in each line, and distinguish punctuation marks, etc., so as to preprocess the images, and sort out each text image after processing It is handed over to the recognition module for recognition.
  • the layout analysis is the overall analysis of the text image, which is to sort out all the text blocks in the document, distinguish the text paragraphs and the typesetting order, and the areas of the images and tables.
  • the domain boundaries of each text block including the start and end coordinates of the domain in the image, as well as the attributes within the domain, that is, horizontal and vertical layout methods and the connection relationship of each text block, are provided as a data structure to the recognition module for automatic recognition , Recognize the text area directly, perform special table analysis and recognition processing on the table area, and compress or simply store the image area.
  • Line segmentation is the process of cutting a large image into lines and then separating individual characters from the image lines.
  • the text image sorted out from the scanned text is converted into the standard code of the text by the computer.
  • feature points, projection information, and point of the text Analyze the regional distribution, etc., to provide the top10 result of each character recognized in the text, and select top1 as the basic text from the results.
  • top10 result of each character recognized in the text
  • select top1 as the basic text from the results.
  • the recognition result in "I am a person from Zhongyuan” uses the basic text as the text to be recognized for the basic text, so as to realize the initial recognition of the recognized document.
  • the calling module 20 is configured to call a word segmentation tool pre-stored in the first preset area, and use the word segmentation tool to divide the text to be recognized into a plurality of reference characters of a preset length.
  • a word segmentation tool is provided to analyze the text to be recognized through the analysis tool.
  • the word segmentation tool may be jieba, SnowNLP, THULAC, NLPIR, or other word segmentation tools. For example, there is no restriction on this.
  • the word segmentation tool is used to divide the text to be recognized into phrases of a preset word length. For example, the word segmentation tool is used to divide "I am Chinese” into “I”, “Yes” and “Zhongyuanren” , Or “I am”, "Central Park” and "People”.
  • the preset length may be the number of words, for example, "I am” is a character with a length of 2, and "People” is a character with a length of 1, so as to achieve different rules of word segmentation and improve the word segmentation Precision.
  • the phrases with a preset length greater than 2 are listed, that is, "I am” and "China", so as to realize the analysis of the phrases, and also List the phrases that meet other rules.
  • This embodiment does not limit this.
  • the text to be recognized is divided into phrases with a length of 2, thereby improving the efficiency of text recognition.
  • the searching module 30 is used to obtain the reference characters divided by the word segmentation tool, search for the corresponding preset dictionary in the second preset area according to the target length of the reference character, and determine whether there is any stored in the preset dictionary Reference character.
  • the reference character is a number of phrases after word segmentation by a word segmentation tool, for example, "I am Chinese” is divided into several phrases of length 2, such as “I am”, “Zhongyuan” and " people”.
  • the preset dictionary is a dictionary classified according to a preset field, for example, for a dictionary with a word length of 2, a dictionary with a word length of 3, etc.
  • a dictionary with a word length of 2 for example, " “Chinese”, for a dictionary with a word length of 3, such as "Chinese”, etc., so as to classify commonly used phrases according to the length of the words, so as to realize the management of commonly used phrases.
  • the preset dictionary can be used to check whether the target phrase after word segmentation is a common phrase.
  • the phrase with a length of 2 after the word segmentation includes "I am” and "Zhongyuan”.
  • “Yes” and “Zhongyuan” look for the existence in a dictionary with a length of 2, when it does not exist, it indicates that the recognition is abnormal. For example, if the phrase "Zhongyuan” is not found, the phrase “I am” can be found, indicating that "I am” recognition is normal, "Zhongyuan” recognition is abnormal.
  • the filtering module 40 is configured to filter the non-existing reference characters through the fuzzy matching algorithm to obtain the target characters and display the target characters when the reference characters are not stored in the preset dictionary.
  • the unexisted characters are screened by a fuzzy matching algorithm, which is a BK-tree (Burkhard-Keller-tree) algorithm, proposed by Burkhard and Keller.
  • the fuzzy matching algorithm refers to Between the two strings, the minimum number of editing operations required to convert from one to the other, using the number of operations as the editing distance, the smaller the editing distance, the more similar the two strings, when the editing distance is 0 At this time, the two character strings are equal, so as to realize the character recognition.
  • the word segmentation tool is invoked by acquiring the text to be recognized, so that the word segmentation tool divides the text to be recognized into a plurality of characters of a preset length, and finds the corresponding A preset dictionary, to determine whether the character is stored in the preset dictionary, when the character is not stored in the preset dictionary, it indicates that the character has an abnormal recognition situation, in this case, the The unexisted characters are screened out by fuzzy matching algorithm to target characters, so as to realize text recognition by fuzzy matching algorithm and improve the efficiency of text recognition.
  • the present application also proposes a device including: a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the computer may
  • the read instruction is configured to implement the steps of the character recognition method as described above.
  • an embodiment of the present application further provides a storage medium, and the computer-readable storage medium may be a non-volatile readable storage medium.
  • the storage medium of the present application stores computer readable instructions, and the computer readable instructions are executed by the processor to perform the steps of the character recognition method as described above.
  • the method implemented when the computer-readable instruction is executed can refer to various embodiments of the invoicing method of this application, and details are not described herein again.
  • the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or a part that contributes to the existing technology, and the computer software product is stored in a computer-readable storage medium (such as ROM / RAM, magnetic disk, and optical disk), including several instructions to enable an intelligent terminal device (which can be a mobile phone, computer, terminal device, air conditioner, or network terminal device, etc.) to execute the method.
  • a computer-readable storage medium such as ROM / RAM, magnetic disk, and optical disk

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

A big data processing-based symbol identification method, apparatus, device, and storage medium, the method comprising: acquiring a text to be identified (S10); calling a pre-stored word segmenting tool from a first preset area so that the word segmenting tool divides the text to be identified into a plurality of reference symbols of a preset length (S20); according to a target length for the reference symbols, looking for a corresponding preset dictionary in a second preset area, and determining whether the reference symbols are present in the preset dictionary (S30); and when the reference symbols are not present in the preset dictionary, screening a target symbol from the reference symbols that are not present by means of a fuzzy match algorithm (S40).

Description

字符识别方法、装置、设备及存储介质 Character recognition method, device, equipment and storage medium The
本申请要求于2018年10月25日提交中国专利局、申请号为201811254944.6、发明名称为“字符识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application requires the priority of the Chinese patent application filed on October 25, 2018 in the China Patent Office with the application number 201811254944.6 and the invention titled "Character Recognition Method, Device, Equipment and Storage Medium", the entire contents of which are incorporated by reference in Applying.
技术领域Technical field
本申请涉及文本识别技术领域,尤其涉及一种字符识别方法、装置、设备及存储介质。The present application relates to the field of text recognition technology, and in particular, to a character recognition method, device, equipment, and storage medium.
背景技术Background technique
目前,光学字符识别(Optical Character Recognition,OCR)主要是通过电子设备,例如扫描仪或数码相机,检查纸上打印的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文字,在面对印刷体字符时,采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件,并通过识别软件将图像中的文字转换成文本格式,供文字处理软件进一步编辑加工,但是,在字符识别过程中通常使用概率统计方法进行识别时,常常识别速度较低。At present, Optical Character Recognition (Optical Character Recognition (OCR) is mainly through electronic equipment, such as a scanner or a digital camera, to check the characters printed on the paper, determine the shape by detecting the dark and bright patterns, and then use the character recognition method to translate the shape into computer text. When printing characters, the text in the paper document is converted into a black and white dot matrix image file by optical means, and the text in the image is converted into a text format by the recognition software for further editing and processing by the word processing software. However, in In the process of character recognition, when the probability statistical method is usually used for recognition, the recognition speed is often low.
发明内容Summary of the invention
本申请的主要目的在于提出一种字符识别方法、装置、设备及存储介质,旨在提高文本识别效率。The main purpose of this application is to propose a character recognition method, device, equipment and storage medium, aiming to improve the efficiency of text recognition.
为实现上述目的,本申请提供一种字符识别方法,所述字符识别方法包括以下步骤:In order to achieve the above object, the present application provides a character recognition method. The character recognition method includes the following steps:
获取待识别文本;Get the text to be recognized;
调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符;Calling a word segmentation tool pre-stored in the first preset area, and dividing the text to be recognized into a plurality of reference characters of a preset length through the word segmentation tool;
获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符;Acquiring the reference character divided by the word segmentation tool, searching for a corresponding preset dictionary in the second preset area according to the target length of the reference character, and determining whether the reference character is stored in the preset dictionary;
在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。When the reference character is not stored in the preset dictionary, the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain a target character, and the target character is displayed.
此外,为实现上述目的,本申请还提出一种字符识别装置,所述字符识别装置包括:In addition, in order to achieve the above object, the present application also proposes a character recognition device, the character recognition device includes:
获取模块,用于获取待识别文本;Acquisition module, used to acquire the text to be recognized;
调用模块,用于调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符;A calling module, for calling a word segmentation tool pre-stored in the first preset area, and dividing the text to be recognized into a plurality of reference characters of a preset length by the word segmentation tool;
查找模块,用于获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符;The searching module is used to obtain the reference character divided by the word segmentation tool, search for the corresponding preset dictionary in the second preset area according to the target length of the reference character, and determine whether the preset dictionary is stored in the preset dictionary Reference character
筛选模块,用于在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。The filtering module is configured to filter the reference characters that are not stored by the fuzzy matching algorithm when the reference characters are not stored in the preset dictionary to obtain target characters and display the target characters.
此外,为实现上述目的,本申请还提出一种设备,所述设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令配置为实现如上所述的字符识别方法的步骤。In addition, in order to achieve the above object, the present application also proposes a device including: a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the computer may The read instruction is configured to implement the steps of the character recognition method as described above.
此外,为实现上述目的,本申请还提出一种存储介质,所述存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如上文所述的字符识别方法的步骤。In addition, in order to achieve the above object, the present application also proposes a storage medium having computer readable instructions stored on it, which when executed by the processor implements the steps of the character recognition method described above .
本申请提出的字符识别方法,通过获取待识别文本,调用分词工具,以使所述分词工具将所述待识别文本划分为多个预设长度的字符,根据所述预设长度的字符查找对应的预设词典,判断所述预设词典中是否存有所述字符,在所述预设词典中未存有所述字符时,说明所述字符存在识别异常的情况,在这种情况下,将未存有的字符通过模糊匹配算法筛选出目标字符,从而通过模糊匹配算法实现文字识别,提高文字识别效率。In the character recognition method proposed in this application, the word segmentation tool is called by acquiring the text to be recognized, so that the word segmentation tool divides the text to be recognized into a plurality of characters of a preset length, and finds a correspondence according to the characters of the preset length In the preset dictionary to determine whether the character is stored in the preset dictionary. If the character is not stored in the preset dictionary, it indicates that the character has a recognition abnormality. In this case, The unexisted characters are filtered out by fuzzy matching algorithm to target characters, so that the fuzzy matching algorithm realizes text recognition and improves the efficiency of text recognition.
附图说明BRIEF DESCRIPTION
图1是本申请实施例方案涉及的硬件运行环境的设备结构示意图;FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in an embodiment of the present application;
图2为本申请字符识别方法第一实施例的流程示意图;2 is a schematic flowchart of a first embodiment of a character recognition method of the present application;
图3为本申请字符识别方法第二实施例的流程示意图;3 is a schematic flowchart of a second embodiment of a character recognition method of this application;
图4为本申请字符识别方法第三实施例的流程示意图;4 is a schematic flowchart of a third embodiment of a character recognition method of this application;
图5为本申请字符识别装置第一实施例的功能模块示意图。FIG. 5 is a schematic diagram of function modules of the first embodiment of the character recognition device of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
参照图1,图1为本申请实施例方案涉及的硬件运行环境的设备结构示意图。Referring to FIG. 1, FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in an embodiment of the present application.
如图1所示,该设备可以包括:处理器1001,例如CPU,通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如按键,可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the device may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to implement connection communication between these components. The user interface 1003 may include a display (Display), an input unit such as a key, and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as disk storage. The memory 1005 may optionally be a storage device independent of the foregoing processor 1001.
本领域技术人员可以理解,图1中示出的设备结构并不构成对设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。A person skilled in the art may understand that the device structure shown in FIG. 1 does not constitute a limitation on the device, and may include more or less components than shown, or combine certain components, or arrange different components.
如图1所示,作为一种存储介质的存储器1005中可以包括操作***、网络通信模块、用户接口模块以及计算机可读指令。As shown in FIG. 1, the memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and computer-readable instructions.
在图1所示的设备中,网络接口1004主要用于连接外网,与其他网络设备进行数据通信;用户接口1003主要用于连接用户设备,与设备进行数据通信;本申请设备通过处理器1001调用存储器1005中存储的计算机可读指令,并执行本申请实施例提供的字符识别的实施方法。In the device shown in FIG. 1, the network interface 1004 is mainly used to connect to an external network and perform data communication with other network devices; the user interface 1003 is mainly used to connect to user devices and perform data communication with the device; the device of this application passes the processor 1001 Invoke the computer readable instructions stored in the memory 1005, and execute the character recognition implementation method provided by the embodiments of the present application.
基于上述硬件结构,提出本申请字符识别方法实施例。Based on the above hardware structure, an embodiment of the character recognition method of the present application is proposed.
参照图2,图2为本申请字符识别方法第一实施例的流程示意图。Referring to FIG. 2, FIG. 2 is a schematic flowchart of a first embodiment of a character recognition method of the present application.
在第一实施例中,所述字符识别方法包括以下步骤:In the first embodiment, the character recognition method includes the following steps:
步骤S10,获取待识别文本。Step S10: Obtain the text to be recognized.
需要说明的是,在本实施例中,首先通过OCR获取历史识别文本,将所述历史识别文本作为所述待识别文本,在具体实现中,主要通过输入设备将识别文档输入到计算机中,所述输入设备可为扫描仪,还可为其他可实现相同功能的设备,通过扫描一幅简单的印刷文档的图像,对测量文档放置的倾斜角,对文档进行版面分析,对选出的文字域进行排版确认,对横、竖排版的文字行进行切分,实现每一行的文字图像的分离,标点符号的判别等,从而进行对图像的预处理,将处理后的每一个文字图像分检出来交给识别模块识别,其中,版面分析是对文本图像的总体分析,是将文档中的所有文字块分检出来,区分出文本段落及排版顺序,以及图像、表格的区域。将各文字块的域界,包括域在图像中的始点、终点坐标,还包括域内的属性,即横、竖排版方式以及各文字块的连接关系作为一种数据结构,提供给识别模块自动识别,对于文本区域直接进行识别处理,对于表格区域进行专用的表格分析及识别处理,对于图像区域进行压缩或简单存储。行字切分是将大幅的图像先切割为行,再从图像行中分离出单个字符的过程。It should be noted that in this embodiment, the historical recognition text is first obtained through OCR, and the historical recognition text is used as the text to be recognized. In a specific implementation, the recognition document is mainly input into the computer through an input device. The input device can be a scanner or other devices that can achieve the same function. By scanning an image of a simple printed document, the inclination angle of the document is measured, the layout analysis of the document is performed, and the selected text field is analyzed. Perform typesetting confirmation, divide text lines in horizontal and vertical layout, realize the separation of text images in each line, and distinguish punctuation marks, etc., so as to preprocess the images, and sort out each text image after processing It is handed over to the recognition module for recognition. The layout analysis is the overall analysis of the text image, which is to sort out all the text blocks in the document, distinguish the text paragraphs and the typesetting order, and the areas of the images and tables. The domain boundaries of each text block, including the start and end coordinates of the domain in the image, as well as the attributes within the domain, that is, horizontal and vertical layout methods and the connection relationship of each text block, are provided as a data structure to the recognition module for automatic recognition , Recognize the text area directly, perform special table analysis and recognition processing on the table area, and compress or simply store the image area. Line segmentation is the process of cutting a large image into lines and then separating individual characters from the image lines.
需要说明的是,在对文本进行识别时,从扫描文本中分检出的文字图像,由计算机将其图形、图像转变成文字的标准代码,根据文字的笔画、特征点、投影信息、点的区域分布等进行分析,从而提供文本中识别的每个字符的top10的结果,并从结果中选取top1作为基础文本,例如,对于中文文本中对“我是中国人”通过OCR进行识别之后将top1中的识别结果“我是中园人”对于基础文本,将所述基本文本作为所述待识别文本,从而实现对识别文档的初始识别。It should be noted that, when recognizing the text, the text image sorted out from the scanned text is converted into the standard code of the text by the computer. According to the strokes, feature points, projection information, and point of the text Analyze the regional distribution, etc., to provide the top10 result of each character recognized in the text, and select top1 as the basic text from the results. For example, for Chinese text, "I am Chinese" is recognized by OCR. The recognition result in "I am a person from Zhongyuan" uses the basic text as the text to be recognized for the basic text, so as to realize the initial recognition of the recognized document.
步骤S20,调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符。In step S20, a word segmentation tool pre-stored in the first preset area is invoked, and the word segmentation tool is used to divide the text to be recognized into a plurality of reference characters of a preset length.
在本实施中,设有分词工具,通过所述分析工具将所述待识别文本进行分析,其中,所述分词工具可为如jieba、SnowNLP、THULAC、NLPIR,还可为其他分词工具,本实施例对此不作限制,通过分词工具将所述待识别文本分为预设词长度的词组,例如通过分词工具将“我是中国人”分为“我”、“是”以及“中园人”,或者“我是”、“中园”以及“人”等。对于中文字符来说,所述预设长度可为字的个数,例如“我是”为长度为2的字符,“人”为长度为1的字符,从而实现不同规则的分词,提高分词的精度。In this implementation, a word segmentation tool is provided to analyze the text to be recognized through the analysis tool. The word segmentation tool may be jieba, SnowNLP, THULAC, NLPIR, or other word segmentation tools. For example, there is no restriction on this. The word segmentation tool is used to divide the text to be recognized into phrases of a preset word length. For example, the word segmentation tool is used to divide "I am Chinese" into "I", "Yes" and "Zhongyuanren" , Or "I am", "Central Park" and "People". For Chinese characters, the preset length may be the number of words, for example, "I am" is a character with a length of 2, and "People" is a character with a length of 1, so as to achieve different rules of word segmentation and improve the word segmentation Precision.
需要说明的是,为了提高识别的效率,在本实施例中,将所述预设长度大于2的词组进行列出,即“我是”、“中国”,从而实现对词组进行分析,还可列出符合其他规则的词组,本实施例对此不作限制,在本实施例中,以将待识别文本分成长度为2的词组,从而提高文本识别的效率。It should be noted that, in order to improve the efficiency of recognition, in this embodiment, the phrases with a preset length greater than 2 are listed, that is, "I am" and "China", so as to realize the analysis of the phrases, and also List the phrases that meet other rules. This embodiment does not limit this. In this embodiment, the text to be recognized is divided into phrases with a length of 2, thereby improving the efficiency of text recognition.
步骤S30,获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符。Step S30: Obtain the reference character divided by the word segmentation tool, search the corresponding preset dictionary in the second preset area according to the target length of the reference character, and determine whether the reference character is stored in the preset dictionary .
需要说明的是,所述参考字符为通过分词工具进行分词后的若干词组,例如将“我是中国人”分为长度为2的若干个词组,例如“我是”、“中园”以及“人”,其中,所述第一预设区域和第二预设区域用于区分所述分词工具的存储地址与所述预设字典的存储地址。It should be noted that the reference character is a number of phrases after word segmentation by a word segmentation tool, for example, "I am Chinese" is divided into several phrases of length 2, such as "I am", "Zhongyuan" and " Person ”, wherein the first preset area and the second preset area are used to distinguish the storage address of the word segmentation tool from the storage address of the preset dictionary.
在本实施例中,所述预设词典为根据预设字段进行分类后的词典,例如对于词长度为2的词典,词长度为3的词典等,对于词长度为2的词典中包含例如“中国”,对于词长度为3的词典中包含例如“中国人”等,从而根据词的长度将常用的词组进行分类,从而实现对常用词组的管理。In this embodiment, the preset dictionary is a dictionary classified according to a preset field, for example, for a dictionary with a word length of 2, a dictionary with a word length of 3, etc. For a dictionary with a word length of 2, for example, " "Chinese", for a dictionary with a word length of 3, such as "Chinese", etc., so as to classify commonly used phrases according to the length of the words, so as to realize the management of commonly used phrases.
在具体实现中,通过所述预设词典可检查分词后的目标词组是否为常见的词组,例如本实施例中分词后长度为2的词组包括“我是”以及“中园”,将“我是”以及“中园”在长度为2的词典中查找是否存在,在不存在时,表明识别有异常,例如未查到“中园”这个词组,可查找到“我是”这个词组,表明“我是”识别正常,“中园”识别异常。In a specific implementation, the preset dictionary can be used to check whether the target phrase after word segmentation is a common phrase. For example, in this embodiment, the phrase with a length of 2 after the word segmentation includes "I am" and "Zhongyuan". "Yes" and "Zhongyuan" look for the existence in a dictionary with a length of 2, when it does not exist, it indicates that the recognition is abnormal. For example, if the phrase "Zhongyuan" is not found, the phrase "I am" can be found, indicating that "I am" recognition is normal, "Zhongyuan" recognition is abnormal.
步骤S40,在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。In step S40, when the reference character is not stored in the preset dictionary, the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain a target character, and the target character is displayed.
在本实施例中,将未存有的字符通过模糊匹配算法进行筛选,所述模糊匹配算法为BK-tree(Burkhard-Keller-tree)算法,由Burkhard和Keller提出的,通过模糊匹配算法是指两个字串之间,由一个转成另一个所需的最少编辑操作次数,将所述操作次数作为所述编辑距离,在编辑距离越小的两个字符串越相似,当编辑距离为0时,两字符串相等,从而实现对字符的识别。In this embodiment, the unexisted characters are screened by a fuzzy matching algorithm, which is a BK-tree (Burkhard-Keller-tree) algorithm, proposed by Burkhard and Keller. The fuzzy matching algorithm refers to Between the two strings, the minimum number of editing operations required to convert from one to the other, using the number of operations as the editing distance, the smaller the editing distance, the more similar the two strings, when the editing distance is 0 At this time, the two character strings are equal, so as to realize the character recognition.
本实施例通过上述方案,通过获取待识别文本,调用分词工具,以使所述分词工具将所述待识别文本划分为多个预设长度的字符,根据所述预设长度的字符查找对应的预设词典,判断所述预设词典中是否存有所述字符,在所述预设词典中未存有所述字符时,说明所述字符存在识别异常的情况,在这种情况下,将未存有的字符通过模糊匹配算法筛选出目标字符,从而通过模糊匹配算法实现文字识别,提高文字识别效率。In this embodiment, through the above solution, the word segmentation tool is invoked by acquiring the text to be recognized, so that the word segmentation tool divides the text to be recognized into a plurality of characters of a preset length, and finds the corresponding characters according to the characters of the preset length A preset dictionary, to determine whether the character is stored in the preset dictionary, when the character is not stored in the preset dictionary, it indicates that the character has an abnormal recognition situation, in this case, the The unexisted characters are screened out by fuzzy matching algorithm to target characters, so as to realize text recognition by fuzzy matching algorithm and improve the efficiency of text recognition.
进一步地,如图3所示,基于第一实施例提出本申请字符识别方法第二实施例,在本实施例中,所述步骤S20之前,所述方法还包括:Further, as shown in FIG. 3, a second embodiment of the character recognition method of the present application is proposed based on the first embodiment. In this embodiment, before the step S20, the method further includes:
步骤S201,接收工具写入指令,提取所述工具写入指令中的分词工具和分词写入地址信息,根据所述分词写入地址信息将所述分词工具写入所述第一预设区域并进行保存。Step S201, receiving a tool writing instruction, extracting the word segmentation tool and word segmentation writing address information in the tool writing instruction, writing the word segmentation tool into the first preset area according to the word segmentation writing address information and Save it.
可以理解的是,为了实现对待识别文本的比对分析,首先写入分词工具在预设区域中,在获取到待识别文本后通过调用预设区域中的分词工具对所述待识别文本进行更细化的分析,其中,所述分词工具可为一段小程序,还可为其他形式的分词工具,本实施例对此不作限制。It can be understood that, in order to realize the comparative analysis of the text to be recognized, the word segmentation tool is first written in the preset area, and after the text to be recognized is obtained, the word segmentation tool in the preset area is called to change the text to be recognized For detailed analysis, the word segmentation tool may be a small program or other forms of word segmentation tools, which are not limited in this embodiment.
需要说明的是,所述工具写入指令可为通过写入平台界面进行写入操作,还可为通过数据串口进行写入,本实施例对此不作限制。It should be noted that the tool writing instruction may be writing operation through the writing platform interface, or writing through the data serial port, which is not limited in this embodiment.
进一步地,所述步骤S20,包括:Further, the step S20 includes:
步骤S202,调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本与各个预设长度的关键词进行比较,根据比较结果提取所述待识别文本中的各个预设长度的目标关键词,将所述目标关键词作为所述预设长度的参考字符。In step S202, a word segmentation tool pre-stored in the first preset area is called, and the word segmentation tool is used to compare the text to be recognized with keywords of each preset length, and each of the texts in the text to be recognized is extracted according to the comparison result. Set a target keyword of length, and use the target keyword as the reference character of the preset length.
在具体实现中,所述分词工具可设有各个关键词,通过将待识别文本与各个关键词进行比较,从而实现对所述待识别文本中的各个关键词的识别,例如将待识别文本“武汉风景好”通过所述分词工具进行分词,可将“武汉风景好”与各个关键词进行比较,从而得到“武汉”、“风景”以及“好”这几个关键词,从而实现对待识别文本的处理。In a specific implementation, the word segmentation tool may be provided with various keywords, and by comparing the text to be recognized with each keyword, to realize the recognition of each keyword in the text to be recognized, for example, the text to be recognized " "Wuhan scenery is good" uses the word segmentation tool to perform word segmentation, and can compare "Wuhan scenery is good" with each keyword to obtain the keywords "Wuhan", "Landscape" and "good", so as to realize the text to be recognized Treatment.
本实施例提供的方案,通过接收写入指令,根据所述写入指令预先写入所述分词工具,通过所分词工具对所述待识别文本进行分词处理,从而实现更细化的文本识别。In the solution provided in this embodiment, by receiving a write instruction, the word segmentation tool is pre-written according to the write instruction, and the word segmentation tool is used to perform word segmentation processing on the text to be recognized, thereby achieving more detailed text recognition.
进一步地,如图4所示,基于第一实施例或第二实施例提出本申请字符识别方法第三实施例,在本实施例中,基于第一实施例进行说明,所述步骤S30之前,所述方法还包括:Further, as shown in FIG. 4, a third embodiment of the character recognition method of the present application is proposed based on the first embodiment or the second embodiment. In this embodiment, the description is based on the first embodiment, and before step S30, The method also includes:
步骤S301,接收字典写入指令,提取所述字典写入指令中的预设字典和字典写入地址信息,根据所述字典写入地址信息将所述预设字典写入所述第二预设区域。Step S301: Receive a dictionary writing instruction, extract the preset dictionary and dictionary writing address information in the dictionary writing instruction, and write the preset dictionary into the second preset according to the dictionary writing address information region.
需要说明的是,为了提高识别的准确性,首先需要写入所述预设词典,具体为接收写入指令,提取写入指令中的预设词典,将所述预设词典保存在预设区域,由于之前保存有分词工具,可将分词工具的存储地址与预设词典的分词地址保存在不同区域,并标上不同的识别标签,即通过第一预设区域和第二预设区域进行区分,从而实现对数据的有效管理。It should be noted that in order to improve the accuracy of recognition, the preset dictionary needs to be written first, specifically to receive the write instruction, extract the preset dictionary in the write instruction, and save the preset dictionary in a preset area , Because the word segmentation tool was previously saved, the storage address of the word segmentation tool and the word segmentation address of the preset dictionary can be saved in different areas and labeled with different identification labels, that is, distinguished by the first preset area and the second preset area To achieve effective data management.
进一步地,所述步骤S30,包括:Further, the step S30 includes:
步骤S302,获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在预设地址关系映射表中查找对应的存储地址。Step S302: Obtain the reference character divided by the word segmentation tool, and search for a corresponding storage address in a preset address relationship mapping table according to the target length of the reference character.
需要说明的是,所述存储地址为预设字典的存储地址,在数据库中存有多个词典,例如长度为2的词典以及长度为3的词典,还存有其他形式的词典,为了实现对词典的管理,可将词典使用不同的存储地址进行储存,并将储存地址与词典的长度的对应关系建立所述预设地址关系映射表,通过获取字符的长度在所述预设地址关系映射表即可查找到对应的字典的地址,例如在所述参考字符长度为2时,根据字符长度2在所述预设地址关系映射表中查找长度2的字典存储的地址信息,从而实现对地址的有效管理。It should be noted that the storage address is a storage address of a preset dictionary, and multiple dictionaries are stored in the database, such as a dictionary with a length of 2 and a dictionary with a length of 3, and other types of dictionaries are also stored. For dictionary management, the dictionary can be stored using different storage addresses, and the correspondence between the storage address and the length of the dictionary can be used to establish the preset address relationship mapping table, and the preset address relationship mapping table can be obtained by obtaining the length of characters. You can find the address of the corresponding dictionary. For example, when the reference character length is 2, the address information stored in the dictionary of length 2 is searched in the preset address relationship mapping table according to the character length 2, so as to realize the address Effective management.
步骤S303,根据所述存储地址在预设区域查找对应的预设词典,并提取所述参考字符的特征信息,将所述特征信息与查找到的词典中的字符的特征信息进行比较,根据比较结果判断所述词典中是否存有所述参考字符。Step S303, searching the corresponding preset dictionary in the preset area according to the storage address, and extracting the characteristic information of the reference character, comparing the characteristic information with the characteristic information of the character in the searched dictionary, and according to the comparison As a result, it is judged whether the reference character is stored in the dictionary.
为了判断识别的参考字符的准确性,通过将参考字符与词典中的字符进行比较,判断所述词典中是否存有所述参考字符,在所述词典中未存有所述参考字符时,说明当前参考字符有异常,在所述词典中存有所述参考字符时,说明当前参考字符识别正确,例如判断所述词典是否存有“我是”、“中园”以及“人”,可知在所述词典中存有“我是”,但是并未存有“中园”,从而可判断出“中园”存在异常。In order to judge the accuracy of the recognized reference characters, by comparing the reference characters with the characters in the dictionary, it is judged whether the reference characters are stored in the dictionary. When the reference characters are not stored in the dictionary, explain The current reference character is abnormal. When the reference character is stored in the dictionary, it indicates that the current reference character is recognized correctly. For example, it is determined whether the dictionary contains "I am", "Zhongyuan" and "person". "I am" is stored in the dictionary, but "Zhongyuan" is not stored, so it can be judged that there is an abnormality in "Zhongyuan".
在具体实现中,通过提取参考字符的特征信息,所述特征信息可为所述参考字符的点的区域分布,各个点的几何分布状态,还可为其他形式的特征信息,本实施例对此不作限制。In a specific implementation, by extracting the characteristic information of the reference character, the characteristic information may be the area distribution of the points of the reference character, the geometric distribution state of each point, or other forms of characteristic information. No restrictions.
进一步地,所述步骤S40,包括:Further, the step S40 includes:
步骤S401,在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符,将所述目标字符进行展示。Step S401, when the reference character is not stored in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm, and The target character is displayed.
在具体实现中,通过BK-tree算法查找编辑距离不大于该词长度的词,例如不存在“中园”,则从过BK-tree查找编辑距离不大于该词长度的词可为“中国”,其中所述编辑距离为字符串A到B的编辑距离,通过只用***、删除和替换三种操作,最少需要多少步可以把A变成B。例如,从FAME到GATE需要两步,两次替换,从GAME到ACM则需要三步,包括删除G和E再添加C,将筛选出的“中国”作为所述目标字符进行展示,从而通过模糊匹配算法实现文本的识别,提高文本识别的准确性。In a specific implementation, the BK-tree algorithm is used to find words whose edit distance is not greater than the length of the word. For example, if there is no "Zhongyuan", the word whose edit distance is not greater than the length of the word from the BK-tree may be "China" , Where the edit distance is the edit distance of the character strings A to B. By using only three operations of insert, delete and replace, the minimum number of steps required to change A into B. For example, it takes two steps from FAME to GATE and two replacements, and three steps from GAME to ACM, including deleting G and E and adding C, and displaying the filtered "China" as the target character, so as to pass the blur The matching algorithm realizes text recognition and improves the accuracy of text recognition.
进一步地,所述步骤S10之后,所述方法还包括:将待识别文本中的各个初始识别字符建立初始识别列表,所述步骤S401,包括:Further, after the step S10, the method further includes: establishing an initial recognition list of each initial recognition character in the text to be recognized, and the step S401 includes:
步骤S402,在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符。Step S402, when the reference character does not exist in the preset dictionary, a target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm.
需要说明的是,通过OCR在对文本进行识别时,从扫描文本中分检出的文字图像,由计算机将其图形、图像转变成文字的标准代码,根据文字的笔画、特征点、投影信息、点的区域分布等进行分析,从而提供文本中识别的每个字符的top10的结果,将所述每个字符的top10的结果建立对应每个字符的初始识别列表。It should be noted that, when recognizing text through OCR, the text image that is sorted out from the scanned text is converted into the standard code of the text by the computer. According to the stroke, feature point, projection information of the text, The area distribution of the points is analyzed to provide the top10 result of each character recognized in the text, and the top10 result of each character is established as an initial recognition list corresponding to each character.
步骤S403,判断所述目标字符的数量,在所述数量为多个时,判断所述目标字符是否存在所述初始识别列表中,将存在所述初始识别列表中的字符对应的目标字符进行展示。Step S403, judging the number of the target characters, when the number is multiple, judging whether the target characters exist in the initial recognition list, and displaying the target characters corresponding to the characters existing in the initial recognition list .
需要说明的是,在通过BK-tree进行筛选时,可能出现多个词的情况,例如上述中可筛选出除了“中国”,还包括“中文”以及“家园”等,针对这种情况,可从筛选词中找出改变的字为之前top10中出现的字对应的词作为所述目标字符进行展示,从而提高文本识别的准确性。It should be noted that when filtering through BK-tree, there may be multiple words. For example, the above can be filtered out in addition to "China", including "Chinese" and "home", etc. For this situation, you can Finding the changed word from the filtered words is the word corresponding to the word that appeared in the previous top10 as the target character for display, thereby improving the accuracy of text recognition.
本实施例提供的方案,通过模糊匹配算法加入文本识别中进行识别,根据编辑距离查找出相似的字符,将筛选出的字符作为所述目标字符,从而提高文本识别的准确性。The solution provided by this embodiment is added to the text recognition through the fuzzy matching algorithm for recognition, finds similar characters according to the editing distance, and uses the selected characters as the target characters, thereby improving the accuracy of text recognition.
本申请进一步提供一种字符识别装置。The present application further provides a character recognition device.
参照图5,图5为本申请字符识别装置第一实施例的功能模块示意图。Referring to FIG. 5, FIG. 5 is a schematic diagram of functional modules of a first embodiment of a character recognition device of the present application.
本申请字符识别装置第一实施例中,该字符识别装置包括:In the first embodiment of the character recognition device of the present application, the character recognition device includes:
获取模块10,获取待识别文本。The obtaining module 10 obtains the text to be recognized.
需要说明的是,在本实施例中,首先通过OCR获取历史识别文本,将所述历史识别文本作为所述待识别文本,在具体实现中,主要通过输入设备将识别文档输入到计算机中,所述输入设备可为扫描仪,还可为其他可实现相同功能的设备,通过扫描一幅简单的印刷文档的图像,对测量文档放置的倾斜角,对文档进行版面分析,对选出的文字域进行排版确认,对横、竖排版的文字行进行切分,实现每一行的文字图像的分离,标点符号的判别等,从而进行对图像的预处理,将处理后的每一个文字图像分检出来交给识别模块识别,其中,版面分析是对文本图像的总体分析,是将文档中的所有文字块分检出来,区分出文本段落及排版顺序,以及图像、表格的区域。将各文字块的域界,包括域在图像中的始点、终点坐标,还包括域内的属性,即横、竖排版方式以及各文字块的连接关系作为一种数据结构,提供给识别模块自动识别,对于文本区域直接进行识别处理,对于表格区域进行专用的表格分析及识别处理,对于图像区域进行压缩或简单存储。行字切分是将大幅的图像先切割为行,再从图像行中分离出单个字符的过程。It should be noted that in this embodiment, the historical recognition text is first obtained through OCR, and the historical recognition text is used as the text to be recognized. In a specific implementation, the recognition document is mainly input into the computer through an input device. The input device can be a scanner or other devices that can achieve the same function. By scanning an image of a simple printed document, the inclination angle of the document is measured, the layout analysis of the document is performed, and the selected text field is analyzed. Perform typesetting confirmation, divide text lines in horizontal and vertical layout, realize the separation of text images in each line, and distinguish punctuation marks, etc., so as to preprocess the images, and sort out each text image after processing It is handed over to the recognition module for recognition. The layout analysis is the overall analysis of the text image, which is to sort out all the text blocks in the document, distinguish the text paragraphs and the typesetting order, and the areas of the images and tables. The domain boundaries of each text block, including the start and end coordinates of the domain in the image, as well as the attributes within the domain, that is, horizontal and vertical layout methods and the connection relationship of each text block, are provided as a data structure to the recognition module for automatic recognition , Recognize the text area directly, perform special table analysis and recognition processing on the table area, and compress or simply store the image area. Line segmentation is the process of cutting a large image into lines and then separating individual characters from the image lines.
需要说明的是,在对文本进行识别时,从扫描文本中分检出的文字图像,由计算机将其图形、图像转变成文字的标准代码,根据文字的笔画、特征点、投影信息、点的区域分布等进行分析,从而提供文本中识别的每个字符的top10的结果,并从结果中选取top1作为基础文本,例如,对于中文文本中对“我是中国人”通过OCR进行识别之后将top1中的识别结果“我是中园人”对于基础文本,将所述基本文本作为所述待识别文本,从而实现对识别文档的初始识别。It should be noted that, when recognizing the text, the text image sorted out from the scanned text is converted into the standard code of the text by the computer. According to the strokes, feature points, projection information, and point of the text Analyze the regional distribution, etc., to provide the top10 result of each character recognized in the text, and select top1 as the basic text from the results. For example, for Chinese text, "I am Chinese" is recognized by OCR. The recognition result in "I am a person from Zhongyuan" uses the basic text as the text to be recognized for the basic text, so as to realize the initial recognition of the recognized document.
调用模块20,用于调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符。The calling module 20 is configured to call a word segmentation tool pre-stored in the first preset area, and use the word segmentation tool to divide the text to be recognized into a plurality of reference characters of a preset length.
在本实施中,设有分词工具,通过所述分析工具将所述待识别文本进行分析,其中,所述分词工具可为如jieba、SnowNLP、THULAC、NLPIR,还可为其他分词工具,本实施例对此不作限制,通过分词工具将所述待识别文本分为预设词长度的词组,例如通过分词工具将“我是中国人”分为“我”、“是”以及“中园人”,或者“我是”、“中园”以及“人”等。对于中文字符来说,所述预设长度可为字的个数,例如“我是”为长度为2的字符,“人”为长度为1的字符,从而实现不同规则的分词,提高分词的精度。In this implementation, a word segmentation tool is provided to analyze the text to be recognized through the analysis tool. The word segmentation tool may be jieba, SnowNLP, THULAC, NLPIR, or other word segmentation tools. For example, there is no restriction on this. The word segmentation tool is used to divide the text to be recognized into phrases of a preset word length. For example, the word segmentation tool is used to divide "I am Chinese" into "I", "Yes" and "Zhongyuanren" , Or "I am", "Central Park" and "People". For Chinese characters, the preset length may be the number of words, for example, "I am" is a character with a length of 2, and "People" is a character with a length of 1, so as to achieve different rules of word segmentation and improve the word segmentation Precision.
需要说明的是,为了提高识别的效率,在本实施例中,将所述预设长度大于2的词组进行列出,即“我是”、“中国”,从而实现对词组进行分析,还可列出符合其他规则的词组,本实施例对此不作限制,在本实施例中,以将待识别文本分成长度为2的词组,从而提高文本识别的效率。It should be noted that, in order to improve the efficiency of recognition, in this embodiment, the phrases with a preset length greater than 2 are listed, that is, "I am" and "China", so as to realize the analysis of the phrases, and also List the phrases that meet other rules. This embodiment does not limit this. In this embodiment, the text to be recognized is divided into phrases with a length of 2, thereby improving the efficiency of text recognition.
查找模块30,用于获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符。The searching module 30 is used to obtain the reference characters divided by the word segmentation tool, search for the corresponding preset dictionary in the second preset area according to the target length of the reference character, and determine whether there is any stored in the preset dictionary Reference character.
需要说明的是,所述参考字符为通过分词工具进行分词后的若干词组,例如将“我是中国人”分为长度为2的若干个词组,例如“我是”、“中园”以及“人”。It should be noted that the reference character is a number of phrases after word segmentation by a word segmentation tool, for example, "I am Chinese" is divided into several phrases of length 2, such as "I am", "Zhongyuan" and " people".
在本实施例中,所述预设词典为根据预设字段进行分类后的词典,例如对于词长度为2的词典,词长度为3的词典等,对于词长度为2的词典中包含例如“中国”,对于词长度为3的词典中包含例如“中国人”等,从而根据词的长度将常用的词组进行分类,从而实现对常用词组的管理。In this embodiment, the preset dictionary is a dictionary classified according to a preset field, for example, for a dictionary with a word length of 2, a dictionary with a word length of 3, etc. For a dictionary with a word length of 2, for example, " "Chinese", for a dictionary with a word length of 3, such as "Chinese", etc., so as to classify commonly used phrases according to the length of the words, so as to realize the management of commonly used phrases.
在具体实现中,通过所述预设词典可检查分词后的目标词组是否为常见的词组,例如本实施例中分词后长度为2的词组包括“我是”以及“中园”,将“我是”以及“中园”在长度为2的词典中查找是否存在,在不存在时,表明识别有异常,例如未查到“中园”这个词组,可查找到“我是”这个词组,表明“我是”识别正常,“中园”识别异常。In a specific implementation, the preset dictionary can be used to check whether the target phrase after word segmentation is a common phrase. For example, in this embodiment, the phrase with a length of 2 after the word segmentation includes "I am" and "Zhongyuan". "Yes" and "Zhongyuan" look for the existence in a dictionary with a length of 2, when it does not exist, it indicates that the recognition is abnormal. For example, if the phrase "Zhongyuan" is not found, the phrase "I am" can be found, indicating that "I am" recognition is normal, "Zhongyuan" recognition is abnormal.
筛选模块40,用于在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。The filtering module 40 is configured to filter the non-existing reference characters through the fuzzy matching algorithm to obtain the target characters and display the target characters when the reference characters are not stored in the preset dictionary.
在本实施例中,将未存有的字符通过模糊匹配算法进行筛选,所述模糊匹配算法为BK-tree(Burkhard-Keller-tree)算法,由Burkhard和Keller提出的,通过模糊匹配算法是指两个字串之间,由一个转成另一个所需的最少编辑操作次数,将所述操作次数作为所述编辑距离,在编辑距离越小的两个字符串越相似,当编辑距离为0时,两字符串相等,从而实现对字符的识别。In this embodiment, the unexisted characters are screened by a fuzzy matching algorithm, which is a BK-tree (Burkhard-Keller-tree) algorithm, proposed by Burkhard and Keller. The fuzzy matching algorithm refers to Between the two strings, the minimum number of editing operations required to convert from one to the other, using the number of operations as the editing distance, the smaller the editing distance, the more similar the two strings, when the editing distance is 0 At this time, the two character strings are equal, so as to realize the character recognition.
本实施例通过上述方案,通过获取待识别文本,调用分词工具,以使所述分词工具将所述待识别文本划分为多个预设长度的字符,根据所述预设长度的字符查找对应的预设词典,判断所述预设词典中是否存有所述字符,在所述预设词典中未存有所述字符时,说明所述字符存在识别异常的情况,在这种情况下,将未存有的字符通过模糊匹配算法筛选出目标字符,从而通过模糊匹配算法实现文字识别,提高文字识别效率。In this embodiment, through the above solution, the word segmentation tool is invoked by acquiring the text to be recognized, so that the word segmentation tool divides the text to be recognized into a plurality of characters of a preset length, and finds the corresponding A preset dictionary, to determine whether the character is stored in the preset dictionary, when the character is not stored in the preset dictionary, it indicates that the character has an abnormal recognition situation, in this case, the The unexisted characters are screened out by fuzzy matching algorithm to target characters, so as to realize text recognition by fuzzy matching algorithm and improve the efficiency of text recognition.
此外,为实现上述目的,本申请还提出一种设备,所述设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令配置为实现如上文所述的字符识别方法的步骤。In addition, in order to achieve the above object, the present application also proposes a device including: a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the computer may The read instruction is configured to implement the steps of the character recognition method as described above.
此外,本申请实施例还提出一种存储介质,所述计算机可读存储介质可以为非易失性可读存储介质。In addition, an embodiment of the present application further provides a storage medium, and the computer-readable storage medium may be a non-volatile readable storage medium.
本申请存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行如上文所述的字符识别方法的步骤。The storage medium of the present application stores computer readable instructions, and the computer readable instructions are executed by the processor to perform the steps of the character recognition method as described above.
其中,该计算机可读指令被执行时所实现的方法可参照本申请***开具方法的各个实施例,此处不再赘述。The method implemented when the computer-readable instruction is executed can refer to various embodiments of the invoicing method of this application, and details are not described herein again.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, It also includes other elements that are not explicitly listed, or include elements inherent to this process, method, article, or device. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article or device that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个计算机可读存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台智能终端设备(可以是手机,计算机,终端设备,空调器,或者网络终端设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation. Based on such an understanding, the technical solution of the present application can be embodied in the form of a software product in essence or a part that contributes to the existing technology, and the computer software product is stored in a computer-readable storage medium (such as ROM / RAM, magnetic disk, and optical disk), including several instructions to enable an intelligent terminal device (which can be a mobile phone, computer, terminal device, air conditioner, or network terminal device, etc.) to execute the method.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种字符识别方法,其特征在于,所述字符识别方法包括:A character recognition method, characterized in that the character recognition method includes:
    获取待识别文本;Get the text to be recognized;
    调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符;Calling a word segmentation tool pre-stored in the first preset area, and dividing the text to be recognized into a plurality of reference characters of a preset length through the word segmentation tool;
    获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符;Acquiring the reference character divided by the word segmentation tool, searching for a corresponding preset dictionary in the second preset area according to the target length of the reference character, and determining whether the reference character is stored in the preset dictionary;
    在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。When the reference character is not stored in the preset dictionary, the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain a target character, and the target character is displayed.
  2. 如权利要求1所述的字符识别方法,其特征在于,所述调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符之前,所述方法包括:The character recognition method according to claim 1, wherein the calling a word segmentation tool pre-stored in the first preset area, the word segmentation tool is used to divide the text to be recognized into a plurality of reference characters of a preset length Previously, the method included:
    接收工具写入指令,提取所述工具写入指令中的分词工具和分词写入地址信息,根据所述分词写入地址信息将所述分词工具写入所述第一预设区域并进行保存。Receiving a tool writing instruction, extracting the word segmentation tool and word segmentation writing address information in the tool writing instruction, and writing the word segmentation tool into the first preset area according to the word segmentation writing address information and saving it.
  3. 如权利要求1所述的字符识别方法,其特征在于,所述调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符,包括:The character recognition method according to claim 1, wherein the calling a word segmentation tool pre-stored in the first preset area, the word segmentation tool is used to divide the text to be recognized into a plurality of reference characters of a preset length ,include:
    调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本与各个预设长度的关键词进行比较,根据比较结果提取所述待识别文本中的各个预设长度的目标关键词,将所述目标关键词作为所述预设长度的参考字符。Call the pre-stored word segmentation tool in the first preset area, compare the text to be recognized with keywords of each preset length by the word segmentation tool, and extract each preset length of the text to be recognized according to the comparison result Target keywords, using the target keywords as reference characters of the preset length.
  4. 如权利要求1所述的字符识别方法,其特征在于,所述获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符之前,所述方法还包括:The character recognition method according to claim 1, wherein the acquiring the reference character divided by the word segmentation tool, searching the corresponding preset dictionary in the second preset area according to the target length of the reference character, and Before determining whether the reference character is stored in the preset dictionary, the method further includes:
    接收字典写入指令,提取所述字典写入指令中的预设字典和字典写入地址信息,根据所述字典写入地址信息将所述预设字典写入所述第二预设区域。Receiving a dictionary writing instruction, extracting the preset dictionary and dictionary writing address information in the dictionary writing instruction, and writing the preset dictionary into the second preset area according to the dictionary writing address information.
  5. 如权利要求1所述的字符识别方法,其特征在于,所述获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符,包括:The character recognition method according to claim 1, wherein the acquiring the reference character divided by the word segmentation tool, searching the corresponding preset dictionary in the second preset area according to the target length of the reference character, and Determine whether the reference character is stored in the preset dictionary, including:
    获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在预设地址关系映射表中查找对应的存储地址;Acquiring the reference characters divided by the word segmentation tool, and searching for a corresponding storage address in a preset address relationship mapping table according to the target length of the reference characters;
    根据所述存储地址在预设区域查找对应的预设词典,并提取所述参考字符的特征信息,将所述特征信息与查找到的词典中的字符的特征信息进行比较,根据比较结果判断所述词典中是否存有所述参考字符。Search the corresponding preset dictionary in the preset area according to the storage address, and extract the feature information of the reference character, compare the feature information with the feature information of the character in the found dictionary, and judge Whether the reference character is stored in the dictionary.
  6. 如权利要求1所述的字符识别方法,其特征在于,所述在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示,包括:The character recognition method according to claim 1, wherein when the reference character is not stored in the preset dictionary, the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain the target character And display the target character, including:
    在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符,将所述目标字符进行展示。When the reference character does not exist in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm, and the target Characters.
  7. 如权利要求6所述的字符识别方法,其特征在于,所述获取待识别文本之后,所述方法还包括:The character recognition method according to claim 6, wherein after the text to be recognized is obtained, the method further comprises:
    将待识别文本中的各个初始识别字符建立初始识别列表; Create an initial recognition list for each initial recognition character in the text to be recognized;
    所述在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符,将所述目标字符进行展示,包括:When the reference character is not stored in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm, and the The target characters are displayed, including:
    在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符;When the reference character does not exist in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm;
    判断所述目标字符的数量,在所述数量为多个时,判断所述目标字符是否存在所述初始识别列表中,将存在所述初始识别列表中的字符对应的目标字符进行展示。The number of the target characters is determined, and when the number is a plurality, whether the target characters exist in the initial recognition list is determined, and the target characters corresponding to the characters present in the initial recognition list are displayed.
  8. 一种字符识别装置,其特征在于,所述字符识别装置包括:A character recognition device, characterized in that the character recognition device includes:
    获取模块,用于获取待识别文本;Acquisition module, used to acquire the text to be recognized;
    调用模块,用于调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符;A calling module, for calling a word segmentation tool pre-stored in the first preset area, and dividing the text to be recognized into a plurality of reference characters of a preset length by the word segmentation tool;
    查找模块,用于获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符;The searching module is used to obtain the reference character divided by the word segmentation tool, search for the corresponding preset dictionary in the second preset area according to the target length of the reference character, and determine whether the preset dictionary is stored in the preset dictionary Reference character
    筛选模块,用于在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。The filtering module is configured to filter the reference characters that are not stored by the fuzzy matching algorithm when the reference characters are not stored in the preset dictionary to obtain target characters and display the target characters.
  9. 一种设备,其特征在于,所述设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令配置为实现以下步骤:An apparatus, characterized in that the apparatus includes: a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the computer-readable instructions configured to implement the following steps :
    获取待识别文本;Get the text to be recognized;
    调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符;Calling a word segmentation tool pre-stored in the first preset area, and dividing the text to be recognized into a plurality of reference characters of a preset length through the word segmentation tool;
    获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符;Acquiring the reference character divided by the word segmentation tool, searching for a corresponding preset dictionary in the second preset area according to the target length of the reference character, and determining whether the reference character is stored in the preset dictionary;
    在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。When the reference character is not stored in the preset dictionary, the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain a target character, and the target character is displayed.
  10. 如权利要求9所述的设备,其特征在于,所述计算机可读指令还配置为实现以下步骤:The device of claim 9, wherein the computer-readable instructions are further configured to implement the following steps:
    接收工具写入指令,提取所述工具写入指令中的分词工具和分词写入地址信息,根据所述分词写入地址信息将所述分词工具写入所述第一预设区域并进行保存。Receiving a tool writing instruction, extracting the word segmentation tool and word segmentation writing address information in the tool writing instruction, and writing the word segmentation tool into the first preset area according to the word segmentation writing address information and saving it.
  11. 如权利要求9所述的设备,其特征在于,所述计算机可读指令还配置为实现以下步骤:The device of claim 9, wherein the computer-readable instructions are further configured to implement the following steps:
    调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本与各个预设长度的关键词进行比较,根据比较结果提取所述待识别文本中的各个预设长度的目标关键词,将所述目标关键词作为所述预设长度的参考字符。Call the pre-stored word segmentation tool in the first preset area, compare the text to be recognized with keywords of each preset length by the word segmentation tool, and extract each preset length of the text to be recognized according to the comparison result Target keywords, using the target keywords as reference characters of the preset length.
  12. 如权利要求9所述的设备,其特征在于,所述计算机可读指令还配置为实现以下步骤:The device of claim 9, wherein the computer-readable instructions are further configured to implement the following steps:
    接收字典写入指令,提取所述字典写入指令中的预设字典和字典写入地址信息,根据所述字典写入地址信息将所述预设字典写入所述第二预设区域。Receiving a dictionary writing instruction, extracting the preset dictionary and dictionary writing address information in the dictionary writing instruction, and writing the preset dictionary into the second preset area according to the dictionary writing address information.
  13. 如权利要求9所述的设备,其特征在于,所述计算机可读指令还配置为实现以下步骤:The device of claim 9, wherein the computer-readable instructions are further configured to implement the following steps:
    获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在预设地址关系映射表中查找对应的存储地址;Acquiring the reference characters divided by the word segmentation tool, and searching for a corresponding storage address in a preset address relationship mapping table according to the target length of the reference characters;
    根据所述存储地址在预设区域查找对应的预设词典,并提取所述参考字符的特征信息,将所述特征信息与查找到的词典中的字符的特征信息进行比较,根据比较结果判断所述词典中是否存有所述参考字符。Search the corresponding preset dictionary in the preset area according to the storage address, and extract the feature information of the reference character, compare the feature information with the feature information of the character in the found dictionary, and judge Whether the reference character is stored in the dictionary.
  14. 如权利要求9所述的设备,其特征在于,所述计算机可读指令还配置为实现以下步骤:The device of claim 9, wherein the computer-readable instructions are further configured to implement the following steps:
    在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符,将所述目标字符进行展示。When the reference character does not exist in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm, and the target Characters.
  15. 如权利要求14所述的设备,其特征在于,所述计算机可读指令还配置为实现以下步骤:The device of claim 14, wherein the computer-readable instructions are further configured to implement the following steps:
    将待识别文本中的各个初始识别字符建立初始识别列表; Create an initial recognition list for each initial recognition character in the text to be recognized;
    在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符;When the reference character does not exist in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm;
    判断所述目标字符的数量,在所述数量为多个时,判断所述目标字符是否存在所述初始识别列表中,将存在所述初始识别列表中的字符对应的目标字符进行展示。The number of the target characters is determined, and when the number is a plurality, whether the target characters exist in the initial recognition list is determined, and the target characters corresponding to the characters present in the initial recognition list are displayed.
  16. 一种存储介质,其特征在于,所述存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:A storage medium, characterized in that computer-readable instructions are stored on the storage medium, and when the computer-readable instructions are executed by a processor, the following steps are realized:
    获取待识别文本;Get the text to be recognized;
    调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本划分为多个预设长度的参考字符;Calling a word segmentation tool pre-stored in the first preset area, and dividing the text to be recognized into a plurality of reference characters of a preset length through the word segmentation tool;
    获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在第二预设区域查找对应的预设词典,并判断所述预设词典中是否存有所述参考字符;Acquiring the reference character divided by the word segmentation tool, searching for a corresponding preset dictionary in the second preset area according to the target length of the reference character, and determining whether the reference character is stored in the preset dictionary;
    在所述预设词典中未存有所述参考字符时,通过模糊匹配算法对未存有的参考字符进行筛选,得到目标字符,并将所述目标字符进行展示。When the reference character is not stored in the preset dictionary, the reference character that is not stored is filtered by a fuzzy matching algorithm to obtain a target character, and the target character is displayed.
  17. 如权利要求16所述的存储介质,其特征在于,所述计算机可读指令还配置为实现以下步骤:The storage medium of claim 16, wherein the computer-readable instructions are further configured to implement the following steps:
    调用第一预设区域中预存的分词工具,通过所述分词工具将所述待识别文本与各个预设长度的关键词进行比较,根据比较结果提取所述待识别文本中的各个预设长度的目标关键词,将所述目标关键词作为所述预设长度的参考字符。Call the pre-stored word segmentation tool in the first preset area, compare the text to be recognized with keywords of each preset length by the word segmentation tool, and extract each preset length of the text to be recognized according to the comparison result Target keywords, using the target keywords as reference characters of the preset length.
  18. 如权利要求16所述的存储介质,其特征在于,所述计算机可读指令还配置为实现以下步骤:The storage medium of claim 16, wherein the computer-readable instructions are further configured to implement the following steps:
    获取所述分词工具划分后的参考字符,根据所述参考字符的目标长度在预设地址关系映射表中查找对应的存储地址;Acquiring the reference characters divided by the word segmentation tool, and searching for a corresponding storage address in a preset address relationship mapping table according to the target length of the reference characters;
    根据所述存储地址在预设区域查找对应的预设词典,并提取所述参考字符的特征信息,将所述特征信息与查找到的词典中的字符的特征信息进行比较,根据比较结果判断所述词典中是否存有所述参考字符。Search the corresponding preset dictionary in the preset area according to the storage address, and extract the feature information of the reference character, compare the feature information with the feature information of the character in the found dictionary, and judge Whether the reference character is stored in the dictionary.
  19. 如权利要求16所述的存储介质,其特征在于,所述计算机可读指令还配置为实现以下步骤:The storage medium of claim 16, wherein the computer-readable instructions are further configured to implement the following steps:
    在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符,将所述目标字符进行展示。When the reference character does not exist in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm, and the target Characters.
  20. 如权利要求19所述的存储介质,其特征在于,所述计算机可读指令还配置为实现以下步骤:The storage medium of claim 19, wherein the computer-readable instructions are further configured to implement the following steps:
    将待识别文本中的各个初始识别字符建立初始识别列表; Create an initial recognition list for each initial recognition character in the text to be recognized;
    在所述预设词典中未存有所述参考字符时,通过所述模糊匹配算法在所述预设词典中查找出编辑距离小于所述参数字符对应的目标长度的目标字符;When the reference character does not exist in the preset dictionary, the target character whose edit distance is less than the target length corresponding to the parameter character is found in the preset dictionary through the fuzzy matching algorithm;
    判断所述目标字符的数量,在所述数量为多个时,判断所述目标字符是否存在所述初始识别列表中,将存在所述初始识别列表中的字符对应的目标字符进行展示。The number of the target characters is determined, and when the number is a plurality, whether the target characters exist in the initial recognition list is determined, and the target characters corresponding to the characters present in the initial recognition list are displayed.
PCT/CN2018/122832 2018-10-25 2018-12-21 Symbol identification method, apparatus, device, and storage medium WO2020082562A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811254944.6 2018-10-25
CN201811254944.6A CN109657738B (en) 2018-10-25 2018-10-25 Character recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020082562A1 true WO2020082562A1 (en) 2020-04-30

Family

ID=66110077

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/122832 WO2020082562A1 (en) 2018-10-25 2018-12-21 Symbol identification method, apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN109657738B (en)
WO (1) WO2020082562A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582169A (en) * 2020-05-08 2020-08-25 腾讯科技(深圳)有限公司 Image recognition data error correction method, device, computer equipment and storage medium
CN111897958A (en) * 2020-07-16 2020-11-06 邓桦 Ancient poetry classification method based on natural language processing
CN112347765A (en) * 2020-10-10 2021-02-09 清华大学 Entity labeling method, module and device based on dictionary matching
CN112667831A (en) * 2020-12-25 2021-04-16 上海硬通网络科技有限公司 Material storage method and device and electronic equipment
CN113408270A (en) * 2021-06-10 2021-09-17 广州三七极创网络科技有限公司 Variant text recognition method and device and electronic equipment
CN113420564A (en) * 2021-06-21 2021-09-21 国网山东省电力公司物资公司 Hybrid matching-based electric power nameplate semantic structuring method and system
CN113625884A (en) * 2020-05-07 2021-11-09 顺丰科技有限公司 Input word recommendation method and device, server and storage medium
CN113761913A (en) * 2021-08-23 2021-12-07 南京优飞保科信息技术有限公司 Method and system for processing dialect text
CN113988068A (en) * 2021-12-29 2022-01-28 深圳前海硬之城信息技术有限公司 Word segmentation method, device, equipment and storage medium of BOM text
CN114386407A (en) * 2021-12-23 2022-04-22 北京金堤科技有限公司 Word segmentation method and device for text

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633660B (en) * 2019-08-30 2022-05-31 盈盛智创科技(广州)有限公司 Document identification method, device and storage medium
CN110738202A (en) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 Character recognition method, device and computer readable storage medium
CN111241365B (en) * 2019-12-23 2023-06-30 望海康信(北京)科技股份公司 Table picture analysis method and system
CN111860657A (en) * 2020-07-23 2020-10-30 中国建设银行股份有限公司 Image classification method and device, electronic equipment and storage medium
CN112560791B (en) * 2020-12-28 2022-08-09 苏州科达科技股份有限公司 Recognition model training method, recognition method and device and electronic equipment
CN112949446B (en) * 2021-02-25 2023-04-18 山东英信计算机技术有限公司 Object identification method, device, equipment and medium
CN113743102B (en) * 2021-08-18 2023-09-01 百度在线网络技术(北京)有限公司 Method and device for recognizing characters and electronic equipment
CN116580402B (en) * 2023-05-26 2024-06-25 读书郎教育科技有限公司 Text recognition method and device for dictionary pen

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991889A (en) * 2015-06-26 2015-10-21 江苏科技大学 Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN105068994A (en) * 2015-08-13 2015-11-18 易保互联医疗信息科技(北京)有限公司 Natural language processing method and system for drug information
CN107622044A (en) * 2016-07-13 2018-01-23 阿里巴巴集团控股有限公司 Segmenting method, device and the equipment of character string
CN108304484A (en) * 2017-12-29 2018-07-20 北京城市网邻信息技术有限公司 Key word matching method and device, electronic equipment and readable storage medium storing program for executing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100476800C (en) * 2007-06-22 2009-04-08 腾讯科技(深圳)有限公司 Method and system for cutting index participle
JP5716328B2 (en) * 2010-09-14 2015-05-13 株式会社リコー Information processing apparatus, information processing method, and information processing program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991889A (en) * 2015-06-26 2015-10-21 江苏科技大学 Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN105068994A (en) * 2015-08-13 2015-11-18 易保互联医疗信息科技(北京)有限公司 Natural language processing method and system for drug information
CN107622044A (en) * 2016-07-13 2018-01-23 阿里巴巴集团控股有限公司 Segmenting method, device and the equipment of character string
CN108304484A (en) * 2017-12-29 2018-07-20 北京城市网邻信息技术有限公司 Key word matching method and device, electronic equipment and readable storage medium storing program for executing

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113625884A (en) * 2020-05-07 2021-11-09 顺丰科技有限公司 Input word recommendation method and device, server and storage medium
CN111582169A (en) * 2020-05-08 2020-08-25 腾讯科技(深圳)有限公司 Image recognition data error correction method, device, computer equipment and storage medium
CN111582169B (en) * 2020-05-08 2023-10-10 腾讯科技(深圳)有限公司 Image recognition data error correction method, device, computer equipment and storage medium
CN111897958A (en) * 2020-07-16 2020-11-06 邓桦 Ancient poetry classification method based on natural language processing
CN111897958B (en) * 2020-07-16 2024-03-12 邓桦 Ancient poetry classification method based on natural language processing
CN112347765B (en) * 2020-10-10 2022-06-07 清华大学 Entity labeling method, module and device based on dictionary matching
CN112347765A (en) * 2020-10-10 2021-02-09 清华大学 Entity labeling method, module and device based on dictionary matching
CN112667831A (en) * 2020-12-25 2021-04-16 上海硬通网络科技有限公司 Material storage method and device and electronic equipment
CN113408270A (en) * 2021-06-10 2021-09-17 广州三七极创网络科技有限公司 Variant text recognition method and device and electronic equipment
CN113420564B (en) * 2021-06-21 2022-11-22 国网山东省电力公司物资公司 Hybrid matching-based electric power nameplate semantic structuring method and system
CN113420564A (en) * 2021-06-21 2021-09-21 国网山东省电力公司物资公司 Hybrid matching-based electric power nameplate semantic structuring method and system
CN113761913A (en) * 2021-08-23 2021-12-07 南京优飞保科信息技术有限公司 Method and system for processing dialect text
CN113761913B (en) * 2021-08-23 2024-02-23 南京优飞保科信息技术有限公司 Method and system for processing speech operation text
CN114386407A (en) * 2021-12-23 2022-04-22 北京金堤科技有限公司 Word segmentation method and device for text
CN113988068A (en) * 2021-12-29 2022-01-28 深圳前海硬之城信息技术有限公司 Word segmentation method, device, equipment and storage medium of BOM text

Also Published As

Publication number Publication date
CN109657738A (en) 2019-04-19
CN109657738B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
WO2020082562A1 (en) Symbol identification method, apparatus, device, and storage medium
WO2020015067A1 (en) Data acquisition method, device, equipment and storage medium
WO2020119116A1 (en) Medical insurance auditing method, apparatus and device based on data analysis, and storage medium
WO2020233089A1 (en) Test case generating method and apparatus, terminal, and computer readable storage medium
WO2020073495A1 (en) Artificial intelligence-based reexamination method, apparatus, and device, and storage medium
WO2011021907A2 (en) Metadata tagging system, image searching method and device, and method for tagging a gesture thereof
WO2021051558A1 (en) Knowledge graph-based question and answer method and apparatus, and storage medium
WO2020186777A1 (en) Image retrieval method, apparatus and device, and computer-readable storage medium
WO2020251233A1 (en) Method, apparatus, and program for obtaining abstract characteristics of image data
WO2021012489A1 (en) Telephone platform log query method, terminal device, storage medium and apparatus
WO2021215620A1 (en) Device and method for automatically generating domain-specific image caption by using semantic ontology
WO2010123168A1 (en) Database management method and system
WO2016099019A1 (en) System and method for classifying patent documents
WO2020087704A1 (en) Credit information management method, apparatus, and device, and storage medium
WO2010137814A2 (en) Method of providing by-viewpoint patent map and system thereof
WO2019024485A1 (en) Data sharing method and device and computer readable storage medium
WO2020082766A1 (en) Association method and apparatus for input method, device and readable storage medium
WO2021003956A1 (en) Product information management method, apparatus and device, and storage medium
WO2020253113A1 (en) Invoice recording method, device, apparatus, and computer storage medium
WO2021012490A1 (en) Service relay switching method and apparatus, terminal device, and storage medium
WO2018086371A1 (en) Laptop, smart terminal, and method for creating content index for laptop
WO2016088954A1 (en) Spam classifying method, recording medium for implementing same, and spam classifying device
WO2014148784A1 (en) Linguistic model database for linguistic recognition, linguistic recognition device and linguistic recognition method, and linguistic recognition system
WO2024019226A1 (en) Method for detecting malicious urls
WO2023018150A1 (en) Method and device for personalized search of visual media

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18937751

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18937751

Country of ref document: EP

Kind code of ref document: A1