WO2021139349A1 - Graph neural network-based text error correction method, apparatus and device, and storage medium - Google Patents

Graph neural network-based text error correction method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2021139349A1
WO2021139349A1 PCT/CN2020/124828 CN2020124828W WO2021139349A1 WO 2021139349 A1 WO2021139349 A1 WO 2021139349A1 CN 2020124828 W CN2020124828 W CN 2020124828W WO 2021139349 A1 WO2021139349 A1 WO 2021139349A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
confusion
combination
sound
text
Prior art date
Application number
PCT/CN2020/124828
Other languages
French (fr)
Chinese (zh)
Inventor
颜泽龙
王健宗
吴天博
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139349A1 publication Critical patent/WO2021139349A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • the second aspect of the application provides a text error correction device based on a graph neural network, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor.
  • the processor The following steps are implemented when the computer-readable instructions are executed: acquiring medical business corpus, establishing a near-confusion corpus and a near-phonic confusion corpus based on the medical business corpus and a preset dictionary; establishing a set of corpus based on a preset graph neural network
  • the form and near confusion structure map of the prescriptive and near confusion corpus set and the near sound confusion structure map of the near sound confusion corpus set; the graph convolution operation and the near sound confusion structure map are sequentially performed on the near sound confusion structure map and the near sound confusion structure map.
  • the third aspect of the present application provides a computer-readable storage medium that stores computer instructions in the computer-readable storage medium.
  • the computer executes the following steps: acquiring medical business corpus, According to the medical service corpus and a preset dictionary, establish a confusing corpus set and a confusing corpus set; establishing a confusing structure map of the confusing corpus set based on a preset graph neural network and the nearby pronunciation
  • the near-phone confusion structure map of the obfuscated corpus set; the near-phone confusion structure map and the near-phone confusion structure map are sequentially subjected to graph convolution operation and graph attention calculation to obtain the confusion corpus structure map; obtain the text corpus to be tested, Use a preset vector extractor to extract the character vectors of the text corpus to be tested, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and compare the text corpus to be tested according to the basic similarity probability Perform modification processing to obtain the target text corpus.
  • the embodiment of the application provides a method, device, equipment and storage medium for text error correction based on graph neural network.
  • the obfuscated corpus structure graph of medical business corpus is generated through the preset graph neural network, and the server is performing the text corpus to be tested.
  • the server is performing the text corpus to be tested.
  • When correcting text errors directly calculate the basic similarity probability between the corpus confusion structure matrices corresponding to the confusion corpus structure map, and determine the target text corpus for error correction through the value of the basic similarity probability.
  • This solution can be applied to the field of smart medical care, improving The efficiency of text error correction of the text corpus to be tested, thereby promoting the construction of smart cities.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the text error correction device 500 based on the graph neural network. Further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the text error correction device 500 based on the graph neural network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the field of artificial intelligence, and is applied to the field of smart medical treatment. Disclosed are a graph neural network-based text error correction method, apparatus and device, and a storage medium, used for avoiding a large amount of data calculation when a medical service system performs text error correction of a text corpus to be tested, and improving text error correction efficiency. The graph neural network-based text error correction method comprises: according to a medical service corpus, establishing a similar shape confusion corpus set and a similar sound confusion corpus set; on the basis of a preset graph neural network, establishing a similar shape confusion structure graph and a similar sound confusion structure graph; sequentially performing a graph convolution operation and a graph attention calculation on the similar shape confusion structure graph and the similar sound confusion structure graph, and obtaining a confusion corpus structure graph; using a preset vector extractor to extract character vectors of a text corpus to be tested, and according to basic similarity probabilities between the character vectors and the confusion corpus structure graph, changing the text corpus to be tested, and obtaining a target text corpus.

Description

基于图神经网络的文本纠错方法、装置、设备及存储介质Text error correction method, device, equipment and storage medium based on graph neural network
本申请要求于2020年9月7日提交中国专利局、申请号为202010926425.0、发明名称为“基于图神经网络的文本纠错方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 7, 2020, the application number is 202010926425.0, and the invention title is "Text Error Correction Method, Apparatus, Equipment and Storage Medium Based on Graph Neural Network". The entire content is incorporated into the application by reference.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种基于图神经网络的文本纠错方法、装置、设备及存储介质。This application relates to the field of artificial intelligence, and in particular to a text error correction method, device, equipment and storage medium based on graph neural network.
背景技术Background technique
在人工智能领域中,中文纠错是自然语言处理中的一个纠察优化步骤,中文纠错的能力越高说明自然语言处理***的处理准确性越高。中文纠错具体为从包含各种错误的文本中进行错误纠正,还原出正确的标准文本。随着科学技术的发展,中文纠错广泛应用于语音识别和社交网络等场景。在医疗场景下,医生在使用计算机外接键盘或外接语音接收器录入患者信息时,通过外接键盘打字录入文字时会出现拼音错误或相邻键盘按键敲击错误的情况,通过外接语音接收器进行语音转化文本的过程中会出现形近字或近音字转化错误的情况,这些错误在医疗场景下存在一定风险,特别是医生在记录关于患者病情或者治疗方案时产生的错误,不仅不利于患者的治疗和病情追踪,也容易加剧医患关系的紧张,阻碍着医疗***的完善和医疗技术的进步。在现有的技术中,通过计算机大量的计算与排查,对文本文字进行纠错。In the field of artificial intelligence, Chinese error correction is a picketing optimization step in natural language processing. The higher the ability of Chinese error correction, the higher the processing accuracy of the natural language processing system. Chinese error correction specifically refers to correcting errors from texts containing various errors and restoring correct standard texts. With the development of science and technology, Chinese error correction is widely used in scenarios such as speech recognition and social networking. In the medical scenario, when a doctor uses an external computer keyboard or an external voice receiver to input patient information, the pinyin error or the wrong typing of the adjacent keyboard keys may occur when typing in the text through the external keyboard, and the external voice receiver is used for voice In the process of converting the text, there will be errors in the conversion of similar characters or near-phonetic characters. These errors have certain risks in medical scenarios, especially the errors made by doctors when recording the patient’s condition or treatment plan, which are not only detrimental to the patient’s treatment Tracking and tracking the condition can also easily aggravate the tension between doctors and patients, hindering the improvement of the medical system and the advancement of medical technology. In the existing technology, a large amount of computer calculations and investigations are used to correct errors in texts.
但发明人意识到利用现有的技术进行待测文本语料的文本纠错时,计算机需要进行大量的数据计算,耗费大量的时间,进而导致待测文本语料的文本纠错效率低下。However, the inventor realizes that when using the existing technology to perform text error correction of the text corpus to be tested, the computer needs to perform a large amount of data calculations, which consumes a lot of time, which in turn leads to low text error correction efficiency of the text corpus to be tested.
发明内容Summary of the invention
本申请的主要目的在于解决全链路监控***中业务请求采用全局采样频率,导致部分业务没有被采样,而增加全局采样频率又会降低业务性能、消耗网络带宽以及降低监控***性能的技术问题。The main purpose of this application is to solve the technical problem that service requests in the full-link monitoring system use a global sampling frequency, which results in some services not being sampled. However, increasing the global sampling frequency will reduce service performance, consume network bandwidth, and reduce the technical problem of monitoring system performance.
为实现上述目的,本申请第一方面提供了一种基于图神经网络的文本纠错方法,包括:获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。In order to achieve the above objective, the first aspect of the present application provides a text error correction method based on graph neural network, which includes: obtaining medical business corpus, and establishing a set of confusing corpus and similar information based on the medical business corpus and a preset dictionary. A phonetic confusion corpus set; based on a preset graph neural network, a near phonetic confusion structure map of the near phonetic confusion corpus set and a near phone confusion structure map of the near phone confused corpus set; a comparison of the near phonetic confusion structure map and The near-phone confusion structure map performs graph convolution operations and graph attention calculations in sequence to obtain a confusion corpus structure map; obtains the text corpus to be tested, uses a preset vector extractor to extract the character vectors of the text corpus to be tested, and calculates According to the basic similarity probability between the character vector and the confused corpus structure map, the text corpus to be tested is modified according to the basic similarity probability to obtain the target text corpus.
本申请第二方面提供了一种基于图神经网络的文本纠错设备,包括存储器、处理器及 存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。The second aspect of the application provides a text error correction device based on a graph neural network, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor. The processor The following steps are implemented when the computer-readable instructions are executed: acquiring medical business corpus, establishing a near-confusion corpus and a near-phonic confusion corpus based on the medical business corpus and a preset dictionary; establishing a set of corpus based on a preset graph neural network The form and near confusion structure map of the prescriptive and near confusion corpus set and the near sound confusion structure map of the near sound confusion corpus set; the graph convolution operation and the near sound confusion structure map are sequentially performed on the near sound confusion structure map and the near sound confusion structure map. Graph attention calculation to obtain a structure map of the confused corpus; obtain the text corpus to be tested, use a preset vector extractor to extract the character vector of the text corpus to be tested, and calculate the difference between the character vector and the structure map of the confused corpus The basic similarity probability is used to modify the text corpus to be tested according to the basic similarity probability to obtain the target text corpus.
本申请的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。The third aspect of the present application provides a computer-readable storage medium that stores computer instructions in the computer-readable storage medium. When the computer instructions run on a computer, the computer executes the following steps: acquiring medical business corpus, According to the medical service corpus and a preset dictionary, establish a confusing corpus set and a confusing corpus set; establishing a confusing structure map of the confusing corpus set based on a preset graph neural network and the nearby pronunciation The near-phone confusion structure map of the obfuscated corpus set; the near-phone confusion structure map and the near-phone confusion structure map are sequentially subjected to graph convolution operation and graph attention calculation to obtain the confusion corpus structure map; obtain the text corpus to be tested, Use a preset vector extractor to extract the character vectors of the text corpus to be tested, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and compare the text corpus to be tested according to the basic similarity probability Perform modification processing to obtain the target text corpus.
本申请第四方面提供了一种基于图神经网络的文本纠错装置,包括:获取模块,用于获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;建立模块,用于基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;计算模块,用于对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;更改模块,用于获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。The fourth aspect of the present application provides a text error correction device based on graph neural network, including: an acquisition module for acquiring medical business corpus, and establishing a set of confusing corpus and similar information according to the medical business corpus and a preset dictionary. A set of phonetic confusion corpus; a building module for establishing a near phonetic confusion structure map of the near phonetic confusion corpus set and a near phonetic confusion structure map of the near phone confused corpus set based on a preset graph neural network; a calculation module, using In order to perform graph convolution operations and graph attention calculations on the near-form confusion structure atlas and the near-phone confusion structure atlas to obtain the confusion corpus structure atlas; the modification module is used to obtain the text corpus to be tested, using the preset The vector extractor extracts the character vector of the text corpus to be tested, calculates the basic similarity probability between the character vector and the structure map of the confused corpus, and modifies the text corpus to be tested according to the basic similarity probability, Obtain the target text corpus.
本申请提供的技术方案中,获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。本申请实施例中,通过预置的图神经网络生成医疗业务语料的混淆语料结构图谱,服务器在进行待测文本语料的文本纠错时,直接计算混淆语料结构图谱对应的语料混淆结构矩阵之间的基础相似概率, 通过基础相似概率的数值确定纠错的目标文本语料,本方案可应用于智慧医疗领域中,提高了待测文本语料的文本纠错效率,从而推动智慧城市的建设。In the technical solution provided in this application, medical business corpus is obtained, and the form and near-sound confusion corpus set and the near-sound confusion corpus set are established according to the medical business corpus and a preset dictionary; the form and near confusion corpus is established based on a preset graph neural network The near-confusion structure map of the corpus set and the near-sound confusion structure map of the near-phone confusion corpus; the graph convolution operation and the graph attention calculation are sequentially performed on the near-sound confusion structure map and the near-sound confusion structure map , Obtain the structure map of the confused corpus; obtain the text corpus to be tested, extract the character vector of the text corpus to be tested using a preset vector extractor, and calculate the basic similarity probability between the character vector and the structure map of the confused corpus, According to the basic similarity probability, the text corpus to be tested is modified to obtain the target text corpus. In the embodiment of this application, the obfuscated corpus structure map of the medical business corpus is generated through a preset graph neural network, and the server directly calculates the corpus confusion structure matrix corresponding to the obfuscated corpus structure map when performing text error correction of the text corpus to be tested. Based on the basic similarity probability, the target text corpus for error correction is determined by the value of the basic similarity probability. This solution can be applied in the field of smart medical care, improving the text error correction efficiency of the text corpus to be tested, thereby promoting the construction of smart cities.
附图说明Description of the drawings
图1为本申请实施例中基于图神经网络的文本纠错方法的一个实施例示意图;FIG. 1 is a schematic diagram of an embodiment of a text error correction method based on a graph neural network in an embodiment of the application;
图2为本申请实施例中基于图神经网络的文本纠错方法的另一个实施例示意图;2 is a schematic diagram of another embodiment of a text error correction method based on a graph neural network in an embodiment of the application;
图3为本申请实施例中基于图神经网络的文本纠错装置的一个实施例示意图;3 is a schematic diagram of an embodiment of a text error correction device based on graph neural network in an embodiment of the application;
图4为本申请实施例中基于图神经网络的文本纠错装置的另一个实施例示意图;4 is a schematic diagram of another embodiment of a text error correction device based on a graph neural network in an embodiment of the application;
图5为本申请实施例中基于图神经网络的文本纠错设备的另一个实施例示意图。Fig. 5 is a schematic diagram of another embodiment of a text error correction device based on a graph neural network in an embodiment of the application.
具体实施方式Detailed ways
本申请实施例提供了一种基于图神经网络的文本纠错方法、装置、设备及存储介质,通过预置的图神经网络生成医疗业务语料的混淆语料结构图谱,服务器在进行待测文本语料的文本纠错时,直接计算混淆语料结构图谱对应的语料混淆结构矩阵之间的基础相似概率,通过基础相似概率的数值确定纠错的目标文本语料,本方案可应用于智慧医疗领域中,提高了待测文本语料的文本纠错效率,从而推动智慧城市的建设。The embodiment of the application provides a method, device, equipment and storage medium for text error correction based on graph neural network. The obfuscated corpus structure graph of medical business corpus is generated through the preset graph neural network, and the server is performing the text corpus to be tested. When correcting text errors, directly calculate the basic similarity probability between the corpus confusion structure matrices corresponding to the confusion corpus structure map, and determine the target text corpus for error correction through the value of the basic similarity probability. This solution can be applied to the field of smart medical care, improving The efficiency of text error correction of the text corpus to be tested, thereby promoting the construction of smart cities.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects, without having to use To describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" or "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中基于图神经网络的文本纠错方法的一个实施例包括:For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. An embodiment of the text error correction method based on the graph neural network in the embodiment of the present application includes:
101、获取医疗业务语料,根据医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;101. Obtain medical business corpus, and establish a near-sound confusion corpus and near-sound confusion corpus based on the medical business corpus and a preset dictionary;
可以理解的是,本申请的执行主体可以为基于图神经网络的文本纠错装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It is understandable that the execution subject of this application may be a text error correction device based on a graph neural network, or a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.
首先服务器需要收集大量的医疗业务语料,这里的医疗业务语料指的是医疗场景中常用的业务词汇,如:各疾病名称、各疾病解决方案术语等,医疗业务语料的收集令中文纠错后的文本更加贴近实际情况,增强场景的识别度。根据根据医疗业务语料建立形近混淆语料集合与近音混淆语料集合,其中,形近混淆语料集合用于指示与医疗业务语料的字符形状相近的语料集合,如:医疗业务语料为:双目,其形近混淆语料为:双日;近音混淆语料集合用于指示与医疗业务语料的字符音标易产生混淆音标的语料集合,如:医疗业务语料为:处理,其近音混淆语料为:助理。First, the server needs to collect a large amount of medical business corpus. The medical business corpus here refers to the business vocabulary commonly used in medical scenarios, such as the name of each disease, the terminology of each disease solution, etc. The collection of the medical business corpus makes Chinese corrections. The text is closer to the actual situation, and the recognition of the scene is enhanced. According to the medical business corpus, the confusing corpus and the confusing corpus are established. Among them, the confusing corpus is used to indicate the corpus with similar character shape to the medical business corpus. For example, the medical business corpus is: binocular, The form of the confusing corpus is: double day; the confusing corpus of near-sounding is used to indicate the corpus that is likely to cause confusion with the phonetic symbols of the medical business corpus. For example, the medical business corpus is: processing, and the near-sounding confused corpus is: assistant .
需要说明的是,形近混淆语料集合与近音混淆语料集合均是基于预置的字典建立的,预置的字典为标准的字词结合的典籍,其记录了大量的字与词语语料。It should be noted that the corpus of near-form confusion and the corpus of near-phonetic confusion are both established based on preset dictionaries. The preset dictionary is a classic combination of standard words and records a large number of words and words corpus.
102、基于预置的图神经网络建立形近混淆语料集合的形近混淆结构图谱以及近音混淆语料集合的近音混淆结构图谱;102. Based on the preset graph neural network, establish a near-sound confusion structure map of the near-sound confusion corpus set and near-sound confusion structure map of the near-sound confusion corpus set;
服务器在得到形近混淆语料集合与近音混淆语料集合后,通过预置的图神经网络建立形近混淆结构图谱与近音混淆结构图谱,这里预置的图神经网络(graphneural networks,GNN)是一种直接作用于图结构上的神经网络,其中,图是由顶点和边两部分组成的一种数据结构,如:图G可以通过节点集合V和边E进行描述,公式为G=(V,E),根据节点之间是否存在方向依赖关系确定边,边可以是有向的也可以是无向的,在本申请中图G中的节点即为医疗业务语料以及预置的字典中的语料,连接节点的边即为医疗业务语料以及预置的字典中的语料之间的关系,可以为形近混淆语料关系或近音语料关系。此外,若两个节点之间没有边,则说明两个节点对应的语料之间不存在上述关系。After the server obtains the confusing corpus set and the confusing corpus set, it uses the preset graph neural network to establish the graph neural network and the graph neural network. Here, the preset graph neural network (GNN) is A neural network that directly acts on the graph structure. The graph is a data structure composed of two parts: vertices and edges. For example, graph G can be described by node set V and edge E, the formula is G=(V , E), determine the edge according to whether there is a direction dependency between the nodes. The edge can be directed or undirected. In this application, the nodes in Figure G are the medical business corpus and the preset dictionary The corpus, the edge connecting the nodes is the relationship between the medical business corpus and the corpus in the preset dictionary, which can be a confusing corpus relationship or a near-phonetic corpus relationship. In addition, if there is no edge between two nodes, it means that the above-mentioned relationship does not exist between the corpus corresponding to the two nodes.
103、对形近混淆结构图谱以及近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;103. Perform graph convolution and graph attention calculations on the near-form confusion structure map and the near-phone confusion structure map in turn to obtain the confusion corpus structure map;
服务器通过步骤102可以得到医疗业务语料的形近混淆结构图谱以及近音混淆结构图谱,并不能一次观测出医疗业务语料是否同时存在形近混淆语料即近音混淆语料,因此服务器需要利用图卷积操作以及图注意力计算混淆语料结构图谱,混淆语料结构图谱即为形近混淆结构图谱与近音混淆结构图谱的结合,通过不同卷积层的信息分配以及不同权重的分配计算得到混淆语料结构图谱。The server can obtain the near-sound confusion structure map and the near-sound confusion structure map of the medical business corpus through step 102, and cannot observe whether the medical service corpus also has the near-sound confusion corpus, that is, the near-sound confusion corpus. Therefore, the server needs to use graph convolution. Operation and graph attention calculate the obfuscated corpus structure map. The obfuscated corpus structure map is the combination of the near-form obfuscated structure map and the near-phone confusion structure map. The obfuscated corpus structure map is obtained through the information distribution of different convolutional layers and the distribution of different weights. .
104、获取待测文本语料,利用预置的向量提取器提取待测文本语料的字符向量,计算字符向量与混淆语料结构图谱之间的基础相似概率,根据基础相似概率对待测文本语料进行更改处理,得到目标文本语料。104. Obtain the text corpus to be tested, use the preset vector extractor to extract the character vectors of the text corpus to be tested, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and modify the text corpus to be tested according to the basic similarity probability , Get the target text corpus.
服务器通过计算得到混淆语料结构图谱后,即可以进行文本语料的纠正,首先服务器获取待检测文本语料,利用预置的向量提取器在待测文本语料中提取语料的字符向量,然后服务器计算字符向量与混淆语料结构图谱中的语料混淆结构矩阵之间的基础相似概率,根据基础相似概率的数值大小对待测文本语料进行更改处理,得到目标文本语料。After the server obtains the structure map of the confusing corpus through calculation, the text corpus can be corrected. First, the server obtains the text corpus to be detected, uses a preset vector extractor to extract the character vector of the corpus from the text corpus to be tested, and then the server calculates the character vector The basic similarity probability between the corpus confusion structure matrix and the corpus confusion structure matrix in the confusion corpus structure map is modified according to the value of the basic similarity probability to obtain the target text corpus.
请参阅图2,本申请实施例中基于图神经网络的文本纠错方法的另一个实施例包括:Referring to FIG. 2, another embodiment of a text error correction method based on a graph neural network in an embodiment of the present application includes:
201、获取医疗业务语料,根据医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;201. Obtain medical business corpus, and establish a near-sound confusion corpus and near-sound confusion corpus based on the medical business corpus and a preset dictionary;
具体的,服务器首先获取医疗业务语料,利用预置的相似度函数计算医疗业务语料与预置的字典中的标准语料之间的基础字形相似度;其次服务器筛选出基础字形相似度大于相似阈值的目标字形相似度,将目标字形相似度对应的标准语料作为医疗业务语料的形近混淆语料,将医疗业务语料与形近混淆语料组合为形近混淆组合,通过形近混淆组合生成形近混淆语料集合;然后服务器利用预置的模糊匹配算法将医疗业务语料转化为语料音标, 筛选出语料音标中的目标音标,目标音标包括具有易混淆的韵母和/或声母;最后服务器将目标音标转化为近音音标,并在预置的字典中查询标准音标与近音音标相同的标准语料,将标准音标与近音音标相同的标准语料作为医疗业务语料的近音混淆语料,将医疗业务语料与近音混淆语料组合为近音混淆组合,通过近音混淆组合生成近音混淆语料集合。Specifically, the server first obtains the medical business corpus, and uses the preset similarity function to calculate the basic font similarity between the medical business corpus and the standard corpus in the preset dictionary; secondly, the server screens out the basic font similarity greater than the similarity threshold. Target font similarity, the standard corpus corresponding to the target font similarity is used as the confusing corpus of the medical business corpus, combining the medical business corpus and the confusing corpus into a confusing combination, and generating the confusing corpus through the confusing combination. The server then uses the preset fuzzy matching algorithm to transform the medical business corpus into corpus phonetic symbols, and filters out the target phonetic symbols in the corpus phonetic symbols. The target phonetic symbols include confusing vowels and/or initials; finally, the server converts the target phonetic symbols to near Phonetic symbols, and look up the standard corpus with the same standard phonetic symbols and near phonetic symbols in the preset dictionary, use the standard corpus with the same standard phonetic symbols and near phonetic symbols as the near-phonetic confusion corpus of the medical business corpus, and combine the medical business corpus with the near phonetic symbols. The confusing corpus combination is a confusing combination of near tones, and a confusing corpus of near pronunciation is generated through the confusing combination of near tones.
需要说明的是,这里的医疗业务语料为医疗场景下常用的词或短语,预置的字典为记录大量字词与短语的标准词语库,预置的字典中的标准语料包括但不限于医疗业务语料,同时,医疗业务语料的数量与预置的字典中标准语料的数量至少为1000个,在本申请中并不对医疗业务语料的数量与预置的字典中标准语料的数量进行限定,可以根据实际情况对两者的数量进行设定。It should be noted that the medical business corpus here is commonly used words or phrases in medical scenarios. The preset dictionary is a standard vocabulary database that records a large number of words and phrases. The standard corpus in the preset dictionary includes but is not limited to medical business. At the same time, the number of medical business corpus and the number of standard corpus in the preset dictionary is at least 1000. In this application, the number of medical business corpus and the number of standard corpus in the preset dictionary are not limited. The actual situation sets the number of both.
在获取医疗业务语料与预置的字典中标准语料后,服务器可以通过预置的相似度函数计算两者之间的基础字形相似度,进一步说明的是,每个医疗业务语料均可以与标准语料进行基础字形相似度的计算,因此会通过计算得到多个基础字形相似度,在多个基础字形相似度中筛选出基础字形相似度的数值大于相似阈值的目标字形相似度,将目标字形相似度对应的标准语料作为医疗业务语料的形近混淆语料。例如:医疗业务语料为:双目,标准语料分别为:双日,双耳,右耳,左眼,通过预置的相似度函数计算得到基础字形相似度的数值分别为:0.86、0.78、0.46、0.13,设定的相似阈值为0.58,则将0.86与0.78对应的标准语料双日与双耳作为医疗业务语料双目的形近混淆语料。可以理解的是,医疗业务语料可以对应多个形近混淆语料,将医疗业务语料与形近混淆语料进行组合,得到形近混淆组合,将多个形近混淆组合进行整合得到形近混淆语料集合。After obtaining the medical business corpus and the standard corpus in the preset dictionary, the server can calculate the basic glyph similarity between the two through the preset similarity function. It is further explained that each medical business corpus can be compared with the standard corpus. Perform basic glyph similarity calculations. Therefore, multiple basic glyph similarities will be obtained through calculation. From the multiple basic glyph similarities, the target glyph similarity whose value of the basic glyph similarity is greater than the similarity threshold is selected, and the target glyph similarity is determined The corresponding standard corpus is used as the confusing corpus of medical business corpus. For example, the medical business corpus is: binocular, and the standard corpus is: double day, binaural, right ear, left eye, calculated by the preset similarity function, the basic glyph similarity values are: 0.86, 0.78, 0.46 , 0.13, the set similarity threshold value is 0.58, then the standard corpus biday and binaural corresponding to 0.86 and 0.78 are used as the medical business corpus. It is understandable that the medical business corpus can correspond to multiple obfuscated corpora. Combine the medical business corpus and the obfuscated corpus to obtain the obfuscated combination, and integrate the multiple obfuscated combinations to obtain the obfuscated corpus. .
服务器利用预置的模糊匹配算法将医疗业务语料转化为语料音标,这里预置的模糊匹配算法是将医疗业务语料转化为与其对应的拼音,然后筛选出语料音标中的目标音标,再具有易混淆音标的目标音标转化为近音音标,这里目标音标包括具有易混淆的韵母和/或声母,举例说明:辅音易混淆:b/p;前后鼻音易混淆:en/eng;平翘舌易混淆:z/zh。然后服务器在预置的字典中查询标准音标与近音音标相同的标准语料,将标准音标与近音音标相同的标准语料作为医疗业务语料的近音混淆语料,如:医疗业务语料为:增生,将其转化为语料音标为:zheng sheng,筛选出的目标音标以及对应的近音音标为:zh—z,sh—s,eng—en,在预置的字典中可以筛选出的近音混淆语料为:正盛、真身、政审。然后服务器将医疗业务语料与近音混淆语料组合,得到近音混淆组合,将多个近音混淆组合进行整合得到近音混淆语料集合。The server uses the preset fuzzy matching algorithm to convert the medical business corpus into the phonetic transcription of the corpus. The preset fuzzy matching algorithm here converts the medical business corpus into its corresponding pinyin, and then filters out the target phonetic transcription in the phonetic transcription of the corpus, which is easy to be confused. The target phonetic symbols of the phonetic symbols are converted into near phonetic symbols, where the target phonetic symbols include vowels and/or initials that are easy to confuse, for example: consonants are easy to confuse: b/p; front and back nasal sounds are easy to confuse: en/eng; flat tongue is easy to confuse: z/zh. Then the server queries the standard corpus with the same standard phonetic symbols and near phonetic symbols in the preset dictionary, and uses the standard corpus with the same standard phonetic symbols and near phonetic symbols as the near phonetic confusion corpus of the medical business corpus, such as: the medical business corpus is: hyperplasia, Convert it into a corpus phonetic symbol: zheng sheng, the screened target phonetic symbols and corresponding near phonetic phonetic symbols are: zh—z, sh—s, eng—en, which can be screened out in the preset dictionary. It is: Zhengsheng, True Body, and Political Trial. Then the server combines the medical service corpus and the near-sound confusion corpus to obtain a near-sound confusion combination, and integrates multiple near-sound confusion combinations to obtain a near-sound confusion corpus.
202、在医疗业务语料中提取第一业务语料与第二业务语料,将第一业务语料与第二业务语料进行组合,得到待检测组合;202. Extract the first business corpus and the second business corpus from the medical business corpus, and combine the first business corpus and the second business corpus to obtain a combination to be tested;
在医疗业务语料中提取第一业务语料与第二业务语料,这里的第一业务语料与第二业务语料相当于图结构中的语料节点,将第一业务语料与第二业务语料进行组合,相当于在图结构中将两个语料节点通过边进行连接,因此,待检测组合中包括两个语料节点与一个 边。Extract the first business corpus and the second business corpus from the medical business corpus. The first business corpus and the second business corpus here are equivalent to the corpus nodes in the graph structure. Combining the first business corpus and the second business corpus is equivalent to In the graph structure, two corpus nodes are connected by edges. Therefore, the combination to be detected includes two corpus nodes and one edge.
203、根据待检测组合与形近混淆组合确定待检测组合位置坐标的第一位置元素,通过第一位置元素确定基础形近混淆矩阵;203. Determine the first position element of the position coordinate of the to-be-detected combination according to the combination to be detected and the combination of shape and proximity confusion, and determine the basic shape and proximity confusion matrix through the first position element;
服务器首先判断两个语料节点上的第一业务语料与第二业务语料构成的待检测组合是否为形近混淆组合,若待检测组合为形近混淆组合,则确定待检测组合的位置坐标所对应的第一位置元素为第一阈值,这里的第一阈值为1;若待检测组合不为形近混淆组合,则确定待检测组合的位置坐标所对应的第一位置元素为第二阈值,这里的第二阈值为0,服务器根据待检测组合的位置坐标建立初始形近混淆矩阵,根据待检测组合的位置坐标与对应的第一位置元素填充初始形近混淆矩阵,得到基础形近混淆矩阵。The server first judges whether the combination to be detected formed by the first business corpus and the second business corpus on the two corpus nodes is a combination of form and near confusion. If the combination to be detected is a combination of form and near confusion, it determines the position coordinates of the combination to be detected. The first position element of is the first threshold, and the first threshold here is 1; if the combination to be detected is not a form-close confusion combination, it is determined that the first position element corresponding to the position coordinates of the combination to be detected is the second threshold, here The second threshold of is 0, the server establishes an initial shape proximity confusion matrix according to the position coordinates of the combination to be detected, and fills the initial shape proximity confusion matrix according to the position coordinates of the combination to be detected and the corresponding first position element to obtain the basic shape proximity confusion matrix.
举例说明:第一业务语料为一,第二业务语料为亿,由第一业务语料与第二业务语料组合成的待检测组合所对应的位置坐标为(1,2)(2,1),服务器判断待检测组合是否为形近混淆组合,当待检测组合为形近混淆组合时,标记待检测组合的位置坐标对应的第一位置元素为1,当待检测组合不为形近混淆组合时,标记待检测组合的位置坐标对应的第一位置元素为0。通过待检测组合的位置坐标及第一位置元素的数值,可以建立基础形近混淆矩阵。For example: the first business corpus is one, the second business corpus is 100 million, and the position coordinates corresponding to the combination to be detected from the first business corpus and the second business corpus are (1, 2) (2, 1), The server judges whether the combination to be detected is a combination of form and near confusion. When the combination to be detected is a combination of form and near confusion, the first position element corresponding to the position coordinates of the combination to be detected is marked as 1, and when the combination to be detected is not a combination of form and near confusion. , The first position element corresponding to the position coordinates of the combination to be detected is 0. Through the position coordinates of the combination to be detected and the value of the first position element, the basic shape near confusion matrix can be established.
204、根据待检测组合与近音混淆组合确定待检测组合位置坐标的第二位置元素,通过第二位置元素确定基础近音混淆矩阵;204. Determine the second location element of the location coordinates of the combination to be detected according to the combination to be detected and the proximity confusion combination, and determine the basic proximity confusion matrix through the second location element;
服务器首先判断两个语料节点上的第一业务语料与第二业务语料构成的待检测组合是否为近音混淆组合,若待检测组合为近音混淆组合,则确定待检测组合的位置坐标所对应的第二位置元素为第三阈值,这里的第三阈值为1;若待检测组合不为近音混淆组合,则确定待检测组合的位置坐标所对应的第二位置元素为第四阈值,这里的第四阈值为0,服务器根据待检测组合的位置坐标建立初始近音混淆矩阵,根据待检测组合的位置坐标与对应的第二位置元素填充初始近音混淆矩阵,得到基础近音混淆矩阵。The server first judges whether the combination to be detected formed by the first business corpus and the second business corpus on the two corpus nodes is a near-sound confusion combination. If the combination to be detected is a near-sound confusion combination, it determines the position coordinates of the combination to be detected. The second position element of is the third threshold, where the third threshold is 1; if the combination to be detected is not a near-tone confusion combination, it is determined that the second position element corresponding to the position coordinates of the combination to be detected is the fourth threshold, here The fourth threshold is 0. The server establishes an initial proximity confusion matrix according to the position coordinates of the combination to be detected, and fills the initial proximity confusion matrix according to the position coordinates of the combination to be detected and the corresponding second position element to obtain the basic proximity confusion matrix.
举例说明:第一业务语料为牛,第二业务语料为刘,由第一业务语料与第二业务语料组合成的待检测组合所对应的位置坐标为(1,2)(2,1),服务器判断待检测组合是否为近音混淆组合,当待检测组合为近音混淆组合时,标记待检测组合的位置坐标对应的第二位置元素为1,当待检测组合不为近音混淆组合时,标记待检测组合的位置坐标对应的第二位置元素为0。通过待检测组合的位置坐标及第二位置元素的数值,可以建立基础近音混淆矩阵。For example: the first business corpus is Niu, and the second business corpus is Liu. The position coordinates corresponding to the combination to be detected from the first business corpus and the second business corpus are (1, 2) (2, 1), The server judges whether the combination to be detected is a near sound confusion combination. When the combination to be detected is a near sound confusion combination, the second position element corresponding to the position coordinates of the detected combination is marked as 1, and when the combination to be detected is not a near sound confusion combination , The second position element corresponding to the position coordinate of the combination to be detected is 0. Through the position coordinates of the combination to be detected and the value of the second position element, a basic near sound confusion matrix can be established.
205、利用预置的图神经网络生成基础形近混淆矩阵的形近混淆结构图谱以及基础近音混淆矩阵的近音混淆结构图谱;205. Use the preset graph neural network to generate the shape and near confusion structure map of the basic near confusion matrix and the near sound confusion structure map of the basic near sound confusion matrix;
服务器得到基础形近混淆矩阵及基础近音混淆矩阵之后,可以通过预置的图神经网络对基础形近混淆矩阵及基础近音混淆矩阵进行图转换,进而得到形近混淆结构图谱与近音混淆结构图谱。After the server obtains the basic near confusion matrix and the basic near confusion matrix, the basic near confusion matrix and the basic near confusion matrix can be converted through the preset graph neural network to obtain the near confusion structure map and near confusion. Structure map.
206、对形近混淆结构图谱以及近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;206. Perform graph convolution and graph attention calculations on the near-form confusion structure map and the near-phone confusion structure map in turn to obtain the confusion corpus structure map;
具体的,服务器首先对形近混淆结构图谱进行图卷积计算,并利用第一计算公式计算相邻形近语料信息,第一计算公式为:f(A p,H p l)=A p'H p lW p l,其中,f(A p,H p l)表示相邻形近语料信息,A p表示形近混淆结构图谱中的基础形近混淆矩阵,H p l表示第l卷积层的第一超参数,A p'表示基础形近混淆矩阵的正则化矩阵,W p l表示第l卷积层的第二超参数;其次服务器对近音混淆结构图谱进行图卷积计算,并利用第二计算公式计算相邻近音语料信息,第二计算公式为:f(A s,H s l)=A s'H s lW s l,其中,f(A s,H s l)表示相邻近音语料信息,A s表示近音混淆结构图谱中的基础近音混淆矩阵,H s l表示第l卷积层的第三超参数,A s'表示基础近音混淆矩阵的正则化矩阵,W s l表示第l卷积层的第四超参数;然后服务器利用第三计算公式对相邻形近语料信息与相邻近音语料信息进行图注意力计算,得到语料混淆结构矩阵,第三计算公式为: Specifically, the server first performs graph convolution calculation on the shape-near-confusion structure map, and uses the first calculation formula to calculate the adjacent shape-near corpus information. The first calculation formula is: f(A p ,H p l )=A p' H p l W p l , where f(A p ,H p l ) represents the information of the adjacent shape near corpus, Ap indicates the basic shape near confusion matrix in the shape near confusion structure map, and H p l indicates the lth convolution The first hyperparameter of the layer, Ap' represents the regularization matrix of the basic near confusion matrix, W p l represents the second hyperparameter of the lth convolutional layer; secondly, the server performs graph convolution calculation on the near-tone confusion structure map, And use the second calculation formula to calculate the adjacent sound corpus information, the second calculation formula is: f(A s ,H s l )=A s'H s l W s l , where f(A s ,H s l ) Represents the adjacent sound corpus information, A s represents the basic near sound confusion matrix in the near sound confusion structure map, H s l represents the third hyperparameter of the lth convolutional layer, and A s'represents the basic near sound confusion matrix Regularization matrix, W s l represents the fourth hyperparameter of the lth convolutional layer; then the server uses the third calculation formula to calculate the graph attention of the adjacent form and near corpus information and the adjacent phonetic corpus information to obtain the corpus confusion structure Matrix, the third calculation formula is:
Figure PCTCN2020124828-appb-000001
Figure PCTCN2020124828-appb-000001
其中,
Figure PCTCN2020124828-appb-000002
表示语料混淆结构矩阵,
Figure PCTCN2020124828-appb-000003
表示相邻形近语料信息或邻近音语料信息的第l卷积层第i行的语料信息,且i为正整数,k表示信息标记符,且k∈(s,p),
Figure PCTCN2020124828-appb-000004
表示相邻形近语料信息或邻近音语料信息的第l卷积层第i个的语料信息的权重,w a表示表示,β表示控制图注意力权重的超参数;最后服务器采用预置的图神经网络生成语料混淆结构矩阵的混淆语料结构图谱。
among them,
Figure PCTCN2020124828-appb-000002
Represents the corpus confusion structure matrix,
Figure PCTCN2020124828-appb-000003
Represents the corpus information of the i-th row of the l-th convolutional layer of adjacent form near corpus information or adjacent phonetic corpus information, and i is a positive integer, k represents the information marker, and k ∈ (s, p),
Figure PCTCN2020124828-appb-000004
Represents the weight of the i-th corpus information of the l-th convolutional layer of adjacent form near corpus information or adjacent phonetic corpus information, w a represents, and β represents the hyperparameter that controls the attention weight of the graph; finally the server uses the preset graph The neural network generates the confusion corpus structure map of the corpus confusion structure matrix.
上述生成形近混淆结构图谱与近音混淆结构图谱后,服务器可以对医疗业务语料分别进行形近混淆语料检测与近音混淆语料检测,因此服务器需要通过计算公式将两个混淆结构图谱相结合,实现可以同时检测医疗业务语料的混淆语料。After the above-mentioned generation of the near-confusion structure map and the near-sound confusion structure map, the server can perform the detection of the near-sound confusion corpus and the near-sound confusion corpus respectively on the medical business corpus. Therefore, the server needs to combine the two confusion structure maps through a calculation formula. Realize obfuscated corpus that can detect medical business corpus at the same time.
首先服务器需要对形近混淆结构图谱与近音混淆结构图谱分别进行图卷积操作,具体为形近混淆结构图谱与形近混淆结构图谱进行卷积计算,进而提取到相邻形近语料信息,近音混淆结构图谱与近音混淆结构图谱进行卷积计算,进而提取到相邻近音语料信息。需要说明的是,在进行图卷积计算时,图结构中存在不同的层级,服务器将相同层级的卷积层进行卷积,得到的是该层级的相邻形近语料信息。此外,因为本申请并不对医疗业务语料的数量进行限定,因此由医疗业务语料与预置的字典中的标准语料构成的形近语料混淆矩阵的行数可能为一个数值很大的正整数,因此为便于计算,服务器可以将基础近音混淆矩阵进行正则化,进而缩小基础近音混淆矩阵的行数。需要说明的是,在进行近音混淆结构图谱的图卷积计算时,与形近混淆结构图谱进行图卷积计算的原理是相同的,故在此并不赘述。First, the server needs to perform graph convolution operations on the near-form confusion structure atlas and near-phone confusion structure atlas. Specifically, the near-form confusion structure atlas and the near-form confusion structure atlas are convolved, and then the adjacent form and near corpus information is extracted. The near-phone confusion structure map and the near-phone confusion structure map are subjected to convolution calculation, and then the adjacent sound corpus information is extracted. It should be noted that when performing graph convolution calculations, there are different levels in the graph structure, and the server convolves the convolution layers of the same level to obtain adjacent shape near corpus information of this level. In addition, because this application does not limit the number of medical business corpus, the number of rows in the corpus confusion matrix formed by the medical business corpus and the standard corpus in the preset dictionary may be a large positive integer, so To facilitate calculation, the server may regularize the basic near-tone confusion matrix, thereby reducing the number of rows of the basic near-tone confusion matrix. It should be noted that, when performing the graph convolution calculation of the near-phone confusion structure atlas, the principle of the graph convolution calculation of the near-confusion structure atlas is the same, so it is not repeated here.
服务器分别计算形近混淆结构图谱与近音混淆结构图谱的图卷积之后,通过对相邻形 近语料信息与相邻近音语料信息进行图注意力计算,得到语料混淆结构矩阵,也就是说,将每层卷积层计算得到的相邻混淆语料信息进行累计相加,最终得到语料混淆结构矩阵,服务器再通过预置的图神经网络对语料混淆结构矩阵进行转化,从而得到混淆语料结构图谱。The server separately calculates the graph convolution of the near-form confusion structure atlas and the near-phone confusion structure atlas, and then calculates the graph attention of the adjacent form near corpus information and the adjacent phonetic corpus information to obtain the corpus confusion structure matrix, that is to say , Accumulatively add the adjacent confusion corpus information calculated by each layer of convolutional layer, and finally obtain the corpus confusion structure matrix. The server then transforms the corpus confusion structure matrix through the preset graph neural network to obtain the confusion corpus structure map .
207、获取待测文本语料,利用预置的向量提取器提取待测文本语料的字符向量,计算字符向量与混淆语料结构图谱之间的基础相似概率,根据基础相似概率对待测文本语料进行更改处理,得到目标文本语料。207. Obtain the text corpus to be tested, use the preset vector extractor to extract the character vectors of the text corpus to be tested, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and modify the text corpus to be tested according to the basic similarity probability , Get the target text corpus.
具体的,服务器首先获取待测文本语料,利用预置的向量提取器提取待测文本语料中的字符向量;然后服务器计算混淆语料结构图谱的语料混淆结构矩阵与字符向量之间的基础相似概率;最后服务器选择基础相似概率的数值最大的目标相似概率,将目标相似概率对应的混淆文本语料作为更改文本语料,将待测文本语料替换为更改文本语料,得到目标文本语料,混淆文本语料为混淆语料结构图谱中的语料。Specifically, the server first obtains the text corpus to be tested, and uses a preset vector extractor to extract the character vectors in the text corpus to be tested; then the server calculates the basic similarity probability between the corpus confusion structure matrix and the character vectors of the confusion corpus structure map; Finally, the server selects the target similarity probability with the largest value of the basic similarity probability, takes the confused text corpus corresponding to the target similarity probability as the modified text corpus, and replaces the test text corpus with the modified text corpus to obtain the target text corpus, and the confused text corpus is the confused corpus The corpus in the structure map.
服务器获取待测文本语料,这里的待测文本语料指的是医生输入的文字文本或由语音转化后的文字文本,服务器再对待测文本语料进行字符向量的提取,这里服务器是通过预置的向量提取器对待测文本语料进行提取的,预置的向量提取器BERT提取器(bidirectional encoder representation from transformers),BERT提取器可以进一步增加词向量提取器的泛化能力,充分描述字符级、词级、句子级甚至句间关系特征,进而提取到待测文本语料中的字符向量。The server obtains the text corpus to be tested. The text corpus to be tested here refers to the text text input by the doctor or the text text converted from speech. The server then extracts the character vector of the text corpus to be tested. The server here uses the preset vector The extractor extracts the text corpus to be tested, the preset vector extractor BERT extractor (bidirectional encoder representation from transformers), the BERT extractor can further increase the generalization ability of the word vector extractor, and fully describe the character level, word level, Sentence-level and even inter-sentence relationship features are then extracted into character vectors in the text corpus to be tested.
服务器通过计算字符向量与混淆结构矩阵之间的基础相似概率,对待测文本语料进行更改,进而得到目标文本语料。服务器是在全连接层中去计算字符向量与混淆结构矩阵之间的基础相似概率,服务器选择基础相似概率的数值最大的目标相似概率,将目标相似概率对应的混淆文本语料作为更改文本语料,其中,混淆文本语料为混淆语料结构图谱中的语料,也就是混淆语料结构图谱中的语料节点,将待测文本语料替换为更改文本语料,得到目标文本语料。The server changes the text corpus to be tested by calculating the basic similarity probability between the character vector and the confusion structure matrix, and then obtains the target text corpus. The server calculates the basic similarity probability between the character vector and the confusion structure matrix in the fully connected layer. The server selects the target similarity probability with the largest value of the basic similarity probability, and uses the confused text corpus corresponding to the target similarity probability as the modified text corpus. , The obfuscated text corpus is the corpus in the obfuscated corpus structure map, that is, the corpus node in the obfuscated corpus structure map, the text corpus to be tested is replaced with the modified text corpus to obtain the target text corpus.
本申请实施例中,通过预置的图神经网络生成医疗业务语料的混淆语料结构图谱,服务器在进行待测文本语料的文本纠错时,直接计算混淆语料结构图谱对应的语料混淆结构矩阵之间的基础相似概率,通过基础相似概率的数值确定纠错的目标文本语料,本方案可应用于智慧医疗领域中,提高了待测文本语料的文本纠错效率,从而推动智慧城市的建设。In the embodiment of this application, the obfuscated corpus structure map of the medical business corpus is generated through a preset graph neural network, and the server directly calculates the corpus confusion structure matrix corresponding to the obfuscated corpus structure map when performing text error correction of the text corpus to be tested. Based on the basic similarity probability, the target text corpus for error correction is determined by the value of the basic similarity probability. This solution can be applied in the field of smart medical care, improving the efficiency of text error correction of the text corpus to be tested, thereby promoting the construction of smart cities.
上面对本申请实施例中基于图神经网络的文本纠错方法进行了描述,下面对本申请实施例中基于图神经网络的文本纠错装置进行描述,请参阅图3,本申请实施例中基于图神经网络的文本纠错装置一个实施例包括:The text error correction method based on the graph neural network in the embodiment of the application is described above, and the text error correction device based on the graph neural network in the embodiment of the application is described below. Please refer to FIG. 3. In the embodiment of the application, the text error correction method based on the graph neural network is described. An embodiment of a network text error correction device includes:
获取模块301,用于获取医疗业务语料,根据医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;建立模块302,用于基于预置的图神经网络建立形近混淆语料集合的形近混淆结构图谱以及近音混淆语料集合的近音混淆结构图谱;计算模块 303,用于对形近混淆结构图谱以及近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;更改模块304,用于获取待测文本语料,利用预置的向量提取器提取待测文本语料的字符向量,计算字符向量与混淆语料结构图谱之间的基础相似概率,根据基础相似概率对待测文本语料进行更改处理,得到目标文本语料。The obtaining module 301 is used to obtain medical business corpus, and establish the form and near confusion corpus and the near sound confusion corpus according to the medical service corpus and the preset dictionary; the establishment module 302 is used to establish the form and near confusion based on the preset graph neural network The near-confusion structure map of the corpus and the near-sound confusion structure map of the near-sound confusion corpus; the calculation module 303 is used to perform the graph convolution operation and the graph attention calculation on the near-sound confusion structure map and the near-sound confusion structure map in turn , To obtain the structure map of the confused corpus; the modification module 304 is used to obtain the text corpus to be tested, extract the character vector of the text corpus to be tested using a preset vector extractor, and calculate the basic similarity probability between the character vector and the structure map of the confused corpus, According to the basic similarity probability, the text corpus to be tested is modified to obtain the target text corpus.
请参阅图4,本申请实施例中基于图神经网络的文本纠错装置的另一个实施例包括:Referring to FIG. 4, another embodiment of a text error correction device based on a graph neural network in an embodiment of the present application includes:
获取模块301,用于获取医疗业务语料,根据医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;建立模块302,用于基于预置的图神经网络建立形近混淆语料集合的形近混淆结构图谱以及近音混淆语料集合的近音混淆结构图谱;计算模块303,用于对形近混淆结构图谱以及近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;更改模块304,用于获取待测文本语料,利用预置的向量提取器提取待测文本语料的字符向量,计算字符向量与混淆语料结构图谱之间的基础相似概率,根据基础相似概率对待测文本语料进行更改处理,得到目标文本语料。The obtaining module 301 is used to obtain medical business corpus, and establish the form and near confusion corpus and the near sound confusion corpus according to the medical service corpus and the preset dictionary; the establishment module 302 is used to establish the form and near confusion based on the preset graph neural network The near-confusion structure map of the corpus and the near-sound confusion structure map of the near-sound confusion corpus; the calculation module 303 is used to perform the graph convolution operation and the graph attention calculation on the near-sound confusion structure map and the near-sound confusion structure map in turn , To obtain the structure map of the confused corpus; the modification module 304 is used to obtain the text corpus to be tested, extract the character vector of the text corpus to be tested using a preset vector extractor, and calculate the basic similarity probability between the character vector and the structure map of the confused corpus, According to the basic similarity probability, the text corpus to be tested is modified to obtain the target text corpus.
可选的,获取模块301还可以具体用于:获取医疗业务语料,利用预置的相似度函数计算医疗业务语料与预置的字典中的标准语料之间的基础字形相似度;筛选出基础字形相似度大于相似阈值的目标字形相似度,将目标字形相似度对应的标准语料作为医疗业务语料的形近混淆语料,将医疗业务语料与形近混淆语料组合为形近混淆组合,通过形近混淆组合生成形近混淆语料集合;利用预置的模糊匹配算法将医疗业务语料转化为语料音标,筛选出语料音标中的目标音标,目标音标包括具有易混淆的韵母和/或声母;将目标音标转化为近音音标,并在预置的字典中查询标准音标与近音音标相同的标准语料,将标准音标与近音音标相同的标准语料作为医疗业务语料的近音混淆语料,将医疗业务语料与近音混淆语料组合为近音混淆组合,通过近音混淆组合生成近音混淆语料集合。Optionally, the acquisition module 301 can also be specifically used to: acquire medical business corpus, use a preset similarity function to calculate the basic character shape similarity between the medical business corpus and the standard corpus in a preset dictionary; filter out the basic character shape If the similarity is greater than the similarity threshold, the standard corpus corresponding to the target font similarity is used as the confusing corpus of the medical business corpus, and the medical business corpus and the confusing corpus are combined into a confusing combination, and the confusing combination is passed. Combine and generate a set of near-obfuscated corpus; use preset fuzzy matching algorithm to transform medical business corpus into corpus phonetic symbols, and filter out the target phonetic symbols in the corpus phonetic symbols. The target phonetic symbols include confusing vowels and/or initials; transform the target phonetic symbols It is a near-phonetic phonetic symbol, and the standard corpus with the same standard phonetic symbol and the near-phonetic phonetic symbol is searched in the preset dictionary. The standard corpus with the same standard phonetic symbol and the near-phonetic phonetic symbol is used as the near-phonetic confusion corpus of the medical business corpus. The near-phone confusion corpus combination is a near-phone confusion combination, and the near-phone confusion corpus is generated through the near-phone confusion combination.
可选的,建立模块302包括:组合单元3021,用于在医疗业务语料中提取第一业务语料与第二业务语料,将第一业务语料与第二业务语料进行组合,得到待检测组合;第一确定单元3022,用于根据待检测组合与形近混淆组合确定待检测组合位置坐标的第一位置元素,通过第一位置元素确定基础形近混淆矩阵;第二确定单元3023,用于根据待检测组合与近音混淆组合确定待检测组合位置坐标的第二位置元素,通过第二位置元素确定基础近音混淆矩阵;生成单元3024,用于利用预置的图神经网络生成基础形近混淆矩阵的形近混淆结构图谱以及基础近音混淆矩阵的近音混淆结构图谱。Optionally, the establishment module 302 includes: a combining unit 3021, configured to extract the first business corpus and the second business corpus from the medical business corpus, and combine the first business corpus and the second business corpus to obtain the combination to be detected; A determining unit 3022 is used to determine the first position element of the position coordinates of the to-be-detected combination according to the combination to be detected and the shape-proximity confusion combination, and determine the basic shape-proximity confusion matrix through the first position element; the second determination unit 3023 is used to determine The detection combination and the proximity confusion combination determine the second position element of the position coordinate of the combination to be detected, and the basic proximity confusion matrix is determined by the second position element; the generating unit 3024 is used to generate the basic shape near confusion matrix using a preset graph neural network The near-phone confusion structure map and the near-phone confusion structure map of the basic near-phone confusion matrix.
可选的,第一确定单元3022还可以具体用于:判断待检测组合是否为形近混淆组合;若待检测组合为形近混淆组合,则获取待检测组合的位置坐标,并将位置坐标对应的第一位置元素标记为第一阈值;若待检测组合不为形近混淆组合,则获取待检测组合的位置坐标,并将位置坐标对应的第一位置元素标记为第二阈值;通过待检测组合的位置坐标建立初始形近混淆矩阵,将第一位置元素录入初始形近混淆矩阵中,得到基础形近混淆矩阵。Optionally, the first determining unit 3022 may also be specifically configured to: determine whether the combination to be detected is a combination of form and near confusion; if the combination to be detected is a combination of form and near confusion, obtain the position coordinates of the combination to be detected and correspond to the position coordinates The first position element of is marked as the first threshold; if the combination to be detected is not a form-close confusion combination, the position coordinates of the combination to be detected are obtained, and the first position element corresponding to the position coordinates is marked as the second threshold; The combined position coordinates establish an initial shape near confusion matrix, and the first position element is entered into the initial shape near confusion matrix to obtain a basic shape near confusion matrix.
可选的,第二确定单元3023还可以具体用于:判断待检测组合是否为近音混淆组合; 若待检测组合为近音混淆组合,则获取待检测组合的位置坐标,并将位置坐标对应的第二位置元素标记为第三阈值;若待检测组合不为近音混淆组合,则获取待检测组合的位置坐标,并将位置坐标对应的第二位置元素标记为第四阈值;通过待检测组合的位置坐标建立初始近音混淆矩阵,将第二位置元素录入初始近音混淆矩阵中,得到基础近音混淆矩阵。Optionally, the second determining unit 3023 may also be specifically configured to: determine whether the combination to be detected is a combination of near sound confusion; if the combination to be detected is a combination of near sound confusion, obtain the position coordinates of the combination to be detected and correspond to the position coordinates The second position element of is marked as the third threshold; if the combination to be detected is not a near sound confusion combination, the position coordinates of the combination to be detected are obtained, and the second position element corresponding to the position coordinates is marked as the fourth threshold; The combined position coordinates establish an initial near sound confusion matrix, and the second position element is entered into the initial near sound confusion matrix to obtain a basic near sound confusion matrix.
可选的,计算模块303还可以具体用于:对形近混淆结构图谱进行图卷积计算,并利用第一计算公式计算相邻形近语料信息,第一计算公式为:f(A p,H p l)=A p'H p lW p l,其中,f(A p,H p l)表示相邻形近语料信息,A p表示形近混淆结构图谱中的基础形近混淆矩阵,H p l表示第l卷积层的第一超参数,A p'表示基础形近混淆矩阵的正则化矩阵,W p l表示第l卷积层的第二超参数;对近音混淆结构图谱进行图卷积计算,并利用第二计算公式计算相邻近音语料信息,第二计算公式为:f(A s,H s l)=A s'H s lW s l,其中,f(A s,H s l)表示相邻近音语料信息,A s表示近音混淆结构图谱中的基础近音混淆矩阵,H s l表示第l卷积层的第三超参数,A s'表示基础近音混淆矩阵的正则化矩阵,W s l表示第l卷积层的第四超参数;利用第三计算公式对相邻形近语料信息与相邻近音语料信息进行图注意力计算,得到语料混淆结构矩阵,第三计算公式为: Optionally, the calculation module 303 may also be specifically configured to: perform graph convolution calculations on the shape and near confusion structure atlas, and use the first calculation formula to calculate adjacent shape and near corpus information. The first calculation formula is: f(A p , H p l )=A p'H p l W p l , where f(A p , H p l ) represents the adjacent form near corpus information, and Ap represents the basic form near confusion matrix in the form near confusion structure map, H p l represents the first hyperparameter of the lth convolutional layer, Ap' represents the regularization matrix of the basic near confusion matrix, and W p l represents the second hyperparameter of the lth convolutional layer; for the near-tone confusion structure map Perform graph convolution calculation, and use the second calculation formula to calculate the adjacent sound corpus information. The second calculation formula is: f(A s ,H s l )=A s'H s l W s l , where f( A s ,H s l ) represents the adjacent sound corpus information, A s represents the basic near-phone confusion matrix in the near-phone confusion structure map, H s l represents the third hyperparameter of the lth convolutional layer, and A s'represents The regularization matrix of the basic near-phone confusion matrix, W s l represents the fourth hyperparameter of the lth convolutional layer; the third calculation formula is used to calculate the image attention of the adjacent shape near corpus information and the adjacent sound corpus information, Obtain the corpus confusion structure matrix, and the third calculation formula is:
Figure PCTCN2020124828-appb-000005
Figure PCTCN2020124828-appb-000005
其中,
Figure PCTCN2020124828-appb-000006
表示语料混淆结构矩阵,
Figure PCTCN2020124828-appb-000007
表示相邻形近语料信息或邻近音语料信息的第l卷积层第i行的语料信息,且i为正整数,k表示信息标记符,且k∈(s,p),
Figure PCTCN2020124828-appb-000008
表示相邻形近语料信息或邻近音语料信息的第l卷积层第i个的语料信息的权重,w a表示表示,β表示控制图注意力权重的超参数;采用预置的图神经网络生成语料混淆结构矩阵的混淆语料结构图谱。
among them,
Figure PCTCN2020124828-appb-000006
Represents the corpus confusion structure matrix,
Figure PCTCN2020124828-appb-000007
Represents the corpus information of the i-th row of the l-th convolutional layer of adjacent form near corpus information or adjacent phonetic corpus information, and i is a positive integer, k represents the information marker, and k ∈ (s, p),
Figure PCTCN2020124828-appb-000008
Represents the weight of the i-th corpus information of the l-th convolutional layer of the adjacent form near corpus information or adjacent phonetic corpus information, w a represents, and β represents the hyperparameter that controls the attention weight of the graph; a preset graph neural network is used Generate the confusion corpus structure map of the corpus confusion structure matrix.
可选的,更改模块304还可以具体用于:获取待测文本语料,利用预置的向量提取器提取待测文本语料中的字符向量;计算混淆语料结构图谱的语料混淆结构矩阵与字符向量之间的基础相似概率;选择基础相似概率的数值最大的目标相似概率,将目标相似概率对应的混淆文本语料作为更改文本语料,将待测文本语料替换为更改文本语料,得到目标文本语料,混淆文本语料为混淆语料结构图谱中的语料。Optionally, the modification module 304 can also be specifically used to: obtain the text corpus to be tested, use a preset vector extractor to extract the character vectors in the text corpus to be tested; calculate the confusion structure matrix and the character vector of the confusion corpus structure map Select the target similarity probability with the largest value of the basic similarity probability, use the confused text corpus corresponding to the target similarity probability as the modified text corpus, replace the test text corpus with the modified text corpus, obtain the target text corpus, the confused text The corpus is the corpus in the obfuscated corpus structure map.
本申请实施例中,通过预置的图神经网络生成医疗业务语料的混淆语料结构图谱,服务器在进行待测文本语料的文本纠错时,直接计算混淆语料结构图谱对应的语料混淆结构矩阵之间的基础相似概率,通过基础相似概率的数值确定纠错的目标文本语料,本方案可应用于智慧医疗领域中,提高了待测文本语料的文本纠错效率,从而推动智慧城市的建设。In the embodiment of this application, the obfuscated corpus structure map of the medical business corpus is generated through a preset graph neural network, and the server directly calculates the corpus confusion structure matrix corresponding to the obfuscated corpus structure map when performing text error correction of the text corpus to be tested. Based on the basic similarity probability, the target text corpus for error correction is determined by the value of the basic similarity probability. This solution can be applied in the field of smart medical care, improving the efficiency of text error correction of the text corpus to be tested, thereby promoting the construction of smart cities.
上面图3和图4从模块化功能实体的角度对本申请实施例中的基于图神经网络的文本纠错装置进行详细描述,下面从硬件处理的角度对本申请实施例中基于图神经网络的文本纠错设备进行详细描述。The above Figures 3 and 4 describe in detail the text error correction device based on graph neural network in this embodiment of the application from the perspective of modular functional entities. The following is a description of the text correction device based on graph neural network in the embodiment of this application from the perspective of hardware processing. The wrong device is described in detail.
图5是本申请实施例提供的一种基于图神经网络的文本纠错设备的结构示意图,该基于图神经网络的文本纠错设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对基于图神经网络的文本纠错设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在基于图神经网络的文本纠错设备500上执行存储介质530中的一系列指令操作。FIG. 5 is a schematic structural diagram of a text error correction device based on a graph neural network provided by an embodiment of the present application. The text error correction device 500 based on a graph neural network may have relatively large differences due to different configurations or performances, and may include One or more processors (central processing units, CPU) 510 (for example, one or more processors) and memory 520, and one or more storage media 530 (for example, one or more storage media 530 for storing application programs 533 or data 532) Storage device). Among them, the memory 520 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the text error correction device 500 based on the graph neural network. Further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the text error correction device 500 based on the graph neural network.
基于图神经网络的文本纠错设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作***531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的基于图神经网络的文本纠错设备结构并不构成对基于图神经网络的文本纠错设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The text error correction device 500 based on graph neural network may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the text error correction device based on the graph neural network shown in FIG. 5 does not constitute a limitation on the text error correction device based on the graph neural network, and may include more or less components than shown in the figure. , Or combining some components, or different component arrangements.
本申请还提供一种基于图神经网络的文本纠错设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述基于图神经网络的文本纠错方法的步骤。This application also provides a text error correction device based on a graph neural network. The computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor executes the above The steps of the text error correction method based on graph neural network in each embodiment.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
获取医疗业务语料,根据医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;基于预置的图神经网络建立形近混淆语料集合的形近混淆结构图谱以及近音混淆语料集合的近音混淆结构图谱;对形近混淆结构图谱以及近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;获取待测文本语料,利用预置的向量提取器提取待测文本语料的字符向量,计算字符向量与混淆语料结构图谱之间的基础相似概率,根据基础相似概率对待测文本语料进行更改处理,得到目标文本语料。Obtain the medical business corpus, and establish the confusing corpus and the confusing corpus set based on the medical business corpus and the preset dictionary; build the confusing structure map of the confusing corpus and the near confusing corpus based on the preset graph neural network The near-sound confusion structure map of the corpus; the near-sound confusion structure map and the near-sound confusion structure map sequentially perform graph convolution and graph attention calculations to obtain the confusion corpus structure map; obtain the text corpus to be tested, using the preset vector The extractor extracts the character vector of the text corpus to be tested, calculates the basic similarity probability between the character vector and the structure map of the confused corpus, and modifies the text corpus to be tested according to the basic similarity probability to obtain the target text corpus.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种基于图神经网络的文本纠错方法,基于图神经网络的文本纠错方法包括:A text error correction method based on graph neural network. The text error correction method based on graph neural network includes:
    获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;Obtaining medical business corpus, and establishing a near-form confusion corpus and near-sound confusion corpus according to the medical business corpus and a preset dictionary;
    基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;Establishing a near-sound confusion structure map of the near-sound confusion corpus and a near-sound confusion structure map of the near-sound confusion corpus based on a preset graph neural network;
    对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;Perform graph convolution and graph attention calculations on the form and near-confusion structure atlas and the near-phone confusion structure atlas in sequence to obtain a confusing corpus structure atlas;
    获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。Obtain the text corpus to be tested, extract the character vector of the text corpus to be tested using a preset vector extractor, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and pair according to the basic similarity probability The text corpus to be tested is modified to obtain the target text corpus.
  2. 根据权利要求1所述的基于图神经网络的文本纠错方法,其中,所述获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合包括:The method for text error correction based on graph neural network according to claim 1, wherein said acquiring medical service corpus and establishing a near-form confusion corpus and near-sound confusion corpus according to said medical service corpus and a preset dictionary comprises :
    获取医疗业务语料,利用预置的相似度函数计算所述医疗业务语料与预置的字典中的标准语料之间的基础字形相似度;Obtain the medical business corpus, and calculate the basic font similarity between the medical business corpus and the standard corpus in the preset dictionary by using a preset similarity function;
    筛选出所述基础字形相似度大于相似阈值的目标字形相似度,将所述目标字形相似度对应的标准语料作为所述医疗业务语料的形近混淆语料,将所述医疗业务语料与所述形近混淆语料组合为形近混淆组合,通过所述形近混淆组合生成形近混淆语料集合;The target font similarity with the basic font similarity greater than the similarity threshold is screened out, the standard corpus corresponding to the target font similarity is used as the confusing corpus of the medical business corpus, and the medical business corpus is compared with the shape The near-obfuscated corpus combination is a combination of near-obfuscated form, and a set of near-obfuscated corpus is generated through the combination of near-obfuscation;
    利用预置的模糊匹配算法将所述医疗业务语料转化为语料音标,筛选出所述语料音标中的目标音标,所述目标音标包括具有易混淆的韵母和/或声母;Use a preset fuzzy matching algorithm to transform the medical business corpus into a corpus phonetic symbol, and filter out the target phonetic symbols in the corpus phonetic symbols, the target phonetic symbols including confusing vowels and/or initials;
    将目标音标转化为近音音标,并在所述预置的字典中查询标准音标与所述近音音标相同的标准语料,将所述标准音标与所述近音音标相同的标准语料作为所述医疗业务语料的近音混淆语料,将所述医疗业务语料与所述近音混淆语料组合为近音混淆组合,通过所述近音混淆组合生成近音混淆语料集合。Convert the target phonetic symbol into a near-phonetic phonetic symbol, and look up the standard corpus with the same standard phonetic symbol as the near-phonetic phonetic symbol in the preset dictionary, and use the standard corpus with the same phonetic symbol as the near-phonetic phonetic symbol as the standard corpus. The near-sound confusion corpus of the medical business corpus is combined into a near-sound confusion combination with the medical business corpus and the near-sound confusion corpus, and a near-sound confusion corpus is generated through the near-sound confusion combination.
  3. 根据权利要求2所述的基于图神经网络的文本纠错方法,其中,所述基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱包括:The method for text error correction based on a graph neural network according to claim 2, wherein the pre-set graph neural network establishes the shape and confusion structure map of the shape and confusion corpus and the near phonetic confusion corpus. The near-tone confusion structure map includes:
    在医疗业务语料中提取第一业务语料与第二业务语料,将所述第一业务语料与所述第二业务语料进行组合,得到待检测组合;Extracting the first business corpus and the second business corpus from the medical business corpus, and combining the first business corpus and the second business corpus to obtain a combination to be detected;
    根据所述待检测组合与所述形近混淆组合确定所述待检测组合位置坐标的第一位置元素,通过所述第一位置元素确定基础形近混淆矩阵;Determine the first position element of the position coordinate of the to-be-detected combination according to the combination to be detected and the combination of shape and proximity confusion, and determine a basic shape and proximity confusion matrix by using the first position element;
    根据所述待检测组合与所述近音混淆组合确定所述待检测组合位置坐标的第二位置元素,通过所述第二位置元素确定基础近音混淆矩阵;Determine a second location element of the location coordinates of the to-be-detected combination according to the combination to be detected and the combination of near sound confusion, and determine a basic near sound confusion matrix through the second position element;
    利用预置的图神经网络生成所述基础形近混淆矩阵的形近混淆结构图谱以及所述基础近音混淆矩阵的近音混淆结构图谱。A preset graph neural network is used to generate a near-confusion structure map of the basic near-confusion matrix and a near-phonic confusion structure map of the basic near-confusion matrix.
  4. 根据权利要求3所述的基于图神经网络的文本纠错方法,其中,所述根据所述待检测组合与所述形近混淆组合确定所述待检测组合位置坐标的第一位置元素,通过所述第一位置元素确定基础形近混淆矩阵包括:The method for text error correction based on a graph neural network according to claim 3, wherein the first position element of the position coordinates of the to-be-detected combination is determined according to the combination of the to-be-detected and the combination of shape and confusion, through all The first position element to determine the basic shape near confusion matrix includes:
    判断所述待检测组合是否为所述形近混淆组合;Judging whether the combination to be detected is the combination of similar shapes and confusion;
    若所述待检测组合为所述形近混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第一位置元素标记为第一阈值;If the to-be-detected combination is the form-close confusion combination, acquiring the position coordinates of the to-be-detected combination, and marking the first position element corresponding to the position coordinates as a first threshold;
    若所述待检测组合不为所述形近混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第一位置元素标记为第二阈值;If the combination to be detected is not the combination of near-form confusion, acquiring the position coordinates of the combination to be detected, and marking the first position element corresponding to the position coordinates as a second threshold;
    通过所述待检测组合的位置坐标建立初始形近混淆矩阵,将所述第一位置元素录入所述初始形近混淆矩阵中,得到基础形近混淆矩阵。An initial shape near confusion matrix is established by the position coordinates of the combination to be detected, and the first position element is entered into the initial shape near confusion matrix to obtain a basic shape near confusion matrix.
  5. 根据权利要求3所述的基于图神经网络的文本纠错方法,其中,所述根据所述待检测组合与所述近音混淆组合确定所述待检测组合位置坐标的第二位置元素,通过所述第二位置元素确定基础近音混淆矩阵包括:The method for text error correction based on graph neural network according to claim 3, wherein the second position element of the position coordinates of the to-be-detected combination is determined according to the combination of the to-be-detected and the confusing combination of near sound, and the The basic near tone confusion matrix for determining the second position element includes:
    判断所述待检测组合是否为所述近音混淆组合;Judging whether the combination to be detected is the confusing combination of near sound;
    若所述待检测组合为所述近音混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第二位置元素标记为第三阈值;If the combination to be detected is the confusing combination of near sound, acquiring the position coordinates of the combination to be detected, and marking the second position element corresponding to the position coordinates as a third threshold;
    若所述待检测组合不为所述近音混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第二位置元素标记为第四阈值;If the combination to be detected is not the confusing combination of near sound, acquiring the position coordinates of the combination to be detected, and marking the second position element corresponding to the position coordinates as a fourth threshold;
    通过所述待检测组合的位置坐标建立初始近音混淆矩阵,将所述第二位置元素录入所述初始近音混淆矩阵中,得到基础近音混淆矩阵。An initial near sound confusion matrix is established by the position coordinates of the combination to be detected, and the second position element is entered into the initial near sound confusion matrix to obtain a basic near sound confusion matrix.
  6. 根据权利要求3所述的基于图神经网络的文本纠错方法,其中,所述对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱包括:The method for text error correction based on a graph neural network according to claim 3, wherein the graph convolution operation and graph attention calculation are sequentially performed on the near-form confusion structure atlas and the near-phone confusion structure atlas to obtain The obfuscated corpus structure map includes:
    对所述形近混淆结构图谱进行图卷积计算,并利用第一计算公式计算相邻形近语料信息,所述第一计算公式为:f(A p,H p l)=A p'H p lW p l,其中,f(A p,H p l)表示相邻形近语料信息,A p表示形近混淆结构图谱中的基础形近混淆矩阵,H p l表示第l卷积层的第一超参数,A p'表示基础形近混淆矩阵的正则化矩阵,W p l表示第l卷积层的第二超参数; Perform graph convolution calculation on the shape-near confusion structure atlas, and calculate adjacent shape-near corpus information using a first calculation formula. The first calculation formula is: f(A p ,H p l )=A p'H p l W p l , where f(A p ,H p l ) represents the adjacent shape near corpus information, Ap indicates the basic shape near confusion matrix in the shape near confusion structure map, and H p l indicates the lth convolutional layer The first hyperparameter of, Ap' represents the regularization matrix of the basic form near confusion matrix, and W p l represents the second hyperparameter of the lth convolutional layer;
    对所述近音混淆结构图谱进行图卷积计算,并利用第二计算公式计算相邻近音语料信息,所述第二计算公式为:f(A s,H s l)=A s'H s lW s l,其中,f(A s,H s l)表示相邻近音语料信息,A s表示近音混淆结构图谱中的基础近音混淆矩阵,H s l表示第l卷积层的第三超参数,A s'表示基础近音混淆矩阵的正则化矩阵,W s l表示第l卷积层的第四超参数; Near the sound confusing pattern configuration diagram convolution calculation, and calculates adjacent to the second audio information using the calculation formula corpus, the second calculated as: f (A s, H s l) = A s' H s l W s l , where f(A s , H s l ) represents the adjacent sound corpus information, A s represents the basic near sound confusion matrix in the near sound confusion structure map, and H s l represents the lth convolutional layer super third parameter, a s' confusion matrix represents almost sound based regularization matrix, W s l l parameters of a fourth super convolution layer;
    利用第三计算公式对所述相邻形近语料信息与所述相邻近音语料信息进行图注意力计算,得到语料混淆结构矩阵,所述第三计算公式为:A third calculation formula is used to perform graph attention calculation on the adjacent form and near corpus information and the adjacent phonetic corpus information to obtain a corpus confusion structure matrix, and the third calculation formula is:
    Figure PCTCN2020124828-appb-100001
    Figure PCTCN2020124828-appb-100001
    其中,
    Figure PCTCN2020124828-appb-100002
    表示语料混淆结构矩阵,
    Figure PCTCN2020124828-appb-100003
    表示相邻形近语料信息或邻近音语料信息的第l卷积层第i行的语料信息,且i为正整数,k表示信息标记符,且k∈(s,p),
    Figure PCTCN2020124828-appb-100004
    表示相邻形近语料信息或邻近音语料信息的第l卷积层第i个的语料信息的权重,w a表示表示,β表示控制图注意力权重的超参数;
    among them,
    Figure PCTCN2020124828-appb-100002
    Represents the corpus confusion structure matrix,
    Figure PCTCN2020124828-appb-100003
    Represents the corpus information of the i-th row of the l-th convolutional layer of adjacent form near corpus information or adjacent phonetic corpus information, and i is a positive integer, k represents the information marker, and k ∈ (s, p),
    Figure PCTCN2020124828-appb-100004
    Represents the weight of the i-th corpus information of the l-th convolutional layer of the adjacent form near corpus information or the adjacent phonetic corpus information, w a represents, and β represents the hyperparameter that controls the attention weight of the graph;
    采用所述预置的图神经网络生成所述语料混淆结构矩阵的混淆语料结构图谱。Using the preset graph neural network to generate a confusion corpus structure map of the corpus confusion structure matrix.
  7. 根据权利要求1-6中任一项所述的基于图神经网络的文本纠错方法,其中,所述获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料包括:The method for text error correction based on graph neural network according to any one of claims 1-6, wherein said acquiring the text corpus to be tested uses a preset vector extractor to extract the character vector of the text corpus to be tested Calculating the basic similarity probability between the character vector and the confused corpus structure map, and modifying the text corpus to be tested according to the basic similarity probability, and obtaining the target text corpus includes:
    获取待测文本语料,利用预置的向量提取器提取所述待测文本语料中的字符向量;Obtain the text corpus to be tested, and extract the character vectors in the text corpus to be tested by using a preset vector extractor;
    计算所述混淆语料结构图谱的语料混淆结构矩阵与所述字符向量之间的基础相似概率;Calculating the basic similarity probability between the corpus confusion structure matrix of the confusion corpus structure map and the character vector;
    选择所述基础相似概率的数值最大的目标相似概率,将所述目标相似概率对应的混淆文本语料作为更改文本语料,将所述待测文本语料替换为所述更改文本语料,得到目标文本语料,所述混淆文本语料为所述混淆语料结构图谱中的语料。Select the target similarity probability with the largest value of the basic similarity probability, use the confused text corpus corresponding to the target similarity probability as the modified text corpus, replace the to-be-tested text corpus with the modified text corpus to obtain the target text corpus, The obfuscated text corpus is the corpus in the structure map of the obfuscated corpus.
  8. 一种基于图神经网络的文本纠错设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A text error correction device based on a graph neural network, comprising a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor executes the computer-readable instructions When implementing the following steps:
    获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;Acquiring medical business corpus, and establishing a near-form confusion corpus and a near-sound confusion corpus according to the medical business corpus and a preset dictionary;
    基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;Establishing a near-sound confusion structure map of the near-sound confusion corpus and a near-sound confusion structure map of the near-sound confusion corpus based on a preset graph neural network;
    对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力 计算,得到混淆语料结构图谱;Performing graph convolution and graph attention calculations on the form near-confusion structure atlas and the near-phone confusion structure atlas in sequence to obtain a confusing corpus structure atlas;
    获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。Obtain the text corpus to be tested, extract the character vector of the text corpus to be tested using a preset vector extractor, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and pair according to the basic similarity probability The text corpus to be tested is modified to obtain the target text corpus.
  9. 根据权利要求8所述的基于图神经网络的文本纠错设备,其中,所述处理器执行所述计算机可读指令实现所述获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合时,包括以下步骤:The text error correction device based on a graph neural network according to claim 8, wherein the processor executes the computer-readable instructions to implement the acquisition of the medical business corpus, and establishes it based on the medical business corpus and a preset dictionary The following steps are included in the confusing corpus collection and the confusing corpus collection:
    获取医疗业务语料,利用预置的相似度函数计算所述医疗业务语料与预置的字典中的标准语料之间的基础字形相似度;Obtain the medical business corpus, and calculate the basic font similarity between the medical business corpus and the standard corpus in the preset dictionary by using a preset similarity function;
    筛选出所述基础字形相似度大于相似阈值的目标字形相似度,将所述目标字形相似度对应的标准语料作为所述医疗业务语料的形近混淆语料,将所述医疗业务语料与所述形近混淆语料组合为形近混淆组合,通过所述形近混淆组合生成形近混淆语料集合;The target font similarity with the basic font similarity greater than the similarity threshold is screened out, the standard corpus corresponding to the target font similarity is used as the confusing corpus of the medical business corpus, and the medical business corpus is compared with the shape The near-obfuscated corpus combination is a combination of near-obfuscated form, and a set of near-obfuscated corpus is generated through the combination of near-obfuscation;
    利用预置的模糊匹配算法将所述医疗业务语料转化为语料音标,筛选出所述语料音标中的目标音标,所述目标音标包括具有易混淆的韵母和/或声母;Use a preset fuzzy matching algorithm to transform the medical business corpus into a corpus phonetic symbol, and filter out the target phonetic symbols in the corpus phonetic symbols, the target phonetic symbols including confusing vowels and/or initials;
    将目标音标转化为近音音标,并在所述预置的字典中查询标准音标与所述近音音标相同的标准语料,将所述标准音标与所述近音音标相同的标准语料作为所述医疗业务语料的近音混淆语料,将所述医疗业务语料与所述近音混淆语料组合为近音混淆组合,通过所述近音混淆组合生成近音混淆语料集合。Convert the target phonetic symbol into a near-phonetic phonetic symbol, and look up the standard corpus with the same standard phonetic symbol as the near-phonetic phonetic symbol in the preset dictionary, and use the standard corpus with the same phonetic symbol as the near-phonetic phonetic symbol as the The near-sound confusion corpus of the medical service corpus is combined into a near-sound confusion combination with the medical service corpus and the near-sound confusion corpus, and a near-sound confusion corpus is generated through the near-sound confusion combination.
  10. 根据权利要求9所述的基于图神经网络的文本纠错设备,其中,所述处理器执行所述计算机可读指令实现所述基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱时,包括以下步骤:The apparatus for text error correction based on graph neural network according to claim 9, wherein the processor executes the computer-readable instructions to implement the pre-set graph neural network to establish the shape of the confusing corpus set. The near-confusion structure map and the near-sound confusion structure map of the near-sound confusion corpus set include the following steps:
    在医疗业务语料中提取第一业务语料与第二业务语料,将所述第一业务语料与所述第二业务语料进行组合,得到待检测组合;Extracting the first business corpus and the second business corpus from the medical business corpus, and combining the first business corpus and the second business corpus to obtain a combination to be detected;
    根据所述待检测组合与所述形近混淆组合确定所述待检测组合位置坐标的第一位置元素,通过所述第一位置元素确定基础形近混淆矩阵;Determine the first position element of the position coordinate of the to-be-detected combination according to the combination to be detected and the combination of shape and proximity confusion, and determine a basic shape and proximity confusion matrix by using the first position element;
    根据所述待检测组合与所述近音混淆组合确定所述待检测组合位置坐标的第二位置元素,通过所述第二位置元素确定基础近音混淆矩阵;Determine a second location element of the location coordinates of the to-be-detected combination according to the combination to be detected and the combination of near sound confusion, and determine a basic near sound confusion matrix through the second position element;
    利用预置的图神经网络生成所述基础形近混淆矩阵的形近混淆结构图谱以及所述基础近音混淆矩阵的近音混淆结构图谱。A preset graph neural network is used to generate a near-confusion structure map of the basic near-confusion matrix and a near-phonic confusion structure map of the basic near-confusion matrix.
  11. 根据权利要求10所述的基于图神经网络的文本纠错设备,其中,所述处理器执行所述计算机可读指令实现所述根据所述待检测组合与所述形近混淆组合确定所述待检测组 合位置坐标的第一位置元素,通过所述第一位置元素确定基础形近混淆矩阵时,包括以下步骤:The apparatus for text error correction based on graph neural network according to claim 10, wherein the processor executes the computer-readable instructions to realize the determination of the to-be-detected combination and the near-form confusion combination. When detecting the first position element of the combined position coordinates, and determining the basic shape near confusion matrix through the first position element, the following steps are included:
    判断所述待检测组合是否为所述形近混淆组合;Judging whether the combination to be detected is the combination of similar shapes and confusion;
    若所述待检测组合为所述形近混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第一位置元素标记为第一阈值;If the to-be-detected combination is the form-close confusion combination, acquiring the position coordinates of the to-be-detected combination, and marking the first position element corresponding to the position coordinates as a first threshold;
    若所述待检测组合不为所述形近混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第一位置元素标记为第二阈值;If the to-be-detected combination is not the form-close confusion combination, acquiring the position coordinates of the to-be-detected combination, and marking the first position element corresponding to the position coordinates as a second threshold;
    通过所述待检测组合的位置坐标建立初始形近混淆矩阵,将所述第一位置元素录入所述初始形近混淆矩阵中,得到基础形近混淆矩阵。An initial shape near confusion matrix is established by the position coordinates of the combination to be detected, and the first position element is entered into the initial shape near confusion matrix to obtain a basic shape near confusion matrix.
  12. 根据权利要求10所述的基于图神经网络的文本纠错设备,其中,所述处理器执行所述计算机可读指令实现所述根据所述待检测组合与所述近音混淆组合确定所述待检测组合位置坐标的第二位置元素,通过所述第二位置元素确定基础近音混淆矩阵时,包括以下步骤:The apparatus for text error correction based on graph neural network according to claim 10, wherein the processor executes the computer-readable instruction to realize the determination of the to-be-detected combination and the near-sound confusion combination. When detecting the second position element of the combined position coordinates, and determining the basic near sound confusion matrix through the second position element, the following steps are included:
    判断所述待检测组合是否为所述近音混淆组合;Judging whether the combination to be detected is the confusing combination of near sound;
    若所述待检测组合为所述近音混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第二位置元素标记为第三阈值;If the combination to be detected is the confusing combination of near sound, acquiring the position coordinates of the combination to be detected, and marking the second position element corresponding to the position coordinates as a third threshold;
    若所述待检测组合不为所述近音混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第二位置元素标记为第四阈值;If the combination to be detected is not the confusing combination of near sound, acquiring the position coordinates of the combination to be detected, and marking the second position element corresponding to the position coordinates as a fourth threshold;
    通过所述待检测组合的位置坐标建立初始近音混淆矩阵,将所述第二位置元素录入所述初始近音混淆矩阵中,得到基础近音混淆矩阵。An initial near sound confusion matrix is established by the position coordinates of the combination to be detected, and the second position element is entered into the initial near sound confusion matrix to obtain a basic near sound confusion matrix.
  13. 根据权利要求10所述的基于图神经网络的文本纠错设备,其中,所述处理器执行所述计算机可读指令实现所述对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱时,还包括以下步骤:The graph neural network-based text error correction device according to claim 10, wherein the processor executes the computer-readable instructions to implement the pair of the near-form confusion structure atlas and the near-phone confusion structure atlas in sequence When the graph convolution operation and graph attention calculation are performed to obtain the obfuscated corpus structure map, the following steps are also included:
    对所述形近混淆结构图谱进行图卷积计算,并利用第一计算公式计算相邻形近语料信息,所述第一计算公式为:f(A p,H p l)=A p'H p lW p l,其中,f(A p,H p l)表示相邻形近语料信息,A p表示形近混淆结构图谱中的基础形近混淆矩阵,H p l表示第l卷积层的第一超参数,A p'表示基础形近混淆矩阵的正则化矩阵,W p l表示第l卷积层的第二超参数; Perform graph convolution calculation on the shape-near confusion structure atlas, and calculate adjacent shape-near corpus information using a first calculation formula. The first calculation formula is: f(A p ,H p l )=A p'H p l W p l , where f(A p ,H p l ) represents the adjacent shape near corpus information, Ap indicates the basic shape near confusion matrix in the shape near confusion structure map, and H p l indicates the lth convolutional layer The first hyperparameter of, Ap' represents the regularization matrix of the basic form near confusion matrix, and W p l represents the second hyperparameter of the lth convolutional layer;
    对所述近音混淆结构图谱进行图卷积计算,并利用第二计算公式计算相邻近音语料信息,所述第二计算公式为:f(A s,H s l)=A s'H s lW s l,其中,f(A s,H s l)表示相邻近音语料信息,A s表示近音混淆结构图谱中的基础近音混淆矩阵,H s l表示第l卷积层的第三超参数,A s'表示基础近音混淆矩阵的正则化矩阵,W s l表示第l卷积层的第四超参数; Near the sound confusing pattern configuration diagram convolution calculation, and calculates adjacent to the second audio information using the calculation formula corpus, the second calculated as: f (A s, H s l) = A s' H s l W s l , where f(A s , H s l ) represents the adjacent sound corpus information, A s represents the basic near sound confusion matrix in the near sound confusion structure map, and H s l represents the lth convolutional layer super third parameter, a s' confusion matrix represents almost sound based regularization matrix, W s l l parameters of a fourth super convolution layer;
    利用第三计算公式对所述相邻形近语料信息与所述相邻近音语料信息进行图注意力计 算,得到语料混淆结构矩阵,所述第三计算公式为:A third calculation formula is used to perform graph attention calculation on the adjacent form and near corpus information and the adjacent phonetic corpus information to obtain a corpus confusion structure matrix. The third calculation formula is:
    Figure PCTCN2020124828-appb-100005
    Figure PCTCN2020124828-appb-100005
    其中,
    Figure PCTCN2020124828-appb-100006
    表示语料混淆结构矩阵,
    Figure PCTCN2020124828-appb-100007
    表示相邻形近语料信息或邻近音语料信息的第l卷积层第i行的语料信息,且i为正整数,k表示信息标记符,且k∈(s,p),
    Figure PCTCN2020124828-appb-100008
    表示相邻形近语料信息或邻近音语料信息的第l卷积层第i个的语料信息的权重,w a表示表示,β表示控制图注意力权重的超参数;
    among them,
    Figure PCTCN2020124828-appb-100006
    Represents the corpus confusion structure matrix,
    Figure PCTCN2020124828-appb-100007
    Represents the corpus information of the i-th row of the l-th convolutional layer of the adjacent form near corpus information or the adjacent phonetic corpus information, and i is a positive integer, k represents the information marker, and k ∈ (s, p),
    Figure PCTCN2020124828-appb-100008
    Represents the weight of the i-th corpus information of the l-th convolutional layer of the adjacent form near corpus information or the adjacent phonetic corpus information, w a represents, and β represents the hyperparameter that controls the attention weight of the graph;
    采用所述预置的图神经网络生成所述语料混淆结构矩阵的混淆语料结构图谱。Using the preset graph neural network to generate a confusion corpus structure map of the corpus confusion structure matrix.
  14. 根据权利要求8-13中任一项所述的基于图神经网络的文本纠错设备,所述处理器执行所述计算机可读指令实现所述获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料时,包括以下步骤:The graph neural network-based text error correction device according to any one of claims 8-13, wherein the processor executes the computer-readable instructions to implement the acquisition of the text corpus to be tested, using a preset vector extractor Extract the character vector of the text corpus to be tested, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and modify the text corpus to be tested according to the basic similarity probability to obtain the target text The corpus includes the following steps:
    获取待测文本语料,利用预置的向量提取器提取所述待测文本语料中的字符向量;Obtain the text corpus to be tested, and extract the character vectors in the text corpus to be tested by using a preset vector extractor;
    计算所述混淆语料结构图谱的语料混淆结构矩阵与所述字符向量之间的基础相似概率;Calculating the basic similarity probability between the corpus confusion structure matrix of the confusion corpus structure map and the character vector;
    选择所述基础相似概率的数值最大的目标相似概率,将所述目标相似概率对应的混淆文本语料作为更改文本语料,将所述待测文本语料替换为所述更改文本语料,得到目标文本语料,所述混淆文本语料为所述混淆语料结构图谱中的语料。Select the target similarity probability with the largest value of the basic similarity probability, use the confused text corpus corresponding to the target similarity probability as the modified text corpus, replace the to-be-tested text corpus with the modified text corpus to obtain the target text corpus, The obfuscated text corpus is the corpus in the structure map of the obfuscated corpus.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer executes the following steps:
    获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;Obtaining medical business corpus, and establishing a near-form confusion corpus and near-sound confusion corpus according to the medical business corpus and a preset dictionary;
    基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;Establishing a near-sound confusion structure map of the near-sound confusion corpus and a near-sound confusion structure map of the near-sound confusion corpus based on a preset graph neural network;
    对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;Perform graph convolution and graph attention calculations on the form and near-confusion structure atlas and the near-phone confusion structure atlas in sequence to obtain a confusing corpus structure atlas;
    获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。Obtain the text corpus to be tested, extract the character vector of the text corpus to be tested using a preset vector extractor, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, and pair according to the basic similarity probability The text corpus to be tested is modified to obtain the target text corpus.
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    获取医疗业务语料,利用预置的相似度函数计算所述医疗业务语料与预置的字典中的标准语料之间的基础字形相似度;Obtain the medical business corpus, and calculate the basic font similarity between the medical business corpus and the standard corpus in the preset dictionary by using a preset similarity function;
    筛选出所述基础字形相似度大于相似阈值的目标字形相似度,将所述目标字形相似度对应的标准语料作为所述医疗业务语料的形近混淆语料,将所述医疗业务语料与所述形近混淆语料组合为形近混淆组合,通过所述形近混淆组合生成形近混淆语料集合;The target font similarity with the basic font similarity greater than the similarity threshold is screened out, the standard corpus corresponding to the target font similarity is used as the confusing corpus of the medical business corpus, and the medical business corpus is compared with the shape The near-obfuscated corpus combination is a combination of near-obfuscated form, and a set of near-obfuscated corpus is generated through the combination of near-obfuscation;
    利用预置的模糊匹配算法将所述医疗业务语料转化为语料音标,筛选出所述语料音标中的目标音标,所述目标音标包括具有易混淆的韵母和/或声母;Use a preset fuzzy matching algorithm to transform the medical business corpus into a corpus phonetic symbol, and filter out the target phonetic symbols in the corpus phonetic symbols, the target phonetic symbols including confusing vowels and/or initials;
    将目标音标转化为近音音标,并在所述预置的字典中查询标准音标与所述近音音标相同的标准语料,将所述标准音标与所述近音音标相同的标准语料作为所述医疗业务语料的近音混淆语料,将所述医疗业务语料与所述近音混淆语料组合为近音混淆组合,通过所述近音混淆组合生成近音混淆语料集合。Convert the target phonetic symbol into a near-phonetic phonetic symbol, and look up the standard corpus with the same standard phonetic symbol as the near-phonetic phonetic symbol in the preset dictionary, and use the standard corpus with the same phonetic symbol as the near-phonetic phonetic symbol as the standard corpus. The near-sound confusion corpus of the medical business corpus is combined into a near-sound confusion combination with the medical business corpus and the near-sound confusion corpus, and a near-sound confusion corpus is generated through the near-sound confusion combination.
  17. 根据权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 16, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    在医疗业务语料中提取第一业务语料与第二业务语料,将所述第一业务语料与所述第二业务语料进行组合,得到待检测组合;Extracting the first business corpus and the second business corpus from the medical business corpus, and combining the first business corpus and the second business corpus to obtain a combination to be detected;
    根据所述待检测组合与所述形近混淆组合确定所述待检测组合位置坐标的第一位置元素,通过所述第一位置元素确定基础形近混淆矩阵;Determine the first position element of the position coordinate of the to-be-detected combination according to the combination to be detected and the combination of shape and proximity confusion, and determine a basic shape and proximity confusion matrix by using the first position element;
    根据所述待检测组合与所述近音混淆组合确定所述待检测组合位置坐标的第二位置元素,通过所述第二位置元素确定基础近音混淆矩阵;Determine a second location element of the location coordinates of the to-be-detected combination according to the combination to be detected and the combination of near sound confusion, and determine a basic near sound confusion matrix through the second position element;
    利用预置的图神经网络生成所述基础形近混淆矩阵的形近混淆结构图谱以及所述基础近音混淆矩阵的近音混淆结构图谱。A preset graph neural network is used to generate a near-confusion structure map of the basic near-confusion matrix and a near-phonic confusion structure map of the basic near-confusion matrix.
  18. 根据权利要求17所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 17, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    判断所述待检测组合是否为所述形近混淆组合;Judging whether the combination to be detected is the combination of similar shapes and confusion;
    若所述待检测组合为所述形近混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第一位置元素标记为第一阈值;If the to-be-detected combination is the form-close confusion combination, acquiring the position coordinates of the to-be-detected combination, and marking the first position element corresponding to the position coordinates as a first threshold;
    若所述待检测组合不为所述形近混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第一位置元素标记为第二阈值;If the to-be-detected combination is not the form-close confusion combination, acquiring the position coordinates of the to-be-detected combination, and marking the first position element corresponding to the position coordinates as a second threshold;
    通过所述待检测组合的位置坐标建立初始形近混淆矩阵,将所述第一位置元素录入所述初始形近混淆矩阵中,得到基础形近混淆矩阵。An initial shape near confusion matrix is established by the position coordinates of the combination to be detected, and the first position element is entered into the initial shape near confusion matrix to obtain a basic shape near confusion matrix.
  19. 根据权利要求17所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 17, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    判断所述待检测组合是否为所述近音混淆组合;Judging whether the combination to be detected is the confusing combination of near sound;
    若所述待检测组合为所述近音混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第二位置元素标记为第三阈值;If the combination to be detected is the confusing combination of near sound, acquiring the position coordinates of the combination to be detected, and marking the second position element corresponding to the position coordinates as a third threshold;
    若所述待检测组合不为所述近音混淆组合,则获取所述待检测组合的位置坐标,并将所述位置坐标对应的第二位置元素标记为第四阈值;If the combination to be detected is not the confusing combination of near sound, acquiring the position coordinates of the combination to be detected, and marking the second position element corresponding to the position coordinates as a fourth threshold;
    通过所述待检测组合的位置坐标建立初始近音混淆矩阵,将所述第二位置元素录入所述初始近音混淆矩阵中,得到基础近音混淆矩阵。An initial near sound confusion matrix is established by the position coordinates of the combination to be detected, and the second position element is entered into the initial near sound confusion matrix to obtain a basic near sound confusion matrix.
  20. 一种基于图神经网络的文本纠错装置,所述基于图神经网络的文本纠错装置包括:A text error correction device based on graph neural network. The text error correction device based on graph neural network includes:
    获取模块,用于获取医疗业务语料,根据所述医疗业务语料以及预置的字典建立形近混淆语料集合与近音混淆语料集合;The acquiring module is used to acquire medical business corpus, and establish a near-form confusion corpus and a near-sound confusion corpus according to the medical business corpus and a preset dictionary;
    建立模块,用于基于预置的图神经网络建立所述形近混淆语料集合的形近混淆结构图谱以及所述近音混淆语料集合的近音混淆结构图谱;The establishment module is used to establish the near-phonetic confusion structure map of the near-phonetic confusion corpus and the near-phonetic confusion structure map of the near-phonetic confusion corpus based on a preset graph neural network;
    计算模块,用于对所述形近混淆结构图谱以及所述近音混淆结构图谱依次进行图卷积操作与图注意力计算,得到混淆语料结构图谱;A calculation module, configured to sequentially perform graph convolution and graph attention calculations on the near-form confusion structure map and the near-phone confusion structure map to obtain a confusion corpus structure map;
    更改模块,用于获取待测文本语料,利用预置的向量提取器提取所述待测文本语料的字符向量,计算所述字符向量与所述混淆语料结构图谱之间的基础相似概率,根据所述基础相似概率对所述待测文本语料进行更改处理,得到目标文本语料。The modification module is used to obtain the text corpus to be tested, extract the character vector of the text corpus to be tested using a preset vector extractor, calculate the basic similarity probability between the character vector and the structure map of the confused corpus, The basic similarity probability is used to modify the text corpus to be tested to obtain the target text corpus.
PCT/CN2020/124828 2020-09-07 2020-10-29 Graph neural network-based text error correction method, apparatus and device, and storage medium WO2021139349A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010926425.0A CN112016303B (en) 2020-09-07 2020-09-07 Text error correction method, device, equipment and storage medium based on graphic neural network
CN202010926425.0 2020-09-07

Publications (1)

Publication Number Publication Date
WO2021139349A1 true WO2021139349A1 (en) 2021-07-15

Family

ID=73515410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124828 WO2021139349A1 (en) 2020-09-07 2020-10-29 Graph neural network-based text error correction method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN112016303B (en)
WO (1) WO2021139349A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048321A (en) * 2021-08-12 2022-02-15 湖南达德曼宁信息技术有限公司 Multi-granularity text error correction data set generation method, device and equipment
CN114676684A (en) * 2022-03-17 2022-06-28 平安科技(深圳)有限公司 Text error correction method and device, computer equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800752B (en) * 2020-12-31 2023-12-01 科大讯飞股份有限公司 Error correction method, apparatus, device and storage medium
CN113505583B (en) * 2021-05-27 2023-07-18 山东交通学院 Emotion reason clause pair extraction method based on semantic decision graph neural network
CN113938708B (en) * 2021-10-14 2024-04-09 咪咕文化科技有限公司 Live audio error correction method, device, computing equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10198435B2 (en) * 2015-11-17 2019-02-05 Samsung Electronics Co., Ltd. Apparatus and method for generating translation model, apparatus and method for automatic translation
CN109766538A (en) * 2018-11-21 2019-05-17 北京捷通华声科技股份有限公司 A kind of text error correction method, device, electronic equipment and storage medium
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN111062376A (en) * 2019-12-18 2020-04-24 厦门商集网络科技有限责任公司 Text recognition method based on optical character recognition and error correction tight coupling processing
CN111241814B (en) * 2019-12-31 2023-04-28 中移(杭州)信息技术有限公司 Error correction method and device for voice recognition text, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10198435B2 (en) * 2015-11-17 2019-02-05 Samsung Electronics Co., Ltd. Apparatus and method for generating translation model, apparatus and method for automatic translation
CN109766538A (en) * 2018-11-21 2019-05-17 北京捷通华声科技股份有限公司 A kind of text error correction method, device, electronic equipment and storage medium
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048321A (en) * 2021-08-12 2022-02-15 湖南达德曼宁信息技术有限公司 Multi-granularity text error correction data set generation method, device and equipment
CN114676684A (en) * 2022-03-17 2022-06-28 平安科技(深圳)有限公司 Text error correction method and device, computer equipment and storage medium
CN114676684B (en) * 2022-03-17 2024-02-02 平安科技(深圳)有限公司 Text error correction method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112016303B (en) 2024-01-19
CN112016303A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2021139349A1 (en) Graph neural network-based text error correction method, apparatus and device, and storage medium
WO2020215672A1 (en) Method, apparatus, and device for detecting and locating lesion in medical image, and storage medium
WO2021212749A1 (en) Method and apparatus for labelling named entity, computer device, and storage medium
WO2019242297A1 (en) Method for intelligent dialogue based on machine reading comprehension, device, and terminal
CN109948149B (en) Text classification method and device
JP2018190188A (en) Summary creating device, summary creating method and computer program
US11468989B2 (en) Machine-aided dialog system and medical condition inquiry apparatus and method
CN105631468A (en) RNN-based automatic picture description generation method
CN111370102B (en) Department diagnosis guiding method, device and equipment
CN112802461B (en) Speech recognition method and device, server and computer readable storage medium
CN114582470A (en) Model training method and device and medical image report labeling method
CN113949582B (en) Network asset identification method and device, electronic equipment and storage medium
CN111382260A (en) Method, device and storage medium for correcting retrieved text
KR20090106936A (en) System for spacing word and method thereof
JP2022006173A (en) Knowledge pre-training model training method, device and electronic equipment
US20220058349A1 (en) Data processing method, device, and storage medium
CN110134967A (en) Text handling method, calculates equipment and computer readable storage medium at device
US20240185840A1 (en) Method of training natural language processing model method of natural language processing, and electronic device
CN113157852A (en) Voice processing method, system, electronic equipment and storage medium
CN109710921A (en) Calculation method, device, computer equipment and the storage medium of Words similarity
CN111444906B (en) Image recognition method and related device based on artificial intelligence
CN110717021A (en) Input text and related device for obtaining artificial intelligence interview
CN111444905B (en) Image recognition method and related device based on artificial intelligence
CN112669845A (en) Method and device for correcting voice recognition result, electronic equipment and storage medium
CN115376491A (en) Voice confidence calculation method, system, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912848

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912848

Country of ref document: EP

Kind code of ref document: A1