CN114386423A

CN114386423A - Text duplicate removal method and device, electronic equipment and storage medium

Info

Publication number: CN114386423A
Application number: CN202210056358.0A
Authority: CN
Inventors: 罗圣西; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-04-22
Anticipated expiration: 2042-01-18
Also published as: CN114386423B

Abstract

The embodiment of the invention provides a text duplicate removal method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence. The text deduplication method comprises the following steps: the method comprises the steps of obtaining a text to be deduplicated, generating an entity word number text according to the text to be deduplicated by utilizing a preset entity word list, generating a character fragment list according to the entity word number text, performing deduplication operation on the character fragment list to obtain a first result list, and generating deduplication result information according to the first result list. In the embodiment, the situation that entity words containing repeated characters appear in the text to be deduplicated is considered, on the basis of not influencing the deduplication effect of other characters, the situation that the spoken language intention recognition effect is poor due to the fact that the repeated characters in the entity words are also removed during deduplication is avoided, intention recognition information can be obtained more accurately, and robustness and expandability of spoken language intention recognition are improved.

Description

Text duplicate removal method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a text duplicate removal method and device, electronic equipment and a storage medium.

Background

In recent years, with the breakthrough of artificial intelligence technology in the field of speech signal processing, speech recognition has been developed rapidly, and speech recognition is used as a text sequence for subsequent processing, such as speech input methods, speech assistants, intelligent speakers, translators, etc., which have been used more often. For example, a conversation robot is a research direction for rapid development in the field of artificial intelligence at present, and a task-type conversation robot based on voice interaction is widely applied to multiple fields such as mobile communication, finance, government affairs and the like. One of the main purposes of such robots is to recognize the intention expressed by human spoken language, which is characterized by more meaningless repeated characters than conventional written text, and to ensure the intention recognition effect, the characters which are meaningless repeated are usually subjected to de-duplication.

However, in the related art, the text deduplication function is mostly based on classifying the text translated from the speech to achieve the purpose of recognizing the intent, the existence of a specific entity name is rarely considered, and if the existence of the entity is not considered and the text deduplication is performed uniformly, the expression of the entity may be damaged, the subsequent intent recognition is negatively affected, and the recognition intent is inaccurate.

Disclosure of Invention

The embodiment of the invention mainly aims to provide a text duplication eliminating method and device, electronic equipment and a storage medium, which can more accurately acquire intention identification information so as to improve the robustness and expandability of spoken language intention identification.

In order to achieve the above object, a first aspect of an embodiment of the present invention provides a text deduplication method, including:

acquiring a text to be deduplicated;

generating an entity word number text according to the text to be deduplicated by using a preset entity word list;

generating a character fragment list according to the entity word number text;

carrying out duplication removal operation on the character fragment list to obtain a first result list;

and generating duplication removing result information according to the first result list.

In some embodiments, the preset entity term list includes at least one entity term; the generating an entity word number text according to the text to be deduplicated by using a preset entity word list comprises the following steps:

searching the entity words in the preset entity word list one by one in the text to be deduplicated;

if the entity words are found, acquiring preset replacement characters;

and replacing the entity words appearing in the text to be deduplicated with the preset replacement characters and the replacement word numbers to generate the entity word number text, wherein the preset replacement characters and the replacement word numbers are stored in a number dictionary list.

In some embodiments, before searching for the corresponding entity word in the preset entity word list one by one in the text to be deduplicated, the method further includes:

pre-training an entity word extraction model, the entity word extraction model comprising: a natural language processing model;

and generating the preset entity word list by utilizing the entity word model according to a spoken language intention recognition application scene, wherein the spoken language intention recognition application scene comprises one or more of financial spoken language recognition, mobile communication spoken language recognition and government affair spoken language recognition.

In some embodiments, the generating a list of character fragments from the entity word number text comprises:

acquiring the length of a character segment;

traversing the entity word number text, and sliding and intercepting a character segment corresponding to the length of the character segment;

and obtaining the character fragment list according to the character fragments.

In some embodiments, the performing a deduplication operation on the character fragment list to obtain a first result list includes:

storing the character segments obtained from the character segment list in a result list one by one;

when the storage times are larger than the length of the character segment, judging whether the current character segment is the same as the stored character segment in the result list or not; if the current character segment is the same as the first character segment, keeping the current character segment in the first result list;

deleting the character segments with the same length as the character segments before the current character segment in the result list until the character segment list is traversed, and obtaining a corresponding first result list.

In some embodiments, the generating deduplication result information from the first result list comprises:

acquiring preset replacement characters in the first result list one by one;

acquiring entity words corresponding to the preset replacing characters according to the numbering dictionary list;

and replacing the preset replacing characters in the first result list with the entity words to generate the duplication eliminating result information.

In some embodiments, after the generating the duplicate removal result information according to the first result list, further includes:

taking the duplication elimination result information as the text to be duplicated;

gradually updating the character segment length to a preset character segment length;

and obtaining the duplication removing result information corresponding to the preset character segment length.

To achieve the above object, a second aspect of the present invention provides a text deduplication apparatus, comprising:

the text to be deduplicated acquiring module is used for acquiring a text to be deduplicated;

the entity word number text generation module is used for generating an entity word number text according to the text to be deduplicated by utilizing a preset entity word list;

the character fragment list generating module is used for generating a character fragment list according to the entity word number text;

the first result list generating module is used for carrying out duplication removing operation on the character fragment list to obtain a first result list;

and the duplicate removal result information generating module is used for generating duplicate removal result information according to the first result list.

To achieve the above object, a third aspect of the present invention provides an electronic apparatus comprising:

at least one memory;

at least one processor;

at least one program;

the programs are stored in a memory and a processor executes the at least one program to implement the method of the invention as described in the above first aspect.

To achieve the above object, a fourth aspect of the present invention proposes a storage medium which is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute:

a method as described in the first aspect above.

According to the text duplication elimination method and device, the electronic device and the storage medium, the text to be duplicated is obtained, the preset entity word list is used for generating the entity word number text according to the text to be duplicated, the character fragment list is generated according to the entity word number text, duplication elimination operation is carried out on the character fragment list to obtain the first result list, and duplication elimination result information is generated according to the first result list. In the embodiment, the situation that entity words containing repeated characters appear in the text to be deduplicated is considered, the entity words are replaced by using the preset entity word list on the basis that the deduplication effect of other characters is not influenced, the problem that the recognition effect of the obtained deduplication result information is poor due to the fact that the repeated characters in the entity words are also removed during deduplication is avoided, the intention recognition information can be obtained more accurately, and therefore the robustness and the expandability of the spoken language intention recognition are improved.

Drawings

Fig. 1 is a flowchart of a text deduplication method according to an embodiment of the present invention.

Fig. 2 is another flowchart of a text deduplication method provided in the embodiment of the present invention.

Fig. 3 is another flowchart of a text deduplication method according to an embodiment of the present invention.

Fig. 4 is another flowchart of a text deduplication method provided in the embodiment of the present invention.

Fig. 5 is another flowchart of a text deduplication method provided in an embodiment of the present invention.

Fig. 6 is another flowchart of a text deduplication method provided in an embodiment of the present invention.

Fig. 7 is a block diagram of a device for removing duplicate spoken texts according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

First, several terms referred to in the present application are resolved:

natural Language Processing (NLP): natural language processing, which is simply to say, a computer accepts input in the form of natural language of a user, and internally performs a series of operations such as processing, calculation and the like through an algorithm defined by a human to simulate the understanding of the natural language by the human and return a result expected by the user.

However, in the related art, the text deduplication function is mostly based on classifying the text translated by the speech to achieve the purpose of recognizing intent, the existence of a specific entity name is rarely considered, for example, in a financial business scene, a product side often proposes some more novel product names, many of which contain repeated characters, and if uniform text deduplication is performed without considering the existence of the entity, the expression of the entity may be damaged, so that negative effects are caused to the recognition of the spoken language meaning, and the intention of spoken language recognition is not accurate.

Based on this, the embodiment of the invention provides a text duplication removing method and device, an electronic device, and a storage medium, in which the text duplication removing method includes obtaining a spoken text to be duplicated, generating an entity word number text according to the spoken text to be duplicated by using a preset entity word list, generating a character fragment list according to the entity word number text, performing duplication removing operation on the character fragment list to obtain a first result list, and generating duplication removing result information according to the first result list. In the embodiment, the situation that entity words containing repeated characters appear in the to-be-deduplicated spoken language text is considered, on the basis of not influencing the deduplication effect of other characters, the situation that the spoken language intention recognition effect is poor due to the fact that the repeated characters in the entity words are also removed in deduplication is avoided, intention recognition information can be obtained more accurately, and robustness and expandability of spoken language intention recognition are improved.

Embodiments of the present invention provide a text deduplication method and apparatus, an electronic device, and a storage medium, and are specifically described with reference to the following embodiments, in which the text deduplication method in the embodiments of the present invention is first described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the invention provides a text duplicate removal method, which relates to the technical field of artificial intelligence, in particular to the technical field of data mining. The text deduplication method provided by the embodiment of the invention can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, smart watch, or the like; the server can be an independent server, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and artificial intelligence platform and the like; the software may be an application or the like that implements a text deduplication method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an alternative flowchart of a text deduplication method provided in an embodiment of the present invention, and the method in fig. 1 may include, but is not limited to, steps S101 to S105.

And step S101, acquiring a text to be deduplicated.

In an embodiment, the spoken language voice information is obtained, so that the text to be deduplicated can also be a text with a deduplicated spoken language, and then the spoken language voice information is subjected to text recognition to generate the text to be deduplicated. For example, in multiple fields of mobile communication, finance, government affairs and the like, spoken language voice information is acquired by means of a voice input method, a voice assistant, an intelligent sound box, a translator and the like, and then text recognition is carried out on the spoken language voice information to obtain a to-be-deduplicated spoken language text.

And S102, generating an entity word number text according to the text to be deduplicated by using a preset entity word list.

In one embodiment, the text to be de-duplicated obtained as described above has a large characteristic that meaningless repeated characters appear more than conventional written text, and in order to ensure the intention recognition effect, the characters which are meaningless repeated are often subjected to de-duplication processing. For example, in a financial business scenario, a product side often gives out some more novel product names, many of which contain repeated characters, so that if uniform text deduplication is performed without considering the existence of a specific entity name, the representation of the entity may be damaged, so that semantic graph recognition is negatively affected, and the intention of spoken language recognition is inaccurate. Therefore, in this embodiment, specific entity names need to be considered.

In one embodiment, the list of pre-determined entity terms includes: the entity words and the corresponding word sequence are generated according to different application fields or a preset word list database is established, entity words appearing in different application fields are collected and updated, and each entity word corresponds to different word sequences when stored, so that the corresponding entity words can be selected more conveniently.

In an embodiment, in order to obtain corresponding entity words in different spoken language intention recognition application scenarios, on one hand, artificial statistics may be performed, and on the other hand, learning extraction may be performed by using a machine learning method. In this embodiment, the example of extracting entity words in a machine learning manner is taken as an example for explanation. Referring to fig. 2, the step of generating the preset entity word list includes, but is not limited to, steps S210 to S220.

Step S210, training an entity word extraction model in advance.

In one embodiment, constructing a solid word extraction model, which is a natural language processing model, can be implemented to include: the method comprises the following steps of word segmentation, part of speech tagging, word stem extraction, named entity identification and the like, and the entity word extraction model is trained through a large number of training samples, so that the entity language extraction model can identify and extract entity words in a text.

Step S220, generating a preset entity word list by utilizing the entity word model according to the application scene of the spoken language intention recognition.

In one embodiment, the application scenarios of spoken language intention recognition are divided into: financial spoken language identification, mobile communication spoken language identification or government spoken language identification, namely, different application scenes correspond to entity words, the entity words may be overlapped or different, and corresponding entity words can be obtained according to word numbers of the corresponding entity words.

In an embodiment, referring to fig. 3, step S102 includes, but is not limited to, steps S1021 to S1022.

Step S1021, searching corresponding entity words in a preset entity word list one by one in the text to be deduplicated.

In an embodiment, each entity word in the preset entity word list is searched in the to-be-deduplicated spoken language text one by one, and whether the entity word in the preset entity word list exists in the to-be-deduplicated spoken language text is judged.

Step S1022, if the entity word is found, acquiring a preset replacement character, representing the entity word appearing in the spoken language text to be deduplicated as the preset replacement character and the replacement word number, and generating an entity word number text.

In an embodiment, if an entity word in a preset entity word list exists in a to-be-deduplicated spoken text, the existing entity word in the to-be-deduplicated spoken text is replaced by a preset replacement character and a replacement word number, and an entity word number text is generated, wherein the preset replacement character and the replacement word number are stored in a number dictionary list, and the number dictionary list includes a corresponding relationship between the preset replacement character and the entity word.

In an embodiment, the preset replacing characters may be special characters such as "#", "-", or "%", each preset replacing character is stored in the number dictionary list corresponding to a different replacing word number, and the number of the replacing word is convenient for searching the preset replacing character in the number dictionary list. In this embodiment, the replacement term number may be a term sequence of the entity term in the corresponding preset entity term list, for example, if the first entity term in the preset entity term list is found in the to-be-deduplicated spoken text, the entity term that will exist in the to-be-deduplicated spoken text is replaced by a "# 1" manner, and so on, if the entity term is the fifth entity term in the preset entity term list, the entity term is replaced by a "# 5" manner. It can be understood that the preset replacement character is only illustrated, and is not limited specifically, and a character convenient to replace may be set as the preset replacement character according to actual requirements.

And step S103, generating a character fragment list according to the entity word number text.

In one embodiment, the present embodiment takes into account the different lengths of the repeated characters to de-duplicate the spoken text to be de-duplicated, since repeated characters have different lengths, e.g., "i want me" are two word repeats and "kay" is one word repeat.

In an embodiment, referring to fig. 4, step S103 includes, but is not limited to, steps S1031 to S1033.

Step S1031, a character segment length is acquired.

In one embodiment, the character segment length starts from 1 to N, where N represents a preset maximum character segment length, where N may be obtained according to the requirements in the practical application, and the character segment length represents the length of the repeated character, for example, the character segment length of "i want me" is 2, and the character segment length of "kayings" is 1.

Step S1032, traversing the entity word number text, and sliding and intercepting the character segment corresponding to the length of the character segment.

In one embodiment, according to different character segment lengths, character segments corresponding to the character segment length are hijacked in the to-be-deduplicated spoken text character-by-character sliding, for example, "i want to eat", and when the character segment length is 1, the intercepted character segments are: "I", "want", "eat" and "meal"; when the length of the character segment is 2, the intercepted character segment is: "i want", "eat", and so on, a character fragment corresponding to the character fragment length is obtained.

In an embodiment, if there is a replaced entity word in the to-be-deduplicated spoken text, the preset replacement character and the replacement word number are intercepted as a whole, and the character length of the whole is set to 1, for example, "today" is the first entity word in a preset entity word list, then when the to-be-deduplicated spoken text is: "i will eat today", the entity word number text generated by the replacement in step S102 is: "i #1 has a meal", when the character segment length is 2, the character segments intercepted are: "i # 1", "# 1 to", "to eat", and "to eat".

Step S1033, a character segment list is obtained according to the character segments.

In one embodiment, the character segment corresponding to the length of the character segment is stored in a corresponding list of character segments to facilitate subsequent deduplication operations.

Step S104, carrying out duplication elimination operation on the character fragment list to obtain a first result list.

In an embodiment, the deduplication operation is performed on the obtained character fragment list to obtain a first result list, specifically: traversing the character segment list according to the length of the character segments to obtain repeated character segments in the character segment list, and then removing the repeated character segments in the corresponding character segment list to obtain a corresponding first result list.

In one embodiment, referring to fig. 5, the step of traversing to remove the repeated character segments in the corresponding character segment list includes, but is not limited to, steps S510 to S530:

step S510, storing the character segments obtained from the character segment list in the result list one by one.

In one embodiment, for the character segment list with the character segment length i, the character segment list is first traversed, and each character segment is stored in the result list one by one.

Step S520, when the storage times is greater than the length of the character segment, determining whether the current character segment is the same as the stored character segment.

In one embodiment, when the storage times is greater than the length of the character segment, it is determined whether the character segment in the current character segment list is the same as the stored character segment in the result list.

Step S530, if the current character segment is the same as the stored character segment, retaining the current character segment in the first result list, and deleting the character segments in the result list, which are the same as the character segments in length before the current character segment, until the traversal of the character segment list is completed, so as to obtain the corresponding first result list.

In an embodiment, when the number of traversals is just larger than the character fragment length i, judging whether the character fragments in the current character fragment list are the same as the character fragments which are stored in the result list and are separated by i, if so, representing that the current character fragment has repeated fragments, and removing all the previous i characters of the current character fragment from the result list and keeping the current character fragment.

For example: the spoken language text to be deduplicated is: "i want me", the current character segment length i is 2, then the stored character segments of the result list are processed as follows:

the result list obtained by the first traversal is: "I want";

the result list obtained by the second traversal is: "i want", "want me";

the result list obtained by the third traversal is: "i want", "want me", "i want";

when the result list is changed to the third pass, the character segment "i want" in the current character segment list is judged to be the same as the character segments "i want" and "i want to" before 2 intervals, which are already stored in the result list, so the third character segment "i want" is repeated with the character segments "i want" before the first 2 character segments, so that the character segments before the current character segment in the result list and the character segments with the same length are deleted from the result list, namely the first two items "i want" and "i want" in the result list need to be deleted from the result list, and the result list only keeps one "i want" until the end of the character segment list is traversed, and a corresponding first result list is obtained.

In step S105, duplicate removal result information is generated according to the first result list.

In one embodiment, the de-duplication result information corresponding to the length of the character segment is generated according to the first result list and the corresponding number dictionary list.

In an embodiment, the method includes obtaining a de-duplicated spoken text according to the first result list, and includes the following specific steps: traversing each character segment in the first result list, if the character segment is the first character segment, directly taking the first character segment as an initial character sequence, and for each subsequent character segment, if the penultimate character of the character segment does not contain a preset replacement character (such as "# 1"), namely the character segment is not terminated by an entity word, splicing the last character of the character segment to the rear of the current existing character sequence; if the penultimate character of the segment contains a preset replacement character (such as "# 1"), the preset replacement character plus the number behind the preset replacement character are spliced to the back of the existing character sequence, and after the first result list is traversed, a new de-duplicated spoken language text can be obtained, wherein the spoken language text has the repeated characters with corresponding lengths removed.

In an embodiment, the preset replacement characters in the first result list are acquired one by one, the entity words corresponding to the preset replacement characters are acquired according to the corresponding relationship between the preset replacement characters and the entity words stored in the numbering dictionary list, and then the preset replacement characters in the first result list are replaced by the entity words to generate the duplication elimination result information corresponding to the character segment length. And replacing the preset replacement characters in the de-duplicated spoken language text with corresponding entity words in a preset entity word list according to the relationship between the preset replacement characters and the number of the replacement words in the number dictionary list, so as to obtain de-duplication result information corresponding to the length of the character segments and complete de-duplication operation.

In addition, in an embodiment, since the repeated characters have different lengths, the length of the character segment is gradually increased from 1 to N, and the above-mentioned deduplication step is repeated for each length of the character segment, so as to obtain deduplication result information in which deduplication is completed for the repeated characters with different lengths. Referring to fig. 6, the text deduplication method of the present embodiment further includes, but is not limited to, step S610 to step S630.

And step S610, using the duplication elimination result information as the text to be duplicated.

Step S620, successively updating the character segment length to a preset character segment length.

Step S630, the deduplication result information corresponding to the preset character segment length is obtained.

In an embodiment, the length of the character segment is gradually increased from 1 to N, so that the information of the deduplication result after the previous deduplication is used as the text to be deduplicated, that is, the deduplication of the duplicate characters is completed from the minimum character length to the preset maximum character length.

For example, in a specific application scenario, the following example is used to describe the text deduplication method.

S1, assuming that the spoken language text to be deduplicated is:

"Yi Bian-she felt that I want to eat crushed ice and crushed ice today";

s2, because the 'broken ice' belongs to a more novel product name and contains a repeated character 'broken', the existence of a specific entity name needs to be considered for removing the weight, so that the expression of the entity is prevented from being damaged, and negative influence on the recognition of the oral semantic graph is avoided. The present embodiment assumes that the preset entity word is "crushed ice", and the sequence thereof in the preset entity word list is 1.

Step S102, generating an entity word number text according to the to-be-deduplicated spoken text by using a preset entity word list, wherein the entity word number text comprises the following steps:

"Yi En Yi felt that I want to eat # 1" today;

s3, assuming that the maximum length of the character segment length is 2, first selecting the character segment length as 1 for de-duplication:

s31, the obtained character fragment list is:

"kaki", "I", "ju", "D", "day", "I", "O", "I", "1";

s32, the first result list obtained by de-duplicating the character fragment list with the character fragment length of 1 is:

"kazio", "I", "ju", "jin", "day", "I", "to" "# 1";

s33, the deduplication result information generated according to the first result list is:

"Yi En Do I feel I want to eat #1 today";

s4, the character segment length is 2:

s41, the obtained character fragment list is:

"kayime", "I feel", "jin", "today", "heaven me", "I want", "eat # 1";

s42, the first result list obtained by de-duplicating the character fragment list with the character fragment length of 2 is:

"kame", "I feel", "D-jin", "today", "I-day", "I-am # 1";

s43, the deduplication result information generated according to the first result list is:

"Yi En felt me today to eat # 1";

and S5, replacing the preset replacement characters in the de-duplicated spoken language text with corresponding entity words in the preset entity word list according to the relationship between the preset replacement characters and the number of the replacement words in the numbering dictionary list, so as to obtain de-duplication result information corresponding to the length of the character segments and complete de-duplication operation.

The above-obtained deduplication result information is: "Yi Zhen me felt that I want to eat crushed ice today".

According to the technical scheme provided by the embodiment of the invention, the spoken language text to be deduplicated is obtained, the preset entity word list is utilized to generate the entity word number text according to the spoken language text to be deduplicated, the character fragment list is generated according to the entity word number text, the duplication removing operation is carried out on the character fragment list to obtain the first result list, and the duplication removing result information is generated according to the first result list. In the embodiment, the situation that entity words containing repeated characters appear in the to-be-deduplicated spoken language text is considered, on the basis of not influencing the deduplication effect of other characters, the situation that the spoken language intention recognition effect is poor due to the fact that the repeated characters in the entity words are also removed in deduplication is avoided, intention recognition information can be obtained more accurately, and robustness and expandability of spoken language intention recognition are improved.

An embodiment of the present invention further provides a spoken language text deduplication device, which can implement the text deduplication method, and with reference to fig. 7, the device includes:

a to-be-deduplicated text acquisition module 710 configured to acquire a to-be-deduplicated text;

an entity word number text generating module 720, configured to generate an entity word number text according to the to-be-deduplicated spoken text by using a preset entity word list;

the character fragment list generating module 730 is configured to generate a character fragment list according to the entity word number text;

a first result list generating module 740, configured to perform a deduplication operation on the character segment list to obtain a first result list;

and a deduplication result information generating module 750 configured to generate deduplication result information according to the first result list.

In an embodiment, the to-be-deduplicated text obtaining module 710 is further configured to obtain spoken language voice information, and then perform text recognition on the spoken language voice information to generate the to-be-deduplicated spoken language text.

In an embodiment, the entity word numbering text generating module 720 is further configured to search the to-be-deduplicated spoken text for corresponding entity words in the preset entity word list one by one, then search each entity word in the preset entity word list one by one in the to-be-deduplicated text, determine whether an entity word in the preset entity word list exists in the to-be-deduplicated text, and if the entity word is found, represent the entity word appearing in the to-be-deduplicated spoken text as a preset replacement character and a replacement word number, so as to generate the entity word numbering text.

In an embodiment, the character fragment list generating module 730 is further configured to obtain a length of a character fragment, traverse the entity word number text, slide and intercept the character fragment corresponding to the length of the character fragment, and finally obtain a character fragment list according to the character fragment.

In an embodiment, the first result list generating module 740 is further configured to store the character segments obtained in the character segment list in the result list one by one, determine whether the current character segment is the same as the stored character segment when the storage frequency is greater than the length of the character segment, if the current character segment is the same as the stored character segment, keep the current character segment in the first result list, and delete the character segments in the result list, which are the same as the character segments before the current character segment, until the traversal of the character segment list is completed, so as to obtain the corresponding first result list. In an embodiment, when the number of traversals is just larger than the character fragment length i, judging whether the character fragments in the current character fragment list are the same as the character fragments which are stored in the result list and are separated by i, if so, representing that the current character fragment has repeated fragments, and removing all the previous i characters of the current character fragment from the result list and keeping the current character fragment.

In an embodiment, the deduplication result information generating module 750 is further configured to generate deduplication result information corresponding to the length of the character segment according to the first result list and the corresponding number dictionary list.

In addition, in an embodiment, since the repeated characters have different lengths, the length of the character segment is gradually increased from 1 to N, and for each length of the character segment, the above-mentioned deduplication operation is repeated according to the above-mentioned module, so as to obtain deduplication result information in which deduplication is completed for the repeated characters of different lengths.

In the embodiment, for the situation that the word name of the special service entity may occur in the spoken language intention recognition process is not considered in the related art, a strategy that the word name of the special service entity is replaced with a preset replacement character (for example, a special character) first, then duplication removal is performed, and finally the special character is restored to the word name of the service entity is adopted. The spoken language intention recognition scheme of this embodiment can consider the condition that the entity word containing the repeated character appears in the text to be deduplicated, avoids also removing the repeated character in the entity word when removing the repetition from and leads to spoken language intention recognition effect not good, has higher robustness, and the logic is clear simultaneously, and is succinct understandable, has higher scalability.

The specific implementation of the spoken text deduplication device of this embodiment is substantially the same as the specific implementation of the text deduplication method, and is not described herein again.

An embodiment of the present invention further provides an electronic device, including:

at least one memory;

at least one processor;

at least one program;

the programs are stored in a memory and a processor executes the at least one program to implement the present invention to implement the text deduplication method described above. The electronic device can be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA for short), a vehicle-mounted computer and the like.

Referring to fig. 8, fig. 8 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 801 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present invention;

the memory 802 may be implemented in a ROM (read only memory), a static memory device, a dynamic memory device, or a RAM (random access memory). The memory 802 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 802, and the processor 801 calls the text deduplication method according to the embodiments of the present disclosure;

an input/output interface 803 for realizing information input and output;

the communication interface 804 is used for realizing communication interaction between the device and other devices, and can realize communication in a wired manner (such as USB, network cable, and the like) or in a wireless manner (such as mobile network, WIFI, bluetooth, and the like); and

a bus 805 that transfers information between the various components of the device (e.g., the processor 801, memory 802, input/output interfaces 803, and communication interface 804);

wherein the processor 801, the memory 802, the input/output interface 803 and the communication interface 804 are communicatively connected to each other within the device via a bus 805.

The embodiment of the invention also provides a storage medium which is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used for causing a computer to execute the text deduplication method.

According to the text duplication eliminating method, the text duplication eliminating device, the electronic equipment and the storage medium, the text to be duplicated is obtained, the preset entity word list is used for generating the entity word numbered text according to the text to be duplicated, the character segment list is generated according to the entity word numbered text, duplication eliminating operation is carried out on the character segment list to obtain the first result list, and duplication eliminating result information is generated according to the first result list. In the embodiment, the situation that entity words containing repeated characters appear in the text to be deduplicated is considered, on the basis of not influencing the deduplication effect of other characters, the situation that the spoken language intention recognition effect is poor due to the fact that the repeated characters in the entity words are also removed during deduplication is avoided, intention recognition information can be obtained more accurately, and robustness and expandability of spoken language intention recognition are improved.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment described in the embodiment of the present invention is for more clearly illustrating the technical solution of the embodiment of the present invention, and does not constitute a limitation to the technical solution provided in the embodiment of the present invention, and it can be known by those skilled in the art that the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems with the evolution of technology and the occurrence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-6 are not intended to limit the embodiments of the present invention, and may include more or less steps than those shown, or some steps in combination, or different steps.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not intended to limit the scope of the embodiments of the invention. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present invention are intended to be within the scope of the claims of the embodiments of the present invention.

Claims

1. A method for text deduplication, comprising:

acquiring a text to be deduplicated;

generating a character fragment list according to the entity word number text;

2. The text deduplication method of claim 1, wherein the preset entity term list contains at least one entity term; the generating an entity word number text according to the text to be deduplicated by using a preset entity word list comprises the following steps:

if the entity words are found, acquiring preset replacement characters;

3. The text deduplication method according to claim 2, wherein before searching the text to be deduplicated for the corresponding entity words in the preset entity word list one by one, the method further comprises:

4. The text deduplication method of any one of claims 1 to 3, wherein the generating of the character fragment list according to the entity word number text comprises:

acquiring the length of a character segment;

and obtaining the character fragment list according to the character fragments.

5. The text deduplication method of claim 4, wherein the performing deduplication operations on the character fragment list to obtain a first result list comprises:

6. The text deduplication method of claim 2, wherein the generating deduplication result information according to the first result list comprises:

acquiring preset replacement characters in the first result list one by one;

7. The text deduplication method of claim 4, wherein after the generating deduplication result information according to the first result list, further comprising:

8. A text deduplication apparatus, comprising:

9. An electronic device, comprising:

at least one memory;

at least one processor;

at least one program;

the programs are stored in a memory, and a processor executes the at least one program to implement:

the method of any one of claims 1 to 7.

10. A storage medium that is a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform:

the method of any one of claims 1 to 7.