CN118227806A - Wind control method, device, medium and equipment for variant text similar retrieval - Google Patents

Wind control method, device, medium and equipment for variant text similar retrieval Download PDF

Info

Publication number
CN118227806A
CN118227806A CN202410384521.5A CN202410384521A CN118227806A CN 118227806 A CN118227806 A CN 118227806A CN 202410384521 A CN202410384521 A CN 202410384521A CN 118227806 A CN118227806 A CN 118227806A
Authority
CN
China
Prior art keywords
text
risk
variant
character
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410384521.5A
Other languages
Chinese (zh)
Inventor
张江滨
赵智源
祝慧佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202410384521.5A priority Critical patent/CN118227806A/en
Publication of CN118227806A publication Critical patent/CN118227806A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The specification discloses a wind control method, device, medium and equipment for variant text similar retrieval, and obtains identified risk text. And carrying out variant on the risk text to obtain variant text of the risk text. Inputting the risk text and the variant text into an extraction model to be trained, respectively determining the text characteristics of the risk text and the variant text, determining the similarity between the text characteristics of the risk text and the variant text, and training the extraction model by taking the similarity as a target, wherein the similarity is larger than a preset similarity threshold. And determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples to achieve the effect of identifying and preventing and controlling the risk texts of the variants.

Description

Wind control method, device, medium and equipment for variant text similar retrieval
Technical Field
The specification relates to the field of computer technology, and in particular relates to a wind control method, device, medium and equipment for variant text similar retrieval.
Background
At present, with the rapid development of technologies such as big data, cloud computing and artificial intelligence, the information volume on an internet platform is explosively increased, so as to avoid risks such as personal information leakage caused by the fact that users release illegal information through the internet platform. Therefore, there is a need to wind control information distributed by users through the internet platform.
In the prior art, aiming at information sent by a user on an internet platform, the information is generally imported into a database, and information matched with illegal information in the database is subjected to wind control in a similar search matching mode.
To this end, the present specification provides a method, apparatus, medium and device for wind control for variant text similar retrieval.
Disclosure of Invention
The present specification provides a method, apparatus, medium and device for wind control for variant text similarity retrieval, to partially solve the above-mentioned problems of the prior art.
The technical scheme adopted in the specification is as follows:
the specification provides a wind control method for variant text similar retrieval, comprising the following steps:
Acquiring an identified risk text;
performing variant on the risk text to obtain variant text of the risk text, wherein the semantics of the variant text and the semantics of the risk text are the same;
inputting the risk text and the variant text into an extraction model to be trained, respectively determining text characteristics of the risk text and the variant text, determining similarity between the text characteristics of the risk text and the variant text, and training the extraction model with the similarity being larger than a preset similarity threshold;
And determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples.
Optionally, the risk text is subjected to variant, so that variant text of the risk text is obtained, and the method specifically comprises the following steps:
Determining text fragments to be varied in the risk text;
And carrying out variant on the determined text fragments of the risk text to be variant to obtain variant text of the risk text.
Optionally, determining the text segment to be changed in the risk text specifically includes:
Performing word segmentation on the risk text, determining the part of speech of each word segment in the risk text, and determining the word segment to be changed from each word segment according to the part of speech of each word segment to be changed as a text segment to be changed; or alternatively
And performing word segmentation on the risk text, and determining word segmentation matched with the keyword dictionary in each word segmentation according to a preset keyword dictionary to serve as the text segment to be changed.
Optionally, the method for performing the variant on the text segment to be variant of the determined risk text specifically includes:
screening at least part of characters from the characters of the text segment to be varied;
and aiming at each screened character, carrying out variation on the character in the risk text according to homophones of the character or pinyin of the character.
Optionally, the method for performing the variant on the text segment to be variant of the determined risk text specifically includes:
screening at least part of characters from the characters of the text segment to be varied;
for each screened character, determining at least one of a near word of the character, a text corresponding to the character in other languages, a pictogram corresponding to the character and a text after splitting the character as a representation form corresponding to the character;
And carrying out variation on the character in the risk text according to the expression form corresponding to the character.
Optionally, inputting the risk text and the variant text into an extraction model to be trained, and determining text characteristics of the risk text and the variant text respectively, which specifically includes:
respectively inputting the risk text and the variant text as input data into an extraction model to be trained;
For each character in the input data, determining the image data of the character and the pinyin data of the character through a fusion layer of the extraction model to be trained;
Determining the image characteristics of the image data of the character and the pinyin characteristics of the pinyin data of the character;
carrying out feature fusion on the image features of the image data of the character, the pinyin features of the pinyin data of the character and the character to obtain fusion features of the character;
And splicing the fusion characteristics of the characters according to the sequence of the characters in the input data, and inputting the spliced fusion characteristics into a coding layer and a decoding layer to obtain the text characteristics of the input data.
Optionally, splicing the fusion features of the characters according to the sequence of the characters in the input data, which specifically includes:
Adding position codes to each character in the input data according to the sequence of each character in the input data;
And splicing the position coding and fusion characteristics of each character, and splicing the position fusion characteristics of each character according to the sequence of each character in the input data as the position fusion characteristics.
Optionally, performing wind control according to the text feature of the risk sample specifically includes:
responding to a service request carrying a text to be sent, and inputting the text to be sent into a trained extraction model to obtain text characteristics of the text to be sent;
and matching the text characteristics of the text to be sent with the text characteristics of the risk sample, and determining the wind control strategy of the service request according to the matching result.
The specification provides a wind control device for variant text similar search, comprising:
The acquisition module is used for acquiring the identified risk text;
the variant module is used for carrying out variant on the risk text to obtain variant text of the risk text, wherein the semantics of the variant text and the semantics of the risk text are the same;
The training module is used for inputting the risk text and the variant text into an extraction model to be trained, respectively determining the text characteristics of the risk text and the variant text, determining the similarity between the text characteristics of the risk text and the variant text, and training the extraction model with the similarity being larger than a preset similarity threshold;
And the wind control module is used for determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples.
The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the wind control method of variant text similarity retrieval described above.
The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of wind control of variant text similarity retrieval when executing the program.
The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:
The wind control method for variant text similar retrieval provided by the specification obtains the identified risk text. And carrying out variant on the risk text to obtain variant text of the risk text, wherein the semantics of the variant text and the semantic of the risk text are the same. Inputting the risk text and the variant text into an extraction model to be trained, respectively determining the text characteristics of the risk text and the variant text, determining the similarity between the text characteristics of the risk text and the variant text, and training the extraction model by taking the similarity as a target, wherein the similarity is larger than a preset similarity threshold. And determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples.
The risk text is obtained, the variety is carried out on the risk text, and then the extraction model is trained, so that the text characteristics of the risk text are extracted through the extraction model, the extraction model can sense the variety information of the risk text, and the effect of identifying, preventing and controlling the variety of the risk text is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:
FIG. 1 is a flow chart of a method for wind control for variant text similarity retrieval according to an embodiment of the present disclosure;
FIG. 2 is a schematic illustration of a multi-feature fusion provided herein;
FIG. 3 is a schematic flow chart of a wind control provided in the present specification;
FIG. 4 is a schematic diagram of a wind control device for variant text similar retrieval provided in the present specification;
Fig. 5a schematic structural diagram of the electronic device corresponding to fig. 1 is provided in the present specification.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.
The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a wind control method for variant text similar search according to an embodiment of the present disclosure, including the following steps:
s100: the identified risk text is obtained.
Since the process of the method for performing wind control on the text generally involves processing a large amount of text and has high requirement on equipment computing power in the process of extracting text features of the text, in the embodiment of the specification, the server can execute the process of the wind control method for similarly searching the variant text. Of course, the present specification is not limited to the process of the wind control method for performing the variant text similar search by any device, and may also be a method for extracting text features of a text and wind controlling the text by using devices such as a personal computer, a mobile terminal, and a server. For convenience of description, the following description will be made with the server as an execution subject.
In one or more embodiments herein, the server may obtain identified risk text from a public database, a database of related institutions or organizations, for training the extraction model to be trained. Taking a risk text as an example, the following steps are used for explaining the training of the extraction model to be trained.
S102: and carrying out variant on the risk text to obtain variant text of the risk text, wherein the semantics of the variant text and the semantics of the risk text are the same.
In one or more embodiments of the present disclosure, the server performs a variant on the obtained risk text to obtain a variant text of the risk text, where the variant text of the risk text and the semantic meaning of the risk text are the same.
Specifically, for each risk text, the server determines the text segments to be varied in the risk text. And carrying out variant on the determined text fragments to be variant of the risk text to obtain variant text of the risk text.
Further, in one or more embodiments of the present disclosure, the server performs word segmentation on the risk text, determines a part of speech of each word segment in the risk text, and determines a word segment to be changed from each word segment according to the part of speech of each word segment, as a text segment to be changed. I.e. a method of locating text fragments to be varied based on part of speech. Or the risk text is segmented, and the segmented words matched with the keyword dictionary in the segmented words are determined to be the text fragments to be changed according to the preset keyword dictionary. I.e. a method of locating text fragments to be varied based on keywords. Of course, the method for determining the text segment to be changed in the risk text is not limited in the present specification, and may be selected according to the specific situation, or the two methods may be combined to determine the text segment to be changed.
It should be noted that, in the present specification, the text segment to be morphed is determined by the risk text, and the original meaning of the risk text is not changed after the morph.
A method for locating text fragments to be varied based on part of speech. This approach focuses on identifying and locating specific parts of speech in text data, such as nouns, verbs, adjectives, etc., to vary the specific parts of speech identified. For example, for a risk text "maritime Royal lottery online," part-of-speech-based localization may be performed to "maritime-noun" and "Royal-noun".
A method for locating text segments to be varied based on keywords. This approach focuses on key knowledge points or key information points (simply key points) in the text data, which may contain sensitive or offending risk text. Such methods typically incorporate natural language processing (Natural Language Processing, NLP) techniques, such as keypoint identification, keypoint extraction, etc., to identify key information in the text data for its variants. For example, in the financial arts, particular financial terms or operations may be of interest. In the medical field, particular drug names or treatment options may be of interest. Taking the risk text of "online Royal lottery at sea" as an example, knowledge point-based positioning is performed on the risk text, and the "lottery-sensitive words" can be positioned.
Further, in one or more embodiments of the present disclosure, the server may filter at least some characters from each character of the text segment of the risk text to be varied. And aiming at each screened character, carrying out variation on the character in the risk text according to homophones of the character or pinyin of the character. Of course, for a character, text translated from the dialect of the character may also be used as variants of the character, such as the dialect cantonese, beijing, and so forth.
For example, taking the risk text "off-shore Royal lottery on-line", the "off-shore" variant may be "off-shore" or "hai", and the "Royal" variant may be "yellow-colored".
Or the server may also filter out at least some characters from the characters of the text segment of the risk text to be varied. And determining at least one of a near word of the character, a text corresponding to the character in other languages, a pictogram corresponding to the character and a text after splitting the character as a representation form of a variant corresponding to the character according to each character selected. The shape-similar word, such as "ji" variant is "ji". Text corresponding to the character in other languages, such as "up" variant is "up". The text after splitting the character, such as "lottery" variant is "take tickets". Of course, adjacent characters in the screened characters can be combined, if the combined expression forms have corresponding texts, the combined expression forms can be used as expression forms of a variety, for example, a variety of 'people' is 'from'. Wherein the pictogram also belongs to one of the variant texts.
And carrying out variation on the character in the risk text according to the expression form corresponding to the character.
For example, taking the risk text "Royal lottery on the sea" as an example, the "lottery" variant may be "pick tickets".
In one or more embodiments of the present disclosure, the server may determine homophones and pinyin of at least some characters in the text segment to be changed of the risk text, determine near-phones of at least some characters in the text segment to be changed of the risk text, text corresponding to the characters in other languages, pictorials corresponding to the characters, and expressions of the text after splitting the characters, and then change the text segment to be changed according to at least one of the homophones, pinyin, and expressions corresponding to at least some characters in the text segment to be changed. And the text fragments to be changed can be changed according to the homophones, the pinyin and the expression forms corresponding to at least part of characters in the text fragments to be changed, which are freely combined into the change combination.
S104: inputting the risk text and the variant text into an extraction model to be trained, respectively determining the text characteristics of the risk text and the variant text, determining the similarity between the text characteristics of the risk text and the variant text, and training the extraction model by taking the similarity as a target, wherein the similarity is larger than a preset similarity threshold.
In one or more embodiments of the present disclosure, a server inputs a risk text and a variant text into an extraction model to be trained, determines text features of the risk text and the variant text, and determines a similarity between the text features of the risk text and the variant text, and trains the extraction model with the similarity being greater than a preset similarity threshold.
Specifically, since a plurality of variant texts may be obtained after a risk text is variant, taking a risk text and a variant text of the risk text as an example, the server inputs the risk text and the variant text of the risk text into an extraction model to be trained, determines text features of the risk text and the variant text of the risk text respectively, determines similarity between the text features of the risk text and the variant text of the risk text, and trains the extraction model with the similarity being greater than a preset similarity threshold. The model structure of the extraction model may be a structure of a bi-directional encoder representation (Bidirectional Encoder Representations from Transformers, BERT) model based on a transformer, and of course, the specific model structure may be set according to practical situations, which is not limited in this specification.
S106: and determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples.
In one or more embodiments of the present disclosure, a server determines text features of risk samples in a database by training a completed extraction model, and performs wind control according to the text features of the risk samples.
Specifically, the server extracts text features of the acquired risk text through the trained extraction model, and the text features are used as text features of risk samples in the database.
And responding to the service request carrying the text to be sent, inputting the text to be sent carried by the service request into the trained extraction model, and obtaining the text characteristics of the text to be sent carried by the service request.
And the server matches the text characteristics of the text to be sent carried by the service request with the text characteristics of the risk sample in the database, and determines the wind control strategy of the service request according to the matching result.
The identified risk text is obtained based on a wind control method of variant text similar retrieval as shown in fig. 1. And carrying out variant on the risk text to obtain variant text of the risk text, wherein the semantics of the variant text and the semantic of the risk text are the same. Inputting the risk text and the variant text into an extraction model to be trained, respectively determining the text characteristics of the risk text and the variant text, determining the similarity between the text characteristics of the risk text and the variant text, and training the extraction model by taking the similarity as a target, wherein the similarity is larger than a preset similarity threshold. And determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples.
The risk text is obtained, the variety is carried out on the risk text, and then the extraction model is trained, so that the text characteristics of the risk text are extracted through the extraction model, the extraction model can sense the variety information of the risk text, and the effect of identifying, preventing and controlling the variety of the risk text is achieved.
In addition, in one or more embodiments of the present disclosure, before matching text features of a text to be sent carried by a service request with text features of risk samples in a database, the server inputs the obtained risk text into a trained extraction model, and obtains the text features of the risk text. And storing the text characteristics of the risk text as the text characteristics of the risk sample in a database for matching the text characteristics of the text to be transmitted carried by the service request.
In one or more embodiments of the present specification, the server may input the risk text and the variant text as input data, respectively, into the extraction model to be trained. For each character in the input data, determining the image data of the character and the pinyin data of the character through a fusion layer of an extraction model to be trained. Image features of the image data of the character and pinyin features of the pinyin data of the character are determined. And carrying out feature fusion on the image features of the image data of the character, the pinyin features of the pinyin data of the character and the character to obtain fusion features of the character. And splicing the fusion characteristics of the characters according to the sequence of the characters in the input data, and inputting the spliced fusion characteristics into a coding layer and a decoding layer to obtain the text characteristics of the input data.
If the text is crown, the crown can be represented by the pictogram of crown pattern after variation, after converting the pictogram of crown pattern into image data, the image data is also directly displayed by the pictogram of crown pattern, and the image directly displayed by the pictogram of crown pattern is known as crown.
Fig. 2 is a schematic diagram of a multi-feature fusion provided in the present specification. In fig. 2, "festival happy", each character, such as "festival", "day", "happy", respectively, fuses the image feature and the pinyin feature with itself at a fusion layer and concatenates with the position code.
The text features, the image features and the pinyin features of the characters are fused, so that the trained extraction model can sense more variety information.
In one or more embodiments of the present disclosure, when the server concatenates the fusion features of the characters according to the order of the characters in the input data, the server may specifically add position codes to the characters according to the order of the characters in the input data. And splicing the position coding and fusion characteristics of each character, and splicing the position fusion characteristics of each character according to the sequence of each character in input data as the position fusion characteristics.
In one or more embodiments of the present disclosure, a server determines an air control policy of a service request according to a matching result, and when the matching result is not matching, the server may send a text to be sent and a prompt message carried by the service request to an auditor, where the prompt message is used to prompt the auditor to audit the text to be sent, and determine the air control policy of the service request according to an audit result returned by the auditor.
Fig. 3 is a schematic flow chart of a wind control provided in the present specification.
In one or more embodiments of the present disclosure, as shown in fig. 3, when a offending user issues and propagates offending information, such as offending risk information that may cause adverse consequences and negative effects to a normal user on the internet platform, through the internet platform, a batch of offending information is typically issued, so as to achieve the purpose of spreading offending information widely on the internet platform and obtaining illegal benefits. For such offending users, when the issued offending information is first generated on the internet platform, after the offending information of the batch is issued, the offending information of the batch hits the wind control system anti-repetition link. After the wind control system anti-repetition link detects a lot of repeated illegal information, the wind control system anti-repetition link can not identify whether the information is illegal or not, so that the lot of illegal information can be sent to an auditor, and the auditor is prompted to audit. When the auditor determines that the information is illegal, the risk content is included, the wind control processing is carried out on the batch of illegal information, the batch of illegal information is input into a trained extraction model, the text characteristics of the illegal information are extracted, the text characteristics are stored in a database and used as the text characteristics of a risk sample, and therefore the server can automatically detect and wind control when the illegal information appears next time. Of course, if the auditor determines that the information is not illegal, it is not necessary to wind control the auditor.
In one or more embodiments of the present description, a server, when training an extraction model to be trained, may determine different risk texts and determine variant texts of the different risk texts, after which text features of the different risk texts and text features of the variant texts of the different risk texts may be extracted. Then, the similarity between different risk texts can be calculated, the similarity between different risk texts and the variant texts of different risk texts can be calculated, and the similarity between the variant texts of different risk texts and the variant texts of different risk texts can be calculated. And training the extraction model to be trained by taking the similarity smaller than the preset value as a target. For example, calculating similarity between text features of the risk text a and text features of variant texts of the risk text B, and training an extraction model to be trained with the similarity being smaller than a preset value, wherein the semantics of the risk text a and the semantics of the risk text B are not the same.
In one or more embodiments of the present disclosure, the server splices the risk text and the variant text, and since a plurality of variant texts may be obtained after the risk text is variant, the risk text and the variant text are spliced, and may also be spliced to obtain a plurality of spliced data.
Specifically, the server performs front-back stitching on the risk text and a variant text of the risk text, and uses the front-back stitching as stitching data.
For example, taking the risk text "maritime Royal lottery online" as an example, the variant text is "maritime Royal lottery online tickets online", the spliced data after splicing is "maritime Royal lottery online, maritime Royal lottery online tickets online", and of course, when the extraction model to be trained is input, the input form is "CLS maritime Royal lottery online SEP maritime Royal yellow online tickets online SEP".
And inputting the spliced data into an extraction model to be trained, respectively determining text characteristics of each spliced data, determining the similarity between the text characteristics, and training the extraction model by taking the similarity larger than a preset similarity threshold as a target.
Of course, the splicing data after the text features of the risk text and the text features of the variant text are spliced can be extracted, the text features of the splicing data after the text features of the other risk text and the text features of the variant text are spliced can be extracted, the similarity between the text features of the two splicing data is calculated, and the training of the extraction model to be trained is performed by taking the similarity smaller than a preset value as a target. If one variant text of the risk text C is D, after splicing, the variant text of the risk text C is [ C, D ], the variant text of the risk text E is F, after splicing, the variant text of the risk text E is [ E, F ], text features of the [ C, D ] and the [ E, F ] are extracted, similarity between the text features of the two spliced data is calculated, training of an extraction model to be trained is carried out with the similarity being smaller than a preset value, and semantics of the risk text E and semantics of the risk text F are not identical.
In one or more embodiments of the present description, the server may also determine the text segments to be varied in the risk text by positioning based on the challenge model.
Based on the localization of the countermeasure model, this approach involves using a machine learning model to combat variant text. Typically involves training a model to identify normal text and variant text, and then using the model to locate segments of text from the text that are potentially possible variants. This method involves training an antagonistic model, i.e. the model learns both normal and variant text during training to better identify variant text. Such a method may use deep learning models, such as BERT models, generative pre-training transducers (GENERATIVE PRE-Trained Transformer, GPT) models, etc., with which deep semantic information of text can be captured, thereby achieving the ability to combat variant text.
The above provides a wind control method for similar retrieval of variant texts for one or more embodiments of the present specification, and based on the same thought, the present specification also provides a wind control device for similar retrieval of variant texts, as shown in fig. 4.
FIG. 4 is a schematic diagram of a wind control device for variant text similar search provided in the present specification, specifically including:
An obtaining module 400, configured to obtain the identified risk text;
a variant module 402, configured to variant the risk text to obtain variant text of the risk text, where the semantics of the variant text and the semantics of the risk text are the same;
The training module 404 is configured to input the risk text and the variant text into an extraction model to be trained, determine text features of the risk text and the variant text, determine a similarity between the text features of the risk text and the variant text, and train the extraction model with the similarity being greater than a preset similarity threshold;
and the wind control module 406 is configured to determine text features of the risk sample in the database through the trained extraction model, and perform wind control according to the text features of the risk sample.
Optionally, the variant module 402 is specifically configured to determine a text segment to be variant in the risk text, and variant the determined text segment to be variant of the risk text to obtain variant text of the risk text.
Optionally, the variant module 402 is further configured to segment the risk text, determine a part of speech of each word segment in the risk text, determine a word segment to be variant from each word segment according to the part of speech of each word segment, as a text segment to be variant, or segment the risk text, and determine a word segment matched with the keyword dictionary in each word segment according to a preset keyword dictionary, as the text segment to be variant.
Optionally, the variant module 402 is further configured to screen at least part of characters from the characters of the text segment to be variant, and for each character screened, variant the character in the risk text according to homonyms of the character or pinyin of the character.
Optionally, the variant module 402 is further configured to screen at least part of characters from the characters of the text segment to be variant, determine, for each character screened, at least one of a word near the character, a text corresponding to the character in other languages, a pictogram corresponding to the character, and a text after splitting the character, as a representation form corresponding to the character, and variant the character in the risk text according to the representation form corresponding to the character.
Optionally, the training module 404 is specifically configured to input the risk text and the variant text as input data, respectively, input an extraction model to be trained, determine, for each character in the input data, image data of the character and pinyin data of the character through a fusion layer of the extraction model to be trained, determine image features of the image data of the character and pinyin features of the pinyin data of the character, perform feature fusion on the image features of the image data of the character, pinyin features of the pinyin data of the character and the pinyin features of the character, obtain fusion features of the character, splice the fusion features of each character according to an order of each character in the input data, and input the spliced fusion features to a coding and decoding layer to obtain text features of the input data.
Optionally, the training module 404 is further configured to add a position code to each character according to the order of each character in the input data, splice the position code and the fusion feature of each character, and splice the position fusion feature of each character according to the order of each character in the input data as a position fusion feature.
Optionally, the wind control module 406 is specifically configured to input, in response to a service request carrying a text to be sent, the text to be sent into a trained extraction model to obtain text features of the text to be sent, match the text features of the text to be sent with the text features of the risk sample, and determine a wind control policy of the service request according to a matching result.
The present specification also provides a computer readable storage medium storing a computer program operable to perform a method of wind control of variant text similarity retrieval as provided in figure 1 above.
The present specification also provides a schematic structural diagram of the electronic device shown in fig. 5. At the hardware level, as shown in fig. 5, the electronic device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile storage, and may of course include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the wind control method for variant text similar retrieval described above with respect to fig. 1.
Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims (11)

1. A method of wind control for variant text similarity retrieval, comprising:
Acquiring an identified risk text;
performing variant on the risk text to obtain variant text of the risk text, wherein the semantics of the variant text and the semantics of the risk text are the same;
inputting the risk text and the variant text into an extraction model to be trained, respectively determining text characteristics of the risk text and the variant text, determining similarity between the text characteristics of the risk text and the variant text, and training the extraction model with the similarity being larger than a preset similarity threshold;
And determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples.
2. The method according to claim 1, wherein the risk text is subjected to variant, so as to obtain variant text of the risk text, and the method specifically comprises the following steps:
Determining text fragments to be varied in the risk text;
And carrying out variant on the determined text fragments of the risk text to be variant to obtain variant text of the risk text.
3. The method according to claim 2, wherein determining text segments to be varied in the risk text, in particular comprises:
Performing word segmentation on the risk text, determining the part of speech of each word segment in the risk text, and determining the word segment to be changed from each word segment according to the part of speech of each word segment to be changed as a text segment to be changed; or alternatively
And performing word segmentation on the risk text, and determining word segmentation matched with the keyword dictionary in each word segmentation according to a preset keyword dictionary to serve as the text segment to be changed.
4. The method according to claim 2, characterized in that the determined text segments of the risk text to be morphed are morphed, in particular comprising:
screening at least part of characters from the characters of the text segment to be varied;
and aiming at each screened character, carrying out variation on the character in the risk text according to homophones of the character or pinyin of the character.
5. The method according to claim 2, characterized in that the determined text segments of the risk text to be morphed are morphed, in particular comprising:
screening at least part of characters from the characters of the text segment to be varied;
for each screened character, determining at least one of a near word of the character, a text corresponding to the character in other languages, a pictogram corresponding to the character and a text after splitting the character as a representation form corresponding to the character;
And carrying out variation on the character in the risk text according to the expression form corresponding to the character.
6. The method according to claim 1, wherein inputting the risk text and the variant text into an extraction model to be trained, determining text characteristics of the risk text and the variant text, respectively, comprises:
respectively inputting the risk text and the variant text as input data into an extraction model to be trained;
For each character in the input data, determining the image data of the character and the pinyin data of the character through a fusion layer of the extraction model to be trained;
Determining the image characteristics of the image data of the character and the pinyin characteristics of the pinyin data of the character;
carrying out feature fusion on the image features of the image data of the character, the pinyin features of the pinyin data of the character and the character to obtain fusion features of the character;
And splicing the fusion characteristics of the characters according to the sequence of the characters in the input data, and inputting the spliced fusion characteristics into a coding layer and a decoding layer to obtain the text characteristics of the input data.
7. The method of claim 6, wherein stitching the fusion features of the characters in the input data in the order of the characters in the input data, specifically comprises:
Adding position codes to each character in the input data according to the sequence of each character in the input data;
And splicing the position coding and fusion characteristics of each character, and splicing the position fusion characteristics of each character according to the sequence of each character in the input data as the position fusion characteristics.
8. The method according to claim 1, wherein the wind control is performed according to the text characteristics of the risk sample, specifically comprising:
responding to a service request carrying a text to be sent, and inputting the text to be sent into a trained extraction model to obtain text characteristics of the text to be sent;
and matching the text characteristics of the text to be sent with the text characteristics of the risk sample, and determining the wind control strategy of the service request according to the matching result.
9. A wind control device for variant text similarity retrieval, comprising:
The acquisition module is used for acquiring the identified risk text;
the variant module is used for carrying out variant on the risk text to obtain variant text of the risk text, wherein the semantics of the variant text and the semantics of the risk text are the same;
The training module is used for inputting the risk text and the variant text into an extraction model to be trained, respectively determining the text characteristics of the risk text and the variant text, determining the similarity between the text characteristics of the risk text and the variant text, and training the extraction model with the similarity being larger than a preset similarity threshold;
And the wind control module is used for determining text characteristics of the risk samples in the database through the trained extraction model, and performing wind control according to the text characteristics of the risk samples.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-8.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-8 when executing the program.
CN202410384521.5A 2024-03-29 2024-03-29 Wind control method, device, medium and equipment for variant text similar retrieval Pending CN118227806A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410384521.5A CN118227806A (en) 2024-03-29 2024-03-29 Wind control method, device, medium and equipment for variant text similar retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410384521.5A CN118227806A (en) 2024-03-29 2024-03-29 Wind control method, device, medium and equipment for variant text similar retrieval

Publications (1)

Publication Number Publication Date
CN118227806A true CN118227806A (en) 2024-06-21

Family

ID=91508385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410384521.5A Pending CN118227806A (en) 2024-03-29 2024-03-29 Wind control method, device, medium and equipment for variant text similar retrieval

Country Status (1)

Country Link
CN (1) CN118227806A (en)

Similar Documents

Publication Publication Date Title
CN110263158B (en) Data processing method, device and equipment
CN115952272B (en) Method, device and equipment for generating dialogue information and readable storage medium
CN113221555B (en) Keyword recognition method, device and equipment based on multitasking model
CN107402945A (en) Word stock generating method and device, short text detection method and device
CN117076650B (en) Intelligent dialogue method, device, medium and equipment based on large language model
CN117591661B (en) Question-answer data construction method and device based on large language model
CN112417093B (en) Model training method and device
CN111611393A (en) Text classification method, device and equipment
CN112597301A (en) Voice intention recognition method and device
CN113887206B (en) Model training and keyword extraction method and device
CN116340467A (en) Text processing method, text processing device, electronic equipment and computer readable storage medium
CN117669598A (en) Safe and intelligent question-answering method and device and related equipment
CN116186330B (en) Video deduplication method and device based on multi-mode learning
CN110688460B (en) Risk identification method and device, readable storage medium and electronic equipment
CN116662657A (en) Model training and information recommending method, device, storage medium and equipment
CN118227806A (en) Wind control method, device, medium and equipment for variant text similar retrieval
CN114676257A (en) Conversation theme determining method and device
CN115017905A (en) Model training and information recommendation method and device
CN115658891B (en) Method and device for identifying intention, storage medium and electronic equipment
CN117744837A (en) Model training and text detection method and device, storage medium and equipment
CN115017915B (en) Model training and task execution method and device
CN117369783B (en) Training method and device for security code generation model
CN118069824A (en) Risk identification method and device, storage medium and electronic equipment
CN117079646B (en) Training method, device, equipment and storage medium of voice recognition model
CN117390163A (en) Fact verification method, device, medium and equipment based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination